Reporting bias relates to the provenance of training data available to data scientists, wherein the data set reflects a bias in which data are included. Often overlooked, reporting bias is common where humans are engaged in the initiation, sampling, or recording of data used for eventual machine learning model training.
A real-world example best illustrates why organizations need to be aware of potential reporting bias. Early in C3 AI’s history, we developed machine learning algorithms to detect customer fraud. In one customer deployment, the algorithms were significantly underperforming in a particular geography, a remote island. Under further examination, we found substantial reporting bias in the data set from the island. Every historical investigation performed on the island was a fraud case, skewing the data distributions from that island.
Because of the island’s remoteness, investigators wanted to be sure that a case would be fraudulent before they would travel there. The algorithm incorrectly maximized performance by marking all customers on the island with a high fraud score. Because the frequency of events, properties, and outcomes in the training set from that island differed from their real-world frequency, the model required adjustment to counteract implicit bias caused by the selective fraud inspections on the island.
C3 AI Suite and C3 AI Applications provide sophisticated capabilities to explore training data sets and evaluate model performance prior to production deployment. In addition, based on C3 AI’s extensive experience in helping organizations solve large-scale problems with machine learning, we have codified best practices around detecting bias in data sets, that we make available to organizations we work with.