Information (or data) leakage is undesired behavior in machine learning during which information that should not be in the training data inflates the model’s ability to learn, causing poor performance at prediction time or in production. Models subject to information leakage do not generalize well to unseen data.
There are multiple types of data leakage, including:
Avoiding or detecting information leakage early is important to prevent models from learning the wrong signals and overestimating their value before they go into production.
In addition to following data science best practices, model interpretability is a great tool to identify and fight information leakage.
At C3.ai, data scientists are well-versed in information leakage problems and how to detect them. C3.ai carefully splits the data into separate groups – training, validation, and test sets – and keeps the test set intact to report the final performance after the model has been optimized on the validation set. For time-series data, C3 AI applications always apply a cut-off timestamp or time-series cross-validation.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.