Develop strategies to move AI models from the lab into production safely and effectively

Enterprises that deploy AI systems face unique challenges not presented by other software. Traditional enterprise software can be tested simply and logically to guarantee its performance and it is as fair as its engineer designs it to be. Once deployed, traditional software’s performance remains consistent.

But AI applications are another story entirely. They evolve as underlying data and infrastructure change, making it difficult to maintain performance, transparency, auditability, and fairness.

Without careful planning, moving AI initiatives from the lab to production can end in disaster. Lacking proper governance structures, bad AI models could make it into production and good models might go bad — costing companies substantially in negative brand impacts, regulatory action, safety, and profits as customers, employees, and critical assets are lost.

Over the past decade of delivering Enterprise AI applications to Fortune 500 companies across a wide range of industries, has developed expertise in helping customers design comprehensive AI model governance strategies that address the challenges Enterprise AI applications present. With AI model governance in place, enterprises can press ahead with their most ambitious AI projects, confident that their businesses are protected and that their AI initiatives will be successful. works with companies that are blazing new trails in the AI domain. Read on to learn how our customers are designing model governance strategies that ensure their enterprise AI applications are properly maintained and continue to perform well while remaining transparent, auditable, and fair.


Complex AI models are opaque and difficult to interpret. Unlike traditional software, engineers can’t point to “if/then” logic to explain a software outcome to a business stakeholder, regulator, or customer. This lack of transparency can impair decision making and lead to economic damage and regulatory action if enterprises misunderstand, improperly apply, or blindly follow AI models. Alternatively, a lack of transparency can lead to user distrust and refusal to use AI applications at all.

Fortunately, options exist to make AI transparent. Option 1 is to simplify the AI algorithm itself, using a more straightforward model – for example, linear or tree-based – to reduce opacity, although this sometimes comes at the cost of reduced model performance. Option 2 is to deploy every AI model with an “interpreter” module that deduces what factors the model determined to be important when making predictions. Interpreter modules might take model-agnostic approaches as Lime or Shapley do, or model-specific approaches such as tree interpreters.

Transparency is a top priority for one commercial lending customer using C3 AI Smart Lending application. This multinational bank needs to be able to explain AI-driven lending decisions to regulators, which requires that each credit risk prediction be accompanied by a human-readable interpreter to allow credit officers to identify strengths and weaknesses for each potential borrower. To meet these requirements, the bank is using’s out-of-the-box ML Pipeline technology for model training and inference, which integrates natively with leading open-source interpreters like ELI5, Lime, SHAP, and TreeInterpreter to explain each prediction in its loan approval application. The application preserves confidentiality and prevents gaming the system by aggregating local feature contributions above the feature level. ELI5, the interpreter module chosen for this application, also is used in the C3 AI® Anti-Money Laundering application to identify the specific money laundering typologies – such as structuring, shell accounts, or trade finance – that each fraud prediction has identified as features of importance.


Traditional enterprise software is largely static after it is released to production, evolving slowly through occasional enhancements and upgrades that are initiated and tracked through DevSecOps processes and code control. Enterprise AI applications are much more dynamic. Significant data changes can happen with minimal notice in production environments, which means that AI models need to evolve continuously and rapidly. Thousands of models, each with different parameters and dependencies, may be developed, tested, deployed, and used in parallel, requiring dynamic adjustments to changing data and business needs. Auditing system outcomes and tracing the many variants, both past and present, of AI models quickly can become hopelessly complex.

Smart ML model management is the necessary antidote to enable auditing of AI systems. An ML model management framework lets enterprises track both AI models deployed to production (champion) – and challenger models. To support an enterprise’s ability to trace back details of model deployments, the framework captures when the model was deployed as well as its algorithm, libraries, and parameters. In conjunction with ML model management, a good framework tags results and associated data, establishing data lineage and allowing end-to-end traceability of each of the model’s thousands of results.

Auditability is mandatory for a Fortune 100 customer using AI for churn management. Not only does this bank need to be ready to recall its AI decisions for regulators at a moment’s notice, but it also must be able to highlight the specific data inputs used to make recommendations. To meet these requirements, the bank is using’s out-of-the-box ML Pipeline technology for model training and inferencing, combined with C3 AI ML Studio’s model deployment framework. These tools enable auditability of the full model lifecycle. As new challenger models are trained, approved, and promoted to the live production application, both the data and modeling lineage are stored to ensure auditability and reproducibility.

“Software is not free of human influence. Algorithms are written and maintained by people, and ML algorithms adjust what they do based on people’s behavior. As a result, say researchers in computer science, ethics and law, algorithms can reinforce human prejudices.” [Miller 2015].

ML algorithms fueled by big data are used to make decisions about healthcare, employment, education, housing, and policing even as evidence accumulates that such models drive discrimination. Models developed with the best of intentions inadvertently may exhibit bias against historically disadvantaged groups, perform relatively worse for certain demographics, or promote inequality.

Discrimination is at the heart of machine learning, but enterprises must avoid making statistical discrimination the basis for unjustified differentiation. This can occur because of practical irrelevance –for example, incorporating race or gender in prediction tasks such as employment – or moral irrelevance despite statistical relevance – for example, incorporating disability into employment decisions.

Avoiding unfair differentiation is easier said than done. In cases where unjustified bias related to race is suspected, for example, the “easy fix” of removing race as a feature may not solve the problem; race may be correlated with other features, such as ZIP code, which may continue to propagate bias. Instead, the best practice would be to include race explicitly as a feature of the data set used to train the model, and then correct for bias.

Human bias that exists within the training data is the primary cause of unfairness in ML systems. AI models tend to amplify such biases. Several fairness criteria have been developed to quantify and correct for discriminatory bias in classification tasks, including demographic parity [Zemel 2013], equal opportunity, and equalized odds [Hardt, et. al. 2016].

Ultimately, because there is no consensus on how to define fairness, enterprises should avoid trying to guarantee fairness with a single technical framework. Rather, they should teach AI practitioners to critically examine the social stakes of each project. Data scientists can then adapt fairness frameworks to each project to avoid propagating systematic discrimination at scale.

At, we leveraged the equalized odds approach on a recent project that involved classifying people based on historical behavior. When it comes to building “fair” AI models, it is crucial to understand when to use each fairness metric and what to consider when applying a metric. An acceptable trade-off between accuracy and fairness usually can be found.

Native C3 AI Suite functionality also helps data scientists ensure the fairness of their algorithms. The C3 AI Metrics Engine provides a machine learning feature store, which helps data scientists centrally store, version, and keep track of all features used in machine learning models, enabling subsequent audits for fairness.


Enterprises may choose to decrease some aspects of model performance to increase fairness. The equalized odds approach mentioned above, for example, forces a model to misclassify positive outcomes across groups in order to ensure that each group has an equal percentage of positive classifications.

Considerations of fairness aside, model performance is at risk any time an enterprise prematurely pushes a project from the lab into production. Accuracy assessments conducted in the lab often do not apply to the real world, because the distribution, quality, and key characteristics of data in training sets do not match reality. An image recognition model that is trained on high-resolution images, for example, typically will experience performance reductions when deployed on photos taken in poor lighting in the field.

Data collection, transmission, and processing pipelines can also lead to performance degradations. A model that returns consistent results for batch predictions, for example, may have reduced performance when handling streaming predictions because of the differences in processing between streaming versus batch data. Enterprises can overcome this challenge with a robust technology stack like that in the C3 AI Suite that leverages standardized modules for both batch and stream processing and manages the entire end-to-end process from data ingestion to ML predictions.

Uncertainty is another factor that complicates model performance management. Classifier accuracy is often measured against a fixed label that does not reflect real-world complexity. Take, for example, a medical diagnosis classifier. Complex cases about which physicians disagree ultimately are labeled either true or false in the training data. Enterprises must account for this kind of uncertainty by placing AI systems within larger processes where sources of uncertainty are discussed and incorporated into decision making, rather than simply rejected and replaced by a binary true or false signal.

A final example from a customer illustrates the performance challenges that enterprise AI systems face. This customer needed to develop classifiers to predict electricity theft for a national utility. Models returned good results in development environments and incrementally were rolled out for testing in the field. Using automated model performance scoring and alerts provided by the C3 AI Suite, the data science team detected disproportionately high false positive rates for one specific region — a remote island. Further investigation revealed that this island historically had a high ratio of confirmed fraud cases to fraud inspections, which explained why the classifier predicted fraud there at higher rates. Did the islanders really steal at higher rates than their mainland neighbors? Likely not. Indeed, ultimately found that fraud inspectors historically had only visited the remote island if they were highly confident of theft. In contrast, mainland customers were inspected more frequently on slimmer evidence. The high ratio of historical electricity theft, and the resulting challenges with model performance, resulted from sampling bias.

Examples like this demonstrate why enterprises must take care to ensure that model training results are replicated in production via data collection, modeling, and validation best practices. Integrated data science platforms like the C3 AI Suite encourage best practices by making the transition from experimentation to production seamless. Teams can roll out newly trained and validated models incrementally as randomized A/B experiments or shadow models, then automate monitoring and operation of all models in the live application to ensure careful management of performance and fairness metrics for all models.

The C3 AI Suite enables comprehensive model performance management via out-of-the-box AutoML capabilities. A fully managed and automated Hyperparameter Optimization framework takes a model with its tuning parameters, and leverages Bayesian Optimization and Random and Grid Search functionalities to find the best set of hyper-parameters. The C3 AI Model Deployment framework is the engine that allows the user to operate multiple modeling projects per application – A/B experiments, champion-challenger deployments, shadow test, canary deployments, all built into your production environments. One can train and use for inference one model per population segment or per asset. All of these can be set up on a no-code, low-code or code fashion. As part of the model deployment framework, the users also get for free and automated stream-based event-based processing of data, as opposed to a Cron Job that makes predictions on data say once a day. For instance, one can automate the risk score prediction for a turbine only when temperature on the blade surpasses a certain threshold using this framework. This helps optimizing performance and save money and provide predictions only when needed which is crucially important when dealing with hundreds of thousands or millions of assets as some of our customers do . Once the models are promoted to production you have to keep track of model performance in production to ensure model integrity and performance. The framework provides the capability to set up operational dashboards, providing full auditability and traceability of models through the production phase. All these capabilities are provided under the umbrella of C3 AI ML Studio, as part of C3 AI Integrated Development Studio (IDS).

Monitoring and Maintenance

Technical teams typically start focusing on monitoring and maintenance after models are deployed to production. Monitoring and maintenance are required to ensure that performance does not drop over time and that models are meeting the risk tolerance of end consumers. But unlike a typical software application, it is not always obvious when an AI model has stopped performing as expected and requires maintenance or replacement.

Myriad performance monitoring methods are available. AI systems can compute and store time series metrics on various model KPIs, including simple statistics on model inputs/outputs such as distributions or frequencies (depending on the model type), counts of true or false positive predictions, latency of API endpoints, and memory/CPU utilization. Systems can also leverage event logging to capture more contextual information around key steps in workflows, including exceptions, which may not be captured easily in a time series metric. For microservice architectures, distributed network tracing can be used to track events that traverse multiple services. The key is to ensure that enough information is tracked to identify and debug predictable failures without degrading the performance of overall systems or overwhelming technical teams with data.

Enterprises should establish an AI model management process prior to deploying any models to production. In order to ensure that model management remains transparent and auditable, processes should reproduce as closely as possible the steps used in model development for initial training and tuning.

One customer uses C3 AI ML Pipeline’s automated performance scoring feature to track linear correlation analyses, false positives rates, and other metrics that can identify when AI models fall out of spec. Old models are contested by new observation, or challenger, models, which are trained and promoted to the live application using C3 AI ML Studio’s native multi-model deployment features. Champion models serve predictions to the application UI until model drift and other performance metrics signal a need to swap, promote, demote, or retrain models.

The importance of model governance must not be underestimated. Make governance decisions early in the process – at the latest before transitioning AI models from the lab to production. These decisions help map out critical questions to ask during development about auditability, transparency, fairness, and performance as well as guiding what needs to happen to operationalize AI models, maintain them, and monitor them closely. Ensuring that there is a model governance strategy in place from the start helps to drive confidence as AI expands across the enterprise. The C3 AI Suite is built with a comprehensive ML lifecycle in mind and designed to enable ML governance at scale.

About the authors

AJ Christensen is a product manager at He is focused on creating solutions for financial services, including AI for commercial credit, capital markets, and compliance. He holds an MBA from the Berkeley Haas School of Business and a bachelor’s in economics from Brigham Young University. AJ is a finance quant turned AI advocate who is passionate about using data and technology to reinvent the way we do business.

Mehdi Maasoumy is a principal data scientist at where he is leading AI teams that develop machine learning, deep learning, and optimization algorithms to solve previously unsolvable business problems, including stochastic optimization of inventory and predictive maintenance across a wide range of industries such as oil & gas, manufacturing, energy, and healthcare. Mehdi holds a M.Sc. and Ph.D. in Engineering from University of California at Berkeley and a B.Sc. from Sharif University of Technology. He is the recipient of three best paper awards from ACM and IEEE. In his free time, he enjoys running and reading.

Wilbur Tong is a delivery manager at He has been delivering enterprise AI applications for the past decade across major industries and geographies with a focus on NLP (natural language processing) and machine vision use cases. Wilbur holds a Bachelor of Software Engineer from the University of Queensland, Australia.