December 21, 2018
A Design Pattern for Explainability and Reproducibility in Production ML

The past decade has seen tremendous growth in production deployments of machine learning algorithms across a range of applications such as targeted advertising, self-driving cars, speech translation, medical diagnosis etc [1]. In these contexts, models make key decisions such as predicting the likelihood of a person committing a future crime, trustworthiness for loan approval, medical diagnosis etc [2]. Presence of bias based on gender, geographical location, race etc., and their consequent negative impact, have been uncovered in several of these deployments [3], [4]. Industries and governments are reacting, enacting regulations requiring that decisions made by machine learning models be Interpretable/Explainable [5].

Explainability across the full range of ML and DL algorithms is an unsolved research problem, with many innovations over the last several years and entire conferences devoted to the topic. However, even simple explainability solutions that are considered established in development (training environments) run into additional difficulties when put into live production.

This week at the International Conference on Machine Learning Applications (IEEE ICMLA 2018), our Lead Architect Swami Sundararaman will be presenting a paper on production explainability and reproducibility. This paper covers a design pattern for production explainability that users can put into practice today with our MLOps solution, MCenter. The slides for his presentation can be found at [8]. In this blog, we summarize this design pattern and how it can be used in production deployments.

Canary Models for Explainability, and Production Challenges

Our design pattern uses a well-known technique for explainability—the canary model (sometimes called surrogate model) [6,7]. In this approach, a classically non-explainable technique, such as a neural network, is paired with an explainable model that approximates the predictions of the non-explainable technique, such as a decision tree. As long as predictions match, the canary model’s behavior can be used to provide a human understandable reasoning for the prediction. The primary (non-explainable) model typically has better accuracy than the explainable canary – otherwise the canary could be used directly. However, the two are expected to be sufficiently close in prediction patterns that the number of deviations (non-explainable predictions) is small.

This technique has the advantage of simplicity and generality; it can be applied to explain many types of ML and DL models. However, its effectiveness depends entirely on whether the two models match. Using such techniques in production (where they are most needed) brings additional challenges since two models that matched during training can easily deviate in production inference if data patterns change. Table 1 shows such an example. The primary algorithm in deployment is the multi-layer perceptron (MLP). A decision tree model is used as the canary model, which is explainable and mimics the predictions of MLP. Both are trained on an open source TELCO dataset available in [9]. As the table shows, when the live production data matches training patterns, both models infer similarly and the canary can be used to explain the MLP. However, if production data deviates from training patterns, the canary model’s behavior deviates and it can no longer be used for explanations.

Table 1: Illustration of how prediction patterns can change when inference input data patterns deviate from training data patterns.

A Production Design Pattern for Canary Explainability

With this in mind, a production design pattern for canary-based explainability needs to include several elements:
Two inference pipelines, one with the primary model and the second with a canary model. The same traffic is sent to both primary and canary inference pipelines. The models used in these pipelines may also need to be periodically retrained.
A pipeline to compare the results of primary and canary, and mechanisms to trigger when canary outputs and primary outputs mismatch so that potential issues in canary-based explainability can be detected.
A reproducibility mechanism that can replay prediction patterns and outputs so that alternative means can be used to track what happened in cases where the canary cannot be used to explain the primary.

In MCenter, this is done via the following

Step 1: An ML Application is created that includes the primary inference pipeline, the canary inference pipeline, and re-training pipelines for each model (see Figure 1). Via this application, the user can specify how the pipelines should run (Batch, REST, etc.), how frequently to retrain, how to decide if new models are pushed to production inference for either primary or canary, and also how to compare the outputs and generate an alert if the primary and the canary mismatch.

Figure 1: ML Application for Canary-based Explainable Models in Production

Step 2: Once this ML Application runs in MCenter, statistics are collected on each pipeline and alerts are generated if the primary and canary display significant deviation. Figure 2 shows an example of this.

Figure 2: Canary Dashboard comparing the results of Primary and Canary Models

Step 3: Finally, the MCenter system tracks and maintains a full lineage and temporal sequence of all datasets, computations, events, and results occurring in all pipelines. If there is an explainability event, i.e. the primary and canary models deviate in prediction patterns, the entire sequence can be reproduced and replayed “as-live” for examination. This level of reproducibility ensures that both the actual computation and any non-deterministic runtime factors are fully captured.

Figure 3 gives an example of how MCenter captures and correlates information sequences for complete reproducibility. As each pipeline executes, MCenter captures all configuration information, statistics, error events, generated predictions, models, etc. The relationships between pipelines (as specified in the ML application) are then used to combine all the information from each pipeline into a single coherent timeline, as shown in Figure 3.

Figure 3: Temporal sequencing and correlation between pipelines in the ML Application.

If an explainability alert occurs, and the two inference pipelines (primary and canary) are no longer providing similar predictions, the temporal sequence can be used to determine exactly how the primary pipeline generated any particular prediction, and if necessary replay that prediction. The replay can also been seen live – as shown in Figure 4. Using built-in replay dashboards, Data Scientists can examine in depth any execution time frame, including areas where explainability alerts have occured.

Figure 4: A visual replay of the time sequence leading up to and including an alert event.


Explainability is critical for practical ML usages as more and more industries use ML to make critical decisions and everyone from consumers to regulators try to determine how each algorithm makes its determination. Even straightforward explainability approaches (like canary models) can be challenging to put into production due to additional complexities and variabilities that occur in a production environment. We have described a design pattern for deploying canary explainability in production, which can be used with a wide range of ML and DL models and captures the full range of explainability, reproducibility, and alerting needed to deliver explainability in production.


[1] I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.

[2] G. F. Cooper, C. F. Aliferis, R. Ambrosino, J. M. Aronis, B. G. Buchanan, R. Caruana, M. J. Fine, C. Glymour, G. J. Gordon, B. H. Hanusa, J. E. Janosky, C. Meek, T. M. Mitchell, T. S. Richardson, and P. Spirtes, “An evaluation of machine-learning methods for predicting pneumonia mortality,” Artificial Intelligence in Medicine, vol. 9, no. 2, pp. 107–138, 1997.

[3] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias. [Online]. Available:

[4] S. Buranyi. Rise of the racist robots – how AI is learning all our worst impulses. [Online]. Available:

[5] B. Goodman and S. Flaxman, “European union regulations on algorithmic decision-making and a ”right to explanation”,” 2016. [Online]. Available: arXiv:1606.08813

[6] M. T. Ribeiro, S. Singh, and C. Guestrin, “”Why should I trust you?”: Explaining the predictions of any classifier,” CoRR, vol. abs/1602.04938, 2016. [Online]. Available:

[7] P. Hall, N. Gill, M. Kurka, and W. Phan, “Machine Learning Interpretability with H2O Driverless AI,” 2018. [Online]. Available:

[8] S. Ghanta, S. Subramaniam, S. Sundararaman, L. Khermosh, V. Sridhar, D. Arteaga, Q. Luo, D. Das, N. Talagala. Interpretability and Reproducibility in Production Machine Learning Applications. Presented at IEEE ICMLA 2018.

[9] R. Yanggratoke, J. Ahmed, Ardelius, C. J., Flinta, A. Johnsson, D. Gillblad, and R. Stadler, “Linux kernel statistics from a video server and service metrics from a video client.” 2014. [Online]. Available:

Share This Post:

Get started with a free account!

Try MCenter and See How Much Easier ML In Production Can Be

Start Free Account