The above is not a precisely measured statistic, but having engaged with hundreds of data scientists over the past few years, it is clear that while generating models has become easier, moving them into production still remains challenging. What is making Machine Learning more accessible on the one hand, but challenging for broad deployment on the other?
Machine learning (ML) technologies have been around for many decades, with intermittent spikes of activity and interest. In the last few years however, ML and Deep Learning (DL) technologies have proven their value and criticality to real world use cases in many domains. This shift is driven by several factors:
- The Data: Devices ranging from sensors to robots are generating increasing amounts of data and increasingly richer data (ranging from simple value time series to images, sound and video). While the data itself is valuable – its ultimate benefit to a business’ bottom line comes from the analytics that extract the insights hidden within. While simple datasets (such as streams of individual values) can be analyzed via database queries or complex event processing techniques, the increasing richness of data (multiple correlated mixed type streams, images, sound, video) requires more complex Machine Learning (ML) and Deep Learning (DL) approaches. The increased volumes of data also enable ML/DL algorithms to achieve peak efficiency.
- The Compute: The ubiquity of high performance commodity computing, driven by both massive core count increases in individual CPUs and low-cost cloud computing services, have made it possible to match data growth with similarly scalable ML and DL capabilities. Hardware innovations such as GPUs, custom FPGAs, and instruction set support in modern CPUs have further improved ML algorithm performance, making it practical to train massive datasets.
- The Algorithms: The availability of open source algorithms for ML and DL via libraries for analytic engines like Spark, TensorFlow, Caffe, NumPy, Scikit-learn, just to name a few. With these packages, a massive range of algorithmic techniques are now available in the Data Scientist sandbox. With open source, even the most state of the art algorithms in research are frequently publicly available to test, tune and use, nearly as soon as they are invented.
The above trends addressed the first issues impeding real-world ML (the data, the compute and quality algorithmic implementations). The next problem was finding a data scientist to match the specific business problem and dataset to a suitable algorithm. A lot has been written about the shortage of data scientists. This issue, while real, has been actively addressed in the last several years with online Data Science courses, specialty programs in universities for Data Science, and tools that simplify model creation (the democratization of data science). The latest approach to mitigating this problem – AutoML, promises to automate the process of model creation and selection, making it even easier to improve the productivity of a single data scientist or business analyst.
Getting an ML/DL algorithmic model to deliver exciting results in the Data Scientist sandbox is only part of the puzzle. To deliver business value, the model has to be deployed in production with its outputs (recommendations, classifications etc) used. Deploying, managing, and optimizing ML/DL in production incurs additional challenges not addressed above:
- Real-World Dynamism: Depending on use case, incoming data feeds can change dramatically, possibly beyond what was evaluated in the Data Scientist sandbox.
- Expertise Mismatch: On one side, IT operations administrators are experts in deployment and management of software and services in production. On the other side, data scientists are experts in the algorithms and associated mathematics. Operating ML/DL in production requires the combined skills of both groups.
- Non-Intuitive Complexity: In contrast to other analytics like rule-based, relational database or pattern matching key-value based systems, the core of ML/DL algorithms are mathematical functions whose data-dependent behavior is not intuitive to most humans.
- Reproducibility and Diagnostics Challenges: Since ML/DL algorithms can be probabilistic in nature, there is no consistently “correct” result. For example, even for the same data input, many different outputs are possible depending on what recent training occurred and other factors.
- Inherent Heterogeneity: Many classes of ML algorithms exist (Machine Learning, Deep Learning, Reinforcement Learning, for example) and specialized analytic engines have emerged focusing on each. Practical ML solutions frequently combine different algorithmic techniques, requiring the production deployment to leverage multiple engines. This is uncommon in other application spaces. In databases, for example, standardizing on a single type of DB for a workflow can be a useful production norm.
The term “Cambrian explosion” has already been used in several contexts to describe the growth of AI (examples here and here). Within this trend, what we are seeing now is the explosion of models in the data scientist sandbox; models that cannot generate business returns till they are able to deliver on their promise in production. As the number of data scientists increases, as democratization and AutoML tools improve data science productivity, and as compute power grows making it easier to test new algorithms in sandbox, more and more models will be developed, each one awaiting the move into production.
In the next chapters of this blog, we will discuss in more depth the challenges of Production ML/DL, and possible solutions.