Almost every industry has begun adopting machine learning (ML) and some have even gone as far as becoming reliant on ML for their operations. Deploying ML models and applications in production settings where real-world decisions are being made by AI needs robust development, testing and staging cycles. In this post, we introduced MLOps and how it helps ease ML adoption in production settings. This article describes how ParallelM’s MLOps software allows an organization to construct production ML workflows that manages these cycles – from the first training on structured data all the way to evaluating a deployed model’s real-world behavior.
Typical ML Deployment Lifecycle
- In data-centric organizations, which today is almost everybody, most analytics or ML is only performed once data has been organized, structured and made accessible to data scientists, engineers and modelers. The first step towards constructing an ML model that drives a particular business impact is the data preparation stage. This is mostly a large batch operation that involves multiple tools to create schemas and structure around data. In preparation of the training stage, feature selection or engineering is also performed here.
- Once the data has been readied, the next process typically involves Data Scientists developing ML applications that read this data and create a ‘Model’ specific to a business need (for example, a recommendation system would need a recommendation model). This stage normally involves multiple iterations on algorithm choices and tweaks, datasets, feature engineering, and evaluation. The outcome is a model or set of models that a data scientist is reasonably certain about with the limited evaluation performed.
- The next stage includes a set of testing processes that take the model through various combinations of datasets and configuration parameters. In order to prepare the ML application to be production-ready, this stage also involves a test strategy that involves scale and robustness. This process is typically performed with one or more business needs kept in mind. In case a model does not clear certain criteria, the training phase needs to be revisited. Once a model (and its associated parameters) has been cleared through the testing process, the next step would be to deploy this model in a limited setting. This process involves a roll out of the model and the business application it accompanies to a subset of users. Unlike a full production deployment, the liabilities and business impacts from the rolled out models are minimal. This stage also involves certain evaluation methods that decide the model’s production readiness. In case these criteria aren’t met, the training stage is repeated.
- The final stage involves deploying this model in production. Once out in the field, businesses usually deploy analytics or reporting applications that track the impacts on the end goals. Continuing the above example, a useful metric would be a change in the click-through rate with the new recommendation system. The feedback gathered could impact the training phase again.
Challenges Faced by Organizations
Organizing such an end-to-end lifecycle is challenging for multiple reasons ranging from logistics to technical issues.
- Diverse Skill Sets: Given such a wide range of processes involved, the entire life cycle could span multiple units within a company. Most modern enterprises have a dedicated R&D group, QA, Business Org, etc. Each unit consists of personnel with varied skill sets and modes of operations.
- Parallelism: Applying ML to a complex business problem involves a heavy amount of experimentation. In many cases, multiple Data Scientists are working in parallel on many algorithms trying to find the best fit for the data and the end goal. In such situations, the ability to launch parallel workflows and track each independently goes a long way in accelerating the model deployment.
- Resource Utilization: In terms of resource allocation and planning, it helps to anticipate the resource consumption using historical data. Unfortunately, fine-grained tracking of resources across these multiple stages is extremely hard, owing to the disconnected nature of these stages.
- Closing the Loop: The primary purpose of most ML initiatives is to convert insight into business value (whether it is by recommendations, optimizations, fault detections, etc.). These final outcomes in business values need to be tied to decisions made in the development phases – data processing, training to iteratively improve future models.
How MLOps Manages Machine Learning Lifecycles
As noted earlier , ParallelM introduced an MLOps software package that automates Production Machine Learning. Using a software management platform, ParallelM manages multiple machine learning applications running in parallel on standard ML engines and platforms.
The ability to link all stages of an ML application (feature selection, training, production sandbox, and deployment) within a single platform, and providing collaborative features to people playing different roles in the ML lifecycle, is a key aspect of the MLOps software. ParallelM software can be used to define a pattern (or overlay) on top of independent, commodity systems that are linked together by the role they play in the ML ecosystem.
This ML-focused, end-to-end linkage enables a user playing the role of Data Scientist, Operations, Business Analyst, or a combination of these to define workflows and ML cycles as described in Figure 1. For example, automating a Feature Selection Training Production Sandbox Training cycle can be defined using the MLOps platform as a single directed, cyclic flow where each participating application is running independently. Unlike traditional pipeline schedulers, the MLOps system connects parallel executing entities using their ML linkage and enables flow of ML objects (models, predictions) end-to-end. This flow can then be managed and controlled entirely within the MLOps system. Similarly, other cycles can be defined that form a subset of or the entirety of the lifecycle described in Figure 1.