November 16, 2017
Why your ML should be deployed as a Micro-service

Machine Learning (and Deep Learning) are popping up everywhere. The common use cases of ML and DL (such as making recommendations, classifying users, detecting violations and anomalies), can be applied in virtually every commercial industry. But ML and DL usually help the solution by enabling informed options based on insight and learning, which leads to the question—how should ML and DL be integrated into applications? In this blog, we describe why the popular micro-services architecture model works well for this problem.


ML Basics

Let’s consider a simple application flow that can use ML. A user visits a web page. Based on user information and context, the web page recommends additional links to visit.

To add ML into this application, you need to deploy two primary functions:

  1. Training (also known as model build or model develop). The training pipeline analyzes the data using your chosen ML algorithm and produces a model that maps the available user information to the desired recommendation.
  2. Inference (also known as model serving, prediction, or sometimes scoring). This is where input data from your application is run through the model, and the ML prediction (or ML answer) is fed back into your application.

Variants of this structure (such as online ML, ensemble models, etc.) exist, but the most basic structure above is the one you need to integrate to support most ML functions in your application.


Integrating ML into your application

Let’s now look at what it would take to integrate ML into our example application.

  • The primary ML module to integrate is the inference function. The input is the context specific parameters (user info, possibly other contexts such as time) and the output is the recommendation.
  • However, the output of the inference function is highly dependent on the training and the model that the training has provided to the inference.
    • To keep up with changing trends, the model may need to be retrained regularly, perhaps frequently.
    • The outputs of the training (i.e., new models) need to be provided to the inference module.
  • Additional complexities can exist. For example
    • The training program may be compute intensive and may need to run on an analytic engine (example – Spark).
    • The training program may be shared across multiple instances of the inference program. For example – if the application is running as multiple independent instances, there is no necessary requirement to train each instance’s model separately.


What is a Micro-service architecture?

Many online descriptions are available for micro-services (see examples here [1] and here [2]).  Micro-services is a design and deployment architecture pattern where an application is composed of a set of functional services rather than being implemented as a monolithic entity.  Micro-services benefit complex applications by enabling each independent function to be designed, implemented, scaled, upgraded etc. as its own service.


Benefits of a Micro-service approach for ML

Making the ML operation a micro-service for the primary application has many benefits:

  • The training and inference execution can be managed independently of the calling application.
  • ML algorithms and strategies can be changed without impacting the primary application. For example, the primary model can be replaced with an alternate model or an ensemble without having to release an update to the primary application.
  • Resource management can become more manageable. Depending on the ML algorithms used, significant CPU resources or special hardware such as GPUs may be needed for efficient training and/or inference. These can be deployed without changing the primary application’s resource configuration, and can be shared across multiple calling applications.
  • Models can be shared (and model trainings leveraged) across multiple applications.


Figure 1 shows how ParallelM’s MLOps Center supports applications that integrate production ML as a micro-service.  With ParallelM’s MLOps Center Solution, all the ML pipelines for such an application can be managed as a coherent entity, with the predictions available for subscription as a service:

  • The web service application receives predictions via REST calls to the inference/prediction pipeline. Alternatively, predictions can also be processed in batch if latency tolerances allow.
  • The prediction pipeline is fed via a periodic retraining pipeline which generates models based on past data. Models can be exchanged between training and inference pipelines using standard model formats such as PMML.
  • Training and inference pipelines can be scaled independently of each other and the primary application. Training and inference pipelines can be run on standard analytic engines (Spark, Flink, TensorFlow etc.)
  • All aspects of production ML management (see earlier blog post about specific ML production challenges [3]) can now be managed separately from the primary application and commonalities between different ML production pipelines (such as shared models, shared algorithms, shared GPUs etc.) can be exploited without requiring changes to the primary applications.


Figure 1: Example deployment of ML micro-service with ParallelM MLOps Center

Share This Post:

Get started with a free account!

Try MCenter and See How Much Easier ML In Production Can Be

Start Free Account