Last week, ParallelM participated at the Spark + AI Summit held in San Francisco where we presented ‘Operationalizing Edge Machine Learning with Apache Spark’. The talk centered around our MLOps platform, MCenter, and how it helps deploy and manage ML at scale.
In our talk, we showcased common trends we’re witnessing with our customers that indicate a growing incidental complexity around incorporating AI into business applications and the consequent need for MLOps (the operation and management of ML throughout its production lifecycle). We discussed issues primarily around managing models, ML training and ML prediction applications in production environments. Specifically, we observed customers requiring the monitoring of their models’ scoring effectiveness and the impacts on business decisions, requirements around auditing and diagnosis that require fine-grained governance metadata, and collaboration amongst data scientists, IT operations and stakeholders who are responsible for ML-driven projects.
We demonstrated an Edge/Cloud scenario that involves training an ML model in a Spark cluster hosted in a cloud and several simulated edge devices that periodically receive a newly-trained model and infer on them with data local to that edge device.
The issues highlighted include –
- Representation and control of the ML application from cloud to edge nodes with a single dashboard
- Packaging and deploying custom ML applications for training or scoring at each of the nodes
- Monitoring the progress of and quality of the ML training and scoring at each of the nodes
- Capturing a trace of the operations involved in a model being trained and inferred on over a period of time and diagnosing issues offline
With our ION technology, we built a graph-like structure that captures the network spanning the cloud-based Spark cluster and each edge device running a standalone Spark stack. Each node in this graph encapsulates the ML task, operational policies like human approval, schedule for execution, and resource requirements for the instance of the ML pipeline. On deploying this graph, which we refer to “launching an ION”, the cloud-based Spark cluster begins training the model and the ION’s mechanisms ensure model distribution, monitoring, governance, and timeline captures that we demonstrated in our talk.