This post was co-written byNisha Tagala and Mellanox’s Ramnath Sagar of Mellanox Technologies.
Tesla’s semi-autonomous Autopilot system has drawn a lot of attention in the automotive industry. The ability of Tesla to push smarter Autopilot service with each Over-The-Air (OTA) update enables them to maintain a competitive edge in the autonomous vehicle era. But using AI, especially High Performance Deep Learning (DL), to have a competitive edge is not just relevant for Tesla but for any enterprise looking to build an intelligent software.
In today’s world where rapid innovation is no longer optional but mandatory, DevOps becomes critical – bringing software developers and operations staff to work closely. However, in an DL-powered software world, effectively deploying a DevOps-styled application to production remains a humongous challenge. This is due to the complexities of configuration, the need for efficient hardware to scale training and inference performance, and the complexities of continuously managing and supporting deep learning in production.
Mellanox and ParallelM have teamed up to solve this challenge using MLOps (DevOps for Machine Learning) and defined a reference architecture for Production-scale High Performance Deep Learning solution. We demonstrate how our technologies (Mellanox for high performance deep learning and ParallelM for Production DL Management), coupled with the state-of-the-art technologies from Open Source community, can enable AI-first enterprises to maintain their competitive edge.
For our reference design, we chose Tensorflow, one of the most popular ML/DL frameworks, but the solution can be easily extended to other frameworks such as SparkML, Caffe, Torch and others.
This reference design accomplishes two key objectives:
- Fastest Time-to-Train with support for leading tools and frameworks out-of-the-box
- Fastest Time-to-Inference with the ability to rapidly train & retrain, and move to & manage/optimize in production while ensuring prediction quality in a dynamic environment
For more details, refer to our reference design: https://community.mellanox.com/docs/DOC-3001
Ramnath Sai Sagar is a Marketing Manager at Mellanox Technologies, heading market development for Big Data, Enterprise AI and Web2.0. He has an extensive background in both R&D and Marketing. Prior to joining Mellanox, he had worked as a Performance & Solutions Architect at Emulex Corporation, and in some of the premier research projects in European labs including Brain Mind Institute (BMI) at EPFL, Switzerland and Barcelona Supercomputing Center (BSC), Spain. He has been published in a number of leading conferences and journals in scientific computing and holds a Bachelor of Science in Computer Engineering from Anna University, India.
Nisha Talagala is CTO and vice president of engineering at Parallel Machines, where she focuses on production machine learning and deep learning solutions from the edge to the cloud. Nisha has more than 15 years of expertise in software development, distributed systems, I/O solutions, persistent memory, and flash. Previously, Nisha was a fellow at SanDisk; a fellow and lead architect at Fusion-io, where she drove innovation in nonvolatile memory, including the industry’s first persistent memory solution; technology lead for server flash at Intel, where she led server platform nonvolatile memory technology development, storage-memory convergence, and technical partner engagements; and CTO of Gear6, where she designed and built clustered computing caches for high-performance I/O environments. Nisha holds 48 patents in distributed systems, networking, storage, performance, and nonvolatile memory. She has authored many technical ad research publications and serves on multiple academic and industry conference program committees. Nisha holds a PhD from UC Berkeley, where her research focused on software clustering and distributed storage.