MLOps — Why You’d Want It In Your Company Today

How an additional role greatly improved our development process

Roy Peleg
Nanit Engineering

--

Photo by rawpixel from Burst

This story might sound familiar to you: a new startup is born, a handful of people with fair amounts of ambition are trying to make an idea come to life. Since time to market is critical and resources are scarce, sometimes sub-optimal choices are made and technical debt piles up.

Nanit was no different in that sense, and since our product is based on Machine Learning (ML) algorithms, we had many more challenges to overcome that aren’t being tackled in traditional software. However, instead of trying to solve our problems using external solutions managed by our algorithms developers, we decided to take a different approach by building our very own MLOps team.

What is MLOps?

In 2015, a paper by Google engineers pointed out the challenges and incurring technical debt that arises when using ML models in production. Around that time, DevOps was starting to be adopted by many companies, and its impact and significance were being recognized.

The combination of DevOps best practices and ML complexities formed this new field. It has gained a lot of traction and recognition in the past few years, as more and more companies incorporate ML in their systems and face all the challenges that come with that.

If you’re already working in a company that uses AI, you might know some of the pain points that are particular to this field. Let me tell you about one specific issue we had here at Nanit.

The required infrastructure for a real-world ML system. Image taken from Hidden Technical Debt in Machine Learning Systems

How we used to work — AKA why it hurt

As a small company with less than 30 developers, a big issue we had was a tight coupling between the Algorithms and Backend teams. The former developed the ML models, while the latter helped deploy them to production. In our case, we had 3 different services running in Kubernetes, serving classification, detection and segmentation models, that were written by both teams, each contributing its specialization.

The reason for that codebase mix was the inherent knowledge gap each team has. The Algorithms team lacks the ability to set a service in production and serve the model inferences in a stable and complete way, while the Backend team doesn’t have experience in the algorithmic aspects that an ML model has.

Although it gave us a significantly shorter development time at the initial step, it raised a couple of problems in the long run. The major one was the dependency of the Algorithms team on the Backend team — and vice versa — when deploying new models or algorithmic code modifications. Other issues were related to the visibility of our models — we didn’t have the appropriate monitoring of our models in production, and we couldn’t know how well they performed generally.

As time went by, the company grew and had a few dozen developers, and as we scaled, we decided to open a new MLOps team to cover these gaps.

First Project

The first team member came aboard and started to learn about all the different systems we had, and planned out the first major project — rewriting the services that served our DL (Deep Learning) models in production. The idea was to slowly decouple the Algorithms and Backend teams, while adding the necessary logging and monitoring that an ML model should have.

There were many areas in our ML pipeline that could benefit from development or refactoring, and all would bring a significant amount of value. At first glance, it might seem that code rewriting doesn’t help much, since it doesn’t give improvement to algorithm development. But choosing to focus on rewriting our production services helped us in a few important ways.

First, taking full responsibility for the services means we can do future development without depending on other people to integrate our changes and make sure we’re not breaking anything. Since the services are the main friction point between our ML models and production, defining it before anything else helps to standardize an API that we can build on for later stages. All while knowing with high certainty that everything will stay as it is unless we decide to change it.

Second, it enabled us to design the new infrastructure while thinking ahead about what we wished to accomplish. When Nanit started to use DL in production, it was quite experimental and wasn’t planned ahead that much. However, at that point in time, we were mature enough and served a couple of DL models, we knew our pain points, and we could tackle them with a better design.

Another important point was that we could get to know the current state of code and how different processes are done. We could read the code and hear explanations, while we try to develop another part of the ML pipeline. But nothing beats knowing it from the inside out after rebuilding it yourself and understanding the small nuances that aren’t visible at first sight.

The impact

It took several iterations and almost a full year until the first service rolled out to production. It might seem like a long time, especially when you consider that we were merely replacing an existing feature, without bringing anything new to the customer. But by passing through all the different roadblocks, the rest of the services were made easier and quicker to deploy. By using the infrastructure we’d already built, we deployed the other two services within five months. We also had a meaningful impact in a few other areas.

Deployment time

One of the major pain points we had before was the long deployment time of a new model to production. It could take a few weeks until a new model version would be deployed, since it was a cross-team task that required resource allocation from both sides, and if a completely new capability was planned it could take much longer.

Now, by building the infrastructure to serve our models in an easy and accessible way, we can deploy a model within a few hours and build a completely new service in a few days, much quicker than before, and with a lot more control.

Model monitoring

We used to monitor our ML services in production for basic liveness and performance metrics — how many requests we receive, what the response time is, etc.
Once we deployed the new services, we had more flexibility to change and add metrics without dependencies. We added advanced monitoring on the model results, for example its confidence level on the detected objects or the amount of objects detected. This helps us to see actual changes in production when we update our models, and it also pinpoints us to the sources of issues when they arise — so we can know where we need to focus our efforts.

Exposing the API

By leveraging the framework we chose to work with, our current services have interactive documentation that is accessible to all our developers, even outside the AI group.
That means that anyone can easily know what we’re capable of doing in each service, and think about new features and ideas. Furthermore, it helps us with integrations with other teams, by simply directing them to the exposed API, and makes the integration process seamless.

Present day

Nowadays, we already have a couple of new services in production, enhancing our product capabilities and providing even more complex algorithms.
We’ve also started using new frameworks and upgraded the existing frameworks of our models — all thanks to a solid design that took into account the different needs of an ML model.
However, we still face many challenges, since so far we’ve focused on the serving infrastructure and monitoring aspects of our ML models.

Our models’ quality relies heavily on the data we use to train them, and as a data company, we need to handle huge amounts of data.
That’s why one of our current big projects is redesigning the data pipeline of the AI group. We aim to streamline our data collection and annotation process, to have a better picture of the models’ performance in production.
That, coupled with an easier process of model training, can help us constantly improve our models’ quality and provide better predictions over time, and know when the performance degrades.

Another issue we have is legacy code, since core components of our product were written more than five years ago, and they’re getting hard to maintain.
That’s why another major project includes rewriting the core algorithms. Now that we have an MLOps team that can communicate between the Algorithms and Backend teams, and have a mix of understanding from both worlds, we can design a new architecture that will help us scale and develop new features more efficiently.

We still have much to work on, but we’re seeing how every step we’re taking has a significant impact that shortens our development time, increases our confidence in the models we deploy, and leads to new ideas we want to explore and improve. Showing we’ve made the right decision by introducing MLOps into our company.

--

--