Behind The Build: Kubernetes

This is part of our ongoing Behind the Build series, where we share how we’re building and evolving our ML platform at zally. In the past, we’ve explored how we adopted tools like MLflow and lakeFS to bring structure and traceability to our workflows.

Today, we’re discussing Kubernetes  and how it became the backbone of our infrastructure for training, serving, and scaling ML models.

For those unfamiliar, Kubernetes (often abbreviated as K8s) is an open-source system for automating the deployment, scaling, and management of containerised applications.

What We Had Before

Before Kubernetes, our infrastructure prioritised simplicity and speed. We deployed models as pickle files, stored them in S3, and loaded them at inference time using Lambda functions. Pre- and post-processing logic was also handled via Lambdas, stitched together into lightweight services.

This setup worked well, particularly for smaller, classical ML models, but as our use cases became more complex, several problems began to surface.

One of the biggest concerns was security, specifically around how we were saving and loading our models. In machine learning, we often serialise a model, which means converting it into a file so it can be stored or shared. Later, we deserialise it to bring it back into memory and use it again.

With Python’s pickle format, this process doesn’t just save data — it also saves code. And when you load a pickle file, that code runs automatically. So every time a Lambda function loaded a model, it was actually executing code from inside that file. Even though we generated those models ourselves, it was difficult to fully inspect or control what was being run. That’s not just messy — it’s a real security risk, especially as the system grows. It was a foundational flaw we knew we couldn’t scale safely.

Another issue was dependency management. Each Lambda function needed to bundle all the required dependencies for the model it was serving. As models became larger and more complex, these deployment packages ballooned in size, leading to slow build times and more brittle deployments.

Our processing logic was tightly coupled to the serving infrastructure. Pre-inference transformation and post-inference processing — such as reshaping inputs, applying business logic to model outputs, or formatting response payloads — were often embedded directly into the same Lambda function that loaded and ran the model. Updating any single piece of that logic meant redeploying the entire function, making iteration slower and error-prone.

What We Have Now

Kubernetes is now the backbone of our ML platform. It’s where we run everything from production model endpoints to internal tooling.

Containers are small, portable packages that include everything an application needs to run, like code and libraries.

It helps ensure that your applications (or in our case, ML models and services) are running reliably, can recover from failure, and can scale up or down based on demand.

We’ve moved away from loading pickled models inside Lambda functions. Instead, models are packaged as containers and deployed as dedicated services on Kubernetes. These containers are built with their dependencies included, versioned through MLflow, and served via stable, reproducible endpoints. This approach has greatly reduced the security risks and deployment friction associated with dynamic deserialisation, as each model now runs within its own isolated environment, providing a much safer and more robust foundation for our model deployments.

This containerised approach has also made model experimentation much more flexible. We can now deploy multiple model versions simultaneously and route traffic between them for A/B testing, canary deployments, or gradual rollouts. Switching between models or rolling back problematic deployments is now a matter of updating service configurations rather than rebuilding and redeploying entire Lambda functions.

Pre- and post-inference logic is no longer embedded in the same runtime as the model. Instead, it lives in separate FastAPI microservices, each focused on a specific task — transforming inputs, validating data, interpreting outputs, or applying business logic. These services are deployed independently and can be updated without touching the model itself.

Beyond model serving and processing, Kubernetes is also where we host our internal ML tooling. MLflow, ZenML, and LakeFS dashboards all run in the cluster, giving us a centralised way to manage models, pipelines, and datasets.

Kubernetes isn't simple, and the initial setup was challenging. But it's one of those things where once you get it right, it just works. The investment in learning and tooling has made our ML platform significantly more robust and easier to iterate on. Since adopting Kubernetes, our average model deployment lead time has dropped from days to just minutes.

Previous
Previous

Behind the Build: ZenML

Next
Next

Behind the build: lakeFS