How we scale with Kubernetes at zally

At zally, Kubernetes is the core of how we run and scale everything from authentication to machine learning workloads. In this blog, we’ll walk through our approach to container orchestration, explain the core abstractions behind Kubernetes, and show how we use node groups, taints, tolerations, and other scheduling strategies to host services such as Keycloak, our MLOps stack, machine learning models, and real-time inference endpoints.

How Kubernetes Works

To understand how Kubernetes runs workloads, we’re going to break it down into three core concepts.

Control Plane

The control plane is the “brain” of the cluster. It’s responsible for making global decisions — for example, deciding where a new application container should run or what to do if one fails. It’s made up of several components:

  • API Server – the front door to Kubernetes, where all requests come in.

  • Scheduler – decides which node should run a given workload.

  • Controller Manager – watches the state of the cluster and makes changes to match the desired state.

  • etcd – a distributed key-value database that stores all cluster configuration and state. The name comes from the /etc directory on Linux (where configuration lives), with the “d” standing for distributed.

Nodes

Nodes are the machines — physical or virtual — that run your workloads. 

Each node has:

  • Kubelet – the agent that talks to the control plane and ensures containers are running as instructed.

  • Kube-Proxy – handles networking so your applications can communicate

Nodes come in different shapes and sizes; in our case at zally, we group them into node groups optimised for specific workloads, such as GPU nodes for machine learning models.

Pods

A pod is the smallest deployable unit in Kubernetes. It wraps one or more containers along with their storage and network settings. Pods are ephemeral — if they fail, Kubernetes automatically replaces them. You rarely run a single pod on its own; instead, higher-level objects like Deployments manage pods for you.

Shaping Our Cluster for Different Workloads

At zally, we run a wide range of services on Kubernetes — from low-latency inference endpoints to stateful authentication systems and our full MLOps stack. These workloads differ in performance characteristics, resource needs, and operational criticality. Running them all on a single, undifferentiated pool of nodes would be inefficient and could cause noisy-neighbour issues.

To address this, we use node groups — separate sets of worker nodes optimised for specific workloads. Each node group has its own instance type, sizing, and scaling policy, ensuring that the right workload runs on the right hardware.

Our Node Groups

Inference – Runs real-time preprocessing FastAPI apps. These need predictable CPU performance and low-latency networking to serve incoming requests quickly.

  1. Models – Hosts machine learning models for inference. These often require GPUs or high-memory nodes for optimal performance.

  2. Auth – Dedicated to Keycloak, our authentication service. This node group prioritises reliability for secure identity management.

  3. Monitoring – Contains Grafana and Prometheus, ensuring we can observe and alert on the health of the cluster without interference from application workloads.

  4. General – A flexible pool for everyday application services that don’t have specialised requirements.

  5. MLOps – Runs MLflow, ZenML, and LakeFS, supporting our experimentation, model tracking, and data versioning needs.

By separating workloads in this way, we can:

  • Avoid resource contention between unrelated services.

  • Apply tailored autoscaling rules to each workload type.

  • Optimise costs by matching the right instance type to the workload.

This separation is further reinforced with taints and tolerations, which ensure that only the intended workloads are scheduled to each node group.

Keeping Workloads in Their Lane with Taints and Tolerations

Node groups provide physical separation for different workloads, but by default, Kubernetes does not prevent a pod from landing on the wrong node. To enforce this, we use taints and tolerations.

  • Taints mark a node as restricted, preventing pods from being scheduled there.

  • Tolerations let specific pods run on those tainted nodes.

We taint our specialised node groups so only the right workloads can run on them:

  • Models – only ML model pods can tolerate the GPU node taint.

  • Auth – only Keycloak pods tolerate the authentication node taint.

This ensures that expensive resources aren’t wasted, noisy neighbours are avoided, and each workload stays in the environment it was designed for. 

Observability and Monitoring

Reliable operations depend on seeing what’s happening inside the cluster — and catching problems before they impact users. We run a dedicated Monitoring node group that hosts Prometheus for metrics collection and Grafana for visualisation and alerting.

By isolating these workloads from application nodes, we ensure monitoring remains accurate even during traffic spikes or resource contention elsewhere in the cluster. Prometheus scrapes metrics from every service, node, and system component, giving us detailed insight into:

  • Resource usage by node group (CPU, memory, GPU).

  • Service-level performance for inference endpoints, models, and MLOps tools.

  • Cluster health metrics such as pod restarts, failed deployments, and node availability.

Grafana dashboards turn these metrics into actionable views, while alerts notify us of anomalies — from latency increases in inference services to unexpected memory growth in MLOps workloads. This visibility helps us react quickly, plan capacity, and keep the platform running smoothly.

What We Learned

When we first started, everything ran on a single pool of nodes. It got the job done, but it often meant services were competing for resources, and a big training job could easily slow down our inference endpoints. We were mostly reacting to issues as they came up, rather than shaping the cluster around our needs.

Over time, separating workloads into node groups and tightening up scheduling with taints and tolerations gave us much more control. Combined with dedicated monitoring, we reached a point where we could deploy new models and experiment freely, without worrying that Keycloak, MLflow, ZenML, or other critical systems would be affected.

That shift gave us real confidence in the platform. What once felt like adapting to Kubernetes has become building on it, and that’s what lets us keep growing and improving.

Words by Adil Said (Machine Learning Engineer)

Previous
Previous

Behind the build: How we use Cursor for speed optimisation

Next
Next

Next.js, AWS, and SST