Behind the Build: ZenML

Behind the build

18 Jul

At zally, robust machine learning (ML) pipelines are central to how we deliver data products. For a long time, our approach to running these pipelines was simple and pragmatic: we used SageMaker notebooks scheduled via cron jobs.

This setup allowed us to easily prototype and iterate directly in SageMaker notebooks while using cron to automate runs, avoiding the need for complex orchestration early on.

The Problem We Wanted To Solve

So what exactly was our setup? We had a typical pattern. Pipeline code was written directly in a SageMaker notebook, and execution was scheduled via a cron job. Each time the cron job triggered, the notebook ran end-to-end, executing all steps sequentially within a single environment.

While functional, our notebook-based approach lacked true pipeline structure. No clear steps, inputs/outputs, dependency management, or directed acyclic graph (DAG) visualisation.

Beyond that, there was no orchestration abstraction in place. Our orchestration was effectively just “run this notebook on a schedule.” There was no separation between pipeline logic, what the ML workflow does, and orchestration logic, how it runs, where it runs, and how steps are managed. This limited independent step management, task parallelism, and component reuse.

Another issue was that our pipelines were tightly coupled to SageMaker notebooks. Running them locally for fast iteration was impractical, and running them in different environments, such as Kubernetes, required significant reimplementation effort. This reduced both the portability and flexibility of our workflows.

Finally, tracking and reproducibility were limited. While we could view execution logs from the notebooks, there was no structured tracking of pipeline runs as unified executions. We couldn’t easily see inputs, outputs, and parameters for each step, nor did we have metadata lineage that would enable reproducibility, effective debugging, or robust audits of our ML workflows.

Why We Adopted ZenML

We adopted ZenML to overcome these limitations and embrace a pipeline-first workflow.

What is ZenML?

ZenML is an open-source MLOps framework designed to simplify the development, orchestration, and tracking of pipelines. It allows us to define pipelines as sequences of steps, each with clear inputs and outputs. These pipelines can then be run on any orchestrator. Whether that’s locally on a developer’s machine, in SageMaker for scalable production workloads, or on Kubernetes, all without changing the pipeline code. Importantly, ZenML tracks all runs, parameters, and artefacts in a structured and reproducible way, providing full transparency into pipeline execution.

Pipeline abstraction

With ZenML, we define our pipelines purely as Python functions grouped into modular steps. This transforms our workflows from linear notebook scripts into true pipelines, where each step is clearly defined and independently testable. This structure improves maintainability, readability, and reusability across different workflows.

Stacks decouple pipelines from infrastructure

ZenML’s 'stack' abstraction defines where and how a pipeline runs via components like the orchestrator, artefact store, and other tools like experiment trackers and container registries.

The orchestrator determines where the pipeline executes, whether that is locally, on SageMaker, or within a Kubernetes cluster. The artefact store defines where data generated by the pipeline is saved, which could be an S3 bucket or a local filesystem, depending on the environment. Finally, additional components such as experiment trackers or container registries integrate seamlessly to provide broader MLOps capabilities.

This decoupling has been transformative for our workflows. We can write pipeline logic once and then decide at runtime where it will execute. For instance, we can run pipelines locally during development for fast iteration and debugging, then switch to running the exact same pipeline on SageMaker for production-scale jobs, or on Kubernetes (EKS) for flexible orchestration within our cluster – all simply by changing the active stack configuration. No code changes are required to switch environments, which keeps our pipelines clean and infrastructure-agnostic.

Unified tracking and visualisation

Another major benefit of ZenML is its built-in tracking and visualisation. Every pipeline run is tracked as a DAG, giving us a clear visual representation of the workflow and its dependencies. Inputs, outputs, and parameters for each step are logged in a structured manner, and full metadata lineage is maintained. This level of tracking provides robust reproducibility, makes debugging easier, and enables thorough audits of our ML workflows when needed.

What This Enables For Us

Together, these improvements have made a big difference in how we build and run our ML pipelines. With ZenML, our workflows are easier to develop, easier to maintain, and can run in any environment without extra effort. We spend less time on setup and more on building useful models—moving faster and staying focused on delivering value.

Words by Adil Said (Machine Learning Engineer)

Patrick Smith