Behind the build: lakeFS 

Data Versioning at Scale with lakeFS — How We're Bringing Version Control to Behavioural Data

At zally, we use high-frequency sensor data to power user authentication. That means our machine learning pipelines depend on time-sensitive, high-volume behavioural data, and reproducibility isn’t just a nice-to-have; it’s a requirement.

As our data infrastructure matured, we ran into a familiar challenge:
How do you manage, test, and reproduce evolving datasets reliably?

We found the answer with lakeFS, and it fundamentally changed how we think about data versioning. But to appreciate what it gave us, it’s worth talking about what life was like before.

Before Data Versioning

Before adopting lakeFS, we didn’t have formal data versioning in place. Like a lot of teams working with S3, our workflows relied on convention, caution, and a healthy respect for the risks involved.

When we needed to backfill data, say, to recompute a feature or rerun a transformation, we’d start by manually creating a backup of the existing dataset. 

That usually meant copying the relevant S3 folder to a new location, timestamping it, and making sure it was stored somewhere safe just in case we needed to roll back. 

There wasn’t a centralised policy around this, it was more of an understood practice: “don’t touch anything unless you’ve got a copy you can restore from.”

For experimentation, we’d often download a slice of the data locally to work with. This gave us flexibility, but it also introduced drift. If someone else was testing a related change, their data snapshot might be slightly different. And since everyone managed their own copies, it was easy to lose track of which version was used for which experiment.

The system worked, but it was fragile. Backups were manual, rollback was difficult and collaboration relied heavily on communication and discipline. We made it work because we had to. But as the team scaled and experiments got more complex, we needed something more robust and appropriate for production.

So How Has Introducing lakeFS Changed That?

lakeFS gave us structure where we previously relied on caution. It turned our manual safeguards into proper, versioned workflows — with branching, rollbacks, and traceability built in from the start.

Before, testing a new feature meant downloading a local copy or duplicating data in S3 to avoid impacting production. Now, we simply create a data branch, just like in Git.

We can test, iterate, and roll back confidently, knowing the main branch remains untouched. The main branch represents production-ready data, while feature development happens in isolated branches. This provides stability whether exploring a new touch-derived signal or adjusting a transformation.

These branches act as safe sandboxes: versioned, revertible, and fully disposable. No more risk of inconsistencies or data drift from manual copies.

If an experiment doesn't work, we delete the branch. If it does, we merge it — no folder duplication, no path rewrites, no guesswork. Promotion to production is just a merge, and lineage stays intact.

Every dataset change is a versioned commit. That means we can trace exactly what changed, when, and why — with full visibility into our data’s evolution over time. For behavioral data that changes fast and matters deeply, that kind of traceability is critical. And the best part? It didn’t add friction — it removed it.

The Effect?

It can be hard to quantify with a metric, but the biggest effect is that we don’t have to think about it most of the time and that’s exactly the point. Data versioning doesn’t get in the way, but it’s there when we need it. It gives us the confidence to move quickly, test ideas, and make changes without the fear of breaking something. And on the rare occasion things do go wrong, we know we can recover.

In a domain like behavioral biometrics where even subtle differences in sensor data can affect outcomes, versioning at the data layer is a foundational capability. lakeFS gives us confidence that every model, metric, and prediction is built on a well defined, reproducible dataset.

Post

How do you bring Git-style versioning to massive behavioural datasets?

At zally, our ML pipelines depend on fast, high-frequency sensor data and reproducibility isn’t optional.

But trying to versioning training data in S3? Welcome to a world of careful conventions, manual steps, and a lot of coordination!

That is, until we adopted lakeFS.

It gave us branching, rollback, and traceability for our data — just like Git does for code.

In our latest article, we share how lakeFS fundamentally changed how we manage and experiment with behavioral data at scale.

What kind of data disasters have you seen first hand?

Next
Next

Behind the Build: MLflow