TensorFlow over PyTorch: choosing Our ML framework

In one of our previous behind the build articles, ZenML’, we explored the architecture and philosophy behind this framework. In this article, we’ll dive into another crucial decision we had to make: choosing between TensorFlow and PyTorch for ZenML development, both open-source machine learning frameworks widely used for building and training deep learning models. They provide a comprehensive set of tools, libraries, and resources that enable developers and researchers to create, deploy, and manage AI systems.

ZenML is framework-agnostic, seamlessly integrating with both TensorFlow and PyTorch. This flexibility is a core tenet of ZenML, allowing teams to focus on building the best models without being constrained by their ML tooling. But what made TensorFlow the right choice for zally’s ML needs?

We’re going to be exploring this below. 

The Choice of TensorFlow for zally

When considering our short- and long-term goals, TensorFlow's ecosystem offered strategic advantages that were impossible to ignore.

1. Gentle Learning Curve and Abundant Resources

From a team development perspective, TensorFlow is highly accessible. With the high-level Keras API fully integrated, new engineers can get up to speed quickly. Building complex neural networks can be as simple as stacking layers, lowering the barrier to entry and accelerating development. Furthermore, TensorFlow’s historic popularity with developers means a wealth of high-quality tutorials, official documentation, and online courses are available. This rich ecosystem makes onboarding new team members easier and problem-solving faster, a huge benefit for a growing team.

2. Rapid Prototyping

While TensorFlow is often associated with production-grade deployments, it also supports fast experimentation and prototyping. Features like eager execution and the high-level Keras API allow engineers to iterate quickly on model architectures, test ideas, and validate approaches without getting bogged down in low-level implementation details. This speed of iteration enabled our team at zally to experiment efficiently, accelerating the path from concept to working prototype while keeping production-scale deployment options open.

3. Scalability for Large-Scale Models

TensorFlow is built to scale. Its robust support for distributed training, GPU/TPU acceleration, and production-level serving ensures that models can grow alongside your data and user base without requiring major architectural changes. This scalability is critical for teams like zally, anticipating increasing model complexity over time.

4. Independent Deployment and Mature Ecosystem

ZenML helps standardise the path to deployment, but the final deployment target is a strategic decision with long-term consequences. We had to ask: if we ever moved away from our current MLOps stack, would we still have the flexibility to serve our models anywhere? TensorFlow provides a decisive “yes.”

Its ecosystem is engineered for deployment independence and production readiness. TensorFlow Extended (TFX) offers a complete, end-to-end platform for every stage of the ML lifecycle, from data validation and transformation to model analysis and serving. TensorFlow Lite enables mobile and edge deployment, while TensorFlow.js brings models to the browser, providing a well-supported path to diverse endpoints outside any single MLOps framework.

Additionally, TensorFlow Serving is a high-performance serving system designed specifically for production environments. With optimisations for latency and throughput, model versioning, and seamless integration with infrastructure like Kubernetes, we had the confidence to deploy and manage our models at scale. Choosing TensorFlow gave zally a strategic safety net, ensuring deployment flexibility for the long term.

Wrapping Up Our TensorFlow Decision

Taken together, these factors, accessibility and learning resources, rapid prototyping, scalability, and independent deployment with a mature ecosystem made TensorFlow the clear choice for zally. It allowed our team to move quickly from experimentation to production while ensuring long-term flexibility and robustness. TensorFlow’s cohesive, battle-tested ecosystem gives us the confidence that our models can scale, be deployed anywhere, and remain maintainable as our needs grow.

​​Challenges with TensorFlow

Despite its strengths, TensorFlow isn’t without drawbacks. Its vast ecosystem can feel overwhelming when deciding which tools or APIs to use, and debugging low-level operations is often less straightforward than in PyTorch’s more Pythonic design. Eager execution speeds up prototyping but can introduce performance trade-offs at large scale, and integrating TensorFlow with libraries outside its ecosystem sometimes takes extra effort.

These challenges haven’t outweighed the benefits for zally, but acknowledging them helps us plan workflows realistically and set expectations when building production-grade systems.

Acknowledging the Contender: The Appeal of PyTorch

The decision to standardise on a machine learning framework wasn’t one we took lightly. PyTorch presents a compelling case. Its Pythonic design makes development intuitive, and its dynamic computation graph is a massive advantage for research and prototyping novel architectures. Over the past few years, PyTorch has also significantly strengthened its production capabilities with tools like TorchServe, TorchScript, and integrations with major cloud MLOps platforms. This makes it increasingly suitable for end-to-end production workflows, not just research.

We have immense respect for the PyTorch ecosystem and understand why it has become the framework of choice for many in academia and research. However, for a company like zally, whose focus is on building robust, scalable, and production-grade AI features, the criteria for an internal framework extend beyond initial development experience. We needed a framework that combined speed of experimentation with a battle-tested path to deployment. For our team, TensorFlow provided the strongest combination of development accessibility, production support, and long-term strategic flexibility.

Looking Ahead

While TensorFlow has proven to be the right choice for zally today, we continue to keep an open mind about other frameworks. PyTorch, JAX, and emerging tools each bring unique strengths, and the landscape of deep learning frameworks evolves quickly. Thanks to ZenML’s framework-agnostic design, we can experiment with new technologies without committing our entire stack, minimising the weight of this decision. This flexibility ensures that as the field grows and new opportunities arise, zally can continue to adopt the best tools for both experimentation and production, while maintaining robust, scalable, and deployable AI systems.

Next
Next

How we use Cursor for speed optimisation