What Happens When a Vision Model Meets Physics

Article August 22, 2025

Share this post

Inside a sealed enclosure on a stadium gantry, a processor runs six camera feeds through a detection model. It has no fan. It has no cloud connection. It has a fraction of a second to process each frame before the next one arrives, and outside the enclosure, it is raining. Dr. Bhagyashree Lad, OZ's Head of AI Vision, designs the models that run inside those enclosures. Her background is in computer vision research, where models are evaluated on clean datasets with consistent lighting and unlimited compute. Production perception, she has learned, is not a harder version of that world. It is a different discipline entirely, one where the constraints are physical, the ground truth is live, and the only test that matters is whether the system works for ninety continuous minutes.

From Research to the Edge#

Q: Your background is in computer vision research. What drew you to production perception at the edge?

Research gives you tight feedback loops. You design a model, train it, measure accuracy against a known test set. The metrics improve. The papers get published. The progress is real, but it exists inside a bounded world. Consistent lighting. Clean annotations. Known distributions.

The question I could never answer from inside that world was: what happens when you deploy this outside the lab? Not to a cloud server with elastic compute, but to a physical device at a physical venue where the conditions change constantly and the system cannot pause to think.

Q: And OZ was the answer to that question?

OZ was the first place where the question was the entire job. Every model we build ships to venue hardware: sealed enclosures on stadium gantries with fixed compute, fixed power, fixed memory. The model either works under those constraints or it does not. There is no fallback. There is no cloud. The match is live.

That combination (physical constraints, real-time requirements, continuous operation) creates a discipline that research does not prepare you for. Research teaches you to make models accurate. Production teaches you where accuracy is not enough.

What Makes Edge Perception a Different Discipline#

Q: You say it is a different discipline, not just a harder version of the same thing. What do you mean?

In research, the variable you optimise is accuracy on a test set. The compute is elastic. The time budget is flexible. Failure means a worse number in a table.

At the edge, you optimise for many things simultaneously. The model must run within a strict power envelope because the hardware is sealed and fanless, so there's no way to dissipate extra heat. It must process multiple high-resolution camera feeds at frame rate, because even a single dropped frame means the tracking loses continuity. It must maintain identity consistency across long sequences: not thirty-second clips, but ninety continuous minutes where players sprint, cluster, collide, and sometimes wear kits that are nearly the same colour under certain lighting.

The failure mode is not a worse number. It is a camera that points at the wrong part of the pitch during a live broadcast. The feedback is not an accuracy score. It is a production director who sees the wrong shot.

Q: How does that change the way you design models?

Every model at OZ has a production specification before it has a training plan. We define the speed requirement, the accuracy floor, the recovery time after an occlusion, and the power budget before we write the first line of training code. The model is designed to meet the specification, not to maximise a benchmark score.

That inversion (specification first, training second) changes everything downstream. The architecture choices, the compression strategy, the training data, the validation criteria: all of it follows from the production requirements. In research, you start with a model and ask how accurate it can be. Here, you start with what the production demands and ask what model can meet those demands on the hardware you have.

Production perception inverts the typical research workflow. Instead of designing a model and measuring its accuracy, OZ defines the production specification first (speed, accuracy floor, power budget, recovery time) and designs the model to meet it. The hardware is fixed. The constraints are physical. The only variable is the engineering.

What Venue Conditions Do to a Model#

Q: What do real venue conditions do to a model that performs well on benchmarks?

Benchmark datasets are captured under consistent conditions: controlled lighting, well-separated subjects, static backgrounds. A model that trains on those datasets learns the patterns of those conditions. It does not learn what rain does to pixel contrast, or how a setting sun turning into floodlights changes the colour temperature across the pitch in the space of twenty minutes, or how lens condensation at halftime can degrade an entire camera array.

The gap is not about model quality. A model can be excellent and still struggle with conditions it has never seen. The gap is about training distribution: the conditions your model has learned from versus the conditions it faces in the field.

Q: How does OZ close that gap?

By training on real venue data. Our models train on footage from actual matches: the variable lighting, the weather, the occlusion patterns, the crowd conditions that our system actually faces. Every deployment generates new training material. Conditions that cause difficulty become test cases that gate future model releases.

Dr. Bhagyashree Lad

Head of AI Vision

Perception & Computer Vision

“Research teaches you to make models accurate. Production teaches you where accuracy is not enough.”

The practical effect is that our test suites grow with deployment experience. A model doesn't ship unless it handles the specific conditions that previous versions struggled with, not just on aggregate scores, but on the particular scenarios that matter in production.

Multi-Camera Fusion: The Next Frontier#

Q: What is the hardest perception problem you are working on now?

Multi-camera fusion. Taking what six cameras see and building a unified understanding of the scene, where every player has one identity, one position, one trajectory, regardless of which cameras can see them at any moment.

Each camera has a different focal length, a different exposure, a different perspective. A player visible in three cameras must be recognised as the same person. When they disappear behind a referee in one camera and reappear in another, the system must maintain continuity. No duplicates. No identity swaps.

Q: Why is it harder than single-camera tracking?

Single-camera tracking is a well-studied problem with mature techniques. Multi-camera fusion is a different problem class. You are not just tracking; you are reconciling different perspectives on the same physical reality. The cameras disagree about what a player looks like because they see different angles, different lighting, different levels of occlusion. Resolving those disagreements in real time, at frame rate, under production conditions: that is the frontier.

And it gets harder with player density. Open play with well-separated players is manageable. Corner kicks, where a dozen players crowd a small area and every camera sees a mass of overlapping bodies. That is where the problem becomes genuinely hard.

What Research Does Not Prepare You For#

Q: What is the most important thing you have learned that research did not teach you?

That the model is not the product. The model is one component of a system that must operate continuously, under conditions that change constantly, with no room for failure. The perception pipeline includes the model, but also the tracking logic, the identity management, the confidence estimation, the graceful handling of conditions where visibility degrades. The model alone solves nothing.

In research, the model is the deliverable. You publish the architecture, the training method, the accuracy numbers. In production, the model is a component. The system around it (the engineering that handles the cases the model gets wrong) is what determines whether the product works.

That perspective changes how you evaluate your own work. You stop asking "how accurate is the model?" and start asking "how does the system behave when the model is uncertain?" The second question is harder, and more important.

What Happens When a Vision Model Meets Physics

Article August 22, 2025

Share this post

From Research to the Edge#

Q: Your background is in computer vision research. What drew you to production perception at the edge?

Q: And OZ was the answer to that question?

What Makes Edge Perception a Different Discipline#

Q: You say it is a different discipline, not just a harder version of the same thing. What do you mean?

In research, the variable you optimise is accuracy on a test set. The compute is elastic. The time budget is flexible. Failure means a worse number in a table.

Q: How does that change the way you design models?

What Venue Conditions Do to a Model#

Q: What do real venue conditions do to a model that performs well on benchmarks?

Q: How does OZ close that gap?

Dr. Bhagyashree Lad

Head of AI Vision

Perception & Computer Vision

“Research teaches you to make models accurate. Production teaches you where accuracy is not enough.”

Multi-Camera Fusion: The Next Frontier#

Q: What is the hardest perception problem you are working on now?

Q: Why is it harder than single-camera tracking?

What Research Does Not Prepare You For#

Q: What is the most important thing you have learned that research did not teach you?