What Breaks When Computer Vision Meets a Live Stadium
Multi-Camera Fusion in Theory vs Production#
Q: You described multi-camera fusion in your first interview as the next frontier. What does it actually look like in production?
On a whiteboard, multi-camera fusion is elegant. Six cameras, overlapping fields of view, unified scene reconstruction. One slide, clean geometry.
In production, those six cameras have different focal lengths, different exposures, and different perspectives on the same player. One faces the setting sun. Another faces LED advertising boards cycling at 10 Hz. One sees a player's full body; another sees a torso obscured by a referee. And the players are moving at full sprint, changing direction constantly, in kits that sometimes near-match the opponent's colours.
Q: How does OZ approach the problem?
We treat all six feeds as one unified scene rather than six independent streams. The key design choice was to reason about where people actually are in physical space, not where they appear in each camera image. Early iterations that analysed each camera separately and tried to reconcile results created a matching problem that grew exponentially with player density. Fusing in world coordinates from the start eliminated that.
It still took several design iterations. Approaches that worked in controlled testing fell apart under match conditions, typically when unusual player arrangements exposed assumptions in the camera alignment model.
Why Academic Benchmarks Understate Production Difficulty#
Q: What specifically is different about production conditions?
Standard tracking benchmarks evaluate on pre-recorded clips, typically 30 to 60 seconds, consistent lighting, pedestrians walking at normal speed. Production is a continuous 90-minute tracking session where lighting shifts as the sun sets and floodlights activate, players sprint and cluster and collide, and corner kicks pack a dozen bodies into a small area where most are partially occluding each other.
No public dataset covers these conditions. So our models train on real venue data, footage from actual matches under the variable lighting, weather, and crowd conditions our system faces. Every deployment generates new training material. Every failure we diagnose in production becomes a test case we gate future models against.
The practical effect is that our test suites grow with deployment experience. A model doesn't ship unless it resolves the specific failures flagged since the last release, not just on an aggregate score, but on the particular tracking cases where the previous version struggled.
Training on real venue data inverts the typical research workflow. Instead of designing models around benchmark performance, OZ's perception pipeline starts with real production failures, builds test cases from them, and gates every model release against those cases. The dataset grows with every deployment, and it covers conditions no public benchmark reproduces.
Running Compressed Models on Edge Hardware#
Q: You run compressed models on venue hardware. What does compression buy, and what does it cost?
Compression lets us run more sophisticated models within the power and thermal constraints of sealed, fanless edge enclosures. Our hardware processes multiple high-resolution camera feeds simultaneously; without compression, the models physically cannot keep up with the incoming video.
The cost shows up in edge cases, exactly the moments where you need the most precision. A player partially occluded by a referee, a ball against a white pitch line, a goalkeeper in a dark kit against shadowed advertising boards. Compression reduces the model's ability to resolve these subtle distinctions.
Dr. Bhagyashree Lad
Head of AI Vision
“The hardest part of production AI is not making models accurate. It is making them honest about when they cannot see.”
Q: How do you mitigate that?
We keep full numerical precision at the critical decision points (the final layers that determine "this is a player at this position" and "this is still the same player from the previous frame") while compressing everything else. It is a hybrid approach shaped by specific failure modes, not by benchmark optimisation. We give back a portion of the compute savings from full compression, but eliminate the critical failures it introduced.
What Happens When Vision Degrades#
Q: What happens when cameras lose visibility: smoke, fog, condensation?
This is the problem that changed our architecture the most. Perception is probabilistic, but the broadcast is continuous. When camera feeds degrade, the downstream systems (camera automation, graphics overlays, the Venue Graph) still expect continuous spatial output. You cannot pause the broadcast while you wait for visibility to return.
Q: What did you build?
A degradation-aware tracking pipeline. When a camera feed quality drops (whether from smoke, lens condensation during a temperature swing, fog at a coastal venue, or a floodlight flicker) the system doesn't simply stop tracking. It maintains trajectory estimates using recent motion history while flagging those positions as lower confidence. It leans harder on the remaining cameras with clear sightlines. And when the degraded feeds recover, it reconciles the estimates with actual observations and corrects any drift.
The system knows when it is running in degraded mode, and it communicates that uncertainty downstream. Honest uncertainty propagation matters more than false confidence.
Because all processing runs on-venue (not in the cloud) there is no remote fallback when conditions degrade. The system must handle degradation locally. This constraint forces a level of built-in resilience that cloud-dependent architectures never develop, because cloud systems always have the option of scaling up remote compute. At the venue, you either handle it or you fail on live broadcast.
What Compounds in Production Perception#
Q: After working through these production challenges, what advantage compounds over time?
Institutional knowledge encoded in the system. Every degradation event, every compression artefact, every corner-kick occlusion pattern becomes a test case that gates future releases. The graceful-degradation pipeline exists because of specific visibility events. The hybrid precision approach exists because of specific detection failures. Venue-tuned model variants exist because lighting and conditions vary more across stadiums than any synthetic augmentation captures.
A team starting fresh would need to discover these failure modes independently, and survive them during live production without damaging venue relationships. The knowledge does not sit in a document. It sits in the pipeline as test gates, fallback logic, and confidence thresholds calibrated against real match conditions.
That is what accumulates. Not the architecture; architectures are published. Not the raw data; data can be collected. The production-hardened understanding of what actually fails, and the engineering that prevents it from failing again.