The Gap Between an AI Demo and AI Infrastructure

Article November 2, 2025

Share this post

The model update improved accuracy by 2.3 percentage points. It passed every benchmark. It cleared every lab validation. It shipped to a venue, and within twenty minutes, dropped frames appeared. The culprit took seventy-two hours to find: a specific type of LED floodlight created reflections that triggered a memory bottleneck causing 15-to-20-millisecond slowdowns, invisible in averages, fatal to a live broadcast where half a million viewers see every stutter. That failure is the story of OZ Cortex in miniature. Suresh Gohane, OZ's AI Stack Lead, built the runtime that executes every AI model on-site at every venue, and his definition of success is the gap between a demo that runs once to applause and infrastructure that runs a thousand times without anyone noticing.

The Promise of On-Site Intelligence#

The promise was irresistible and, on the surface, straightforward. Take the extraordinary progress in AI (models that detect, track, classify, and predict with superhuman accuracy in controlled settings) and deploy them to the physical world. Run them on-site, where the data originates, where the system must react in milliseconds, where the difference between a correct decision and a missed one determines whether a robotic camera tracks a striker's run or loses them behind a cluster of defenders.

Every AI company begins with this promise. The conference talks are compelling. The demos are remarkable. A model runs on a powerful workstation, processes a pre-recorded video clip, and produces beautiful bounding boxes, smooth trajectories, and accurate classifications. The audience applauds. Funding follows.

What the audience doesn't see (what the demo carefully obscures) is the gap between that moment and what happens when you deploy the same model to a sealed box on a stadium rooftop, processing six high-resolution camera streams simultaneously, in rain, under floodlights bouncing off wet surfaces, for ninety consecutive minutes, with zero tolerance for glitches because 500,000 viewers are watching the broadcast feed.

That gap is where I live. Bridging it is what Cortex was built to do.

Cloud-Dependent AI#

Before OZ, the default approach to AI processing was cloud-dependent. Capture video at the venue. Compress it. Stream it to a cloud data center. Run the AI models. Return results. The cloud offered seemingly unlimited compute, elastic scaling, and centralized model management. For recommendation engines, content moderation, and batch analytics, this architecture works well.

For real-time physical-world applications, it is fundamentally broken.

The arithmetic is unforgiving. A robotic camera tracking a player at full sprint (changing direction, accelerating, decelerating unpredictably) needs to see and react faster than a human blink. That's the physics.

A round-trip to the nearest cloud data center eats up most of that time budget in network transit alone, before any useful computation even begins. And this assumes the network connection is stable, which in a stadium with 40,000 fans consuming bandwidth alongside broadcast uplinks, it demonstrably isn't.

Cloud-dependent AI works when speed is a nice-to-have. It fails when speed is a physics constraint. Every real-time physical-world application (autonomous vehicles, industrial robotics, live broadcast automation, security response) hits this same wall. The cloud is too far away. Processing on-site is the only option.

"Just Run It in the Cloud"#

"Just run it in the cloud." I heard this from every advisor, every industry contact, every engineer who had built their career on cloud architectures. The resistance was genuine and well-intentioned. Cloud infrastructure is mature, well-understood, and scalable. On-site AI is none of those things. On-site, you have a fixed processing budget. You have heat limits. You have power limits. You have no operations team present. And when something fails, you cannot just restart a server; you have to reason about the physical state of hardware you cannot touch.

The resistance also came from within. Building a purpose-built on-site AI runtime is, by any honest assessment, an extraordinarily ambitious undertaking. We could have used off-the-shelf AI serving software and adapted it to our on-site constraints. It would have been faster to prototype. It would have been easier to hire for.

But adaptation isn't the same as purpose-building. Software designed for cloud data centers carries assumptions (about available memory, about cooling, about failure recovery, about priorities) that don't hold on-site at a stadium. We tried. The gap between "working in the lab" and "working at the venue" wasn't a matter of tweaking settings. It was architectural.

Building Cortex#

Cortex began as an answer to a specific question: what would an AI runtime look like if it were designed from scratch to guarantee consistent performance at the venue?

The key word is deterministic: every operation completes on time, every time. In the cloud, performance is typically described in averages: "usually fast, occasionally slow." On-site, averages are meaningless. A single slow frame means the robotic camera misses a tracking update. The camera lurches. The viewer sees a jump. The broadcast director loses trust. Deterministic means every frame, every time, within budget.

Cortex achieves this through three design decisions. First, strict time budgeting. Every AI model in the pipeline has a defined time budget, and the system enforces it. If a model threatens to exceed its budget, the pipeline degrades gracefully, reducing quality on non-critical tasks rather than allowing the entire pipeline to stall.

Second, all memory is pre-allocated. Nothing is allocated on the fly during operation. Every piece of working memory is reserved at startup. This eliminates the unpredictable slowdowns that plague long-running AI systems, the kind of micro-stutters that ruin a broadcast.

Third, heat awareness is built in. The venue node reports its temperature continuously. When the hardware approaches heat limits (a common scenario on a summer afternoon in direct sunlight), Cortex doesn't wait for the hardware to throttle itself. It proactively adjusts the workload, running lighter-weight versions of models and deferring non-critical tasks, while keeping the mission-critical path at full quality.

Suresh Gohane

OZ Cortex / AI Stack Lead

AI Runtime & Cortex

“A demo runs once and everyone claps. Infrastructure runs a thousand times and nobody notices. That is the standard.”

Optimization Under Constraint#

Once Cortex existed as an architecture, the real work began: making production AI models run within the fixed processing power of the OZ VI Venue node, the kind of GPU power that usually fills a data center rack, packed into a single weatherproof box. Six high-resolution camera streams. All AI models (detection, tracking, body positioning, event recognition) running simultaneously, continuously, for hours.

New compression techniques were the first lever. By reducing the numerical precision of AI models in carefully chosen layers, we can run them twice as efficiently, but the accuracy trade-off must be managed surgically. The critical layers that spot players maintain full precision. The less-critical layers accept lower precision. Every trade-off is validated against real venue conditions, not just academic benchmarks.

Model distillation was the second lever. Training smaller, faster "student" models that learn from larger, slower "teacher" models. The student doesn't need to learn everything, just what actually appears in the real world. A model trained specifically on venue data (real lighting conditions, real camera angles, real player behaviors) outperforms a larger general-purpose model at that venue, at a fraction of the processing cost.

Smart scheduling was the third lever. Not every AI model needs to run on every frame. Player detection runs at full speed. Body position analysis can run at half speed with interpolation between frames. Event recognition can sample key moments. Cortex schedules each model at its optimal pace, extracting maximum intelligence from fixed processing power.

On-site AI isn't cloud AI with less horsepower. It's a discipline built on constraint: fixed heat limits, fixed power, fixed memory, zero cloud fallback. The constraint isn't a limitation. It's the forcing function that produces systems reliable enough to be called infrastructure.

When Benchmarks Lie#

The moment that defined Cortex's development was a model update that passed every benchmark and failed in production.

The new model improved accuracy by 2.3 percentage points on our test data. It passed every lab validation. It met speed targets in testing. It shipped to a venue for the next match. And within twenty minutes, dropped frames appeared. Not many, one or two per minute. Imperceptible in dashboards that report averages. Devastating to a broadcast feed where a single dropped frame creates a visible stutter.

The root cause took seventy-two hours to identify. The new model had a subtly different way of using the hardware's memory that, under the specific lighting conditions of that venue's floodlights (a type of LED that created unusual reflections on the pitch) triggered a processing bottleneck that never appeared under normal lab lighting. The bottleneck caused intermittent slowdowns of just 15 to 20 milliseconds, invisible in averages, fatal for a system that must hit its timing on every single frame.

This experience fundamentally changed how we test. We no longer validate models against static test data. We validate against full-match recordings from every deployed venue, under the heat conditions those venues produce, with the exact system configuration that production uses. If a model introduces even a single timing hiccup on any venue's test scenario, it doesn't ship.

A demo runs once and everyone claps. Infrastructure runs a thousand times and nobody notices. The gap between those two sentences is where the real engineering lives.

Published Guarantees, Every Match#

The system sees and reacts faster than a human blink, from the moment light hits the camera sensor to structured data appearing in the Venue Graph. Not an average. Not a best case. Contractual. Published. Measured per venue, per match, with financial penalties if we miss.

That guarantee isn't a marketing number. It's the consequence of every design decision in Cortex: the strict time budgeting, the pre-allocated memory, the heat-aware workload adjustment, the venue-specific validation. Each decision removes a source of unpredictability. The guarantee is what remains when unpredictability has been systematically eliminated.

Published performance guarantees with financial penalties are the difference between a technology vendor and an infrastructure partner. When you attach real money to your performance promises, you discover very quickly whether your system actually delivers what you claim.

We publish these numbers because accountability is the foundation of trust. A venue operator evaluating AI infrastructure doesn't care about benchmark scores. They care about what happens during the eighty-seventh match of the season, in January, when the temperature has dropped and a software update shipped that morning. They care about consistency. And consistency is what Cortex delivers, not because it is extraordinary, but because it is systematically engineered to be ordinary. Reliably, repeatably, invisibly ordinary.

A Different Discipline Entirely#

On-site AI isn't cloud AI with less compute. I say this to every engineer who joins the team, and I say it to every external collaborator who asks about Cortex. It's a different discipline entirely.

In cloud AI, the solution to performance problems is throwing more hardware at them: add more servers, add more processing power, add more bandwidth. The fundamental assumption is that resources are elastic. In on-site AI, resources are fixed. The solution to performance problems is deeper understanding: of the hardware, of the heat dynamics, of the memory limits, of the scheduling constraints, of the specific conditions at each deployment site.

This constraint produces a different kind of engineering culture. Cloud engineers optimize for how much they can process. On-site engineers optimize for consistency. Cloud engineers study average behavior. On-site engineers study worst-case behavior. Cloud engineers celebrate beating a benchmark. On-site engineers celebrate surviving a season.

Cortex is the embodiment of that discipline. It runs every day, at every venue, under every condition. It meets its performance guarantees not because it is powerful, but because it is precise. And every match it runs successfully makes the next match more certain, because the operational data feeds back into validation, the heat profiles refine the scheduling, and the venue-specific learnings compound.

Latency isn't a metric. It's a promise. And Cortex keeps it.

The Gap Between an AI Demo and AI Infrastructure

Article November 2, 2025

Share this post

The Promise of On-Site Intelligence#

That gap is where I live. Bridging it is what Cortex was built to do.

Cloud-Dependent AI#

For real-time physical-world applications, it is fundamentally broken.

"Just Run It in the Cloud"#

Building Cortex#

Cortex began as an answer to a specific question: what would an AI runtime look like if it were designed from scratch to guarantee consistent performance at the venue?

Suresh Gohane

OZ Cortex / AI Stack Lead

AI Runtime & Cortex

“A demo runs once and everyone claps. Infrastructure runs a thousand times and nobody notices. That is the standard.”

Optimization Under Constraint#

When Benchmarks Lie#

The moment that defined Cortex's development was a model update that passed every benchmark and failed in production.

A demo runs once and everyone claps. Infrastructure runs a thousand times and nobody notices. The gap between those two sentences is where the real engineering lives.

Published Guarantees, Every Match#

A Different Discipline Entirely#

On-site AI isn't cloud AI with less compute. I say this to every engineer who joins the team, and I say it to every external collaborator who asks about Cortex. It's a different discipline entirely.

Latency isn't a metric. It's a promise. And Cortex keeps it.