One JSON File Runs Six Cameras for Ninety Minutes
The System Sees Everything. Now What?#
Q: Your previous interviews covered the perception pipeline, how the system detects, tracks, and fuses everything happening in a venue. What changed?
We solved the seeing part. Not perfectly (perception is never perfect) but reliably enough that every match, the system produces a continuous, fused model of every entity on the pitch. Twenty-two players, the ball, the referees, the goalposts. Positions updated sixty times per second. Velocities. Trajectories. Which player has the ball. Which team is attacking which goal. The Venue Graph holds all of it.
But knowing where everyone is does not tell a camera what to do.
Q: What do you mean?
Think about what a broadcast director does during a match. They are not just watching the game. They are making hundreds of decisions per minute. When the ball crosses the halfway line and play transitions from midfield to attack, the main camera should widen to show the tactical picture. The action camera should tighten on the ball carrier. The goal camera behind the net should prepare, because the ball is now heading toward its end. And the framing isn't just "point at the ball." There is leading space, extra space in the frame ahead of the direction of play, toward the goal being attacked, so the viewer's eye has somewhere to go. There is composition, the goalpost should be visible in the boundary camera to give the viewer spatial context. There is smoothing, the camera movement should be fluid, not jerky, even when players change direction at full sprint.
All of this is happening simultaneously, across six cameras, sixty times per second, for ninety minutes. A human director does it on instinct built over decades. But we needed the system to do it, and we needed it to be configurable, not hardcoded.
What If Behavior Was a Document?#
Q: How was it done before?
Code. Behavior was written in the software. If you wanted the attack camera to behave differently (tighter framing, more leading space) an engineer had to change the code, recompile, redeploy. Every venue was slightly different, so every venue had its own configuration buried in source files. If a producer wanted to adjust how corner kicks were covered, they had to file a request, wait for engineering, test, deploy. Weeks.
And the logic was opaque. A producer could not read the code and understand what the cameras would do. They had to watch the output and give feedback. It was like directing a play by watching a recording and sending notes to someone who speaks a different language.
Q: So what did you build?
The Playbook. A single JSON file that declares the complete behavior specification for every camera in a venue.
Open it in any text editor. Read it. Understand what every camera will do in every situation: when the ball is in midfield, when it enters the attacking third, when there is a corner kick, when the director overrides. One file. Human-readable. Machine-executable. No compilation. No engineering team required to modify it.
Q: A JSON file that controls six broadcast cameras producing a live football match?
One file. And it is not code. It is a document. A declaration of intent. It says what should happen and when. It never says how. It never mentions motor speeds, PTZ angles, or pixel coordinates. The lower layers of the system handle all of that. The Playbook operates at the level a director thinks at: zones, entities, situations, and shots.
The Playbook JSON is a declarative behavior specification. It describes what cameras should do and when, without encoding how the underlying hardware executes it. The format is designed to be readable by a producer, interpretable by a machine, and portable between venues.
The Space Comes First#
Q: Walk us through what is actually in the file. What does a producer see when they open it?
The first section is the environment. Before the system can reason about what happens in a space, it needs to know the shape of that space.
A football pitch is 105 meters by 68 meters. The origin (the zero point) sits at the center circle. Everything is measured in meters from there. The pitch is divided into zones: the middle zone from roughly the 35-meter lines inward, and two primary zones (the attacking thirds) from the 35-meter lines out to each goal line.
Then there are landmarks. The goal lines. The center line. These are reference points that the system uses for spatial reasoning: how far is the ball from the goal? Which boundary is closest?
And there are phases. Pre-match. First half. Half time. Second half. Full time. The system behaves differently in each phase. You don't run attack coverage during half time.
Q: Can you show me what that looks like in the file?
Here, this is the environment block for a football pitch:
{
"environment": {
"origin": {
"type": "center",
"coordinate_system": "cartesian_meters"
},
"bounds": {
"x": [-52.5, 52.5],
"z": [-34.0, 34.0]
},
"zones": [
{
"id": "zone_primary_neg",
"name": "Primary zone (negative X)",
"bounds": { "x_min": -52.5, "x_max": -17.5 }
},
{
"id": "zone_middle",
"name": "Middle zone",
"bounds": { "x_min": -17.5, "x_max": 17.5 }
},
{
"id": "zone_primary_pos",
"name": "Primary zone (positive X)",
"bounds": { "x_min": 17.5, "x_max": 52.5 }
}
],
"phases": ["pre_event", "active_event", "intermission", "post_event"]
}
}The zones describe the space in general terms. "Primary zone negative X" is the attacking third on one side. "Middle zone" is midfield. "Primary zone positive X" is the attacking third on the other side. The phases describe the temporal structure: "active event" covers the two halves, "intermission" is half time. The same vocabulary describes any environment: a basketball court, a shipping port, or an airport terminal. The zones change shape and meaning, but the structure stays the same. A domain extension layer maps these general terms to the specific language of each use case, so a football producer sees "attacking third" in their interface, but the underlying format is universal.
Entities: What the World Contains#
Q: After the environment?
Entity types. What lives inside that space.
The first is the focus entity, the primary thing the system is tracking. In football, that is the ball. It has a position, a velocity, a status. The system knows where it is, how fast it is moving, and in which direction.
The second is the dynamic agent, any autonomous entity in the environment. In football, those are the players. Each one has a position, velocity, classification, and group membership. The domain extension adds team, jersey number, player name. The system knows not just that there are twenty-two people on the pitch, but which team each belongs to, which one has the ball, which one is the goalkeeper, which one is the last defender.
The third is the static landmark, fixed objects that matter for spatial reasoning. The goalposts. The corner flags. These do not move, but their positions are critical. When the ball is near the goal, the boundary camera needs to frame the shot with the goalpost visible. That is goalpost awareness, and it comes from the static landmark definition in the Playbook.
Q: So the Playbook tells the system what to pay attention to?
It tells the system what exists and what matters. Here, this is how a football Playbook declares its entities:
{
"entity_types": {
"focus_entity": {
"description": "Primary entity of interest.",
"_domain_extensions": {
"broadcasting": { "maps_to": "ball" }
}
},
"dynamic_agent": {
"description": "Autonomous agents in the environment.",
"properties": ["position", "velocity", "classification", "group_id"],
"_domain_extensions": {
"broadcasting": {
"maps_to": "player",
"extra_props": ["team", "jersey_number", "player_name"]
}
}
},
"static_landmark": {
"description": "Fixed reference objects in the environment.",
"_domain_extensions": {
"broadcasting": { "maps_to": ["goalpost", "corner_flag"] }
}
}
}
}The focus_entity is the ball, the primary thing the cameras follow. The dynamic_agent is a player, with team, jersey number, and name added through the _domain_extensions layer. The static_landmark is the goalposts and corner flags. In a different domain (say, port security) the focus entity becomes the alert trigger, the dynamic agents become persons and vehicles, and the static landmarks become gates and dock edges. Same structure, different meaning.
The focus entity is what the cameras track by default. The dynamic agents provide the context: who is near the ball, who is making a run, who is the last defender. The static landmarks are the spatial anchors: the goals, the boundaries, the reference points that give the viewer a sense of place.
Dr. Bhagyashree Lad
Head of AI Vision
“A football match is one of the most demanding spatial orchestration problems you can find. Twenty-two agents, a focus entity moving at 120 km/h, complex rules, and a viewing audience that expects broadcast framing perfection, sixty times per second.”
Why Football Is the Hardest Problem#
Q: You have said before that football is uniquely difficult for AI. Why?
Because football is one of the most demanding spatial orchestration problems you can find.
Start with speed. The ball can travel at 120 km/h. Players reach 35 km/h and change direction in fractions of a second. The entire dynamic of the game (who is attacking, who is defending, where the danger is) can invert in a single pass. A long ball from a defender puts the striker through on goal, and in two seconds the relevant camera shifts from a wide midfield view to a tight boundary shot of the goalkeeper facing a one-on-one. Every camera needs to react before the action, not after.
Then add complexity. Twenty-two players. Two teams. Each team has a formation, but formations are fluid. They compress, expand, rotate. The goalkeeper has different rules from the outfield players. The last defender defines the offside line, which matters for camera positioning. The player in possession is the most important dynamic agent at any moment, but possession changes dozens of times per minute.
Then add rules. Football has restarts: goal kicks, corner kicks, free kicks, throw-ins, penalty kicks. Each restart has a different spatial pattern. A corner kick clusters twelve players in a twenty-square-meter area. A free kick creates a wall of defenders ten meters from the ball. A penalty isolates the shooter and goalkeeper with everyone else behind the ball. Each requires a completely different camera behavior.
And then add the broadcasting standard. Professional broadcast has a visual grammar that has been refined over decades. The main camera sits at center pitch, elevated, showing the tactical picture. When play moves toward a goal, the frame needs leading space, extra room in the direction of travel, toward the goal being attacked, so the viewer's eye is drawn forward into the action. The goal camera needs to include the goalpost in frame to anchor the viewer spatially. Shot transitions need to be smooth, no jarring cuts, no sudden zooms. A close-up of a player's reaction needs to arrive before the moment passes, not after.
All of this must happen at sixty frames per second, on edge hardware, with zero cloud dependency, in real time. No second chances.
Q: And the Playbook encodes all of this?
Every bit of it. In a JSON file that a human can read.
Contexts: When Everything Changes#
Q: How does the Playbook express "the ball just entered the attacking third, change everything"?
Through contexts. A context is a declared situation with a trigger, a priority, and a set of camera assignments.
Here is the attacking context, exactly as it appears in the Playbook:
{
"id": "ctx_primary_zone_action",
"name": "Primary Zone Action",
"priority": 80,
"trigger": {
"type": "composite",
"operator": "OR",
"conditions": [
{ "type": "zone", "subject": "focus_entity", "zone_ref": "zone_primary_neg" },
{ "type": "zone", "subject": "focus_entity", "zone_ref": "zone_primary_pos" }
],
"stability": {
"dwell_ms": 150,
"hysteresis_m": 1.0
}
},
"assignments": {
"PRIMARY_TACTICAL_VIEW": "intent_primary_wide_focus",
"CLOSE_ACTION_FOLLOW": "intent_primary_tight_follow",
"BOUNDARY_THREAT": "intent_boundary_action_focus",
"LATERAL_VIEW": "intent_defensive_hold"
}
}Read it from top to bottom. The trigger says: when the focus entity enters zone primary negative OR zone primary positive (meaning the ball is in either attacking third) with a stability dwell of 150 milliseconds and a spatial hysteresis buffer of one meter, activate this context at priority 80. Then the assignments block maps each sensor capability to an intent: the primary tactical camera gets one behavior, the close action camera gets another, the boundary camera gets another.
One hundred and fifty milliseconds. That is the dwell time, how long the condition must be true before the context activates. It exists to prevent flickering. If the ball brushes the edge of the zone and bounces back, you do not want all six cameras to switch modes for a split second. But 150 milliseconds is fast enough that the system reacts before the viewer notices any delay.
Q: And when the ball is in midfield?
The midfield context activates instead, at priority 60. Lower priority than the attack context, which means if both triggers are somehow true simultaneously, the attack context wins. The midfield context uses wider framing, more tactical. Camera one gets a very wide view, a coverage target of 0.15, meaning the subjects fill only 15% of the frame and the viewer sees the full tactical shape. The action camera still follows the ball carrier, but with more breathing room.
Q: What happens when the ball goes from midfield to attack? How does the system transition?
The attack context's trigger evaluates to true. The stability dwell of 150 milliseconds passes. The attack context activates at priority 80, suppressing the midfield context at priority 60. Each camera receives its new intent assignment from the attack context. The transition happens with a configured easing (200 milliseconds, ease-out) so the cameras move smoothly to their new framing, not abruptly.
And there is an exit trigger. The attack context stays active until the ball returns past the 16-meter line back toward midfield, at which point the midfield context takes over again.
Context transitions in football happen in fractions of a second, but must feel invisible to the viewer. The Playbook controls this through stability dwell times (preventing false triggers), priority ordering (ensuring the right context wins), and transition easing (smoothing camera movements between states). All of these are declared parameters, not code.
Intents: What Each Camera Actually Does#
Q: Once a context activates, what happens to each individual camera?
The context maps sensor capabilities to intents. An intent is a declared behavior for a single camera: what it should frame, how tightly, with how much leading space, and how smooth its movements should be.
Take the attack context. It assigns four intents. Look at the difference between the wide tactical shot and the tight action shot:
{
"intent_primary_wide_focus": {
"display_name": "Primary Wide Focus",
"method": "dynamic_compositional_bounding_box",
"subjects": ["focus_entity", "nearest_dynamic_agent_to_focus", "possessing_agent"],
"target_coverage": 0.20,
"leading_space_factor": 0.12,
"smoothing_alpha": 0.15
}
}{
"intent_primary_tight_follow": {
"display_name": "Primary Tight Follow",
"method": "dynamic_compositional_bounding_box",
"subjects": ["focus_entity", "possessing_agent"],
"target_coverage": 0.40,
"leading_space_factor": 0.05,
"smoothing_alpha": 0.20
}
}The wide shot frames three subjects (the ball, the nearest player, and the player in possession) with a target coverage of 0.20. That 0.20 means the subjects fill 20% of the frame. It is a wide shot. The viewer sees the tactical picture: the attackers, the defensive line, the space between them. The leading space factor is 0.12, meaning 12% of the frame is reserved as empty space ahead of the direction of play, pulling the viewer's eye toward the goal being attacked.
The tight shot frames only two subjects (the ball and the possessing player) at 0.40 coverage. Twice as tight. The ball carrier fills a meaningful portion of the frame. Leading space drops to 0.05, less look-ahead because this shot is about the player, not the space. And smoothing_alpha is higher (0.20 versus 0.15) which means the camera tracks more responsively. A tight shot cannot afford to lag behind a player who changes direction.
Then the boundary camera (the goal camera on the side being attacked) gets a third intent:
{
"intent_boundary_action_focus": {
"display_name": "Boundary Action Focus",
"method": "dynamic_compositional_bounding_box",
"subjects": ["focus_entity", "last_line_agent", "nearest_static_landmark"],
"target_coverage": 0.35
}
}Three subjects: the ball, the last defender, and the nearest static landmark, which resolves to the goalpost. The goalpost is in frame. The ball is in frame. The last defender is in frame. The viewer sees the spatial relationship between all three: how close the attack is to the goal, where the defensive line is set, whether there is a gap. That is goalpost awareness, expressed as a parameter.
Q: And the goal camera on the other side?
It gets a different intent entirely: keeper context. It is watching the goalkeeper and the defensive line on the opposite side, ready in case play suddenly switches direction with a long ball.
Q: How does the system know which goal camera is on the attacking side?
Parametric resolvers. This is one of the most elegant parts of the Playbook:
{
"parametric_resolvers": {
"side": {
"derive": "nearest_boundary_to_focus_entity",
"values": ["NEGATIVE", "POSITIVE"],
"tie_breakers": ["focus_entity_velocity_x_sign", "last_known_side"]
},
"boundary_threat_role": {
"derive": "alias",
"source": "side",
"map": {
"NEGATIVE": "NEGATIVE_BOUNDARY_THREAT",
"POSITIVE": "POSITIVE_BOUNDARY_THREAT"
}
}
}
}Every tick (every sixtieth of a second) the system resolves which side the play is on. It checks which boundary the focus entity is nearest to. If the ball is closer to the negative-X boundary, side resolves to NEGATIVE. If it is closer to the positive-X boundary, POSITIVE. And when the ball is near the center? The tie-breakers kick in: first the ball's velocity direction, then the last known side. No ambiguity, no flicker.
The beauty is that the context assignment never says "left goal camera" or "right goal camera." It says BOUNDARY_THREAT. And the boundary_threat_role alias resolver maps that to NEGATIVE_BOUNDARY_THREAT or POSITIVE_BOUNDARY_THREAT depending on where the ball is. The same Playbook works regardless of which team is attacking which goal. Swap sides at halftime? The resolver handles it. The Playbook does not change.
Q: What about team awareness? The system needs to know who is attacking and who is defending.
The dynamic agent entity type carries a group ID, team membership. The domain extension adds which team is "home" and which is "away." Combined with the parametric resolver for side, the system knows: the home team is attacking the positive-X goal, the ball is in zone primary positive, therefore the boundary threat is on the positive side, and the designated agent on that side (the defending goalkeeper) should be tracked by the goal camera.
Every camera assignment takes all of this into account. Which team has the ball. Which direction they are attacking. Where the defensive line is set. Where the goalposts are. The Playbook doesn't recalculate this logic; it declares the relationships, and the runtime resolves them sixty times per second against the live world state.
The Director Always Has the Last Word#
Q: What about the human director? Do they just watch?
They have the final word. Always.
The Playbook has a section called operator intents. These are overrides that a human director can trigger at any moment through the OZ Studio interface.
The most common is follow_entity:
{
"follow_entity": {
"action": "track_entity_at_coordinates",
"default_ttl_level": "L7",
"payload_schema": {
"target_x": "float, World X coordinate from operator UI",
"target_z": "float, World Z coordinate from operator UI"
},
"max_search_distance_m": 5.0,
"coverage_ratio": 0.6,
"lock_on_first_tick": true
}
}The director sees a player making an interesting run, maybe a substitute warming up, maybe a striker arguing with the referee. They tap on the player in the top-down view on their screen. The system receives the world coordinates from the tap, finds the nearest dynamic agent within 5 meters, and publishes a follow-entity intent with a TTL level of L7, which maps to 15 seconds. The camera locks onto that player at a coverage ratio of 0.6, tight framing, filling 60% of the frame.
After 15 seconds, the TTL expires. The intent status changes from active to expired. And the camera smoothly returns to whatever the active automated context was assigning it to do. The transition back is gentler (350 milliseconds with ease-in-out) because the shift from manual override back to automated should feel graceful, not abrupt.
If the director taps again before the TTL expires, the new intent supersedes the old one. The TTL resets. The camera stays under manual control as long as the director keeps engaging.
Q: So the system runs autonomously, and the director intervenes only when they want something specific?
Exactly. The Playbook handles the 95%, the standard coverage that professional broadcasting requires. Wide tactical for midfield play. Tight follow during attacks. Goal camera with boundary context. Keeper coverage. Side-aware framing. Leading space. All of it automated, all of it running on the Playbook's declared behavior.
The director focuses on the 5% that requires human judgment. The emotional close-up after a missed chance. The crowd reaction. The manager on the touchline. The moments that make a broadcast compelling aren't algorithmic; they're editorial. The Playbook frees the director to be an editor, not an operator.
Operator intents follow a lifecycle: proposed, active, expired, or superseded. The TTL system ensures that manual overrides are always temporary. The system never gets stuck in a manual state. If the director walks away, every camera returns to its automated context when the last override expires.
Why Not Just Reinforcement Learning?#
Q: With all the AI you've described (the perception, the tracking, the fusion) why not just do direct reinforcement learning on all of this? Train a model end to end: here is a live match, here is what a great broadcast looks like, learn.
That is the most important question, and the answer explains why the Playbook has to exist.
Think about autonomous driving. Before you can train a car to drive itself through reinforcement learning (learning how a human reacts to traffic, pedestrians, weather) you first have to build the foundational layer. The accelerator pedal has to work. The brakes have to respond. The steering wheel has to turn the wheels predictably. The car has to know what a lane is, what a traffic light means, what a speed limit sign looks like. Those are not things you learn through reinforcement learning. Those are the preconditions that make reinforcement learning possible.
If you skip that foundation and try to learn everything from scratch, from raw sensor data to steering angle, you get a system that sometimes works brilliantly and sometimes drives into a wall, and you have no idea why. You cannot debug it. You cannot adjust it. You cannot tell it "be more cautious at intersections" without retraining the entire model.
The Playbook is that foundational layer for spatial orchestration. It defines what the space is: zones, boundaries, landmarks. It defines what the entities are (the focus entity, the dynamic agents, the static references). It defines the logic of the environment: when the ball enters the attacking third, that is a different situation from midfield play, and the cameras should respond differently. These are the accelerator, the brakes, the steering wheel. They are the structured, declarative foundation that makes everything above it possible.
Q: So the Playbook comes first, and reinforcement learning comes on top?
Exactly. Once the foundational logic is established (the system knows what zones are, what entities matter, what a context transition means, how intents map to camera behavior) then reinforcement learning has something meaningful to train on. The RL learns the storytelling: the timing of a camera cut, the instinct to widen before a counter-attack develops, the feeling that a close-up is needed after a near-miss. It learns how a human director uses the controls. But the controls themselves (the zones, the intents, the coverage parameters) are defined by the Playbook.
Without the Playbook, you are asking the RL to learn physics, spatial logic, broadcast grammar, and editorial style all at once from raw pixels. That is like asking a neural network to learn driving by showing it dashboard camera footage without telling it what a steering wheel is. It might eventually work. But it will be brittle, opaque, and impossible to adjust.
Q: And different leagues need different adjustments.
That is the second reason. There is no single correct way to produce football.
Every league has a production identity. Every broadcaster has a visual language. Women's football coverage uses significantly more close-ups, tighter framing on individual players, more reaction shots, more emotional storytelling. The numbers are different: higher target_coverage values, more frequent context switches to single-entity framing.
Men's football in Europe tends toward strategic framing: wider tactical shots, more time on the formation, letting the viewer read the shape of the game. In the US, the same sport is produced with notably more focus on the goalkeeper, more cuts to the keeper's positioning during build-up play, tighter boundary camera shots. A domestic league match is not produced like a cup final. Each has different pacing, different emphasis, different framing conventions.
The Playbook makes all of this configurable. Look at the difference. Same intent, same structure, two completely different production identities:
Strategic coverage, wide tactical framing:
{
"intent_tactical_wide": {
"subjects": ["focus_entity", "nearest_dynamic_agent_to_focus", "possessing_agent"],
"target_coverage": 0.15,
"leading_space_factor": 0.10,
"smoothing_alpha": 0.15
}
}Intimate coverage, more close-ups, tighter framing:
{
"intent_tactical_wide": {
"subjects": ["focus_entity", "possessing_agent"],
"target_coverage": 0.30,
"leading_space_factor": 0.16,
"smoothing_alpha": 0.18
}
}Same intent name. Same structure. But the first Playbook produces wide, strategic coverage: subjects fill only 15% of the frame, three entities tracked, the viewer reads the shape of the game. The second produces tighter, more intimate coverage: subjects fill 30% of the frame, focused on two entities, more leading space to draw the eye forward. A completely different visual experience from changing a few numbers.
You do not retrain the RL for each league. You do not retrain it for each broadcaster. You change the Playbook. Swap a JSON file. The same AI, the same perception pipeline, the same edge hardware, producing a completely different visual identity because the Playbook's parameters have changed. The RL provides the instinct. The Playbook provides the identity.
The Playbook is the foundational logic that makes reinforcement learning possible, like building the steering, brakes, and accelerator before training a car to drive. RL learns the storytelling instinct; the Playbook defines the spatial structure, the entities, and the configurable parameters that give each league and broadcaster a distinct production identity. Foundation first, then learning.
The Moment Everything Clicks#
Q: You mentioned earlier that the core schema avoids football terminology. You said the format is not about football. What did you mean?
Let me show you something.
She opens a second JSON file on her laptop.
This is a different Playbook. Port dock surveillance. A shipping port with container yards, two docks, an entry corridor, perimeter fencing. Six cameras: PTZ sensors on dock cranes, at the gates, on the perimeter.
Look at the structure. It is identical. Environment with an origin, bounds, zones, landmarks. Entity types: focus entity, dynamic agent, static landmark. Contexts with triggers and priorities. Intents with methods and coverage targets. Operator intents with TTL lifecycle.
But the domain extensions are different. Look at the entity types side by side. Football:
{
"focus_entity": {
"_domain_extensions": {
"broadcasting": { "maps_to": "ball" }
}
},
"dynamic_agent": {
"_domain_extensions": {
"broadcasting": { "maps_to": "player", "extra_props": ["team", "jersey_number"] }
}
}
}Port surveillance:
{
"focus_entity": {
"_domain_extensions": {
"surveillance": { "maps_to": "alert_trigger_entity" }
}
},
"dynamic_agent": {
"_domain_extensions": {
"surveillance": {
"classification_labels": ["person", "vehicle", "forklift", "unknown"],
"extra_props": ["badge_detected", "threat_level", "dwell_time_s"]
}
}
}
}The core is identical. focus_entity and dynamic_agent. Only the extensions differ. The zones aren't attacking thirds; they're restricted dock areas, container yards, entry corridors. The phases aren't match halves; they're day shift, night shift, lockdown.
Q: The same format?
The same format. The same runtime. The same system.
And look at the contexts:
{
"id": "ctx_intrusion_alert",
"name": "Intrusion Alert",
"priority": 95,
"trigger": {
"type": "composite",
"operator": "AND",
"conditions": [
{ "type": "zone", "subject": "dynamic_agent", "zone_ref": "zone_dock_1" },
{ "type": "threshold", "subject": "dynamic_agent",
"condition": "classification_not_equals", "value": "forklift" }
],
"stability": { "dwell_ms": 3000, "hysteresis_m": 5.0 }
},
"assignments": {
"NEAREST_OVERVIEW": "intent_alert_wide_context",
"NEAREST_PTZ": "intent_suspect_tight_track",
"GATE_MONITOR": "intent_gate_lockdown_view"
}
}"Intrusion Alert" at priority 95, higher than any football context, because an unauthorized person in a restricted dock zone is more urgent than a ball entering the attacking third. The trigger: a dynamic agent enters the restricted dock zone, classified as not-forklift (because forklifts are authorized) with a stability dwell of 3,000 milliseconds. Three full seconds before the alert fires.
Q: Three seconds? In football you use 150 milliseconds.
Because the operating tempo is completely different. Football changes in fractions of a second. A port intrusion unfolds over minutes. A person walking toward a restricted dock doesn't require instant reaction; it requires confirmation. Three seconds of confirmed presence in the zone before escalation. That prevents false alerts from someone walking near the boundary and turning away.
And the TTLs are longer. The follow-entity intent in the surveillance playbook has a TTL of 120 seconds (two minutes) because tracking a suspect across a port takes longer than tracking a player during a set piece. The patrol sweep intent cycles continuously through waypoints with 15-second dwell at each position. Different tempo. Different domain. Same format.
Q: And someone building a port surveillance system would never see a football term?
Never. The core loader ignores domain extensions it does not recognize. A surveillance operator's Playbook shows zones called "Dock 1: Restricted" and entities classified as "person" or "vehicle." The word "ball" doesn't exist in their system. But the underlying architecture (contexts, intents, triggers, resolvers, TTL lifecycle) is identical.
That is the design. Domain-neutral core. Domain-specific extensions. One format for every sensor array in every environment.
The Playbook's core vocabulary (focus_entity, dynamic_agent, primary zone) describes any spatial orchestration scenario. The _domain_extensions layer maps these universal concepts to the specific language of each use case: football, surveillance, robotics, sports analytics. The core runtime ignores extensions it doesn't recognize, enabling new domains to be added without changing the format.
A Skill You Can Copy and Paste#
Q: If the format is portable across domains, is it also portable across venues?
That is the whole point.
A Playbook is a skill. A self-contained, portable unit of behavior. Like a recipe. A league that develops exceptional attack coverage (beautiful framing, perfect leading space, smooth transitions) can export that Playbook as a JSON file. Another league, another continent, can import it. Load it. Validate it against the schema. Test it in shadow mode (the system runs the Playbook's logic alongside the live production, comparing outputs without controlling the cameras). If it performs well, deploy it. One file. No code. No engineering.
Q: Could different sports use the same format?
A basketball playbook would define a smaller environment (28 by 15 meters instead of 105 by 68). Different zones. Faster transitions, since basketball possession changes are even more rapid than football. Different entity counts. But the structure is identical: environment, entities, contexts, intents, operator overrides.
A robotics mission uses the same format. The focus entity becomes the lead robot. The dynamic agents become obstacles or targets. The contexts become mission phases. The intents become patrol patterns, inspection routines, return-to-base behaviors. Same architecture. Different domain extensions.
Q: You described this as "HTML for world behaviors" in the spec. What does that analogy mean?
HTML is a document that a browser renders visually. The HTML does not know what browser will render it, what screen size, what operating system. It declares structure and content. The browser handles the rendering.
The Playbook is a document that the OZ runtime executes physically. The Playbook does not know what cameras are installed, what lenses they have, what gimbals they use. It declares situations and behaviors. The runtime (the perception layer, the world model, the camera control system) handles the execution.
And just like HTML enabled a network effect where anyone could create a web page and anyone else could view it, the Playbook format is designed to enable a network effect where anyone can create a behavior specification and any OZ-compatible venue can execute it. Playbooks become shareable skills. A library of behaviors. A community of producers who trade coverage strategies the way developers share code.
Three Layers, One System#
Q: Where does the Playbook sit in the broader system architecture?
Three layers.
The bottom layer is OpenUSD, the physical world model. Venue geometry. Camera positions. Lens characteristics. Physics. This is where the system knows that Camera 1 is mounted at 7.2 meters height, 36.6 meters from center, with a 25x optical zoom. The Playbook never touches this layer. It does not know where the cameras are physically mounted.
The middle layer is Arcflow, our graph engine. This is where the live world state lives. Every entity (every player, the ball, every spatial relationship) exists as a node in the graph. Causal edges connect them: this player possesses the ball. This player is nearest to the ball. The ball is in this zone. The graph updates sixty times per second from our perception pipeline.
The top layer is the Playbook. It reads the world state from Arcflow, evaluates its context triggers, resolves its parametric values, assigns intents to sensor roles, and publishes those assignments to the camera control system. The Playbook is the behavior overlay, the part that says "given what the world looks like right now, here is what each camera should do."
When a Playbook is loaded, its contents are materialized as nodes and edges in the Arcflow graph:
┌─────────────────────────────────────────────────┐
│ Playbook JSON (Behavior Overlay) │
│ - Declarative intents, triggers, contexts │
│ - Portable "skill" files │
├─────────────────────────────────────────────────┤
│ Arcflow LPG (Causal / Semantic) │
│ - Entity state, relationships, causality │
│ - Queried via graph queries │
├─────────────────────────────────────────────────┤
│ OpenUSD (Physical World) │
│ - Geometry, sensor transforms, physics │
│ - Venue shape, camera positions, lens models │
└─────────────────────────────────────────────────┘
Each context becomes a graph node. Each intent becomes a node. The causal relationships (this context triggers this intent for this sensor) become edges. You can query the graph and ask: why is Camera 3 currently in tight follow mode? The answer is a causal chain: the ball is in zone primary positive, which activated context primary zone action, which assigned intent primary tight follow to the action sensor role, which maps to Camera 3.
Every decision is traceable. Every behavior is explainable. That matters, not just for debugging, but for trust. A producer can ask the system why it made a specific camera choice, and the system can answer in terms of zones, entities, and contexts.
Everything We Have Built Leads Here#
Q: How does this connect to everything you described in your previous interviews (the perception models, the compression, the graceful degradation)?
Every AI model I've described in previous interviews (the detection, the tracking, the multi-camera fusion, the compression, the graceful degradation under smoke and fog) all of it exists to produce one thing: the world state that the Playbook evaluates.
Every frame, sixty times per second, our perception pipeline updates the position and velocity of every entity in the venue. The fusion layer merges what six cameras see into a single coherent model. The tracking layer maintains identity continuity across occlusions. The compression layer keeps all of this running within the thermal and power constraints of our edge hardware. The graceful-degradation layer ensures the world state remains continuous even when cameras are temporarily compromised.
And then the Playbook reads that world state and decides what every camera should do next.
Perception to world model to Playbook to action. That is the full loop. It runs at the venue, on the edge, with zero cloud dependency. And every match, every single match, generates training data. Every context activation, every intent assignment, every operator override is logged with timestamps and world state snapshots. That data feeds back into model training, intent prediction, and eventually, Playbook generation.
Q: Playbook generation?
The format is simple enough that an AI can learn to write it. A vision-language model that has observed thousands of matches (seen the world state, seen the operator decisions, seen which framings the directors chose) can learn to author Playbooks. Not today. But the format is designed for it. Every parameter is named. Every relationship is declared. The entire behavior specification is a structured document that a machine can read, understand, and eventually create.
The Road Ahead#
Q: Where does the Playbook go from here?
Three directions. Each one deepens what the system can understand and act on.
The first is embedding LLM reasoning directly into the Playbook evaluation loop. Today, the context triggers are rule-based: zone checks, threshold comparisons, composite conditions. They are fast and deterministic, which is essential for real-time orchestration. But they are also limited to what you can express as spatial logic. An LLM can reason about situations that are hard to encode as zone triggers: "the tempo of the match has slowed in the last three minutes" or "one team is defending with ten players behind the ball." These are patterns that emerge from the sequence of world states over time, not from a single tick's spatial snapshot. The roadmap includes an LLM evaluation layer that can process recent world state history and propose context activations that the rule-based system cannot express.
Q: Would that not be too slow for real-time?
The LLM layer operates on a different timescale. The core Playbook evaluation runs every tick, sixteen milliseconds. The LLM layer evaluates every few seconds, looking at the broader pattern. It does not control individual camera movements. It proposes higher-level context shifts ("this is a sustained period of pressure, consider tightening coverage") which the Playbook's real-time engine then executes at full speed. Two timescales, working together.
Q: You mentioned three directions.
The second is graph-native agents. Arcflow is a causal graph. It tracks not just where things are, but why things happen. Today, the Playbook reads the graph and evaluates triggers. In the next generation, autonomous agents live inside the graph. A World Agent that continuously reasons about the spatial relationships between entities, predicts what is likely to happen next, and pre-positions cameras before the action occurs, not reacting to events, but anticipating them.
Think of a corner kick. Today, the context activates when the ball enters the set-piece zone. An agent inside the graph could recognize the attacking team's corner-kick routine from their formation pattern (where each player positions themselves relative to the goal and each other) and adjust camera framing before the corner is even taken. The agent has access to the full causal history: every corner this team has taken this season, where the delivery typically lands, which player attacks the near post, which player pulls back for the edge of the box.
Q: And the third direction?
Physics and biomechanics. The world model today tracks positions and velocities. The next layer adds physical understanding: body pose, joint angles, force estimation, contact dynamics. When a striker is about to shoot, the system doesn't just know their position. It reads their body shape: the planting foot, the angle of the hip rotation, the backswing of the striking leg. The Playbook can trigger a high-framerate biomechanics capture intent, not because the ball is in the penalty area, but because the body mechanics of a shot are detected two hundred milliseconds before the foot contacts the ball.
{
"id": "ctx_shot_biomechanics",
"name": "Shot Biomechanics Capture",
"priority": 90,
"trigger": {
"type": "composite",
"operator": "AND",
"conditions": [
{ "type": "zone", "subject": "focus_entity", "zone_ref": "zone_penalty_pos" },
{ "type": "threshold", "subject": "possessing_agent",
"condition": "pose_indicates_shot", "confidence": 0.85 }
]
},
"assignments": {
"BIOMECHANICS_PRIMARY": "intent_shooter_pose",
"BIOMECHANICS_SECONDARY": "intent_keeper_dive"
}
}The trigger isn't just spatial ("ball in the penalty area"). It includes a pose condition: the possessing agent's body configuration indicates a shot is imminent with 85% confidence. The system catches the moment before it happens. Two cameras switch to high-framerate skeletal capture. The striker's leg swing, the goalkeeper's dive reaction, the contact point on the ball, all captured at 120 frames per second for biomechanics analysis. The same data that powers the broadcast also powers sports science.
Q: This sounds like the Playbook is evolving from a behavior specification into a reasoning system.
It is evolving from describing what to do into understanding why. And that is exactly where the entire AI industry is heading.
Physical AI and What Comes After Language Models#
Q: How does the Playbook fit into the broader direction of AI?
Most of the AI progress the world has seen in the last few years has been language: text generation, reasoning about words and code. That is extraordinary work. But there is a fundamental shift underway. The next generation of AI does not just process language. It perceives physical spaces, reasons about spatial relationships, understands physics, and acts in the real world. Physical AI.
The large language models reason about tokens. Physical AI reasons about space, time, matter, and motion. It needs to understand that a ball travels in a parabolic arc under gravity. That a player's center of mass shifts before they change direction. That a camera at seven meters height sees occluded players differently than one at twelve meters. That smoke disperses, rain changes surface reflectance, and a goalkeeper's dive takes 400 milliseconds.
This is what our World Model does. The Playbook is not a chatbot prompt. It is a behavior specification for Physical AI agents: sensors that perceive real environments, reason about what is happening in three-dimensional space, and act physically by moving cameras, adjusting focal lengths, and tracking entities through a venue.
Q: Where does OZ sit in this shift?
The three-layer architecture we built is a Physical AI stack. OpenUSD gives us the physical world representation: geometry, physics, sensor models. Arcflow gives us the causal reasoning layer: spatial relationships, temporal patterns, entity behavior. The Playbook gives us the behavior policy layer: what the system should do given what it perceives and understands.
This mirrors what the robotics industry is building for autonomous machines. A robot in a warehouse needs the same three layers: a model of the physical space, a reasoning engine that understands entity relationships, and a behavior policy that tells it what to do. The difference is that our "robot" is a camera array, our "warehouse" is a stadium, and our behavior policy runs live on broadcast television.
And the compounding advantage is the same. Every match generates training data, not just for perception models, but for the Physical AI reasoning itself. Every world state snapshot, every context activation, every operator override teaches the system more about how the physical world works and how humans want to interact with it. The simulation layer lets us replay those scenarios, generate synthetic variations, and train the next generation of agents in a controlled environment before they ever touch a live broadcast.
Physical AI is the shift from language-based reasoning to spatial, temporal, and physics-aware intelligence that perceives and acts in real environments. The OZ World Model is a Physical AI system: OpenUSD for physical world representation, Arcflow for causal reasoning, and the Playbook for behavior policy, running at the edge, in real time, on every match.
Q: Is the Playbook format relevant beyond OZ's own system?
That is the ambition. The format is designed to be proposed as an open standard, an "Embodied Intent Layer" for any Physical AI system that needs to specify behavior over a model of the physical world. Today it orchestrates camera arrays. The same format, with different domain extensions, can specify behavior for any sensor array, any autonomous agent, any system that needs to perceive a physical space and act within it.
The format stays the same: contexts, intents, triggers, extensions. But the intelligence behind the triggers deepens with each generation. Rule-based spatial logic today. LLM-assisted pattern recognition next. Graph-native causal agents after that. Physics-aware biomechanics reasoning alongside all of it. Each layer adds depth without breaking the format.
And all of it runs at the edge. All of it runs in real time. All of it runs on the same platform, the same Playbook format, the same three-layer architecture. The system that produces a football broadcast today is the same system that will reason about biomechanics and anticipate tactical patterns tomorrow. The Playbook just gets smarter.
That is the direction. A format so simple that a producer can read it, a machine can execute it, and, increasingly, an AI can reason about it. But today, the format exists. The spec is real. And every match we produce is running on it.