Governance as Code: How We Operate Hundreds of Autonomous Edge Nodes

Article March 4, 2026

Share this post

Management at scale fails when it depends on people remembering rules. Policies drift. Escalation chains break. Override decisions are made without context. And when you ask "who approved that change?", nobody can answer.

OZ treats management as code: SLOs, override policies, escalation rules, and audit trails are versioned, executable, and enforced by the system, not by memory.

The problem with tribal management#

In traditional operations, management knowledge lives in three places: somebody's head, a PDF nobody reads, and a Slack thread from six months ago.

This creates predictable failure modes:

Policy drift. Each venue interprets rules slightly differently
Escalation gaps. The on-call person does not know the threshold for waking the next tier
Override chaos. Operators make real-time decisions without a framework, then no one can audit what happened
Scaling drag. Every new venue needs management training because the rules are not encoded

These failures are invisible at one venue. At ten venues, they create variance. At fifty, they create risk.

What executable governance looks like#

At OZ, every management decision has a codified structure:

# Example: Override policy for capture priority
policy:
  name: capture-priority-override
  version: 2.4.1
  scope: venue
 
  conditions:
    - trigger: operator-requests-manual-override
    - requires: active-session
    - timeout: 300s  # auto-revert after 5 min
 
  actions:
    - log: override-initiated
    - notify: noc-dashboard
    - enforce: revert-on-timeout
 
  audit:
    - record: operator-id, timestamp, reason, duration
    - retain: 90-days
    - accessible-by: [ops-lead, compliance]

This is not theoretical. Every OZ VI Venue runs with policies like this governing:

Capture priorities. Which zones get camera attention under contention
Override windows. How long a human override stays active before auto-revert
Escalation thresholds. When the system alerts the NOC vs. auto-recovers
Health boundaries. What telemetry values trigger degraded-mode vs. full-stop

SLOs are contracts, not targets#

Most organizations treat SLOs as aspirations. At OZ, they are contractual:

SLO	Target	Enforcement
p99 latency	≤120 ms	Measured per venue, published per deployment
Uptime	≥99.9%	Continuous monitoring, automated recovery
MTTR	<5 min	Self-healing loops, remote diagnostics
Override audit	100% logged	Every human intervention recorded with context

When a venue fails an SLO, the system does not send a notification and hope someone acts. It triggers the codified response: automated recovery first, escalation second, human dispatch only when remote resolution fails.

Versioned policies, zero drift#

Every policy in the OZ operating model is versioned. When a policy changes:

The new version is tested against historical telemetry
It deploys to a subset of venues first (canary rollout)
It carries a rollback path
The change is auditable: who changed it, when, why, and what the previous version was

This means the 100th venue runs the same governance as the 10th venue. Policy drift does not accumulate. Management quality does not degrade with scale.

Why this changes the operating model#

When management is executable:

New venues inherit governance automatically. No training lag, no interpretation variance
Operators focus on exceptions, not routine. The system handles the known; humans handle the novel
Compliance is built in. Audit trails are automatic, not reconstructed after the fact
Management scales without managers. More venues do not require more management layers

This is not about removing humans from decisions. It is about encoding the decisions humans have already made, and applying them consistently, across every venue, every shift, every season.

From documents to deployment#

Traditional management produces documents. OZ produces deployable policies.

The difference is execution. A document describes what should happen. A policy enforces what does happen. When your management layer is code, governance stops being overhead and starts being infrastructure.

That is what scales.

Governance as Code: How We Operate Hundreds of Autonomous Edge Nodes

Article March 4, 2026

Share this post

OZ treats management as code: SLOs, override policies, escalation rules, and audit trails are versioned, executable, and enforced by the system, not by memory.

The problem with tribal management#

In traditional operations, management knowledge lives in three places: somebody's head, a PDF nobody reads, and a Slack thread from six months ago.

This creates predictable failure modes:

Policy drift. Each venue interprets rules slightly differently
Escalation gaps. The on-call person does not know the threshold for waking the next tier
Override chaos. Operators make real-time decisions without a framework, then no one can audit what happened
Scaling drag. Every new venue needs management training because the rules are not encoded

These failures are invisible at one venue. At ten venues, they create variance. At fifty, they create risk.

What executable governance looks like#

At OZ, every management decision has a codified structure:

# Example: Override policy for capture priority
policy:
  name: capture-priority-override
  version: 2.4.1
  scope: venue
 
  conditions:
    - trigger: operator-requests-manual-override
    - requires: active-session
    - timeout: 300s  # auto-revert after 5 min
 
  actions:
    - log: override-initiated
    - notify: noc-dashboard
    - enforce: revert-on-timeout
 
  audit:
    - record: operator-id, timestamp, reason, duration
    - retain: 90-days
    - accessible-by: [ops-lead, compliance]

This is not theoretical. Every OZ VI Venue runs with policies like this governing:

Capture priorities. Which zones get camera attention under contention
Override windows. How long a human override stays active before auto-revert
Escalation thresholds. When the system alerts the NOC vs. auto-recovers
Health boundaries. What telemetry values trigger degraded-mode vs. full-stop

SLOs are contracts, not targets#

Most organizations treat SLOs as aspirations. At OZ, they are contractual:

SLO	Target	Enforcement
p99 latency	≤120 ms	Measured per venue, published per deployment
Uptime	≥99.9%	Continuous monitoring, automated recovery
MTTR	<5 min	Self-healing loops, remote diagnostics
Override audit	100% logged	Every human intervention recorded with context

Versioned policies, zero drift#

Every policy in the OZ operating model is versioned. When a policy changes:

The new version is tested against historical telemetry
It deploys to a subset of venues first (canary rollout)
It carries a rollback path
The change is auditable: who changed it, when, why, and what the previous version was

This means the 100th venue runs the same governance as the 10th venue. Policy drift does not accumulate. Management quality does not degrade with scale.

Why this changes the operating model#

When management is executable:

New venues inherit governance automatically. No training lag, no interpretation variance
Operators focus on exceptions, not routine. The system handles the known; humans handle the novel
Compliance is built in. Audit trails are automatic, not reconstructed after the fact
Management scales without managers. More venues do not require more management layers

This is not about removing humans from decisions. It is about encoding the decisions humans have already made, and applying them consistently, across every venue, every shift, every season.

From documents to deployment#

Traditional management produces documents. OZ produces deployable policies.

That is what scales.