Governance as Code: How We Operate Hundreds of Autonomous Edge Nodes

Management at scale fails when it depends on people remembering rules. Policies drift. Escalation chains break. Override decisions are made without context. And when you ask "who approved that change?", nobody can answer.
OZ treats management as code: SLOs, override policies, escalation rules, and audit trails are versioned, executable, and enforced by the system, not by memory.
The problem with tribal management#
In traditional operations, management knowledge lives in three places: somebody's head, a PDF nobody reads, and a Slack thread from six months ago.
This creates predictable failure modes:
- Policy drift. Each venue interprets rules slightly differently
- Escalation gaps. The on-call person does not know the threshold for waking the next tier
- Override chaos. Operators make real-time decisions without a framework, then no one can audit what happened
- Scaling drag. Every new venue needs management training because the rules are not encoded
These failures are invisible at one venue. At ten venues, they create variance. At fifty, they create risk.
What executable governance looks like#
At OZ, every management decision has a codified structure:
# Example: Override policy for capture priority
policy:
name: capture-priority-override
version: 2.4.1
scope: venue
conditions:
- trigger: operator-requests-manual-override
- requires: active-session
- timeout: 300s # auto-revert after 5 min
actions:
- log: override-initiated
- notify: noc-dashboard
- enforce: revert-on-timeout
audit:
- record: operator-id, timestamp, reason, duration
- retain: 90-days
- accessible-by: [ops-lead, compliance]This is not theoretical. Every OZ VI Venue runs with policies like this governing:
- Capture priorities. Which zones get camera attention under contention
- Override windows. How long a human override stays active before auto-revert
- Escalation thresholds. When the system alerts the NOC vs. auto-recovers
- Health boundaries. What telemetry values trigger degraded-mode vs. full-stop
SLOs are contracts, not targets#
Most organizations treat SLOs as aspirations. At OZ, they are contractual:
| SLO | Target | Enforcement |
|---|---|---|
| p99 latency | ≤120 ms | Measured per venue, published per deployment |
| Uptime | ≥99.9% | Continuous monitoring, automated recovery |
| MTTR | <5 min | Self-healing loops, remote diagnostics |
| Override audit | 100% logged | Every human intervention recorded with context |
When a venue fails an SLO, the system does not send a notification and hope someone acts. It triggers the codified response: automated recovery first, escalation second, human dispatch only when remote resolution fails.
Versioned policies, zero drift#
Every policy in the OZ operating model is versioned. When a policy changes:
- The new version is tested against historical telemetry
- It deploys to a subset of venues first (canary rollout)
- It carries a rollback path
- The change is auditable: who changed it, when, why, and what the previous version was
This means the 100th venue runs the same governance as the 10th venue. Policy drift does not accumulate. Management quality does not degrade with scale.
Why this changes the operating model#
When management is executable:
- New venues inherit governance automatically. No training lag, no interpretation variance
- Operators focus on exceptions, not routine. The system handles the known; humans handle the novel
- Compliance is built in. Audit trails are automatic, not reconstructed after the fact
- Management scales without managers. More venues do not require more management layers
This is not about removing humans from decisions. It is about encoding the decisions humans have already made, and applying them consistently, across every venue, every shift, every season.
From documents to deployment#
Traditional management produces documents. OZ produces deployable policies.
The difference is execution. A document describes what should happen. A policy enforces what does happen. When your management layer is code, governance stops being overhead and starts being infrastructure.
That is what scales.