Graph Algorithms

Canonical surface for the Algorithm Library layer.

37 graph algorithms built into the world model — no projection lifecycle, no separate analytics service, no plugin installation. Every algorithm operates over the same confidence-scored, temporally-versioned graph that your queries run against. Centrality rankings respect observation class. Community detection propagates across fact confidence. Contradiction detection surfaces conflicting beliefs in the world model. Causal lineage walks CAUSED_BY provenance edges with cumulative confidence decay. Multi-source disagreement reconciles contested observations at the substrate level — the first such primitive in any graph engine. Trajectory analytics compose nearest-at-frame, leverage-gain, release-point, and shadow detection in a single Cypher block for sports / tracking / autonomous-vehicle workloads. Counterfactual branching forks the World Graph at a chosen WAL sequence so swarm rollouts score in isolation against the canonical timeline. Six algorithms run incrementally via LIVE CALL, maintaining results as standing queries that update with every mutation.

GPU acceleration is automatic when hardware is available: same query, same syntax — ArcFlow Adaptive Dispatch routes to the fastest backend at runtime (see GPU Acceleration).

Centrality#

PageRank#

Iterative eigenvector centrality. 20 iterations, damping factor 0.85. GPU-accelerated when supported hardware is present.

CALL algo.pageRank()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | score | float | PageRank score (sums to 1.0) |

-- PageRank on a social network
CREATE (a:Person {name: 'Alice'})-[:FOLLOWS]->(b:Person {name: 'Bob'})
CREATE (b)-[:FOLLOWS]->(c:Person {name: 'Carol'})
CREATE (c)-[:FOLLOWS]->(a)
 
CALL algo.pageRank()

| nodeId | name  | score    |
|---
-----|---
----|---
-------|
| 1      | Alice | 0.333333 |
| 2      | Bob   | 0.333333 |
| 3      | Carol | 0.333333 |

Confidence-Weighted PageRank#

PageRank weighted by edge confidence values. Higher-confidence edges contribute more to rank propagation.

CALL algo.confidencePageRank()

Betweenness Centrality#

Measures how often a node lies on shortest paths between other nodes. Identifies bridges and brokers in the graph.

CALL algo.betweenness()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | betweenness | float | Betweenness centrality score |

CREATE (a:Person {name: 'Alice'})-[:KNOWS]->(b:Person {name: 'Bob'})
CREATE (b)-[:KNOWS]->(c:Person {name: 'Carol'})
CREATE (b)-[:KNOWS]->(d:Person {name: 'Dave'})
 
CALL algo.betweenness()

| nodeId | name | betweenness |
|---
-----|---
---|---
------
----|
| 2      | Bob  | 4.000000    |
| 1      | Alice| 0.000000    |
| 3      | Carol| 0.000000    |
| 4      | Dave | 0.000000    |

Closeness Centrality#

Inverse of the average shortest path distance from a node to all reachable nodes. Higher values indicate more central position.

CALL algo.closeness()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | closeness | float | Closeness centrality score |

Degree Centrality#

Count of relationships per node, optionally filtered by direction.

CALL algo.degreeCentrality()

Community Detection#

Louvain#

Hierarchical community detection via modularity optimization. GPU-accelerated on large graphs.

CALL algo.louvain()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | community | integer | Community assignment ID |

-- Detect communities in a social network
CREATE (a:Person {name: 'Alice'})-[:KNOWS]->(b:Person {name: 'Bob'})
CREATE (a)-[:KNOWS]->(c:Person {name: 'Carol'})
CREATE (d:Person {name: 'Dave'})-[:KNOWS]->(e:Person {name: 'Eve'})
 
CALL algo.louvain()

| nodeId | name  | community |
|---
-----|---
----|---
--------|
| 1      | Alice | 0         |
| 2      | Bob   | 0         |
| 3      | Carol | 0         |
| 4      | Dave  | 1         |
| 5      | Eve   | 1         |

Leiden#

Refined community detection that guarantees well-connected communities. Improvement over Louvain that avoids poorly connected intermediate communities.

CALL algo.leiden()

Returns the same schema as Louvain (nodeId, name, community).

Connected Components (WCC)#

Weakly connected components. Identifies disconnected subgraphs.

CALL algo.connectedComponents()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | component | integer | Component assignment ID |

Community Detection (Label Propagation)#

Fast community detection using label propagation. Near-linear time complexity.

CALL algo.communityDetection()

Returns nodeId, name, and community columns. GPU-accelerated on large graphs.

K-Core Decomposition#

Computes the k-core number for each node. A k-core is a maximal subgraph where every node has degree >= k.

CALL algo.kCore()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | core | integer | Core number |

Path Finding#

BFS Shortest Path#

Unweighted shortest path via breadth-first search. GPU-accelerated on large graphs.

MATCH p = shortestPath((a:Person {name: 'Alice'})-[*..10]->(b:Person {name: 'Dave'}))
RETURN p

Variable-length traversal with depth bounds:

MATCH (a:Person {name: 'Alice'})-[:KNOWS*1..5]->(b)
RETURN b.name

Dijkstra Shortest Path#

Weighted shortest path with optimized memory reuse. Pass the property key to use as edge weight.

CALL algo.dijkstra(startNodeId, endNodeId, 'weight')

-----	---

----| | path | string | Ordered node IDs along the shortest path | | distance | float | Total weighted path cost |

A* Pathfinding#

Heuristic-guided shortest path. Faster than Dijkstra when a distance estimate is available — the heuristic guides search toward the goal. Pass a node property to use as the heuristic estimate.

CALL algo.astar(startNodeId, endNodeId, 'travel_time', 'estimated_distance')

| Parameter | Description | |--- --------|---#

----| | start | Source node ID | | end | Target node ID | | weight_key | Edge property to use as cost | | heuristic_key | Node property to use as heuristic estimate |

Confidence-Weighted Shortest Path#

Shortest path weighted by edge confidence. Finds the most trusted route between two nodes.

CALL algo.confidencePath(1, 2)

-----	---

----| | path | string | Ordered node IDs in the path | | confidence | float | Aggregate path confidence |

All-Pairs Shortest Path#

Computes shortest path distances between all node pairs. GPU-accelerated for large graphs.

CALL algo.allPairsShortestPath()

-----	---

----| | source | integer | Source node ID | | target | integer | Target node ID | | distance | integer | Shortest path distance |

Similarity#

Jaccard Similarity#

Available through node similarity computation. Measures overlap of neighbor sets between node pairs.

Node Similarity#

Pairwise similarity between all nodes based on shared neighbors.

CALL algo.similarNodes()

Clustering Coefficient#

Local clustering coefficient for each node. Measures how close a node's neighbors are to forming a clique.

CALL algo.clusteringCoefficient()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | coefficient | float | Local clustering coefficient (0.0 to 1.0) |

GPU-accelerated: dispatches to CUDA/Metal when graph exceeds 50 nodes.

Graph Metrics#

Triangle Count#

Counts the number of triangles in the graph. GPU-accelerated on large graphs.

CALL algo.triangleCount()

-----	---

----| | triangles | integer | Total triangle count |

Density#

Graph density: ratio of actual edges to possible edges.

CALL algo.density()

-----	---

----| | density | float | Graph density (0.0 to 1.0) |

Diameter#

Longest shortest path in the graph.

CALL algo.diameter()

-----	---

----| | diameter | integer | Graph diameter | | source_id | integer | Source node of the longest path | | target_id | integer | Target node of the longest path |

Biconnected Components#

Biconnected components — maximal subgraphs with no articulation points. O(V+E) time complexity.

CALL algo.biconnectedComponents()

-----	---

----| | nodeId | integer | Internal node ID | | component_id | integer | Biconnected component assignment |

Embedding Algorithms#

Node2Vec#

Generate node embeddings via structural random walk. Captures structural position in the graph.

CALL algo.node2vec(128, 80, 10)

| Parameter | Description | |--- --------|---#

----| | dimensions | Embedding size (e.g., 128) | | walk_length | Steps per random walk (e.g., 80) | | num_walks | Walks per node (e.g., 10) |

Sets node.embedding on every node. Use with vector search for structural similarity.

Struct2Vec#

Structural equivalence embeddings — nodes with similar local structure get similar vectors, regardless of where they are in the graph.

CALL algo.struc2vec(64)

GraphSAGE#

Inductive node embedding via neighborhood aggregation. Handles unseen nodes (no retraining needed).

CALL algo.graphSAGE(128)

Stale Embedding Detection#

Find nodes whose embeddings are out of date (property changed after last embedding).

CALL algo.staleEmbeddings()

Returns (nodeId, name) for nodes needing re-embedding.

Embedding Classification#

Classify nodes by similarity to labeled examples using a vector index.

CALL algo.classify($embedding, 'index_name', {k: 10})

Returns (nodeId, label, confidence).

Knowledge Graph Algorithms#

Entity Resolution#

Find nodes that refer to the same real-world entity despite different IDs or properties.

CALL algo.entityResolution()

Returns (nodeId1, nodeId2, similarity) pairs above the resolution threshold.

Fact Contradiction Detection#

Find pairs of facts in the graph that contradict each other based on predicates and confidence.

CALL algo.factContradiction()

Returns (nodeId1, nodeId2, contradiction_type, confidence).

Relationship Strength#

Score how strong a relationship is based on frequency, recency, and mutual connections.

CALL algo.relationshipStrength()

Returns (fromId, toId, strength) for all relationships.

Compounding Score#

Composite confidence score that propagates through chains of facts.

CALL algo.compoundingScore()

Entity Freshness#

How recently each entity was observed or corroborated. Low freshness = stale facts.

CALL algo.entityFreshness()

Returns (nodeId, name, freshness, last_observed_at).

Semantic Deduplication#

Find and score near-duplicate nodes using embedding similarity.

CALL algo.semanticDedup()

Multi-Source Disagreement#

A substrate-level cross-source resolution primitive: given the same logical entity reported by multiple sources, return a typed verdict about whether they agree, disagree, or stand alone. Three modes cover the common shapes of real-world observation disagreement; the algorithm is deterministic with lexicographic tie-break, weights each source's claim by a per-source prior (typically _confidence), and surfaces both a per-source assignment and a per-group consensus.

The TVF is the natural composition partner for Causal Reasoning — once disagreement is detected, causalLineage walks back to the observations that produced each conflicting claim.

CALL arcflow.multi_source_disagreement(
  entity_label:          "Charting",       -- the typed entity carrying source-reported values
  group_property:        "play_id",        -- groups rows that should agree
  source_property:       "source",         -- distinguishes reporters
  value_property:        "run_pass",       -- the column being resolved
  prior_weight_property: "_confidence",    -- per-source trust weight
  disagreement_kind:     "categorical",    -- "categorical" | "numeric" | "spatial"
  agreement_threshold:   1.0               -- mode-specific (see below)
)
YIELD source, value, agreement_class, group_consensus, dispute_score
RETURN group_consensus, dispute_score, source, value, agreement_class

Three modes#

Mode	Algorithm	`dispute_score` semantics
`categorical`	Weighted-majority over distinct values; Shannon entropy of the weighted distribution, normalized to `[0, 1]`	`0.0` when all sources agree; `1.0` at maximum disagreement (entropy ≈ log₂(n_sources))
`numeric`	Weighted median across reported values; max pairwise delta among reported values	`clamp(max_pairwise_delta / agreement_threshold, 0.0, 1.0)`
`spatial`	Weiszfeld geometric median (iterative; converges on the L1-optimal 3D point); max pairwise Euclidean distance	Same clamped-delta shape as numeric, against the 3D distance

Output rows#

Each row carries:

Column	What it is
`source`	The reporter this row is about (or `null` for the group-level consensus row).
`value`	The value this source reported (categorical: the value; numeric: the float; spatial: the 3D point as WKT).
`agreement_class`	One of `consensus` (all sources agree within threshold), `outlier` (this source disagrees with the majority), `disputed` (no clear majority), `single_source` (only one source reported).
`group_consensus`	The resolved value for the group — the weighted-majority winner (categorical), the weighted median (numeric), or the geometric median (spatial).
`dispute_score`	`[0.0, 1.0]` per the mode-specific formula above.

Worked example — NFL play-call resolution#

Four vendors charted the same play; three say "run", one says "pass":

CALL arcflow.multi_source_disagreement(
  entity_label: "Charting",
  group_property: "play_id",
  source_property: "source",
  value_property: "run_pass",
  prior_weight_property: "_confidence",
  disagreement_kind: "categorical"
) YIELD source, value, agreement_class, group_consensus, dispute_score
WHERE source IS NOT NULL
RETURN source, value, agreement_class, dispute_score
ORDER BY dispute_score DESC

Output (illustrative):

[
  {"source": "ngs",       "value": "run",  "agreement_class": "consensus", "dispute_score": 0.0},
  {"source": "pff",       "value": "run",  "agreement_class": "consensus", "dispute_score": 0.0},
  {"source": "pro_fb",    "value": "run",  "agreement_class": "consensus", "dispute_score": 0.0},
  {"source": "xos",       "value": "pass", "agreement_class": "outlier",   "dispute_score": 0.8}
]

The group_consensus row reports value: "run", dispute_score: 0.4 (a single dissenter pulls the entropy up but not to the disputed threshold).

Why this matters#

Most cross-source resolution in industry is either bespoke per-pipeline or built on top of a row-level conflict-resolution toolkit (LWW, FWW, vector clocks). Multi-source disagreement is substrate-level — the graph engine itself knows how to reconcile contested observations, with the resolution carrying back through arcflow.causalLineage to the typed entities that produced the conflict. The output rows can be written back as :Charting nodes with agreement_class and dispute_score properties, becoming queryable themselves.

The three modes cover most observation shapes: discrete labels (play type, classification), continuous values (sensor readings, scores), and 3D positions (tracking coordinates, sensor locations). For other shapes, the same framing extends through additional disagreement_kind values as the substrate grows.

Causal Reasoning#

Two algorithms walk CAUSED_BY provenance edges to answer "why did this fact land in the graph?" — the question every audit, every counterfactual, and every explanation needs. They compose with the World Graph's mission tiers (observed / inferred / predicted) and confidence scoring to produce cumulative-confidence-decayed lineages rather than flat parent lists.

The CAUSED_BY edge is a first-class provenance primitive. Every inferred or predicted fact in the graph can declare which observations or earlier inferences justified it; the engine's optional CAUSED_BY mandatory edge constraint enforces "every write must be justified" at the schema level.

Causal Lineage#

BFS over CAUSED_BY edges from a starting node, emitting every node visited with its hop distance, parent pointer, and cumulative confidence (the product of per-node _confidence along the path).

MATCH (s:Sack {play_id: 'X-vs-Y-q2-3:47'})
CALL arcflow.causalLineage(start_node: id(s), depth: 4)
YIELD node_id, hop, parent_id, node_label, node_confidence,
      cumulative_confidence, direction
RETURN hop, node_label, cumulative_confidence
ORDER BY hop, cumulative_confidence DESC

Parameters:

Parameter	Default	What it does
`start_node`	required	Node ID to walk from.
`depth`	required	Maximum BFS depth in hops.
`direction`	`"upstream"` (alias: `"causes"`)	Walk outgoing `CAUSED_BY` edges (toward causes) or incoming (toward consequences). Aliases: `"downstream"` / `"caused"` / `"consequences"`.

Returns: one row per visited node with the visit metadata.

Use cases:

Audit trail for any inferred or predicted fact ("show me what justified this Sack designation, four hops back").
Confidence floor for derived facts (the cumulative-confidence column shows how much trust survives the inference chain).
Counterfactual base — pair with AS OF to ask "what would have been justified at time T?"

Causal Path#

Shortest CAUSED_BY chain between two specific nodes via BFS + parent-pointer backtrack. Returns one row per hop in the discovered path, or a single path_found="false" row if no chain exists within the depth bound.

MATCH (sack:Sack), (snap:Snap)
WHERE sack.id = $sack_id AND snap.id = $snap_id
CALL arcflow.causalPath(from: id(sack), to: id(snap), depth: 8)
YIELD hop, node_id, node_label, path_found
RETURN hop, node_label
ORDER BY hop

Parameters:

Parameter	Default	What it does
`from`	required	Source node ID.
`to`	required	Target node ID.
`depth`	required	Maximum BFS depth in hops.
`direction`	`"upstream"`	Same direction semantics as `causalLineage`.

Returns: the discovered path as (hop, node_id, node_label, path_found) rows, or path_found="false" when no chain exists.

Use cases:

Targeted "why does X depend on Y?" queries when you have specific endpoints.
Disambiguating between multiple plausible causal chains by picking the shortest.
Failure-mode diagnostics — causalPath from a low-confidence fact back to a known-bad source.

Counterfactual Branching#

A substrate primitive for counterfactual exploration — branch the World Graph at a specific WAL sequence point, apply hypothetical edits to the branch in isolation, score the alternative outcome, then drop or keep the branch. The substrate that makes "what if we'd called the play differently?" / "what if this sensor had reported X instead of Y?" / "what if the model had predicted this branch first?" queries answerable without polluting the canonical timeline.

The CALL wrapper translates to ArcFlow's existing canonical BRANCH AT SEQ DDL substrate — branches created via CALL behave identically to those created via DDL (DROP BRANCH, DIFF BRANCH … AGAINST HEAD, AS BRANCH 'name' MATCH … routing all work uniformly).

Branch At#

Create a named branch of the World Graph at a specific WAL sequence point.

CALL arcflow.counterfactual.branchAt(name: 'rollout-1', seq: 42)
YIELD branch, base_seq, status

Returns one row: {branch: "rollout-1", base_seq: "42", status: "created"}.

Argument forms (all equivalent):

Named: name: 'r1', seq: 42 or name = 'r1', seq = 42
Positional: 'r1', 42
Single or double quotes on the name

Boundary errors (all typed, with recovery_suggestion payloads):

Error code	Cause
`COUNTERFACTUAL_BRANCH_AT_MISSING_NAME`	No `name` argument provided
`COUNTERFACTUAL_BRANCH_AT_MISSING_SEQ`	No `seq` argument provided
`COUNTERFACTUAL_BRANCH_AT_EMPTY_NAME`	Empty string passed as `name`
`COUNTERFACTUAL_BRANCH_AT_UNKNOWN_ARG`	Unexpected kwarg

Fan-out swarm pattern#

Counterfactual exploration usually fans out: N rollouts at the same base seq, each with a different hypothetical edit, scored independently, then dropped or kept based on the score:

def counterfactual_swarm(db, play_id: int, n_rollouts: int = 10) -> list:
    base_seq = db.current_seq()
    rollouts = []
 
    # Fan out N branches at the same base seq.
    for i in range(n_rollouts):
        rollout_id = f"rollout-{play_id}-{i}"
        db.execute(
            f"CALL arcflow.counterfactual.branchAt("
            f"name: '{rollout_id}', seq: {base_seq})"
        )
        rollouts.append(rollout_id)
 
    # Per-rollout scenario edit (your domain logic) + scoring.
    results = []
    for rollout_id in rollouts:
        # Apply the counterfactual edit via AS BRANCH-targeted execute.
        # Then score.
        score = db.execute(
            f"AS BRANCH '{rollout_id}' "
            f"MATCH (p:Play {{play_id: {play_id}}}) "
            f"CALL algo.confidencePageRank('SHADOWS') YIELD rank "
            f"RETURN sum(rank) AS score"
        )
        results.append((rollout_id, score.rows[0]['score']))
 
    # Cleanup.
    for rollout_id in rollouts:
        db.execute(f"DROP BRANCH '{rollout_id}'")
 
    return results

For write-side parallelism over the fan-out (multiple branches created concurrently), use the multi-handle pattern from the Threading Model — each handle holds its own write_mutex, so N handles fan out N branch creations in parallel without HANDLE_BUSY_CONCURRENT_WRITER. Reads against created branches don't hit the guard regardless.

Why a CALL surface (not just DDL)#

The DDL form (BRANCH AT SEQ 42 AS 'rollout-1') and the CALL form return the same branch. The CALL form exists because:

It composes with YIELD — CALL … YIELD branch, base_seq, status returns a typed row that downstream Cypher can consume directly (filter, transform, drive a MATCH against the branch).
It composes with Python orchestration — db.execute("CALL …") is the natural shape for swarm fan-out where the rollout names are computed in code.
It composes with EXPLAIN / PROFILE — the CALL surface is a planner-level operator with introspectable cost.
Typed errors with recovery_suggestion — the DDL form returns parse errors; the CALL form returns the structured boundary errors named above.

Use cases#

Sports analytics counterfactual swarm — "what if the QB had thrown to receiver Y instead of Z at frame 1024?" Fan out N branches at the snap frame, each with a different receiver target, score by composite pressure / completion-probability / yards-after-catch heuristics, surface the highest-EV alternative.
Sensor-failure replay — branch at the sensor-loss event, replay with the lost sensor's last-known reading held constant, compare downstream fusion outputs against the canonical timeline to bound the failure's blast radius.
Decision audit — when a low-confidence inferred fact lands, branch backward, apply the alternative inference, compare what would have followed. Pair with arcflow.causalLineage to trace which downstream facts depended on the original inference.
A/B replay against historical data — branch at a historical seq, apply the alternative policy, replay the same observation stream against the modified state, compare metrics.

The branch's lifetime is the caller's responsibility — DROP BRANCH 'name' releases the WAL state when the analysis is done. Branches not dropped accumulate; long-running counterfactual analytics should pair every branchAt with a DROP BRANCH in a try/finally or context manager.

GraphRAG Pipeline#

GraphRAG#

Full retrieval-augmented generation pipeline: query decomposition, graph traversal, evidence collection with confidence scoring.

CALL algo.graphRAG('What connections exist between Alice and the project?')

GraphRAG Context#

Retrieves structured context from the graph for a natural language query, with token budgeting for LLM context windows.

CALL algo.graphRAGContext('Tell me about Alice')

Trusted GraphRAG#

Confidence-filtered GraphRAG — only returns evidence above the trust threshold. Facts outrank inferences outrank predictions.

CALL algo.graphRAGTrusted('high-confidence relationships for Alice')

Sensor Fusion#

Weighted Centroid#

Multi-sample weighted-mean position + per-axis variance + total weight, computed in-engine over every node carrying a 3D position property and a numeric weight. Replaces the per-service Python np.average(positions, weights=...) round-trip with one CALL.

CALL algo.fusion.weighted_centroid('PlayerObs', 'pos_world', 'confidence')
YIELD x, y, z, var_x, var_y, var_z, total_weight, n

Arguments:

Arg	Type	Meaning
`label`	string	Node label to scan
`posProp`	string	Property carrying `Point3d` / `Vector3d` (also accepts a 2D `Point` — treated as z=0)
`weightProp`	string	Property carrying `Float` or `Int` (numeric weight; must be ≥ 0)

Returned columns: x, y, z (centroid), var_x, var_y, var_z (per-axis variance), total_weight (Σwᵢ), n (samples summed). Nodes missing either property are silently skipped; NaN slots are treated as missing. Aliased as algo.fusion.weightedCentroid for camelCase parity with the other graph algorithms.

Errors:

Code	Cause
`EMPTY_SAMPLE_SET`	zero usable nodes after the property gate
`WEIGHT_NOT_NUMERIC`	a `weightProp` value isn't `Float` or `Int`
`NEGATIVE_WEIGHT`	a weight < 0
`INVALID_WEIGHT_SUM`	Σwᵢ ≤ 0
`NAN_IN_INPUT`	NaN slipped through the property filter

Direct API for callers that prefer not to round-trip through Cypher:

from arcflow import ArcFlow
db = ArcFlow()
row = db.execute(
    "CALL algo.fusion.weighted_centroid('PlayerObs', 'pos_world', 'confidence')"
).rows[0]

The pure-function algebra (weighted_centroid_2d, weighted_centroid_3d, weighted_centroid_nd) is also exposed on the Rust crate as arcflow_core::algo_fusion.

Trajectory Analytics#

A family of Cypher CALL surfaces over the arcflow-runtime::trajectory kernel primitives — designed for sports analytics, tracking, autonomous-vehicle telemetry, and any workload where entities sweep through 2D space across frames. Each procedure is a thin graph→Sample2 { frame, x, y } wrapper around a pure-function kernel (PAT-0049 humble-object shape); the kernels themselves are tested independently and reachable from Rust as arcflow_core::trajectory::*.

The four primitives compose: a single play's coverage analysis can chain releasePoint → nearestAtFrame → shadowedBy → leverageGain in one Cypher block, no SKILL ceremony, no Python round-trip.

Nearest at Frame#

Top-k entities by 2D distance from a query point at a specific frame. The single-frame slice of a "who was closest when" question.

CALL arcflow.trajectory.nearestAtFrame(
  entity_label: "Player",
  frame_property: "frame",
  x_property: "x",
  y_property: "y",
  frame: 1024,
  qx: 50.0, qy: 23.5,
  k: 5
) YIELD other_node_id, other_label, distance, frame

Returned columns: other_node_id, other_label, distance, frame. Rows are ordered by ascending distance with deterministic tie-break by other_node_id. Entities missing the frame, x, or y properties at the requested frame are silently skipped.

Leverage Gain#

Per-frame closing-or-falling-behind delta between a chaser and a target trajectory. Positive delta means the chaser is closing the distance; negative means falling behind. Frames where either side is missing get dropped (not zeroed).

CALL arcflow.trajectory.leverageGain(
  entity_label: "Player",
  chaser_filter_property: "player_id", chaser_filter_value: 28,
  target_filter_property: "player_id", target_filter_value: 12
) YIELD frame, delta

Returned columns: frame, delta (one row per frame that exists in both trajectories, ordered ascending). N-1 rows for a span of N matched frames — the first frame has no delta-from-previous to report.

Release Point#

Single-row event detection: the frame at which forward x-velocity peaks. A throw-release heuristic for quarterbacks; generalises to any "fastest forward motion" event over a 2D trajectory.

CALL arcflow.trajectory.releasePoint(
  entity_label: "Player",
  filter_property: "player_id", filter_value: 5
) YIELD frame

Returned columns: frame (one row, or zero if the entity has fewer than 2 frame samples). Ties take the earliest frame. Unsorted frame input is handled by the kernel — no caller-side ordering required.

Shadowed By#

Three-trajectory geometric primitive: which frames have the defender obstructing the attacker→target line of sight, within an angular tolerance? The "is the receiver covered?" question for any pursuit-shadow scenario.

CALL arcflow.trajectory.shadowedBy(
  entity_label: "Player",
  attacker_filter_property: "player_id", attacker_filter_value: 5,
  target_filter_property: "player_id",   target_filter_value: 12,
  defender_filter_property: "player_id", defender_filter_value: 28,
  angle_tol_rad: 0.1
) YIELD frame

Returned columns: frame (one row per frame where the defender is "shadowing" — between the attacker and target within the angular tolerance, and not past the target). Frames where any of the three entities is missing get dropped.

NFL coverage descriptor — composing all four#

A single Cypher block computes a complete per-play coverage descriptor: release frame, nearest defender at snap, line-of-sight shadow, and closing speed:

// 1. Find QB's nearest defender at snap (frame 1024)
CALL arcflow.trajectory.nearestAtFrame(
  entity_label: "Player", frame_property: "frame",
  x_property: "x", y_property: "y",
  frame: 1024, qx: 30.0, qy: 22.0, k: 3
) YIELD other_node_id AS defender_id, distance
 
// 2. Compute the throw release frame for the QB
CALL arcflow.trajectory.releasePoint(
  entity_label: "Player",
  filter_property: "player_id", filter_value: 5
) YIELD frame AS release_frame
 
// 3. Check if CB28 shadowed the throw line to WR12
CALL arcflow.trajectory.shadowedBy(
  entity_label: "Player",
  attacker_filter_property: "player_id", attacker_filter_value: 5,
  target_filter_property: "player_id",   target_filter_value: 12,
  defender_filter_property: "player_id", defender_filter_value: 28,
  angle_tol_rad: 0.1
) YIELD frame AS shadow_frame
 
// 4. Closing speed between CB28 + WR12 over the same window
CALL arcflow.trajectory.leverageGain(
  entity_label: "Player",
  chaser_filter_property: "player_id", chaser_filter_value: 28,
  target_filter_property: "player_id",   target_filter_value: 12
) YIELD frame, delta AS closing

Four composable signals for a single play, each a single CALL. The kernels operate on the in-memory dense trajectory arrays the engine already maintains; the graph layer's only contribution is selecting which entities to feed in.

Why a separate procedure family#

Trajectory analytics has different shape from generic graph algorithms:

Inputs are dense per-entity trajectories, not graph adjacency. The kernels operate on Sample2[frame, x, y] arrays; the graph layer's job is selecting which entities' arrays to feed in via label + property filters.
Output is per-frame or per-event, not per-node. Most return one row per matched frame; releasePoint returns at most one row total.
DenseStore-aware: reads go through the same store.effective_property path that bulk-create-from-Arrow ingestion writes to, so vectorised ingestion and trajectory analytics share the same column layouts.
Typed errors for the predictable failure modes: missing entity_label, missing filter_property, missing filter_value all return validation errors with recovery suggestions, never silent zero-row results.

For workloads beyond sports analytics — robotics tracking, autonomous-vehicle telemetry, drone formation analysis, animal motion studies — the same four primitives extend without modification. The "Player" / "QB" / "CB" semantics are entirely customer-side; the kernels only see Sample2 arrays.

Streaming functions#

ArcFlow ships three per-stream stateful scalar functions for the per-step-derivative patterns common in sensor pipelines. They are sibling to the sliding-window aggregators — instead of emitting one aggregate per window completion, they emit one value per feed.

Function	Definition	First-output offset
`lag(value, n)`	the value from `n` feeds ago	n+1
`lead(value, n)`	retroactive output for event `F-n` when feed `F` arrives	n+1
`delta(value)`	`current - previous` (shorthand for `lag(value, 1)`)	2

Today these are reachable via the Rust + C ABI surface (register_stream_fn

feed_stream_fn). A Python wrapper is shipped under arcflow.stream_fn; the Cypher window-expression surface (RETURN lag(p.pan, 1) OVER (ORDER BY p.captured_at)) is follow-up work.

Rust#

use arcflow::stream_fn::{StreamFnKind};
 
store.register_stream_fn("pan_delta", StreamFnKind::Delta)?;
for (pan, ts) in samples {
    let outputs = store.feed_stream_fn("pan_delta", pan)?;
    if let Some(out) = outputs.first() {
        let pan_rate = out.fn_value / dt; // dt from a sibling Delta on the timestamp
    }
}

Python#

from arcflow import ArcFlow
from arcflow.stream_fn import StreamFn, Kind
 
db = ArcFlow()
pan_delta = StreamFn(db, "pan_delta", Kind.Delta)
ts_delta  = StreamFn(db, "ts_delta",  Kind.Delta)
 
for pan, ts in samples:
    pan_out = pan_delta.feed(pan)
    ts_out  = ts_delta.feed(ts)
    if pan_out and ts_out:
        pan_rate = pan_out[0].fn_value / ts_out[0].fn_value

State is computational — NOT WAL-journaled. Callers re-feed pending events from the source (typically a topic whose events ARE journaled) after restart.

Incremental Algorithms (LIVE CALL)#

Six algorithms can run as standing queries via LIVE CALL. Instead of recomputing from scratch on every mutation, they maintain state incrementally. Cost per update is O(affected neighborhood), not O(V+E).

-- Register incremental PageRank -- updates automatically on graph mutations
LIVE CALL algo.pageRank({ damping: 0.85 })
 
-- Register incremental community detection
LIVE CALL algo.connectedComponents()
LIVE CALL algo.louvain()
 
-- Register incremental pathfinding and metrics
LIVE CALL algo.bfs({ source: 42 })
LIVE CALL algo.triangleCount()
LIVE CALL algo.clusteringCoefficient()

Inspecting live algorithms#

-- List all running incremental algorithms
CALL db.liveAlgorithms()
 
-- Check a specific algorithm's memory and timing
CALL db.viewStats('pageRank')

Built on the Live Delta Stack and Z-set algebra. Every mutation produces a delta (weighted +1/-1) that propagates through the algorithm's state. Results are mathematically identical to a full recompute.

Algorithm Performance Summary#

Benchmarked on a MacBook Air M4 (10-core, 24GB) with 10K-node graphs, single-threaded CPU baseline. GPU speedups measured with CUDA on equivalent hardware. These are laptop numbers — server hardware with more cores and memory will scale further.

Algorithm	Acceleration	GPU Dispatch Threshold
PageRank	GPU-accelerated	500 nodes
BFS Shortest Path	GPU-accelerated	100 nodes
Triangle Count	GPU-accelerated	100 nodes
Community Detection	GPU-accelerated	200 nodes
Louvain	GPU-accelerated	200 nodes
Vector Search	GPU-accelerated	100 vectors
All-Pairs Shortest Path	GPU-accelerated	500 nodes
Clustering Coefficient	GPU-accelerated	100 nodes
Node Similarity	GPU-accelerated	100 nodes
Leiden	CUDA 9.0+ (H100+) only	10M nodes

GPU dispatch is handled by ArcFlow Adaptive Dispatch — which measures graph size at query time and routes to the fastest available backend automatically. No configuration required.

To measure throughput on your hardware, run cargo bench --bench algo from the ozinc/arcflow repo.

When multiple CUDA devices are present, Adaptive Dispatch selects the least-loaded device and gates dispatch on a per-algorithm cost model. Use CALL db.gpuStatus() to inspect per-device load and CALL db.capabilities() for the engine's wired-capability surface.

The parallel execution itself runs through the ArcFlow Graph Kernel — a single-pass parallel structure that maps directly to GPU thread blocks rather than walking the graph one edge at a time.

Six algorithms also support LIVE CALL for incremental maintenance -- see the Incremental Algorithms section above.

Prediction Quality#

Two procedures for monitoring prediction accuracy against observed ground truth.

Prediction Drift#

Compares predicted positions against observed entity positions within a time horizon. One row per matching :Prediction node.

CALL algo.predictionDrift('robot-01', 2000)
  YIELD entity_id, horizon_ms,
        predicted_x, observed_x, delta_x,
        predicted_y, observed_y, delta_y,
        delta_magnitude, prediction_source,
        observation_seq, predicted_seq

delta_magnitude is the Euclidean distance between predicted and observed (x, y). horizon_ms defaults to 1000 (1 second) — only :Prediction nodes with _timestamp_ms >= (now - horizon_ms) are included.

Confidence Calibration#

Expected Calibration Error (ECE) for an entity's predictions. Measures whether predicted confidence scores match observed accuracy.

CALL algo.confidenceCalibration('robot-01', '1h')
  YIELD ece_score, sample_count, calibration_curve, calibration_status

`calibration_status`	Meaning
`insufficient-data`	Not enough predictions for a reliable ECE score
`well-calibrated`	ECE < 0.05 — predictions are trustworthy
`overconfident`	Model assigns higher confidence than observed accuracy justifies
`underconfident`	Model assigns lower confidence than observed accuracy justifies

window accepts duration strings: "1h", "30m", "24h". Agreement tolerance: 1.0 meter. 10 confidence buckets.

Flywheel Tuning#

Dry-run query analysis with actionable remediation proposals. Accepts variadic Cypher query strings, compiles and executes each against an ephemeral graph, and produces optimization proposals.

CALL arcflow.flywheel.tune(
  'MATCH (p:Player) WHERE p.name = $name RETURN p',
  'CALL algo.pageRank()',
  'MATCH (e:Entity)-[:TRACKED]->(f:Frame) RETURN e, f'
) YIELD action, rationale, bounded

Three analysis passes:

Missing indexes — identifies label+property pairs without indexes in slow or failed queries. Proposes CREATE INDEX.
Cache pressure — detects high latency variance (p99/p50 ratio > 10x). Proposes CALL db.warmCache().
Failure remediation — maps error strings to fixes: unknown labels → db.labels(), constraint violations → MERGE, syntax errors → db.help().

Graph Algorithms

Canonical surface for the Algorithm Library layer.

GPU acceleration is automatic when hardware is available: same query, same syntax — ArcFlow Adaptive Dispatch routes to the fastest backend at runtime (see GPU Acceleration).

Centrality#

PageRank#

Iterative eigenvector centrality. 20 iterations, damping factor 0.85. GPU-accelerated when supported hardware is present.

CALL algo.pageRank()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | score | float | PageRank score (sums to 1.0) |

-- PageRank on a social network
CREATE (a:Person {name: 'Alice'})-[:FOLLOWS]->(b:Person {name: 'Bob'})
CREATE (b)-[:FOLLOWS]->(c:Person {name: 'Carol'})
CREATE (c)-[:FOLLOWS]->(a)
 
CALL algo.pageRank()

| nodeId | name  | score    |
|---
-----|---
----|---
-------|
| 1      | Alice | 0.333333 |
| 2      | Bob   | 0.333333 |
| 3      | Carol | 0.333333 |

Confidence-Weighted PageRank#

PageRank weighted by edge confidence values. Higher-confidence edges contribute more to rank propagation.

CALL algo.confidencePageRank()

Betweenness Centrality#

Measures how often a node lies on shortest paths between other nodes. Identifies bridges and brokers in the graph.

CALL algo.betweenness()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | betweenness | float | Betweenness centrality score |

CREATE (a:Person {name: 'Alice'})-[:KNOWS]->(b:Person {name: 'Bob'})
CREATE (b)-[:KNOWS]->(c:Person {name: 'Carol'})
CREATE (b)-[:KNOWS]->(d:Person {name: 'Dave'})
 
CALL algo.betweenness()

| nodeId | name | betweenness |
|---
-----|---
---|---
------
----|
| 2      | Bob  | 4.000000    |
| 1      | Alice| 0.000000    |
| 3      | Carol| 0.000000    |
| 4      | Dave | 0.000000    |

Closeness Centrality#

Inverse of the average shortest path distance from a node to all reachable nodes. Higher values indicate more central position.

CALL algo.closeness()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | closeness | float | Closeness centrality score |

Degree Centrality#

Count of relationships per node, optionally filtered by direction.

CALL algo.degreeCentrality()

Community Detection#

Louvain#

Hierarchical community detection via modularity optimization. GPU-accelerated on large graphs.

CALL algo.louvain()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | community | integer | Community assignment ID |

-- Detect communities in a social network
CREATE (a:Person {name: 'Alice'})-[:KNOWS]->(b:Person {name: 'Bob'})
CREATE (a)-[:KNOWS]->(c:Person {name: 'Carol'})
CREATE (d:Person {name: 'Dave'})-[:KNOWS]->(e:Person {name: 'Eve'})
 
CALL algo.louvain()

| nodeId | name  | community |
|---
-----|---
----|---
--------|
| 1      | Alice | 0         |
| 2      | Bob   | 0         |
| 3      | Carol | 0         |
| 4      | Dave  | 1         |
| 5      | Eve   | 1         |

Leiden#

Refined community detection that guarantees well-connected communities. Improvement over Louvain that avoids poorly connected intermediate communities.

CALL algo.leiden()

Returns the same schema as Louvain (nodeId, name, community).

Connected Components (WCC)#

Weakly connected components. Identifies disconnected subgraphs.

CALL algo.connectedComponents()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | component | integer | Component assignment ID |

Community Detection (Label Propagation)#

Fast community detection using label propagation. Near-linear time complexity.

CALL algo.communityDetection()

Returns nodeId, name, and community columns. GPU-accelerated on large graphs.

K-Core Decomposition#

Computes the k-core number for each node. A k-core is a maximal subgraph where every node has degree >= k.

CALL algo.kCore()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | core | integer | Core number |

Path Finding#

BFS Shortest Path#

Unweighted shortest path via breadth-first search. GPU-accelerated on large graphs.

MATCH p = shortestPath((a:Person {name: 'Alice'})-[*..10]->(b:Person {name: 'Dave'}))
RETURN p

Variable-length traversal with depth bounds:

MATCH (a:Person {name: 'Alice'})-[:KNOWS*1..5]->(b)
RETURN b.name

Dijkstra Shortest Path#

Weighted shortest path with optimized memory reuse. Pass the property key to use as edge weight.

CALL algo.dijkstra(startNodeId, endNodeId, 'weight')

-----	---

----| | path | string | Ordered node IDs along the shortest path | | distance | float | Total weighted path cost |

A* Pathfinding#

Heuristic-guided shortest path. Faster than Dijkstra when a distance estimate is available — the heuristic guides search toward the goal. Pass a node property to use as the heuristic estimate.

CALL algo.astar(startNodeId, endNodeId, 'travel_time', 'estimated_distance')

| Parameter | Description | |--- --------|---#

----| | start | Source node ID | | end | Target node ID | | weight_key | Edge property to use as cost | | heuristic_key | Node property to use as heuristic estimate |

Confidence-Weighted Shortest Path#

Shortest path weighted by edge confidence. Finds the most trusted route between two nodes.

CALL algo.confidencePath(1, 2)

-----	---

----| | path | string | Ordered node IDs in the path | | confidence | float | Aggregate path confidence |

All-Pairs Shortest Path#

Computes shortest path distances between all node pairs. GPU-accelerated for large graphs.

CALL algo.allPairsShortestPath()

-----	---

----| | source | integer | Source node ID | | target | integer | Target node ID | | distance | integer | Shortest path distance |

Similarity#

Jaccard Similarity#

Available through node similarity computation. Measures overlap of neighbor sets between node pairs.

Node Similarity#

Pairwise similarity between all nodes based on shared neighbors.

CALL algo.similarNodes()

Clustering Coefficient#

Local clustering coefficient for each node. Measures how close a node's neighbors are to forming a clique.

CALL algo.clusteringCoefficient()

-----	---

----| | nodeId | integer | Internal node ID | | name | string | Node label or name property | | coefficient | float | Local clustering coefficient (0.0 to 1.0) |

GPU-accelerated: dispatches to CUDA/Metal when graph exceeds 50 nodes.

Graph Metrics#

Triangle Count#

Counts the number of triangles in the graph. GPU-accelerated on large graphs.

CALL algo.triangleCount()

-----	---

----| | triangles | integer | Total triangle count |

Density#

Graph density: ratio of actual edges to possible edges.

CALL algo.density()

-----	---

----| | density | float | Graph density (0.0 to 1.0) |

Diameter#

Longest shortest path in the graph.

CALL algo.diameter()

-----	---

----| | diameter | integer | Graph diameter | | source_id | integer | Source node of the longest path | | target_id | integer | Target node of the longest path |

Biconnected Components#

Biconnected components — maximal subgraphs with no articulation points. O(V+E) time complexity.

CALL algo.biconnectedComponents()

-----	---

----| | nodeId | integer | Internal node ID | | component_id | integer | Biconnected component assignment |

Embedding Algorithms#

Node2Vec#

Generate node embeddings via structural random walk. Captures structural position in the graph.

CALL algo.node2vec(128, 80, 10)

| Parameter | Description | |--- --------|---#

----| | dimensions | Embedding size (e.g., 128) | | walk_length | Steps per random walk (e.g., 80) | | num_walks | Walks per node (e.g., 10) |

Sets node.embedding on every node. Use with vector search for structural similarity.

Struct2Vec#

Structural equivalence embeddings — nodes with similar local structure get similar vectors, regardless of where they are in the graph.

CALL algo.struc2vec(64)

GraphSAGE#

Inductive node embedding via neighborhood aggregation. Handles unseen nodes (no retraining needed).

CALL algo.graphSAGE(128)

Stale Embedding Detection#

Find nodes whose embeddings are out of date (property changed after last embedding).

CALL algo.staleEmbeddings()

Returns (nodeId, name) for nodes needing re-embedding.

Embedding Classification#

Classify nodes by similarity to labeled examples using a vector index.

CALL algo.classify($embedding, 'index_name', {k: 10})

Returns (nodeId, label, confidence).

Knowledge Graph Algorithms#

Entity Resolution#

Find nodes that refer to the same real-world entity despite different IDs or properties.

CALL algo.entityResolution()

Returns (nodeId1, nodeId2, similarity) pairs above the resolution threshold.

Fact Contradiction Detection#

Find pairs of facts in the graph that contradict each other based on predicates and confidence.

CALL algo.factContradiction()

Returns (nodeId1, nodeId2, contradiction_type, confidence).

Relationship Strength#

Score how strong a relationship is based on frequency, recency, and mutual connections.

CALL algo.relationshipStrength()

Returns (fromId, toId, strength) for all relationships.

Compounding Score#

Composite confidence score that propagates through chains of facts.

CALL algo.compoundingScore()

Entity Freshness#

How recently each entity was observed or corroborated. Low freshness = stale facts.

CALL algo.entityFreshness()

Returns (nodeId, name, freshness, last_observed_at).

Semantic Deduplication#

Find and score near-duplicate nodes using embedding similarity.

CALL algo.semanticDedup()

Multi-Source Disagreement#

The TVF is the natural composition partner for Causal Reasoning — once disagreement is detected, causalLineage walks back to the observations that produced each conflicting claim.

CALL arcflow.multi_source_disagreement(
  entity_label:          "Charting",       -- the typed entity carrying source-reported values
  group_property:        "play_id",        -- groups rows that should agree
  source_property:       "source",         -- distinguishes reporters
  value_property:        "run_pass",       -- the column being resolved
  prior_weight_property: "_confidence",    -- per-source trust weight
  disagreement_kind:     "categorical",    -- "categorical" | "numeric" | "spatial"
  agreement_threshold:   1.0               -- mode-specific (see below)
)
YIELD source, value, agreement_class, group_consensus, dispute_score
RETURN group_consensus, dispute_score, source, value, agreement_class

Three modes#

Mode	Algorithm	`dispute_score` semantics
`categorical`	Weighted-majority over distinct values; Shannon entropy of the weighted distribution, normalized to `[0, 1]`	`0.0` when all sources agree; `1.0` at maximum disagreement (entropy ≈ log₂(n_sources))
`numeric`	Weighted median across reported values; max pairwise delta among reported values	`clamp(max_pairwise_delta / agreement_threshold, 0.0, 1.0)`
`spatial`	Weiszfeld geometric median (iterative; converges on the L1-optimal 3D point); max pairwise Euclidean distance	Same clamped-delta shape as numeric, against the 3D distance

Output rows#

Each row carries:

Column	What it is
`source`	The reporter this row is about (or `null` for the group-level consensus row).
`value`	The value this source reported (categorical: the value; numeric: the float; spatial: the 3D point as WKT).
`agreement_class`	One of `consensus` (all sources agree within threshold), `outlier` (this source disagrees with the majority), `disputed` (no clear majority), `single_source` (only one source reported).
`group_consensus`	The resolved value for the group — the weighted-majority winner (categorical), the weighted median (numeric), or the geometric median (spatial).
`dispute_score`	`[0.0, 1.0]` per the mode-specific formula above.

Worked example — NFL play-call resolution#

Four vendors charted the same play; three say "run", one says "pass":

CALL arcflow.multi_source_disagreement(
  entity_label: "Charting",
  group_property: "play_id",
  source_property: "source",
  value_property: "run_pass",
  prior_weight_property: "_confidence",
  disagreement_kind: "categorical"
) YIELD source, value, agreement_class, group_consensus, dispute_score
WHERE source IS NOT NULL
RETURN source, value, agreement_class, dispute_score
ORDER BY dispute_score DESC

Output (illustrative):

[
  {"source": "ngs",       "value": "run",  "agreement_class": "consensus", "dispute_score": 0.0},
  {"source": "pff",       "value": "run",  "agreement_class": "consensus", "dispute_score": 0.0},
  {"source": "pro_fb",    "value": "run",  "agreement_class": "consensus", "dispute_score": 0.0},
  {"source": "xos",       "value": "pass", "agreement_class": "outlier",   "dispute_score": 0.8}
]

The group_consensus row reports value: "run", dispute_score: 0.4 (a single dissenter pulls the entropy up but not to the disputed threshold).

Why this matters#

Causal Reasoning#

Causal Lineage#

BFS over CAUSED_BY edges from a starting node, emitting every node visited with its hop distance, parent pointer, and cumulative confidence (the product of per-node _confidence along the path).

MATCH (s:Sack {play_id: 'X-vs-Y-q2-3:47'})
CALL arcflow.causalLineage(start_node: id(s), depth: 4)
YIELD node_id, hop, parent_id, node_label, node_confidence,
      cumulative_confidence, direction
RETURN hop, node_label, cumulative_confidence
ORDER BY hop, cumulative_confidence DESC

Parameters:

Parameter	Default	What it does
`start_node`	required	Node ID to walk from.
`depth`	required	Maximum BFS depth in hops.
`direction`	`"upstream"` (alias: `"causes"`)	Walk outgoing `CAUSED_BY` edges (toward causes) or incoming (toward consequences). Aliases: `"downstream"` / `"caused"` / `"consequences"`.

Returns: one row per visited node with the visit metadata.

Use cases:

Audit trail for any inferred or predicted fact ("show me what justified this Sack designation, four hops back").
Confidence floor for derived facts (the cumulative-confidence column shows how much trust survives the inference chain).
Counterfactual base — pair with AS OF to ask "what would have been justified at time T?"

Causal Path#

MATCH (sack:Sack), (snap:Snap)
WHERE sack.id = $sack_id AND snap.id = $snap_id
CALL arcflow.causalPath(from: id(sack), to: id(snap), depth: 8)
YIELD hop, node_id, node_label, path_found
RETURN hop, node_label
ORDER BY hop

Parameters:

Parameter	Default	What it does
`from`	required	Source node ID.
`to`	required	Target node ID.
`depth`	required	Maximum BFS depth in hops.
`direction`	`"upstream"`	Same direction semantics as `causalLineage`.

Returns: the discovered path as (hop, node_id, node_label, path_found) rows, or path_found="false" when no chain exists.

Use cases:

Targeted "why does X depend on Y?" queries when you have specific endpoints.
Disambiguating between multiple plausible causal chains by picking the shortest.
Failure-mode diagnostics — causalPath from a low-confidence fact back to a known-bad source.

Counterfactual Branching#

Branch At#

Create a named branch of the World Graph at a specific WAL sequence point.

CALL arcflow.counterfactual.branchAt(name: 'rollout-1', seq: 42)
YIELD branch, base_seq, status

Returns one row: {branch: "rollout-1", base_seq: "42", status: "created"}.

Argument forms (all equivalent):

Named: name: 'r1', seq: 42 or name = 'r1', seq = 42
Positional: 'r1', 42
Single or double quotes on the name

Boundary errors (all typed, with recovery_suggestion payloads):

Error code	Cause
`COUNTERFACTUAL_BRANCH_AT_MISSING_NAME`	No `name` argument provided
`COUNTERFACTUAL_BRANCH_AT_MISSING_SEQ`	No `seq` argument provided
`COUNTERFACTUAL_BRANCH_AT_EMPTY_NAME`	Empty string passed as `name`
`COUNTERFACTUAL_BRANCH_AT_UNKNOWN_ARG`	Unexpected kwarg

Fan-out swarm pattern#

Counterfactual exploration usually fans out: N rollouts at the same base seq, each with a different hypothetical edit, scored independently, then dropped or kept based on the score:

def counterfactual_swarm(db, play_id: int, n_rollouts: int = 10) -> list:
    base_seq = db.current_seq()
    rollouts = []
 
    # Fan out N branches at the same base seq.
    for i in range(n_rollouts):
        rollout_id = f"rollout-{play_id}-{i}"
        db.execute(
            f"CALL arcflow.counterfactual.branchAt("
            f"name: '{rollout_id}', seq: {base_seq})"
        )
        rollouts.append(rollout_id)
 
    # Per-rollout scenario edit (your domain logic) + scoring.
    results = []
    for rollout_id in rollouts:
        # Apply the counterfactual edit via AS BRANCH-targeted execute.
        # Then score.
        score = db.execute(
            f"AS BRANCH '{rollout_id}' "
            f"MATCH (p:Play {{play_id: {play_id}}}) "
            f"CALL algo.confidencePageRank('SHADOWS') YIELD rank "
            f"RETURN sum(rank) AS score"
        )
        results.append((rollout_id, score.rows[0]['score']))
 
    # Cleanup.
    for rollout_id in rollouts:
        db.execute(f"DROP BRANCH '{rollout_id}'")
 
    return results

Why a CALL surface (not just DDL)#

The DDL form (BRANCH AT SEQ 42 AS 'rollout-1') and the CALL form return the same branch. The CALL form exists because:

It composes with YIELD — CALL … YIELD branch, base_seq, status returns a typed row that downstream Cypher can consume directly (filter, transform, drive a MATCH against the branch).
It composes with Python orchestration — db.execute("CALL …") is the natural shape for swarm fan-out where the rollout names are computed in code.
It composes with EXPLAIN / PROFILE — the CALL surface is a planner-level operator with introspectable cost.
Typed errors with recovery_suggestion — the DDL form returns parse errors; the CALL form returns the structured boundary errors named above.

Use cases#

Sports analytics counterfactual swarm — "what if the QB had thrown to receiver Y instead of Z at frame 1024?" Fan out N branches at the snap frame, each with a different receiver target, score by composite pressure / completion-probability / yards-after-catch heuristics, surface the highest-EV alternative.
Sensor-failure replay — branch at the sensor-loss event, replay with the lost sensor's last-known reading held constant, compare downstream fusion outputs against the canonical timeline to bound the failure's blast radius.
Decision audit — when a low-confidence inferred fact lands, branch backward, apply the alternative inference, compare what would have followed. Pair with arcflow.causalLineage to trace which downstream facts depended on the original inference.
A/B replay against historical data — branch at a historical seq, apply the alternative policy, replay the same observation stream against the modified state, compare metrics.

GraphRAG Pipeline#

GraphRAG#

Full retrieval-augmented generation pipeline: query decomposition, graph traversal, evidence collection with confidence scoring.

CALL algo.graphRAG('What connections exist between Alice and the project?')

GraphRAG Context#

Retrieves structured context from the graph for a natural language query, with token budgeting for LLM context windows.

CALL algo.graphRAGContext('Tell me about Alice')

Trusted GraphRAG#

Confidence-filtered GraphRAG — only returns evidence above the trust threshold. Facts outrank inferences outrank predictions.

CALL algo.graphRAGTrusted('high-confidence relationships for Alice')

Sensor Fusion#

Weighted Centroid#

CALL algo.fusion.weighted_centroid('PlayerObs', 'pos_world', 'confidence')
YIELD x, y, z, var_x, var_y, var_z, total_weight, n

Arguments:

Arg	Type	Meaning
`label`	string	Node label to scan
`posProp`	string	Property carrying `Point3d` / `Vector3d` (also accepts a 2D `Point` — treated as z=0)
`weightProp`	string	Property carrying `Float` or `Int` (numeric weight; must be ≥ 0)

Errors:

Code	Cause
`EMPTY_SAMPLE_SET`	zero usable nodes after the property gate
`WEIGHT_NOT_NUMERIC`	a `weightProp` value isn't `Float` or `Int`
`NEGATIVE_WEIGHT`	a weight < 0
`INVALID_WEIGHT_SUM`	Σwᵢ ≤ 0
`NAN_IN_INPUT`	NaN slipped through the property filter

Direct API for callers that prefer not to round-trip through Cypher:

from arcflow import ArcFlow
db = ArcFlow()
row = db.execute(
    "CALL algo.fusion.weighted_centroid('PlayerObs', 'pos_world', 'confidence')"
).rows[0]

The pure-function algebra (weighted_centroid_2d, weighted_centroid_3d, weighted_centroid_nd) is also exposed on the Rust crate as arcflow_core::algo_fusion.

Trajectory Analytics#

Nearest at Frame#

Top-k entities by 2D distance from a query point at a specific frame. The single-frame slice of a "who was closest when" question.

CALL arcflow.trajectory.nearestAtFrame(
  entity_label: "Player",
  frame_property: "frame",
  x_property: "x",
  y_property: "y",
  frame: 1024,
  qx: 50.0, qy: 23.5,
  k: 5
) YIELD other_node_id, other_label, distance, frame

Leverage Gain#

CALL arcflow.trajectory.leverageGain(
  entity_label: "Player",
  chaser_filter_property: "player_id", chaser_filter_value: 28,
  target_filter_property: "player_id", target_filter_value: 12
) YIELD frame, delta

Release Point#

Single-row event detection: the frame at which forward x-velocity peaks. A throw-release heuristic for quarterbacks; generalises to any "fastest forward motion" event over a 2D trajectory.

CALL arcflow.trajectory.releasePoint(
  entity_label: "Player",
  filter_property: "player_id", filter_value: 5
) YIELD frame

Shadowed By#

CALL arcflow.trajectory.shadowedBy(
  entity_label: "Player",
  attacker_filter_property: "player_id", attacker_filter_value: 5,
  target_filter_property: "player_id",   target_filter_value: 12,
  defender_filter_property: "player_id", defender_filter_value: 28,
  angle_tol_rad: 0.1
) YIELD frame

NFL coverage descriptor — composing all four#

A single Cypher block computes a complete per-play coverage descriptor: release frame, nearest defender at snap, line-of-sight shadow, and closing speed:

// 1. Find QB's nearest defender at snap (frame 1024)
CALL arcflow.trajectory.nearestAtFrame(
  entity_label: "Player", frame_property: "frame",
  x_property: "x", y_property: "y",
  frame: 1024, qx: 30.0, qy: 22.0, k: 3
) YIELD other_node_id AS defender_id, distance
 
// 2. Compute the throw release frame for the QB
CALL arcflow.trajectory.releasePoint(
  entity_label: "Player",
  filter_property: "player_id", filter_value: 5
) YIELD frame AS release_frame
 
// 3. Check if CB28 shadowed the throw line to WR12
CALL arcflow.trajectory.shadowedBy(
  entity_label: "Player",
  attacker_filter_property: "player_id", attacker_filter_value: 5,
  target_filter_property: "player_id",   target_filter_value: 12,
  defender_filter_property: "player_id", defender_filter_value: 28,
  angle_tol_rad: 0.1
) YIELD frame AS shadow_frame
 
// 4. Closing speed between CB28 + WR12 over the same window
CALL arcflow.trajectory.leverageGain(
  entity_label: "Player",
  chaser_filter_property: "player_id", chaser_filter_value: 28,
  target_filter_property: "player_id",   target_filter_value: 12
) YIELD frame, delta AS closing

Why a separate procedure family#

Trajectory analytics has different shape from generic graph algorithms:

Inputs are dense per-entity trajectories, not graph adjacency. The kernels operate on Sample2[frame, x, y] arrays; the graph layer's job is selecting which entities' arrays to feed in via label + property filters.
Output is per-frame or per-event, not per-node. Most return one row per matched frame; releasePoint returns at most one row total.
DenseStore-aware: reads go through the same store.effective_property path that bulk-create-from-Arrow ingestion writes to, so vectorised ingestion and trajectory analytics share the same column layouts.
Typed errors for the predictable failure modes: missing entity_label, missing filter_property, missing filter_value all return validation errors with recovery suggestions, never silent zero-row results.

Streaming functions#

Function	Definition	First-output offset
`lag(value, n)`	the value from `n` feeds ago	n+1
`lead(value, n)`	retroactive output for event `F-n` when feed `F` arrives	n+1
`delta(value)`	`current - previous` (shorthand for `lag(value, 1)`)	2

Today these are reachable via the Rust + C ABI surface (register_stream_fn

feed_stream_fn). A Python wrapper is shipped under arcflow.stream_fn; the Cypher window-expression surface (RETURN lag(p.pan, 1) OVER (ORDER BY p.captured_at)) is follow-up work.

Rust#

use arcflow::stream_fn::{StreamFnKind};
 
store.register_stream_fn("pan_delta", StreamFnKind::Delta)?;
for (pan, ts) in samples {
    let outputs = store.feed_stream_fn("pan_delta", pan)?;
    if let Some(out) = outputs.first() {
        let pan_rate = out.fn_value / dt; // dt from a sibling Delta on the timestamp
    }
}

Python#

from arcflow import ArcFlow
from arcflow.stream_fn import StreamFn, Kind
 
db = ArcFlow()
pan_delta = StreamFn(db, "pan_delta", Kind.Delta)
ts_delta  = StreamFn(db, "ts_delta",  Kind.Delta)
 
for pan, ts in samples:
    pan_out = pan_delta.feed(pan)
    ts_out  = ts_delta.feed(ts)
    if pan_out and ts_out:
        pan_rate = pan_out[0].fn_value / ts_out[0].fn_value

State is computational — NOT WAL-journaled. Callers re-feed pending events from the source (typically a topic whose events ARE journaled) after restart.

Incremental Algorithms (LIVE CALL)#

-- Register incremental PageRank -- updates automatically on graph mutations
LIVE CALL algo.pageRank({ damping: 0.85 })
 
-- Register incremental community detection
LIVE CALL algo.connectedComponents()
LIVE CALL algo.louvain()
 
-- Register incremental pathfinding and metrics
LIVE CALL algo.bfs({ source: 42 })
LIVE CALL algo.triangleCount()
LIVE CALL algo.clusteringCoefficient()

Inspecting live algorithms#

-- List all running incremental algorithms
CALL db.liveAlgorithms()
 
-- Check a specific algorithm's memory and timing
CALL db.viewStats('pageRank')

Algorithm Performance Summary#

Algorithm	Acceleration	GPU Dispatch Threshold
PageRank	GPU-accelerated	500 nodes
BFS Shortest Path	GPU-accelerated	100 nodes
Triangle Count	GPU-accelerated	100 nodes
Community Detection	GPU-accelerated	200 nodes
Louvain	GPU-accelerated	200 nodes
Vector Search	GPU-accelerated	100 vectors
All-Pairs Shortest Path	GPU-accelerated	500 nodes
Clustering Coefficient	GPU-accelerated	100 nodes
Node Similarity	GPU-accelerated	100 nodes
Leiden	CUDA 9.0+ (H100+) only	10M nodes

GPU dispatch is handled by ArcFlow Adaptive Dispatch — which measures graph size at query time and routes to the fastest available backend automatically. No configuration required.

To measure throughput on your hardware, run cargo bench --bench algo from the ozinc/arcflow repo.

The parallel execution itself runs through the ArcFlow Graph Kernel — a single-pass parallel structure that maps directly to GPU thread blocks rather than walking the graph one edge at a time.

Six algorithms also support LIVE CALL for incremental maintenance -- see the Incremental Algorithms section above.

Prediction Quality#

Two procedures for monitoring prediction accuracy against observed ground truth.

Prediction Drift#

Compares predicted positions against observed entity positions within a time horizon. One row per matching :Prediction node.

CALL algo.predictionDrift('robot-01', 2000)
  YIELD entity_id, horizon_ms,
        predicted_x, observed_x, delta_x,
        predicted_y, observed_y, delta_y,
        delta_magnitude, prediction_source,
        observation_seq, predicted_seq

Confidence Calibration#

Expected Calibration Error (ECE) for an entity's predictions. Measures whether predicted confidence scores match observed accuracy.

CALL algo.confidenceCalibration('robot-01', '1h')
  YIELD ece_score, sample_count, calibration_curve, calibration_status

`calibration_status`	Meaning
`insufficient-data`	Not enough predictions for a reliable ECE score
`well-calibrated`	ECE < 0.05 — predictions are trustworthy
`overconfident`	Model assigns higher confidence than observed accuracy justifies
`underconfident`	Model assigns lower confidence than observed accuracy justifies

window accepts duration strings: "1h", "30m", "24h". Agreement tolerance: 1.0 meter. 10 confidence buckets.

Flywheel Tuning#

Dry-run query analysis with actionable remediation proposals. Accepts variadic Cypher query strings, compiles and executes each against an ephemeral graph, and produces optimization proposals.

CALL arcflow.flywheel.tune(
  'MATCH (p:Player) WHERE p.name = $name RETURN p',
  'CALL algo.pageRank()',
  'MATCH (e:Entity)-[:TRACKED]->(f:Frame) RETURN e, f'
) YIELD action, rationale, bounded

Three analysis passes:

Missing indexes — identifies label+property pairs without indexes in slow or failed queries. Proposes CREATE INDEX.
Cache pressure — detects high latency variance (p99/p50 ratio > 10x). Proposes CALL db.warmCache().
Failure remediation — maps error strings to fixes: unknown labels → db.labels(), constraint violations → MERGE, syntax errors → db.help().

Graph Algorithms

Centrality#

PageRank#

Confidence-Weighted PageRank#

Betweenness Centrality#

Closeness Centrality#

Degree Centrality#

Community Detection#

Louvain#

Leiden#

Connected Components (WCC)#

Community Detection (Label Propagation)#

K-Core Decomposition#

Path Finding#

BFS Shortest Path#

Dijkstra Shortest Path#

A* Pathfinding#

| Parameter | Description | |--- --------|---#

Confidence-Weighted Shortest Path#

All-Pairs Shortest Path#

Similarity#

Jaccard Similarity#

Node Similarity#

Clustering Coefficient#

Graph Metrics#

Triangle Count#

Density#

Diameter#

Biconnected Components#

Embedding Algorithms#

Node2Vec#

| Parameter | Description | |--- --------|---#

Struct2Vec#

GraphSAGE#

Stale Embedding Detection#

Embedding Classification#

Knowledge Graph Algorithms#

Entity Resolution#

Fact Contradiction Detection#

Relationship Strength#

Compounding Score#

Entity Freshness#

Semantic Deduplication#

Multi-Source Disagreement#

Three modes#

Output rows#

Worked example — NFL play-call resolution#

Why this matters#

Causal Reasoning#

Causal Lineage#

Causal Path#

Counterfactual Branching#

Branch At#

Fan-out swarm pattern#

Why a CALL surface (not just DDL)#

Use cases#

GraphRAG Pipeline#

GraphRAG#

GraphRAG Context#

Trusted GraphRAG#

Sensor Fusion#

Weighted Centroid#

Trajectory Analytics#

Nearest at Frame#

Leverage Gain#

Release Point#

Shadowed By#

NFL coverage descriptor — composing all four#

Why a separate procedure family#

Streaming functions#

Rust#

Python#

Incremental Algorithms (LIVE CALL)#

Inspecting live algorithms#

Algorithm Performance Summary#

Prediction Quality#

Prediction Drift#

Confidence Calibration#

Flywheel Tuning#

See Also#