GPU Acceleration
ArcFlow provides three distinct execution innovations for high-performance world model queries:
- ArcFlow Graph Kernel — executes graph algorithms as a single parallel pass across all nodes simultaneously, not as sequential edge traversals
- ArcFlow Adaptive Dispatch — routes every operation to the fastest available hardware at runtime, cost-model driven, zero configuration
- ArcFlow GPU Index — a pointer-free spatial index designed for direct GPU traversal without transformation
The developer writes one query. These three layers work together to pick the fastest path.
CALL algo.pageRank()ArcFlow Graph Kernel#
Most graph databases walk the graph one edge at a time: visit a node, follow an edge, visit the next. That is sequential, cache-hostile, and does not map to GPU hardware.
The ArcFlow Graph Kernel processes algorithms differently. The world model is held as a compact parallel structure — every algorithm executes as a single pass across all nodes simultaneously, not a recursive walk. PageRank, BFS, connected components, community detection, triangle counting — each runs as one parallel operation. This maps directly to GPU thread blocks and enables the speedups below.
The same kernel runs on CPU when no GPU is present. The parallel structure is inherently more efficient than pointer-chasing traversal on any hardware.
ArcFlow Adaptive Dispatch#
ArcFlow Adaptive Dispatch measures available hardware at startup and routes each operation to the fastest available backend based on a live cost model:
- Small graphs (< 200 nodes) — CPU path (dispatch overhead exceeds GPU benefit at this scale)
- Apple Silicon (macOS / iOS) — Metal GPU. Unified memory means no CPU→GPU copy overhead — CPU and GPU read the same physical memory.
- NVIDIA GPU (Linux / Windows) — CUDA GPU. Dynamic driver loading — no compile-time GPU dependency, same binary runs everywhere.
- CPU fallback — ArcFlow's parallel CPU implementation when no GPU is present.
The cost model accounts for kernel launch overhead, memory bandwidth per device, and algorithm parallelism characteristics. There are no hardcoded thresholds — routing adapts to the actual hardware measured at runtime.
Zero configuration. The same GQL query runs identically on a laptop, a workstation, or a GPU cluster.
Benchmark Results#
Performance measured against ArcFlow's own CPU path on the same hardware:
| Algorithm | CPU | GPU Speedup |
|---|---|---|
| PageRank | 154M nodes/sec | 2.4x |
| BFS Frontier | 6.3M edges/sec | 3.5x |
| Vector Distance | 25K queries/sec | 4.2x |
| Triangle Count | 943K nodes/sec | 19.8x |
| Community Detection | 185K nodes/sec | 29.6x |
Speedup is most pronounced for algorithms with high parallelism (community detection, triangle counting). For simpler traversals, the CPU path is already fast — GPU dispatch adds value when the graph is large and the algorithm is inherently parallel.
ArcFlow GPU Index#
Spatial queries require a different execution path from graph algorithms — a spatial index, not graph traversal. ArcFlow Adaptive Dispatch routes spatial queries across four lanes based on candidate count and GPU transfer cost:
| Lane | When | Typical use |
|---|---|---|
CpuLive | ≤ 500 candidates | LIVE queries, real-time tracking |
CpuBatch | > 500 candidates (CPU faster than transfer) | Analytics, replay |
GpuLocal | > 50K candidates, fits single GPU | High-density spatial |
GpuMulti | Exceeds single GPU memory | Stadium-scale entity tracking |
The ArcFlow GPU Index is the structure that makes GpuLocal and GpuMulti lanes possible. It is a pointer-free spatial index designed specifically for GPU traversal — transferring directly to GPU memory without transformation. Traditional pointer-based spatial indexes contain virtual memory addresses that are meaningless to GPU threads; the ArcFlow GPU Index eliminates this boundary entirely.
Multi-GPU Partitioning#
Spatial data is partitioned across GPU devices by a stable hash of node_id. Queries spanning partition boundaries fan out to all relevant GPUs and merge results. Devices connected via high-bandwidth GPU interconnects form peer islands — within an island, work-stealing happens without PCIe transfer cost.
-- Same spatial query — Adaptive Dispatch routes to GpuMulti when warranted
CALL algo.nearestNodes($center, 'Entity', 100) YIELD node, distance RETURN node.nameInstanced Geometry#
For scenes with thousands of identical geometry instances (seats, sensors, obstacles), the ArcFlow GPU Index is shared — one allocation serves all instances. Queries transform coordinates into instance-local space rather than rebuilding the index per instance.
Metal GPU (Apple Silicon)#
On macOS and iOS, ArcFlow Adaptive Dispatch routes to Metal compute shaders. Apple Silicon's unified memory means the ArcFlow Graph Kernel and GPU Index operate in the same physical memory as the CPU — zero copy overhead.
PageRank on 10K nodes — CPU: 0.6ms / GPU: 0.25ms (2.4x)
If Metal is unavailable or the graph is below the dispatch threshold, routing falls back to CPU transparently.
CUDA GPU (Linux / Windows)#
On NVIDIA hardware, Adaptive Dispatch routes to CUDA via dynamic driver loading. No compile-time GPU dependency — the driver is discovered at runtime. If CUDA is unavailable, the same binary falls back to CPU. Covering graph algorithms, vector search, and spatial operations at different workload scales.
GPU Introspection#
CALL db.gpuStatus()#
Returns one row per CUDA device. Use this to check availability and load before submitting large GPU workloads.
CALL db.gpuStatus() YIELD device_id, inflight, sm_count, vram_mib, status| Column | Type | Description |
|---|---|---|
device_id | int | CUDA device index |
inflight | int | Currently executing GPU kernels |
sm_count | int | Streaming multiprocessor count |
vram_mib | int | Device VRAM in MiB |
status | string | "available" (inflight < 8) or "saturated" |
Returns {device_id: "N/A", status: "no CUDA devices"} when no CUDA hardware is present.
CALL dbms.gpuThresholds()#
Returns the Adaptive Dispatch registry — minimum requirements for each algorithm before GPU routing is considered.
CALL dbms.gpuThresholds()
YIELD algorithm, min_input_size, bytes_per_element, validated, cuda_min_cc| Column | Description |
|---|---|
algorithm | Algorithm name (e.g. "pageRank", "leiden") |
min_input_size | Minimum node count before GPU dispatch is considered |
bytes_per_element | Per-node memory estimate for transfer cost calculation |
validated | Whether the kernel has been validated on this hardware |
cuda_min_cc | Minimum CUDA compute capability required (e.g. "9.0") or "none" |
CALL arcflow.spatial.dispatch_stats()#
Observability for ArcFlow GPU Index routing decisions.
CALL arcflow.spatial.dispatch_stats()
YIELD lane_chosen, estimated_candidates, actual_candidates,
prefilter_us, rtree_us, gpu_transfer_us, kernel_us, total_usgpu_transfer_us and kernel_us are non-zero only when a GPU lane was chosen.
See Also#
- Graph Algorithms — full algorithm catalog with signatures and output schemas
- Algorithms Reference — GQL syntax for all 27 procedures
- Architecture — how Graph Kernel, Adaptive Dispatch, and GPU Index share memory
- Spatial Queries — GPU-dispatched spatial queries