GPU Acceleration
ArcFlow provides three distinct execution innovations for high-performance world model queries:
- ArcFlow Graph Kernel — executes graph algorithms as a single parallel pass across all nodes simultaneously, not as sequential edge traversals
- ArcFlow Adaptive Dispatch — routes every operation to the fastest available hardware at runtime, cost-model driven, zero configuration
- ArcFlow GPU Index — a pointer-free spatial index designed for direct GPU traversal without transformation
The developer writes one query. These three layers work together to pick the fastest path.
CALL algo.pageRank()ArcFlow Graph Kernel#
Most graph databases walk the graph one edge at a time: visit a node, follow an edge, visit the next. That is sequential, cache-hostile, and does not map to GPU hardware.
The ArcFlow Graph Kernel processes algorithms differently. The world model is held as a compact parallel structure — every algorithm executes as a single pass across all nodes simultaneously, not a recursive walk. PageRank, BFS, connected components, community detection, triangle counting — each runs as one parallel operation. This maps directly to GPU thread blocks and enables the speedups below.
The same kernel runs on CPU when no GPU is present. The parallel structure is inherently more efficient than pointer-chasing traversal on any hardware.
ArcFlow Adaptive Dispatch#
ArcFlow Adaptive Dispatch measures available hardware at startup and routes each operation to the fastest available backend based on a live cost model:
- Small graphs (< 200 nodes) — CPU path (dispatch overhead exceeds GPU benefit at this scale)
- Apple Silicon (macOS / iOS) — Metal GPU. Unified memory means no CPU→GPU copy overhead — CPU and GPU read the same physical memory.
- NVIDIA GPU (Linux / Windows) — CUDA GPU. Dynamic driver loading — no compile-time GPU dependency, same binary runs everywhere.
- CPU fallback — ArcFlow's parallel CPU implementation when no GPU is present.
The cost model accounts for kernel launch overhead, memory bandwidth per device, and algorithm parallelism characteristics. There are no hardcoded thresholds — routing adapts to the actual hardware measured at runtime.
Zero configuration. The same GQL query runs identically on a laptop, a workstation, or a GPU cluster.
Measuring on your hardware#
Performance depends on host CPU/GPU/memory characteristics and graph shape. Rather than quote per-host numbers that decay, ArcFlow ships a benchmark harness so you can measure on the hardware you'll actually deploy on:
# From the ozinc/arcflow repo:
cargo bench --bench algo # CPU + GPU comparisons across algorithms
cargo run --bin metal_baseline # Apple Silicon Metal-specific baselineGPU speedup is most pronounced for algorithms with high parallelism (community detection, triangle counting). For simpler traversals the CPU path is already fast — GPU dispatch adds value when the graph is large and the algorithm is inherently parallel.
ArcFlow GPU Index#
Spatial queries require a different execution path from graph algorithms — a spatial index, not graph traversal. ArcFlow Adaptive Dispatch routes spatial queries across four lanes based on candidate count and GPU transfer cost:
| Lane | When | Typical use |
|---|---|---|
CpuLive | ≤ 500 candidates | LIVE queries, real-time tracking |
CpuBatch | > 500 candidates (CPU faster than transfer) | Analytics, replay |
GpuLocal | > 50K candidates, fits single GPU | High-density spatial |
GpuMulti | Exceeds single GPU memory | Stadium-scale entity tracking |
The ArcFlow GPU Index is the structure that makes GpuLocal and GpuMulti lanes possible. It is a pointer-free spatial index designed specifically for GPU traversal — transferring directly to GPU memory without transformation. Traditional pointer-based spatial indexes contain virtual memory addresses that are meaningless to GPU threads; the ArcFlow GPU Index eliminates this boundary entirely.
Multi-GPU Partitioning#
Spatial data is partitioned across GPU devices by a stable hash of node_id. Queries spanning partition boundaries fan out to all relevant GPUs and merge results. Devices connected via high-bandwidth GPU interconnects form peer islands — within an island, work-stealing happens without PCIe transfer cost.
-- Same spatial query — Adaptive Dispatch routes to GpuMulti when warranted
CALL algo.nearestNodes($center, 'Entity', 100) YIELD node, distance RETURN node.nameInstanced Geometry#
For scenes with thousands of identical geometry instances (seats, sensors, obstacles), the ArcFlow GPU Index is shared — one allocation serves all instances. Queries transform coordinates into instance-local space rather than rebuilding the index per instance.
Metal GPU (Apple Silicon)#
On macOS and iOS, ArcFlow Adaptive Dispatch routes to Metal compute shaders. Apple Silicon's unified memory means the ArcFlow Graph Kernel and GPU Index operate in the same physical memory as the CPU — zero copy overhead.
If Metal is unavailable or the graph is below the dispatch threshold, routing falls back to CPU transparently.
Per-family kernel selection#
ArcFlow selects the optimal Metal Shading Language primitive for your Apple GPU family automatically. The integrated loop branches inside the same in-process call — no separate code path, no configuration, no cross-architecture abstraction tax:
| Apple GPU family | Primitive routed |
|---|---|
| Apple7 (A14 / M1) and newer | Pipeline-state caching for sub-frame cold start; transient buffer-heap allocations |
| Apple8 (A15 / M2 / M3) and newer | simd_sum / simd_min / simd_max cross-lane reductions for graph aggregates |
| Apple9 (A17 / M3) and newer | Native atomic_float for scatter-accumulate kernels |
| M3 family and newer | simdgroup_matrix tile operations for dense linear-algebra inner loops |
arcflow status --json reports the detected GPU family and the selected primitive set. Same binary runs across every Apple device generation; the dispatch decision is per-host.
CPU-side integration#
Vector and dense-numeric paths route through Apple's Accelerate framework (AMX / SME on Apple Silicon, NEON elsewhere) for batched dot products and reductions — including the brute-force fallback path in vector search. CPU work runs on QOS_CLASS_USER_INITIATED for foreground-quality scheduling alongside the GPU loop.
CUDA GPU (Linux / Windows)#
On NVIDIA hardware, Adaptive Dispatch routes to CUDA via dynamic driver loading. No compile-time GPU dependency — the driver is discovered at runtime. If CUDA is unavailable, the same binary falls back to CPU. Covering graph algorithms, vector search, and spatial operations at different workload scales.
GPU Introspection#
CALL db.gpuStatus()#
Returns one row per CUDA device. Use this to check availability and load before submitting large GPU workloads.
CALL db.gpuStatus() YIELD device_id, inflight, sm_count, vram_mib, status| Column | Type | Description |
|---|---|---|
device_id | int | CUDA device index |
inflight | int | Currently executing GPU kernels |
sm_count | int | Streaming multiprocessor count |
vram_mib | int | Device VRAM in MiB |
status | string | "available" (inflight < 8) or "saturated" |
Returns {device_id: "N/A", status: "no CUDA devices"} when no CUDA hardware is present.
On Apple Silicon (macOS / iOS), db.gpuStatus() currently enumerates CUDA devices only, so it returns the "no CUDA devices" shape even when the Metal GPU is fully active. To confirm Metal presence on Mac, inspect db.capabilities() instead:
CALL db.capabilities()
YIELD capability, value
WHERE capability IN ['gpu_spmv_semirings', 'gpu_spgemm', 'gpu_deterministic_f64']
RETURN capability, valueNon-zero gpu_spmv_semirings and a populated gpu_spgemm value indicate Metal kernels are linked and dispatchable. The gpu_backend field reports the backend chosen for the current workload (Adaptive Dispatch keeps small graphs on CPU), not Metal availability.
CALL db.capabilities()#
Returns the engine capability surface, including GPU presence, GPU family flags, and the spgemm/dispatch wiring status. Use this to check whether GPU acceleration is available before submitting large workloads.
CALL db.capabilities()
YIELD gpu_status, gpu_spgemm, gpu_family| Column | Description |
|---|---|
gpu_status | "available", "saturated", or "no CUDA devices" |
gpu_spgemm | Whether GPU sparse-matrix dispatch is wired for this build |
gpu_family | Apple GPU family identifier when running on Metal |
validated | Whether the kernel has been validated on this hardware |
cuda_min_cc | Minimum CUDA compute capability required (e.g. "9.0") or "none" |
CALL arcflow.spatial.dispatch_stats()#
Observability for ArcFlow GPU Index routing decisions.
CALL arcflow.spatial.dispatch_stats()
YIELD lane_chosen, estimated_candidates, actual_candidates,
prefilter_us, rtree_us, gpu_transfer_us, kernel_us, total_usgpu_transfer_us and kernel_us are non-zero only when a GPU lane was chosen.
See Also#
- Graph Algorithms — full algorithm catalog with signatures and output schemas
- Algorithms Reference — GQL syntax for all 27 procedures
- Architecture — how Graph Kernel, Adaptive Dispatch, and GPU Index share memory
- Spatial Queries — GPU-dispatched spatial queries