Low-latency order-matching engine in Rust targeting Apple Silicon with inline ARM64 NEON assembly, lock-free SPSC ring buffers, and measured sub-100 ns round-trip latency. Zero external dependencies.
| .cargo | |
| .claude | |
| .gitattributes | |
| .github | |
| .gitignore | |
| CLAUDE.md | |
| CONTRIBUTING.md | |
| Cargo.toml | |
| LICENSE | |
| Makefile | |
| README.md | |
| docs | |
| scripts | |
| src |
A high-frequency trading engine built from scratch in Rust, targeting Apple Silicon (ARM64 / M-series) with an AVX2 path for x86_64 Linux. Zero external dependencies. Every architectural decision is evaluated in nanoseconds.
> New here? CLAUDE.md is the full architecture & design reference — the why behind every decision below. Want to contribute? See CONTRIBUTING.md.
Measured latency — in-process simulation:
| Platform | Min | p50 | p95 | p99 | p99.9 | Max | |-------------------------------|-----------|--------|--------|----------|-----------|------------| | macOS — M3 Max (signal) | **41 ns** | 125 ns | 250 ns | 458 ns | 1,917 ns | 1,917 ns | | macOS — M3 Max (round-trip) | **41 ns** | 84 ns | 375 ns | 3,458 ns | 10,001 ns | 105,083 ns | | Linux — i9-9900K (signal) | 89 ns | 118 ns | 143 ns | 150 ns | 1,317 ns | 1,317 ns | | Linux — i9-9900K (round-trip) | 92 ns | 108 ns | 140 ns | 199 ns | 1,263 ns | 1,263 ns |
The two platforms make different tradeoffs. The M3 Max achieves a lower floor (41 ns vs 89 ns) — ARM64 NEON and the P-core cluster's memory subsystem. Linux on x86_64 delivers tighter tail discipline — p99 round-trip 199 ns vs 3,458 ns on macOS. The Mac's scheduling spikes are rarer (6,868 stalls/run) but longer when they happen; Linux stalls more frequently (21,741/run) but more uniformly. Neither is "better" — they're different OS scheduling personalities against the same spin-poll workload.
External UDP mode (3-process, kernel boundaries): 43–135 µs — ~163× higher than in-process. That gap is the architectural thesis.
The engine ingests market tick data, runs a momentum signal over a sliding 8-price window, submits orders to an in-process exchange, and records per-trade latency at nanosecond resolution. The current simulation is self-contained — one binary spawns the market feed, the exchange, and the strategy thread internally.
Three threads share elevated priority (macOS QOS_USER_INTERACTIVE / P-core bias; Linux sched_setaffinity equivalent is a planned addition):
A watchdog (default priority) monitors for idle/feed-loss conditions and shuts down after 10 s idle or 30 s without a feed packet.
The design eliminates every source of unpredictable latency on the hot path. Each decision below has a measurable consequence.
The exchange thread shares memory with the strategy thread. The round-trip path — order submission → confirmation write → timestamp read — crosses zero kernel boundaries. The OrderRing SPSC buffer connects them via UnsafeCell<[OrderEntry; 1024]> and a single AtomicU64 cursor.
The standalone fake-exchange binary exists for external UDP measurement when kernel-path characterisation is needed (43–135 µs — the cost of 4× EL0→EL1 crossings plus 2 process wakeups on loopback).
| Structure | Pattern | Purpose | |--------------------|-----------------------------------|----------------------------------------------| | `RingBuffer` | SPSC, `AtomicU64` write head | Ingestor → strategy tick delivery | | `TradeLog` | Single-writer, `AtomicU64` cursor | Strategy commits; exchange reads slot index | | `OrderRing` | SPSC, `AtomicU64` cursor | Strategy → exchange order submission | | `LatencyHistogram` | Per-thread sole-writer buckets | ns-resolution recording, no sort at shutdown |
No mutex, no condvar, no blocking synchronisation on the hot path.