HFT Engine

activeRust41 ns · min round-trip

Low-latency order-matching engine in Rust targeting Apple Silicon with inline ARM64 NEON assembly, lock-free SPSC ring buffers, and measured sub-100 ns round-trip latency. Zero external dependencies.

GitHub ↗← Projects

STATUSactive

LANGUAGERust

METRIC41 ns · min round-trip

STACKRust · ARM64 NEON · Lock-Free · Systems Programming · HFT · Low-Latency

REPOmiki-przygoda/hft-engine ↗

COMMITS92

STARS0

REPOSITORY

miki-przygoda/hft-engine ↗

⊙ 92 commits‹› Code ↗

miki-przygodaMerge pull request #5 from miki-przygoda/claude/futures-cost-risk7e46bd23 weeks ago

▾	.cargo
▾	.claude
·	.gitattributes
▾	.github
·	.gitignore
·	CLAUDE.md
·	CONTRIBUTING.md
·	Cargo.toml
·	LICENSE
·	Makefile
·	README.md
▾	docs
▾	scripts
▾	src

📄

rust-hft-software

A high-frequency trading engine built from scratch in Rust, targeting Apple Silicon (ARM64 / M-series) with an AVX2 path for x86_64 Linux. Zero external dependencies. Every architectural decision is evaluated in nanoseconds.

> New here? CLAUDE.md is the full architecture & design reference — the why behind every decision below. Want to contribute? See CONTRIBUTING.md.

Measured latency — in-process simulation:

| Platform                      | Min       | p50    | p95    | p99      | p99.9     | Max        |
|-------------------------------|-----------|--------|--------|----------|-----------|------------|
| macOS — M3 Max (signal)       | **41 ns** | 125 ns | 250 ns | 458 ns   | 1,917 ns  | 1,917 ns   |
| macOS — M3 Max (round-trip)   | **41 ns** | 84 ns  | 375 ns | 3,458 ns | 10,001 ns | 105,083 ns |
| Linux — i9-9900K (signal)     | 89 ns     | 118 ns | 143 ns | 150 ns   | 1,317 ns  | 1,317 ns   |
| Linux — i9-9900K (round-trip) | 92 ns     | 108 ns | 140 ns | 199 ns   | 1,263 ns  | 1,263 ns   |

The two platforms make different tradeoffs. The M3 Max achieves a lower floor (41 ns vs 89 ns) — ARM64 NEON and the P-core cluster's memory subsystem. Linux on x86_64 delivers tighter tail discipline — p99 round-trip 199 ns vs 3,458 ns on macOS. The Mac's scheduling spikes are rarer (6,868 stalls/run) but longer when they happen; Linux stalls more frequently (21,741/run) but more uniformly. Neither is "better" — they're different OS scheduling personalities against the same spin-poll workload.

External UDP mode (3-process, kernel boundaries): 43–135 µs — ~163× higher than in-process. That gap is the architectural thesis.

What it does

The engine ingests market tick data, runs a momentum signal over a sliding 8-price window, submits orders to an in-process exchange, and records per-trade latency at nanosecond resolution. The current simulation is self-contained — one binary spawns the market feed, the exchange, and the strategy thread internally.

Three threads share elevated priority (macOS QOS_USER_INTERACTIVE / P-core bias; Linux sched_setaffinity equivalent is a planned addition):

Ingestor — binds UDP 34254, spin-polls, writes ticks into a lock-free ring buffer
Strategy — spin-polls the ring buffer, evaluates the momentum signal, commits trades
Exchange — spin-polls the order ring, writes round-trip timestamps back to the trade log

A watchdog (default priority) monitors for idle/feed-loss conditions and shuts down after 10 s idle or 30 s without a feed packet.

Architecture

The design eliminates every source of unpredictable latency on the hot path. Each decision below has a measurable consequence.

In-process exchange over external UDP

The exchange thread shares memory with the strategy thread. The round-trip path — order submission → confirmation write → timestamp read — crosses zero kernel boundaries. The OrderRing SPSC buffer connects them via UnsafeCell<[OrderEntry; 1024]> and a single AtomicU64 cursor.

The standalone fake-exchange binary exists for external UDP measurement when kernel-path characterisation is needed (43–135 µs — the cost of 4× EL0→EL1 crossings plus 2 process wakeups on loopback).

Lock-free data structures throughout

| Structure          | Pattern                           | Purpose                                      |
|--------------------|-----------------------------------|----------------------------------------------|
| `RingBuffer`       | SPSC, `AtomicU64` write head      | Ingestor → strategy tick delivery            |
| `TradeLog`         | Single-writer, `AtomicU64` cursor | Strategy commits; exchange reads slot index  |
| `OrderRing`        | SPSC, `AtomicU64` cursor          | Strategy → exchange order submission         |
| `LatencyHistogram` | Per-thread sole-writer buckets    | ns-resolution recording, no sort at shutdown |

No mutex, no condvar, no blocking synchronisation on the hot path.

Cache-line alignment everywhere

MEASURED LATENCY

41 ns minimum round-trip on M3 Max (ARM64 NEON path).
p50 84 ns, p99 3,458 ns on macOS M3 Max.
p99 199 ns on Linux i9-9900K — tighter tail discipline.
External UDP mode (3-process, kernel boundaries): 43–135 µs — ~163× vs in-process. That gap is the architectural thesis.

ARCHITECTURE

In-process exchange over external UDP — order submission crosses zero kernel boundaries via SPSC OrderRing (UnsafeCell + AtomicU64 cursor).
ARM64 NEON inline assembly — 8-price momentum window lives in two NEON registers (v28/v29) across loop iterations; ~6 NEON instructions per tick, zero window memory accesses between ticks.
Cache-line-aligned structs — every cross-thread struct is #[repr(C, align(64))]; start_time co-located with latest_idx in the same cache line for free L1 warmth.
Lock-free data structures throughout — no mutex, condvar, or blocking sync on the hot path.
Page pre-touch, NEON warmup, 10 warmup packets before measurements — eliminates page-fault and branch-predictor noise from benchmarks.
Zero external dependencies — JSON output, Gregorian calendar, nanosecond timing all hand-rolled.