# 13 — Benchmark & Validation Methodology > This document defines how the architecture's latency and "no allocations > after warmup" claims are measured, so that results are reproducible, > statistically honest, and comparable across code changes. It complements the > enforcement tooling (`tools/AllocationGate`, doc 08) with the *measurement* > discipline that the migration guide's acceptance criteria reference. ## 13.1 What we measure For each release candidate, on the reference hardware profile (doc 12 §12.5): 1. **Wire-to-wire latency**: market-data packet arrival (NIC hardware or kernel timestamp) → order bytes handed to the transmit path. This is the number that matters commercially. 2. **Stage latencies**: decode → book update → signal → risk → encode, timestamped at ring-buffer hand-offs (doc 05) using `Stopwatch.GetTimestamp()`. 3. **Hiccups**: scheduling/jitter events on pinned threads, independent of message flow (§13.5). 4. **Allocation behaviour**: AllocationGate over both short (CI) and soak (nightly) windows. 5. **GC activity**: collection counts and pause durations, which must be **zero after warmup** (`GC.CollectionCount(g)` deltas and `GC.GetGCMemoryInfo().PauseDurations` snapshots taken at window boundaries). ## 13.2 Timestamping rules * Use `Stopwatch.GetTimestamp()` (a `long`; allocation-free) at instrumentation points; convert to nanoseconds only at report time: `ns = ticks * 1_000_000_000 / Stopwatch.Frequency`. Verify at startup that `Stopwatch.IsHighResolution` is true and frequency implies sub-µs resolution; abort the benchmark otherwise (HPET fallback invalidates results — doc 12 §12.5). * Timestamps ride **inside the message structs** (doc 03 reserves the `IngressTimestamp` field) so stage latency needs no side lookups. * Never timestamp with `DateTime.UtcNow` in the hot path; it is coarser and on some platforms slower. * For wire-to-wire numbers, prefer NIC hardware timestamps (or capture-card timestamps on a mirrored port) over application timestamps; application receive timestamps systematically *understate* latency because they miss kernel/NIC queueing. ## 13.3 Recording without allocating: histograms Latencies are recorded into **pre-allocated histograms**, never into growing lists: * Use HdrHistogram-style log-bucketed histograms. The recommended package is **HdrHistogram (.NET port, 2.5.x line)**; construct all histogram instances during warmup, sized `lowestDiscernible=1ns, highest=10s, 3 significant digits`. `RecordValue(long)` on a pre-built histogram is allocation-free and therefore legal in the hot path. If the dependency is unwanted in the engine process, the same structure (~23 KB of `long[]` buckets per histogram) is trivially implemented in-repo; either way the *recording* path must be verified by AllocationGate like any other hot-path code. * One histogram **per stage per thread** (no cross-thread sharing; merge at report time on the control plane). * Snapshot/reset happens on the control-plane thread via the command ring (doc 05), never by the hot thread taking a lock. ## 13.4 Coordinated omission — mandatory correction Replay benchmarks that send the next message only after the previous one completes will silently hide stalls: during a 5 ms hiccup, an open-loop production feed would have delivered thousands of messages that all experienced the stall, but a closed-loop test records *one* slow sample. This is **coordinated omission** and it can understate p99.9 by orders of magnitude. Rules: 1. Replay harnesses are **open-loop**: messages are injected on the schedule of the recorded feed's original timestamps (or a fixed target rate), regardless of whether the system under test has finished the previous message. 2. Latency is measured from the **intended** send time, not the actual send time, so backpressure shows up as latency rather than disappearing. 3. When closed-loop measurement is unavoidable, apply HdrHistogram's `CopyCorrectedForCoordinatedOmission(expectedInterval)` and report both raw and corrected distributions, clearly labelled. ## 13.5 Hiccup meter Independent of message load, each pinned core runs (in benchmark builds) a jHiccup-style probe: a loop that sleeps/spins for a fixed short interval, timestamps before and after, and records `observed − expected` into a histogram. This isolates *platform* jitter (SMIs, IRQs that escaped affinity, frequency transitions, GC threads straying onto the core) from *application* latency. A clean platform shows hiccup p99.99 in the low microseconds; anything worse means doc 12 §12.5's checklist was not actually applied, and message-latency results from that run are quarantined. ## 13.6 Reporting standards * Report **p50 / p90 / p99 / p99.9 / p99.99 / max** and sample count. Never report a mean or standard deviation as a headline number — latency distributions are heavy-tailed and means are misleading. * Plot full percentile curves (HDR "percentile distribution" output) for before/after comparisons; a single percentile can hide a crossed-over tail. * Every report records: commit hash, runtime version, full doc-12 configuration dump, hardware profile, NIC/driver versions, message rate, and run duration. Results without provenance are not accepted in review. * Minimum run lengths: 5 minutes for smoke comparisons; 1 hour for release qualification; 24-hour soak nightly. Tail percentiles need samples: p99.99 requires ≥ 10⁶ measurements to have ~100 samples beyond it. ## 13.7 Run protocol 1. **Environment assert**: run the deployment validation script (doc 12); abort on any mismatch. 2. **Cold start** the engine; let the full warmup protocol complete; confirm the `WarmupComplete` marker. 3. **Attach AllocationGate** for the entire measurement window (`--max-ticks 0`). 4. **Quarantine first N seconds** after the marker (default 30 s) from latency statistics — caches, branch predictors, and frequency residency settle even after functional warmup. 5. **Replay** the canonical captured market-data day(s) open-loop at recorded rate, then at 2× and 5× rate for headroom characterisation. 6. **Collect**: histograms, hiccup histograms, AllocationGate JSON, GC counters (must show zero collections post-marker), working set and handle-count time series (flat lines expected — FMEA F-7/F-9 detection). 7. **Repeat ≥ 5 runs**; report the distribution *across* runs at each percentile. A regression is flagged when the new build's p99.9 exceeds the baseline's p99.9 by more than the cross-run spread (no single-run comparisons). ## 13.8 CI integration tiers | Tier | Trigger | Duration | Gates | |---|---|---|---| | **PR gate** | every PR | ~2 min | AllocationGate zero-tick window on replay smoke; benchmark *compiles and runs*; no latency gating (CI hardware is too noisy for ns-level assertions). | | **Nightly** | scheduled, dedicated bare-metal runner | 1 h replay + 1 h soak gate window | AllocationGate zero-tick over the soak window; GC-count delta == 0; p99.9 regression check vs. rolling baseline with cross-run spread tolerance. | | **Release qualification** | per release | 24 h soak + full replay matrix | All of the above; working-set/handle flatness; hiccup-meter platform validation; signed report archived with the release. | Latency assertions run **only** on the dedicated bare-metal runner with the doc-12 profile applied; asserting nanoseconds on shared cloud CI produces flaky gates and teaches the team to ignore red builds, which is worse than not gating. ## 13.9 Honest-measurement checklist (reviewer crib sheet) * [ ] Open-loop injection, or CO-corrected and labelled? * [ ] Percentiles, not means? Max reported? * [ ] Enough samples for the claimed percentile? * [ ] Warmup quarantined; marker confirmed; gate attached and green? * [ ] Environment validation log attached? Hiccup meter clean? * [ ] ≥ 5 runs; cross-run spread shown? * [ ] Configuration identical to production profile? A benchmark result missing any item is returned to the author. The point of this milestone's architecture is a *provable* property; the proof is only as good as the measurement discipline behind it.