# 08 — The "No Allocations After Warmup" Contract & Enforcement Strategy
Status: Design — Milestone 2
Depends on: 01 (ownership), 02 (pooling), 05 (ring buffers), 06 (arenas), 07 (threading)
---
## 1. Purpose
Every preceding document describes *how* to avoid allocations. This document defines the
**binding contract** that the system is allocation-free in steady state, and — critically —
the machinery that makes the contract *enforceable* rather than aspirational. A
zero-allocation architecture that is not continuously verified decays within weeks: one
innocent `string.Format` in a log call, one LINQ expression in a code review that nobody
flagged, one library upgrade that starts boxing internally, and the GC is back in the hot
path.
The contract has three legs:
1. **A precise definition** of what "no allocations" means, where it applies, and when it
begins (the warmup boundary).
2. **Runtime enforcement** — in-process guards that detect violations in production and in
soak tests, with type-level attribution via EventPipe.
3. **Build-time enforcement** — Roslyn analyzers, banned-API lists, and CI gates that fail
pull requests which would violate the contract, *before* they reach a soak environment.
---
## 2. Contract Definition
### 2.1 Formal statement
> After the process signals `LifecyclePhase.SteadyState`, no thread registered as a
> **hot-path thread** shall cause any managed heap allocation (SOH, LOH, or POH), for the
> remainder of the trading session, except inside an explicitly registered
> **allocation amnesty scope** (§2.5).
Notes on scope:
- The contract applies **per thread**, not per process. Cold-path threads (admin RPC,
EOD reporting, config reload) may allocate freely. This is what makes the contract
achievable: we never try to make the *whole process* allocation-free, only the pinned
hot-path threads defined in doc 07 (`PinnedThreadRegistry`).
- "Allocation" means any managed heap allocation, including:
- `new` of a reference type, arrays included;
- boxing (implicit or explicit), including boxing through interface dispatch on
structs, `params object[]`, string interpolation, and enum-keyed dictionaries;
- closure environment classes and delegate instances created by lambda capture;
- async/iterator state machines that escape to the heap;
- hidden runtime allocations: `string.Concat`, `Enum.ToString`, exception objects,
`GCHandle`-backed allocations, params-array creation.
- Stack allocation (`stackalloc`), arena allocation (doc 06), pool *checkout* (doc 02),
and ring-buffer claims (doc 05) are **not** allocations under this contract — they are
the sanctioned alternatives.
- Native allocations (`NativeMemory.Alloc`) are forbidden in steady state by a *separate*
clause (§2.4) because, while they don't trigger GC, they indicate an unbounded-memory
bug. All native memory is reserved during `Init`.
### 2.2 Lifecycle phases
The process moves through an explicit, one-way state machine. Phase is held in a single
`static int` (volatile) and exposed via `Lifecycle.Phase`:
```
┌──────┐ ┌────────┐ ┌────────┐ ┌─────────────┐ ┌──────────┐
│ Boot │───▶│ Init │───▶│ Warmup │───▶│ SteadyState │───▶│ Teardown │
└──────┘ └────────┘ └────────┘ └─────────────┘ └──────────┘
CLR/host allocate JIT & GC CONTRACT IN contract
startup, everything: settle: FORCE: zero released;
config, pools, ring exercise managed alloc drain, free
DI wiring buffers, every hot on hot threads native mem
arenas, path ≥ N
threads times
```
| Phase | Allocation policy (hot threads) | Typical duration |
|---|---|---|
| `Boot` | Unrestricted | seconds |
| `Init` | Unrestricted — this is when pools/arenas/buffers are sized and allocated | seconds–minutes |
| `Warmup` | Unrestricted but **measured** (baseline capture, §4.3) | 30 s – 5 min |
| `SteadyState` | **Zero** managed allocation; zero native allocation | the trading session |
| `Teardown` | Unrestricted | seconds |
Transition into `SteadyState` is gated by the **warmup completion criteria** (§3.3). The
transition is performed by the control thread; hot threads observe it via a volatile read
each loop iteration (one predictable branch, ~0 cost).
### 2.3 The warmup obligation
Warmup is not optional and not "run traffic for a while." It is a deterministic procedure
with completion criteria, because two CLR mechanisms allocate or jitter *lazily*:
1. **Tiered compilation / dynamic PGO.** Methods start at Tier 0, get instrumented, and
are recompiled at Tier 1 in the background. Recompilation itself doesn't allocate on
our threads, but Tier-0 code is slow and OSR transitions cause latency noise; worse,
some BCL paths take allocating slow paths until caches warm (e.g., reflection caches,
`Utf8Formatter` lookup tables, culture data).
2. **Lazy statics and caches.** First-touch of static constructors, `Encoding` objects,
`TimeZoneInfo`, generic instantiations over new type arguments — all allocate on first
use only.
Therefore warmup **must execute every hot-path code route**, with representative data, at
least `WarmupIterations` times (default 200,000 per route — enough to cross tiering
thresholds with margin even with dynamic PGO enabled). Doc 10 §6 covers building the
warmup driver from recorded market data.
### 2.4 Native-memory clause
All `Arena`, ring-buffer, and pool backing memory is reserved and committed during `Init`
(doc 06 §4 — pre-touch policy). In `SteadyState`:
- `NativeMemory.Alloc/Realloc/Free` is forbidden on hot threads (banned-API enforced, §6.2).
- Arena `Reset()` is permitted (it is pointer arithmetic, not allocation).
- Arena/pool **growth** is forbidden — exhaustion is a fault, handled per the failure-mode
policy in doc 09 §4, never by allocating more.
### 2.5 Allocation amnesty scopes
Reality clause: some unavoidable events allocate — a `SocketException` on disconnect, a
fatal-path log message. Rather than pretend these don't exist, they must be wrapped:
```csharp
using (AllocationGuard.Amnesty(AmnestyReason.SessionDisconnect))
{
// exceptional, latency-irrelevant code; allocations recorded but not faulted
}
```
Rules:
- An amnesty scope **must** carry a reason code from a closed enum (auditable).
- Amnesty scopes are counted and exported as metrics; a scope entered more than
`MaxAmnestyPerSession` times (default 10) trips the same alarm as a contract violation,
because "exceptional" code running frequently means the design is wrong.
- Amnesty does not silence EventPipe capture — allocations inside amnesty are still
attributed and reported, just not faulted.
---
## 3. Runtime Enforcement — In-Process Guards
Defense in depth: three independent detectors with different precision/cost/attribution
trade-offs. All three run simultaneously; they answer different questions.
| Detector | Precision | Attribution | Cost on hot thread | Question answered |
|---|---|---|---|---|
| Per-thread byte counter (§3.1) | Exact (byte-accurate) | None (count only) | ~2 ns per check | *Did this thread allocate at all?* |
| GC collection sentinel (§3.2) | Coarse | None | Zero (control thread polls) | *Did a GC happen at all?* |
| EventPipe `GCAllocationTick` (§4) | Sampled (~100 KB granularity) | **Type name + size + thread** | Zero on hot thread (out-of-band) | *What allocated, and from where?* |
### 3.1 Per-thread allocation counter — the primary tripwire
`GC.GetAllocatedBytesForCurrentThread()` reads the thread's allocation-context counter.
It is exact (includes every managed allocation the thread performed), costs a few
nanoseconds, and requires no events or sessions. Each hot thread checks it once per loop
iteration:
```csharp
namespace ZeroAlloc.Enforcement;
///
/// Per-thread allocation tripwire. One instance per hot thread, created during Init,
/// armed at the SteadyState transition. Zero-allocation itself: holds only longs.
///
public struct AllocationGuard // mutable struct, lives in the thread's loop frame
{
private long _baseline; // bytes allocated by this thread at arming time
private long _amnestyBytes; // bytes excused inside amnesty scopes
private int _armed; // 0 = warmup, 1 = enforcing
public void Arm()
{
_baseline = GC.GetAllocatedBytesForCurrentThread();
_armed = 1;
}
/// Called once per event-loop iteration. Branch-predictable; ~2 ns.
public void Check()
{
if (_armed == 0) return;
long now = GC.GetAllocatedBytesForCurrentThread();
long leaked = now - _baseline - _amnestyBytes;
if (leaked > 0)
ContractViolation.Raise(leaked); // see §3.3 — does NOT throw on hot path
}
// Amnesty bookkeeping: scope records the counter on entry/exit and credits the delta.
internal void Credit(long bytes) => _amnestyBytes += bytes;
}
```
**Violation policy** (`ContractViolation.Raise`) is configurable per environment:
| Environment | Policy |
|---|---|
| CI / soak | `FailFast` — write violation record to pre-allocated diagnostic buffer, `Environment.FailFast`. The trace artifact (§4) attributes the type. |
| UAT | `Quarantine` — raise alarm, keep trading, re-baseline so each violation is reported once. |
| Production | `AlarmOnce` — alarm via pre-allocated ring-buffer message to the telemetry thread; never crash a live trading session over an allocation. |
`Raise` itself must not allocate: the violation record (thread id, leaked bytes,
timestamp) is written into a pre-allocated `SpscRingBuffer` consumed by
the telemetry thread.
### 3.2 GC collection sentinel
The control thread polls every 100 ms:
```csharp
int g0 = GC.CollectionCount(0), g1 = GC.CollectionCount(1), g2 = GC.CollectionCount(2);
```
Any increase after `SteadyState` is a contract violation *somewhere* in the process —
possibly a cold thread allocating so heavily it threatens hot threads via GC suspension
(even cold-thread allocation triggers stop-the-world phases that suspend hot threads;
non-concurrent gen-0 collections suspend everything). Policy: cold threads have a
*budget* (default: no gen-2 ever; gen-0 rate < 1/min), enforced as warnings, because the
process-wide design goal (docs 02/06) is that even cold paths are mostly pooled.
As belt-and-braces, sessions may optionally run inside
`GC.TryStartNoGCRegion(totalSize)` sized to the cold-path budget, with
`GCSettings.LatencyMode = GCLatencyMode.SustainedLowLatency` as the fallback if the
region cannot be established. The no-GC region converts "a GC happened" into a hard,
unambiguous signal: the region ends, the sentinel sees it, the alarm fires.
### 3.3 Warmup completion criteria (arming the guards)
The control thread arms `SteadyState` only when **all** of:
1. Every registered warmup route reports ≥ `WarmupIterations` executions.
2. JIT activity has quiesced: no `MethodJitting` events (observed via the same EventPipe
session, CLR JIT keyword) for `JitQuietWindow` (default 10 s).
3. A forced full compaction has run: `GC.Collect(2, GCCollectionMode.Aggressive, blocking: true, compacting: true)`
followed by `GC.WaitForPendingFinalizers()` and a second collect — so the heap entering
the session is minimal and compacted, and any finalizer-driven allocation is flushed.
4. Each hot thread has performed one full loop iteration *after* the collect with its
`AllocationGuard` in rehearsal mode (checks but logs rather than faults) and reported
zero bytes — a dry run of the contract before it becomes binding.
---
## 4. EventPipe Allocation Tracking — Attribution
The per-thread counter says *that* you allocated; it cannot say *what*. Attribution comes
from the runtime's allocation events over EventPipe, captured **out-of-process** (CI,
soak) or by a **sidecar in-process session** (production), so the hot threads pay nothing.
### 4.1 The event: `GCAllocationTick`
Provider `Microsoft-Windows-DotNETRuntime`, keyword `GC` (`0x1`), level `Verbose`.
The runtime emits `GCAllocationTick` approximately once per **100 KB** allocated (per
heap), with payload including `TypeName`, `AllocationAmount`/`AllocationAmount64`,
`AllocationKind` (Small/Large/Pinned), and `HeapIndex`; the event metadata carries the OS
thread id, letting us filter to registered hot threads.
Implications of the 100 KB sampling granularity:
- A *sustained* leak (the realistic failure: a per-message allocation) crosses 100 KB
within milliseconds at trading message rates and is attributed almost immediately.
- A *single tiny* allocation may not produce a tick. That is why §3.1 (byte-exact
counter) is the tripwire and EventPipe is the *attributor*: counter fires → CI harness
replays the workload under a full-verbosity trace to capture the type.
- .NET 9+ adds a configurable `AllocationSampled` event (provider keyword
`0x80000000000`, Poisson-sampled, default mean 100 KB, tunable down) that yields
per-sample type + stack with bounded overhead; the harness uses it when available and
falls back to `GCAllocationTick` otherwise. **Maintainers should verify the keyword
value and event name against the runtime version in use** — this surface is newer than
`GCAllocationTick`.
### 4.2 Out-of-process capture (CI and soak) — `dotnet-trace`
```bash
dotnet-trace collect \
--process-id $PID \
--providers "Microsoft-Windows-DotNETRuntime:0x1:5" \
--buffersize 512 \
--output steadystate.nettrace \
--duration 00:10:00
```
`0x1` = GC keyword, `5` = Verbose (required for allocation ticks). The resulting
`.nettrace` is parsed by the CI analyzer (§5.3) using the
`Microsoft.Diagnostics.Tracing.TraceEvent` library:
```csharp
using Microsoft.Diagnostics.Tracing;
using Microsoft.Diagnostics.Tracing.Etlx;
using Microsoft.Diagnostics.Tracing.Parsers.Clr;
using var source = new EventPipeEventSource("steadystate.nettrace");
var perType = new Dictionary();
source.Clr.GCAllocationTick += (GCAllocationTickTraceData e) =>
{
if (e.TimeStamp < steadyStateMark) return; // ignore warmup
if (!hotThreadIds.Contains(e.ThreadID)) return; // hot threads only
perType.TryGetValue(e.TypeName, out long n);
perType[e.TypeName] = n + e.AllocationAmount64;
};
source.Process();
// any entry in perType => gate failure, with type names in the report
```
For call-stack attribution (which line allocated), the harness adds a second provider
spec capturing stacks: in practice, run the failing workload once more under
`dotnet-trace collect --profile gc-verbose`, open in PerfView/Visual Studio, and the
allocation stacks identify the call site. The CI report includes the exact reproduction
command.
### 4.3 In-process sidecar (production)
Production hosts often forbid attaching diagnostic tools. The sidecar is a *cold*,
unpinned, low-priority thread inside the process running an `EventListener`:
```csharp
internal sealed class AllocationAttributor : EventListener
{
// Pre-allocated, bounded storage; the listener thread MAY allocate (it is a cold
// thread) but is written pool-friendly to avoid disturbing the GC sentinel budget.
private readonly ViolationSink _sink; // ring buffer to telemetry
protected override void OnEventSourceCreated(EventSource src)
{
if (src.Name == "Microsoft-Windows-DotNETRuntime")
EnableEvents(src, EventLevel.Verbose, (EventKeywords)0x1 /* GC */);
}
protected override void OnEventWritten(EventWrittenEventArgs e)
{
if (e.EventName != "GCAllocationTick") return;
if (Lifecycle.Phase != LifecyclePhase.SteadyState) return;
if (!PinnedThreadRegistry.IsHotOsThread(e.OSThreadId)) return;
// Payload order per ClrEtwAll manifest: AllocationAmount, AllocationKind,
// ClrInstanceID, AllocationAmount64, TypeID, TypeName, HeapIndex, Address...
// Resolve by name for robustness across runtime versions:
int typeIdx = e.PayloadNames!.IndexOf("TypeName");
int amtIdx = e.PayloadNames!.IndexOf("AllocationAmount64");
_sink.Report(e.OSThreadId,
(string)e.Payload![typeIdx]!,
(long)e.Payload![amtIdx]!);
}
}
```
Caveats (documented because they bite):
- The `EventListener` callback runs on dispatcher threads and itself allocates (payload
arrays, strings). That is acceptable: it is a cold thread, registered as such, and
its allocations are excluded from the cold-thread budget by thread id.
- Enabling the GC keyword at Verbose adds the runtime's event-writing cost to allocating
threads. Hot threads don't allocate in steady state, so they pay **nothing**; the cost
lands only on the (allowed-to-allocate) cold threads and on the violator — exactly
where we want the evidence.
- `EventWrittenEventArgs.OSThreadId` and name-based payload lookup are used instead of
positional indices to survive payload-schema additions across runtime versions.
### 4.4 What EventPipe gives us per environment
| Environment | Mechanism | Output |
|---|---|---|
| Dev inner loop | `dotnet-counters monitor` (alloc rate, GC counts) + unit gates | immediate feedback |
| CI PR gate | Harness + `dotnet-trace` + TraceEvent analyzer | pass/fail + per-type table + repro command |
| Nightly soak | 8-hour run, rotating 10-min `.nettrace` captures | trend report; catches slow leaks & rare paths |
| Production | In-process sidecar `EventListener` | alarm with type name within ~1 s of violation |
---
## 5. CI Gates
Three gates, run in order of cost. A PR must pass all three to merge into the hot-path
projects (cold-path projects are exempt by directory).
### 5.1 Gate A — Static analysis (seconds, every build)
Compile-time prevention via Roslyn:
1. **Banned APIs** — `Microsoft.CodeAnalysis.BannedApiAnalyzers` with a
`BannedSymbols.txt` checked into the hot-path projects:
```
T:System.Linq.Enumerable; LINQ allocates enumerators/closures — use indexed loops
M:System.String.Format(System.String,System.Object); boxes & allocates — use Utf8Formatter into pooled buffers
M:System.String.Concat(System.String,System.String); allocates — use pooled Utf8 builders
T:System.Collections.Generic.List`1; growth allocates — use FixedList (doc 03 §7)
T:System.Collections.Generic.Dictionary`2; resizing/boxing hazards — use FixedDictionary
M:System.GC.Collect; forbidden outside Lifecycle transitions
M:System.Runtime.InteropServices.NativeMemory.Alloc(System.UIntPtr); Init-phase only — use Arena
T:System.Threading.Tasks.Task; async state machines escape to heap — use the event-loop model (doc 07)
M:System.Enum.ToString; reflection + allocation — use precomputed name tables
T:System.Text.StringBuilder; allocates — use Utf8Writer over pooled buffers
```
(Full list maintained in `eng/BannedSymbols.txt`; the excerpt shows the shape.
Init/Teardown code that legitimately needs these lives in separate projects or uses
`#pragma warning disable RS0030` with a mandatory justification comment, grep-audited.)
2. **Heap-allocation analyzer** — an allocation-detection analyzer (e.g., the
`ClrHeapAllocationAnalyzer` family / `ErrorProne.NET` analyzers) configured as
**error** severity in hot-path projects for: explicit `new` of reference types,
boxing, closure captures, delegate allocations, params-array creation, and
value-type-to-interface conversions. *Maintainers should pin whichever analyzer
package the org standardizes on; the design requirement is the rule set, not the
specific package.*
3. **In-house analyzers** (specified in doc 11 §9): enforce ownership annotations from
doc 01 (`[Borrowed]`, `[Owns]`, `[PoolReturned]`), forbid `ref struct` escape via
captured spans, and require `readonly` on message structs (doc 03).
`.editorconfig` scoping keeps all of this **error** in `src/HotPath/**` and **suggestion**
elsewhere, so the cold path stays productive.
### 5.2 Gate B — Micro-gates via BenchmarkDotNet (minutes, every PR)
Each hot-path component ships allocation benchmarks with `[MemoryDiagnoser]`. The gate
asserts the `Allocated` column is **0 B** per op (BDN measures via
`GC.GetTotalAllocatedBytes`, byte-exact):
```csharp
[MemoryDiagnoser]
public class OrderPathBenchmarks
{
private OrderGateway _gw = null!;
private MarketDataReplay _replay = null!;
[GlobalSetup]
public void Setup()
{
_gw = TestHost.BuildGateway(); // Init-phase allocation is fine here
_replay = MarketDataReplay.Load("nasdaq-sample.bin");
TestHost.Warmup(_gw, _replay); // drives the real warmup procedure
}
[Benchmark]
public void TickToOrder() => _gw.ProcessOne(_replay.Next());
}
```
A small runner parses BDN's JSON exporter output and fails the build if any benchmark in
the `HotPath` category reports `Allocated > 0`. This catches per-operation allocations
with **exact** attribution to the benchmarked component, far faster than a soak run.
### 5.3 Gate C — Integration gate with EventPipe (10–20 min, every PR to main)
The full-system gate, run by `eng/ci/allocation-gate.sh`:
1. Launch the trading host in **CI mode** (`ZEROALLOC_POLICY=FailFast`) against the
exchange simulator with recorded market data (3 replay profiles: quiet, normal,
stressed — the stressed profile includes opening-auction bursts and feed gaps,
exercising recovery paths).
2. Wait for the host to log `SteadyState` (it performs the §3.3 procedure itself).
3. Attach `dotnet-trace` with the GC-verbose provider (command in §4.2) for the full run.
4. Drive 10 minutes of replay per profile.
5. **Pass criteria:**
- process exited 0 (no `FailFast` from an `AllocationGuard`);
- `GC.CollectionCount` deltas across `SteadyState` = 0 for all generations
(exported by the host on shutdown);
- TraceEvent analysis of the `.nettrace` shows **zero** `GCAllocationTick` events
attributed to hot thread ids after the steady-state marker event (the host emits a
custom `ZeroAlloc/SteadyStateEntered` EventSource event so the analyzer can find
the boundary in the same trace);
- amnesty-scope count within budget.
6. On failure, the job publishes: the per-type allocation table, the violating thread
names, the `.nettrace` artifact, and the one-line repro command.
Pipeline sketch (GitHub Actions; Azure DevOps equivalent in `eng/ci/`):
```yaml
allocation-gate:
runs-on: [self-hosted, hft-bench] # isolated, core-pinnable runner (doc 07 §8)
steps:
- uses: actions/checkout@v4
- run: dotnet build -c Release
- run: dotnet test -c Release --filter Category=HotPathUnit
- run: dotnet run -c Release --project bench/HotPath.Benchmarks -- --filter '*' --exporters json
- run: eng/ci/check-bdn-zero-alloc.sh bench/BenchmarkDotNet.Artifacts/results
- run: eng/ci/allocation-gate.sh --profiles quiet,normal,stressed --duration 600
- uses: actions/upload-artifact@v4
if: failure()
with: { name: nettrace, path: artifacts/*.nettrace }
```
### 5.4 Nightly soak gate
Identical to Gate C but 8 hours, with live-like jittered replay, plus:
- memory ceiling assertion (working set flat ±2% after hour 1 — catches native leaks the
managed gates can't see);
- p99.9/p99.99 latency regression check against the stored baseline;
- amnesty-reason histogram diffed against the previous night.
---
## 6. Configuration Baseline
Runtime configuration that the contract assumes (checked at startup by
`Lifecycle.ValidateRuntimeConfig()`, which fails Boot if violated):
```xml
false
false
true
true
true
true
false
```
Rationale notes: with a genuinely allocation-free hot path the *choice* of GC barely
matters in steady state — the configuration above optimizes the failure case (if a GC
does happen, workstation non-concurrent gen-0 on a tiny heap is microseconds) and the
warmup case. Teams preferring `TieredCompilation=false` (or R2R + composite images) for
deterministic first-call latency may do so; the warmup criteria in §3.3 make either
choice safe.
---
## 7. Summary of Responsibilities
| Actor | Obligation |
|---|---|
| Hot-path developer | Code passes Gates A–C; allocating code wrapped in amnesty with reason, or moved to cold thread |
| Component owner | Ships BDN zero-alloc benchmarks (Gate B) for every public hot-path API |
| Platform team | Maintains banned list, analyzers, gate scripts, warmup driver, replay profiles |
| Control thread (runtime) | Executes warmup criteria, arms guards, runs GC sentinel, hosts EventPipe sidecar |
| Hot thread (runtime) | One `AllocationGuard.Check()` per loop iteration |
| Ops | Treats production violation alarms as sev-2: trade on, fix before next session |