# Reef Storage Engine Architecture This document describes the durable write path and storage engine delivered in milestone 2: the write-ahead log (WAL), segments, manifests, the background indexer, compaction, recovery, and the consistency model. It is the authoritative reference for the on-store format. Higher layers (query execution, caching, branching) build on the invariants defined here. > **Scope note.** Reef is an original, clean-room project. Object storage as > the source of truth with stateless compute is a publicly known architectural > pattern; everything below is Reef's own design. ## 1. Design goals 1. **Object storage is the only durable state.** A compute node can be destroyed at any moment and a fresh node must be able to serve the namespace from the bucket alone. Local disk and memory are caches, never sources of truth. 2. **Append-only, immutable objects.** No object is ever modified in place. Visibility changes happen only through *manifest publication*. 3. **Crash-safety by construction.** Every multi-step operation is ordered so that a crash at any point leaves either (a) the old state, or (b) the new state, or (c) the old state plus unreachable orphan objects that GC removes later. There is no crash point that produces a state requiring manual intervention to serve reads. 4. **Minimal backend requirements.** The engine needs only whole-object atomic PUT, conditional create (`If-None-Match: *` / put-if-absent), GET (with ranges), DELETE, and prefix listing. No rename, no append, no conditional overwrite. This keeps every S3-compatible service viable. ## 2. On-store layout All objects for a namespace live under a single prefix: ``` namespaces// manifest/ .manifest.json # versioned manifests, generation = zero-padded u64 wal/ .wal # one object per committed write batch segments/ .seg # immutable queryable segments quarantine/ ... # objects moved aside by `repair` ``` Key properties of the naming scheme: * WAL objects are named by their **log sequence number (LSN)**, zero-padded so lexicographic listing order equals log order. * Manifests are named by **generation**, also zero-padded, so the latest manifest is the last key under `manifest/`. * Segment ids are unique (ULID-style), so a re-run of a crashed indexing or compaction job can never collide with a previously published segment. ## 3. Write path ### 3.1 Commit protocol A write batch (upserts, patches, deletes — row- or column-oriented at the API boundary, normalized to row operations internally) commits as follows: 1. The writer validates the batch (id syntax, vector dimensions against the namespace schema, value types) and evaluates any **conditional write** predicates against current visible state under the writer lock. 2. The batch is checked against the **idempotency window** (§3.4). A replayed key returns the original receipt without re-writing. 3. The batch is serialized into a WAL object and PUT at `wal/.wal` using **put-if-absent**. Success of this conditional PUT *is the commit point*: the data is durable and the LSN is allocated in one atomic step. If the conditional PUT loses (another writer holds the namespace), the writer re-syncs from the manifest and either retries or surfaces a fencing error. 4. The batch is applied to the in-memory **memtable** overlay so the writing node immediately serves read-your-writes. A WAL object is fully self-describing: ``` [ magic "RFWAL1\0\0" | format version u32 ] [ header: lsn, batch op count, created_at, idempotency key?, condition record? ] [ record 0: u32 length | u32 crc32c | payload ] [ record 1: ... ] [ footer: total record count | crc32c of header+records ] ``` Per-record CRCs detect partial corruption; the footer detects truncation. A WAL object that fails any check is treated as **never committed** (see §7.3 — because the PUT is whole-object atomic, a corrupt WAL object can only result from storage-layer damage, not from a writer crash). ### 3.2 Batched upserts Clients are encouraged to batch. One WAL object per batch amortizes the PUT round-trip; the engine additionally coalesces concurrent small batches from the same process into a single WAL object when they arrive within the group window (default 5 ms), preserving per-batch receipts. ### 3.3 Patches and deletes * **Patch** records carry only the changed attributes; the indexer resolves them against the newest prior version of the document (memtable → newer WAL → segments) when folding. * **Delete** records become tombstones. Tombstones shadow older versions at read time and are physically dropped during compaction once no older segment can contain the id. ### 3.4 Idempotent writes A batch may carry an `idempotency_key`. The manifest stores a bounded, sliding **dedup window** mapping recent keys → (LSN, receipt digest). Replays inside the window are acknowledged without new WAL objects. The window is bounded by count and age (configurable; default 10 000 keys / 24 h) and is advanced by the indexer when it publishes manifests — so deduplication survives restarts, because the window lives in the manifest, not in memory. ### 3.5 Conditional writes Supported predicates: *create-only* (fail if the id exists), *must-exist* (fail if absent), and *attribute equality* guards. Predicates are evaluated under the single-writer lock against the visible state at commit time, and a copy of the predicate is recorded in the WAL header so recovery replays reach the same accept/reject decision deterministically. ## 4. Manifest The manifest is the **root of visibility**. Generation `g` is published by `put_if_absent("manifest/.manifest.json")` — a compare-and-swap on plain object storage. A manifest contains: * format version and namespace schema fingerprint * the **segment list**: id, level, row count, tombstone count, key/LSN range, size, and per-file checksum for every live segment * `wal_floor`: the first LSN **not yet folded** into segments * `last_applied_lsn`: the newest LSN reflected in the segment set * the idempotency dedup window (§3.4) * a writer **fencing token** (epoch) — a writer that fails a manifest CAS knows it has been superseded and must re-open the namespace * GC bookkeeping: tombstoned object keys awaiting grace-period deletion Manifests are small (KBs) and cached aggressively. Old manifest generations are retained for a configurable horizon, which gives time-travel-style debug reads and is the anchor for namespace branching in a later milestone. ## 5. Read path inside the engine A read resolves a document id through three layers, newest first: 1. **Memtable** — writes applied on this node since the last fold. 2. **Unfolded WAL range** — `[wal_floor, head]`, replayed into the memtable at open time (so after restart, layer 1 already contains layer 2). 3. **Segments** — consulted newest-to-oldest by LSN range; the first hit (value or tombstone) wins. Segments carry bloom filters over ids so point reads usually touch one segment. Range/scan reads merge iterators across layers with the same newest-wins rule. ## 6. Background indexer The indexer folds WAL entries into segments. One pass: 1. Load the latest manifest; list `wal/` keys in `[wal_floor, head]`. 2. Stream and validate those WAL objects, building sorted row blocks, resolving patches, and materializing tombstones. 3. Write one or more **segment objects** with fresh unique ids: ``` [ magic "RFSEG1\0\0" | format version ] [ row blocks: sorted by id, each block length-prefixed + crc32c ] [ column blocks: vectors / attribute columns, each crc32c'd ] [ id bloom filter ] [ block index: id ranges → block offsets ] [ footer: offsets of all sections, stats (min/max id, lsn range, row & tombstone counts), footer crc32c, magic again ] ``` The footer-last layout means a truncated upload is detectable from the tail alone, and a single ranged GET of the footer is enough to plan reads. 4. Publish a new manifest (CAS) that adds the segments and advances `wal_floor`. **Only this step changes visibility.** 5. Move folded WAL keys into the GC pending list (deleted later, after the grace period). If the indexer crashes after step 3 but before step 4, the uploaded segments are unreachable orphans: no manifest references them, `wal_floor` is unchanged, and the next pass re-folds the same WAL range under new segment ids. Correctness is unaffected; GC reclaims the orphans. ## 7. Compaction ### 7.1 Strategy Size-tiered: the planner picks the smallest N segments in a level whose combined size crosses the level threshold, merges them (newest-wins, dropping shadowed versions and expirable tombstones), writes the merged segment(s) under fresh ids, and publishes a manifest that atomically swaps inputs for outputs. Inputs go onto the GC pending list with a timestamp. ### 7.2 Atomicity Compaction has exactly one visibility step — the manifest CAS — so a crash at any other point leaves the previous manifest fully serving and at worst some orphan objects. Readers holding the old manifest keep working because input segments are not physically deleted until the GC grace period (default 15 minutes) has passed *and* no retained manifest generation references them. ### 7.3 Corruption and incompleteness handling | Condition | Detection | Handling | |---|---|---| | Truncated segment upload (crash mid-PUT) | impossible to observe — object stores expose whole-object PUTs atomically; a partial upload never becomes visible | n/a | | Segment bytes damaged at rest | footer/block crc32c mismatch, size mismatch vs manifest | `verify` flags it; `repair` quarantines the segment and `rebuild` reconstructs from WAL if still retained, else from the most recent healthy manifest generation | | WAL object damaged at rest | record/footer crc mismatch | treated as never-committed if it is the head; `repair` quarantines and reports the affected LSN range otherwise | | Manifest damaged | JSON/schema/checksum failure | open falls back to the previous generation; `repair --apply` republishes it as the head | | Orphan segments/WAL (crash between upload and publish) | referenced by no retained manifest | `gc` deletes after grace period | ## 8. Recovery Opening a namespace: 1. List `manifest/`, take the highest generation that validates (falling back generation-by-generation on damage, which is logged loudly). 2. List `wal/` from `wal_floor`; validate and replay each object into the memtable in LSN order. The head WAL object, if invalid, is discarded as uncommitted; an invalid object *followed by valid ones* is a hard error surfaced to `repair` (it indicates at-rest damage, not a crash). 3. Resume background indexing/compaction from exactly the manifest state. Crash matrix (all covered by tests): | Crash point | State after restart | |---|---| | After WAL PUT, before memtable apply | Batch is durable; replay at open restores it. Read-your-writes holds. | | Mid-indexing, after segment upload, before manifest CAS | Old manifest serves; WAL re-folded under new ids; orphans GC'd. | | Mid-compaction, before manifest CAS | Inputs still live; merge re-run; orphans GC'd. | | After manifest CAS, before WAL/old-segment cleanup | New state serves; cleanup completes via GC pending list. | | During GC | Pending list is re-derived from manifests; deletes are idempotent. | ## 9. Consistency model * **Durability.** A write is durable when its WAL object PUT succeeds. The client receipt is returned only after that. * **Single-writer per namespace.** Exactly one engine instance may accept writes for a namespace at a time, enforced by LSN put-if-absent and the manifest fencing epoch: a superseded writer deterministically fails its next WAL or manifest CAS and steps down. Split-brain cannot publish. * **Read-your-writes** on the writing node, immediately (memtable overlay). * **Other readers** (multi-node mode) are *eventually consistent*: they see a write once they observe a manifest at or past it, or by tailing the WAL. Staleness is bounded by the reader's manifest/WAL poll interval (configurable; this tradeoff buys stateless, horizontally scalable readers). * **No cross-namespace transactions.** A WAL batch is atomic within its namespace: all of its operations become visible together. ## 10. Garbage collection GC is two-phase. Phase 1 (performed by indexer/compaction): superseded object keys are added to the manifest's pending list with timestamps. Phase 2 (`reef gc`): keys are deleted only when (a) the grace period has elapsed and (b) no retained manifest generation references them. Orphans never listed in any manifest are detected by prefix-listing diff and given the same grace period keyed off their `LastModified` time. Deletes are idempotent, so a crash mid-GC is harmless. ## 11. Operational commands * `reef verify [--deep]` — manifest chain, WAL framing, segment footers; `--deep` re-checksums every block. Exit code 2 on problems. * `reef rebuild` — reconstructs all derived state from the bucket alone, proving disaster recovery works. * `reef repair [--apply]` — quarantines damaged objects, rolls the manifest head back past invalid generations; dry-run by default. * `reef gc [--apply]` — phase-2 deletion; dry-run by default. * `reef index` / `reef compact` — run one background pass in the foreground (useful in tests and cron-style deployments without a resident worker). ## 12. Tradeoffs and future work * **PUT-per-commit latency.** Commit latency is one object-store PUT (~10–80 ms on S3, ~1–5 ms on MinIO/SSD). Group commit amortizes throughput; latency-sensitive deployments should batch. We deliberately did not add a secondary low-latency log (e.g., a replicated disk log) in v1 to preserve the "bucket is everything" property. * **Single writer.** Per-namespace write throughput is bounded by one writer. Namespaces are the sharding unit; this matches the intended workload (many isolated tenant namespaces). * **List-based discovery.** Recovery and GC rely on prefix listing, which is rate-limited on some providers; the zero-padded naming keeps listings ranged and cheap. * Planned on this foundation: filter/vector/text index blocks inside segments (milestone 3–4), copy-on-write branching anchored on manifest generations (milestone 6), and a distributed cache tier (milestone 5).