# Benchmark guide & methodology

This guide explains how to run the Lagoon benchmark suite, what each number
means, and — most importantly — how to report results honestly.

## 1. Principles

1. **Measure what users experience.** Primary numbers are client-observed
   latencies (HTTP + JSON included). Server-internal `took_ms` is recorded
   alongside for diagnosis, never substituted silently.
2. **Disclose the environment.** Every result file embeds timestamp, host,
   platform, CPU count, and the server URL. You must additionally record the
   storage backend (local filesystem / MinIO / real S3 + region), instance
   type, and whether the server was a release build. A number without its
   environment is not a result.
3. **Deterministic datasets.** `datagen.py` produces seeded data: vectors
   from a 64-component Gaussian mixture (L2-normalised) and text from a
   256-word vocabulary with Zipf(1.1) term frequencies. Identical parameters
   ⇒ identical bytes, on any machine.
4. **No unverified comparisons.** See § 7.

## 2. Test matrix

Recommended configurations, smallest first:

| Tier | Docs | Dim | Backend | Purpose |
|---|---|---|---|---|
| smoke | 5,000 | 64 | filesystem | CI sanity, seconds to run |
| dev | 50,000 | 256 | MinIO (Docker) | default; realistic object I/O on one box |
| full | 500,000+ | 768 | real S3, same region | publishable numbers |

Numbers from the filesystem backend say nothing about object-storage latency;
label them as such. MinIO on localhost has near-zero network latency — it
exercises the request *path*, not S3's latency profile. Only same-region S3
results should be quoted as "object-storage performance".

## 3. What each benchmark measures

### 3.1 Ingest (`bench_ingest.py`)

- A batch "counts" when the upsert API returns success, which by Lagoon's
  write contract means the batch is durable in the WAL on the storage
  backend. This is **durable-write throughput**, not buffer throughput.
- **Index catch-up** is reported separately: the time after the last write
  until background indexing reports zero pending WAL entries. Documents are
  queryable through the WAL overlay before catch-up; catch-up affects how
  much of the corpus is served from optimised segments.
- `--concurrency N` runs N closed-loop writer threads. This is a simple
  saturation probe, not an open-loop load test.

### 3.2 Recall (`bench_recall.py`)

- Ground truth is the server's own `mode=exact` brute-force kNN over the same
  namespace — so recall isolates the **ANN index's approximation error**, not
  differences in distance arithmetic.
- `recall@k = |ANN top-k ∩ exact top-k| / k`, averaged over queries (k ∈
  {1, 10, 100}).
- Query vectors are drawn from the same mixture as the corpus (different
  noise), giving realistic difficulty. Uniform-random queries would inflate
  or deflate recall depending on dimension; don't substitute them.
- Always report recall **with** the ANN/exact latency pair. Recall without
  its latency cost (or vice versa) is half a number.

### 3.3 Latency (`bench_latency.py`)

Definitions used throughout:

- **Cold**: the very first query against a namespace after the server process
  starts (or its local cache directory is cleared), forcing manifest and
  segment fetches from object storage. **Only the first query is a true cold
  sample.** Collecting N cold samples requires N restarts/clears; the suite
  refuses to fake this by averaging warm-ish queries into a "cold" figure.
- **Warm**: the namespace has been preloaded via the warm-cache endpoint (or
  prior queries); segments are served from local disk/memory cache.

Warm workloads (default 200 queries each, 5 warmup queries excluded):
`vector_ann`, `vector_exact`, `bm25` (title^2 + body), `hybrid_rrf`,
`filtered_vector` (equality + range filter). Reported: mean, min, p50, p90,
p99, max.

**Coordinated omission caveat:** these are closed-loop, single-in-flight
measurements. They characterise per-request service latency at low load.
They are **not** throughput-under-load numbers and must not be converted to
QPS claims ("p50 is 5 ms therefore 200 QPS" is wrong).

### 3.4 Cache (`bench_cache.py`)

- Scrapes Prometheus `/metrics` before and after a workload and reports the
  **delta** in `lagoon_cache_hits_total` / `lagoon_cache_misses_total` per
  tier, plus object-store GET/PUT counts — so the hit rate is attributable
  to the workload, not the server's lifetime.
- `--hot-set N` cycles through N distinct queries to simulate skewed access;
  with the default (all-distinct queries) hit rates reflect segment reuse
  only, which is the pessimistic case.

## 4. Running the full suite

```bash
cd benchmarks && pip install -r requirements.txt
export LAGOON_URL=http://localhost:8484 LAGOON_API_KEY=dev-key
python run_all.py --docs 50000 --dim 256
# cold samples (repeat 5–10×, restarting the server each time):
docker compose restart api
python bench_latency.py --phase cold --namespace bench-ingest --dim 256
python run_all.py --summary-only
```

Outputs: one JSON file per benchmark under `benchmarks/results/`, plus
`results/SUMMARY.md`. The summary template includes mandatory "FILL IN"
fields (storage backend, server build SHA); a summary with those fields
unfilled is incomplete.

## 5. Run hygiene checklist

Before recording publishable numbers:

- [ ] Release build of the server (not debug).
- [ ] No other heavy processes on the machine; frequency scaling noted if on
      a laptop.
- [ ] Storage backend and network locality recorded (e.g. "S3, us-east-1,
      same-AZ EC2 c6i.2xlarge").
- [ ] Background indexing caught up before latency/recall phases
      (`bench_ingest.py` waits for this automatically).
- [ ] ≥ 5 cold samples collected via real restarts, reported individually or
      as a range — never as a single-sample "average".
- [ ] Each benchmark run at least twice; if run-to-run variance of p50
      exceeds ~10 %, investigate before reporting.
- [ ] Git SHA of the server build recorded in the summary.

## 6. Interpreting results

- **Cold vs warm gap** is dominated by object-store round trips. Expect cold
  latency to scale with the number of segments/manifest objects fetched;
  warming or pinning a namespace removes this from the query path. A large
  gap is the *design working as intended* (cheap cold storage, fast warm
  serving), not a defect — but report both numbers.
- **Recall tuning**: IVF recall rises with the number of probed centroid
  lists, at proportional latency cost. If you change `nprobe`-style settings
  from the server defaults, say so in the results.
- **BM25 latency** depends on term frequency: queries made of common (Zipf
  head) terms touch long posting lists. The generated query terms follow the
  same Zipf distribution as the corpus, which is the realistic case.
- **Index catch-up** is a freshness metric, not a correctness one: documents
  are queryable from the WAL overlay immediately after a successful upsert.

## 7. Honest reporting policy

- Report only numbers actually produced by these scripts (or clearly marked
  derivatives), with their environment blocks.
- **No comparisons with other products** — including the commercial systems
  that inspired Lagoon — unless you measured both systems yourself, on the
  same machine, same dataset, same query set, same day, and you publish both
  raw result sets. This repository makes **no performance-parity claims**
  with any proprietary system, and contributions adding such claims without
  the above evidence will be declined.
- Never average a single cold sample into a distribution; never relabel
  closed-loop latency as throughput; never quote MinIO-on-localhost numbers
  as "S3 performance".
- This repository deliberately ships **no committed benchmark numbers**:
  hardware varies too much for repository-blessed figures to be honest. Run
  the suite on your target environment; `results/` is gitignored.

## 8. CI smoke benchmark

For regression detection (not absolute performance), CI can run the smoke
tier against the filesystem backend:

```bash
python run_all.py --docs 5000 --dim 64 --queries 50 --recall-queries 25
```

Treat CI numbers as *relative trend* signals only — shared CI runners have
noisy neighbours. A sensible CI gate is on **recall@10** (a correctness-like
property, stable across hardware), not on latency.