# Scorecard Formats and Aggregate Generation This document specifies the two scorecard artifacts the Incumbent Benchmark produces — the **per-event scorecard** and the **aggregate scorecard** — and the contract between them. The scoring rubric itself (what the four dimensions mean and how points are assigned) is defined in `RUBRIC.md`; this document covers formats, generation, and interpretation. ## 1. The two-stage pipeline ``` dossiers/*.yaml ──▶ replay harness ──▶ per-event scorecard JSON ──▶ aggregate (harness.py) (scorecard.py) (aggregate.py) ``` Aggregation is deliberately decoupled from simulation: - per-event scorecards are durable artifacts you can inspect, diff, and re-aggregate without re-running simulations; - a partial benchmark run (say, 12 of 30 events) still aggregates cleanly, so an interrupted run is never wasted; - the aggregate is a *pure function* of the per-event JSON files, which makes the headline numbers independently auditable. ## 2. Per-event scorecard JSON (the contract) The aggregator accepts the per-event JSON emitted by the scorecard module. The canonical shape is: ```json { "event": { "id": "us-2020-certification", "title": "US 2020 Presidential Certification", "category": "contested-certification", "country": "United States", "year": 2020 }, "incumbent": { "worst_off": 38.0, "commons_integrity": 55.0, "trust_preservation": 31.0, "latency": 62.0 }, "kernel": { "worst_off": 71.0, "commons_integrity": 68.0, "trust_preservation": 74.0, "latency": 66.0 } } ``` Rules: - **Sides.** Both an `incumbent` side and a `kernel` side are required. Accepted aliases: `incumbent` / `baseline` / `historical`, and `kernel` / `fable` / `kernel_v0_1` / `counterfactual`. Sides may appear at the top level or nested inside a `scores` / `scorecard` / `results` / `comparison` block. - **Dimensions.** All four rubric dimensions are required on each side. Accepted aliases: `worst_off` (also `worst-off`, `worst_off_participant`, `empathy`), `commons_integrity` (also `commons`), `trust_preservation` (also `trust`), `latency` (also `latency_to_resolution`). A dimension value may be a bare number or an object with a `score`/`value` field (extra fields such as evidence notes are ignored by the aggregator but preserved in the per-event file). - **Scale.** Scores are 0–100, higher is better — including latency, which is scored as *latency quality* (fast, legitimate resolution scores high; see `RUBRIC.md` §3). Files scored on a 0–1 scale are auto-detected and rescaled; out-of-range values are clamped with no warning suppressed in the per-event stage. - **Failure mode.** Any missing side or dimension makes the aggregator fail loudly with the offending event id and the keys it found. A contract mismatch must never produce a silently wrong headline number. ## 3. Generating the aggregate From per-event results already on disk (preferred): ```bash python -m incumbent_benchmark.aggregate --results-dir out/ \ -o out/AGGREGATE.md --json out/aggregate.json ``` Or end-to-end, replaying every dossier in-process: ```bash python -m incumbent_benchmark.aggregate --dossier-dir dossiers/ \ -o out/AGGREGATE.md --json out/aggregate.json ``` `--tie-epsilon` (default `1.0`) controls how large a composite delta must be before an event counts as a win rather than a tie. The default exists because the rubric resolves to integer-ish point bands; a sub-point delta is rubric noise, not signal, and reporting it as a "win" would overstate precision the methodology does not have. ## 4. Aggregate scorecard contents ### Markdown (`AGGREGATE.md`) 1. **Headline** — mean composite for each constitution, the delta, and the win/tie/loss tally across all events. 2. **The floor** — the single worst worst-off-participant score either constitution produced, and at which event. This sits in the headline section by design: the project's first scoring rule is that a constitution is judged by its floor before its average. A kernel that wins 25 of 30 events but lets one population crater has failed the test that matters. 3. **By rubric dimension** — means, mean deltas, minima, and win counts per dimension, with the rubric weight shown beside each. 4. **By event category** — composite comparison per category (contested certifications, shutdowns/deadlock, emergency powers, court capture, secession crises, self-coups/institutional rupture), so a reader can see *where* the kernel's wins and losses concentrate rather than a single blended number. 5. **Per-event table** — every event with both composites and the two deltas that matter most (composite, worst-off), sorted kernel-favourable first so the kernel's worst events are at the bottom where reviewers should start reading. 6. **Interpretation caveat** — every rendered aggregate ends with a pointer to `METHODOLOGY.md`. The caveat is part of the artifact, not optional framing. ### JSON (`aggregate.json`) Machine-readable mirror of the above: weights, tie epsilon, composite and per-dimension statistics, category breakdowns, the floor, and the full per-event score matrix with computed composites. Intended consumers: the project website's live scorecard, CI checks that fail if a proposed kernel amendment regresses the floor, and external replication. ## 5. Composite definition ``` composite = Σ_d weight(d) × score(d) d ∈ {worst_off, commons_integrity, trust_preservation, latency} ``` Weights are imported from the rubric module so the aggregate can never drift from per-event scoring; the documented values are **worst-off 40%, commons integrity 25%, trust preservation 20%, latency 15%**. Weights are normalised to sum to 1 at computation time. The composite is a *summary*, not the verdict. Reviewers should read, in order: (1) the floor, (2) the worst-off dimension row, (3) the category breakdown, (4) the composite. The Markdown layout enforces this order. ## 6. What the aggregate does and does not claim The aggregate compares **what each constitutional text permitted, forbade, and incentivised** under the pressures recorded in each dossier, scored by a fixed rubric. It does **not** claim that any real population governed by kernel v0.1 would have produced these outcomes — culture, enforcement capacity, and bad-faith creativity beyond the dossier's actor model are all outside the simulation. Do not quote a headline number from `AGGREGATE.md` without the accompanying limits discussion in `METHODOLOGY.md` §7.