# Scorecard Formats and Aggregate Generation

This document specifies the two scorecard artifacts the Incumbent Benchmark
produces — the **per-event scorecard** and the **aggregate scorecard** —
and the contract between them. The scoring rubric itself (what the four
dimensions mean and how points are assigned) is defined in `RUBRIC.md`;
this document covers formats, generation, and interpretation.

## 1. The two-stage pipeline

```
dossiers/*.yaml ──▶ replay harness ──▶ per-event scorecard JSON ──▶ aggregate
                    (harness.py)        (scorecard.py)              (aggregate.py)
```

Aggregation is deliberately decoupled from simulation:

- per-event scorecards are durable artifacts you can inspect, diff, and
  re-aggregate without re-running simulations;
- a partial benchmark run (say, 12 of 30 events) still aggregates cleanly,
  so an interrupted run is never wasted;
- the aggregate is a *pure function* of the per-event JSON files, which
  makes the headline numbers independently auditable.

## 2. Per-event scorecard JSON (the contract)

The aggregator accepts the per-event JSON emitted by the scorecard module.
The canonical shape is:

```json
{
  "event": {
    "id": "us-2020-certification",
    "title": "US 2020 Presidential Certification",
    "category": "contested-certification",
    "country": "United States",
    "year": 2020
  },
  "incumbent": {
    "worst_off": 38.0,
    "commons_integrity": 55.0,
    "trust_preservation": 31.0,
    "latency": 62.0
  },
  "kernel": {
    "worst_off": 71.0,
    "commons_integrity": 68.0,
    "trust_preservation": 74.0,
    "latency": 66.0
  }
}
```

Rules:

- **Sides.** Both an `incumbent` side and a `kernel` side are required.
  Accepted aliases: `incumbent` / `baseline` / `historical`, and
  `kernel` / `fable` / `kernel_v0_1` / `counterfactual`. Sides may appear
  at the top level or nested inside a `scores` / `scorecard` / `results` /
  `comparison` block.
- **Dimensions.** All four rubric dimensions are required on each side.
  Accepted aliases: `worst_off` (also `worst-off`, `worst_off_participant`,
  `empathy`), `commons_integrity` (also `commons`), `trust_preservation`
  (also `trust`), `latency` (also `latency_to_resolution`). A dimension
  value may be a bare number or an object with a `score`/`value` field
  (extra fields such as evidence notes are ignored by the aggregator but
  preserved in the per-event file).
- **Scale.** Scores are 0–100, higher is better — including latency, which
  is scored as *latency quality* (fast, legitimate resolution scores high;
  see `RUBRIC.md` §3). Files scored on a 0–1 scale are auto-detected and
  rescaled; out-of-range values are clamped with no warning suppressed in
  the per-event stage.
- **Failure mode.** Any missing side or dimension makes the aggregator fail
  loudly with the offending event id and the keys it found. A contract
  mismatch must never produce a silently wrong headline number.

## 3. Generating the aggregate

From per-event results already on disk (preferred):

```bash
python -m incumbent_benchmark.aggregate --results-dir out/ \
    -o out/AGGREGATE.md --json out/aggregate.json
```

Or end-to-end, replaying every dossier in-process:

```bash
python -m incumbent_benchmark.aggregate --dossier-dir dossiers/ \
    -o out/AGGREGATE.md --json out/aggregate.json
```

`--tie-epsilon` (default `1.0`) controls how large a composite delta must
be before an event counts as a win rather than a tie. The default exists
because the rubric resolves to integer-ish point bands; a sub-point delta
is rubric noise, not signal, and reporting it as a "win" would overstate
precision the methodology does not have.

## 4. Aggregate scorecard contents

### Markdown (`AGGREGATE.md`)

1. **Headline** — mean composite for each constitution, the delta, and the
   win/tie/loss tally across all events.
2. **The floor** — the single worst worst-off-participant score either
   constitution produced, and at which event. This sits in the headline
   section by design: the project's first scoring rule is that a
   constitution is judged by its floor before its average. A kernel that
   wins 25 of 30 events but lets one population crater has failed the test
   that matters.
3. **By rubric dimension** — means, mean deltas, minima, and win counts
   per dimension, with the rubric weight shown beside each.
4. **By event category** — composite comparison per category
   (contested certifications, shutdowns/deadlock, emergency powers, court
   capture, secession crises, self-coups/institutional rupture), so a
   reader can see *where* the kernel's wins and losses concentrate rather
   than a single blended number.
5. **Per-event table** — every event with both composites and the two
   deltas that matter most (composite, worst-off), sorted kernel-favourable
   first so the kernel's worst events are at the bottom where reviewers
   should start reading.
6. **Interpretation caveat** — every rendered aggregate ends with a pointer
   to `METHODOLOGY.md`. The caveat is part of the artifact, not optional
   framing.

### JSON (`aggregate.json`)

Machine-readable mirror of the above: weights, tie epsilon, composite and
per-dimension statistics, category breakdowns, the floor, and the full
per-event score matrix with computed composites. Intended consumers: the
project website's live scorecard, CI checks that fail if a proposed kernel
amendment regresses the floor, and external replication.

## 5. Composite definition

```
composite = Σ_d weight(d) × score(d)        d ∈ {worst_off, commons_integrity,
                                                  trust_preservation, latency}
```

Weights are imported from the rubric module so the aggregate can never
drift from per-event scoring; the documented values are **worst-off 40%,
commons integrity 25%, trust preservation 20%, latency 15%**. Weights are
normalised to sum to 1 at computation time.

The composite is a *summary*, not the verdict. Reviewers should read, in
order: (1) the floor, (2) the worst-off dimension row, (3) the category
breakdown, (4) the composite. The Markdown layout enforces this order.

## 6. What the aggregate does and does not claim

The aggregate compares **what each constitutional text permitted, forbade,
and incentivised** under the pressures recorded in each dossier, scored by
a fixed rubric. It does **not** claim that any real population governed by
kernel v0.1 would have produced these outcomes — culture, enforcement
capacity, and bad-faith creativity beyond the dossier's actor model are
all outside the simulation. Do not quote a headline number from
`AGGREGATE.md` without the accompanying limits discussion in
`METHODOLOGY.md` §7.