# The Incumbent Benchmark

Benchmark governance the way we benchmark models. Thirty real constitutional
stress events — contested certifications, shutdowns, emergency powers, court
capture, secession crises, self-coups — encoded as structured dossiers and
replayed under the FablePool constitutional kernel v0.1, producing a
side-by-side scorecard against the outcome the incumbent constitution
actually produced.

## Layout

```
benchmark/
├── dossiers/          # 30 structured event dossiers (YAML) + INDEX.md
├── src/incumbent_benchmark/
│   ├── schema.py      # Pydantic dossier schema
│   ├── kernel.py      # Kernel v0.1 as a machine-readable rule table
│   ├── rubric.py      # Four-dimension lexicographic scoring
│   ├── harness.py     # Deterministic rule-trace replay
│   ├── scorecard.py   # Per-event side-by-side scorecards
│   ├── aggregate.py   # Cross-event aggregate report
│   └── cli.py         # Command-line entry points
├── docs/
│   ├── RUBRIC.md      # Normative scoring anchors
│   ├── SCORECARD.md   # How to read a scorecard
│   └── METHODOLOGY.md # Methodology paper, including limits of text-only simulation
└── tests/             # Harness self-tests
```

## Quickstart

Requires Python 3.10+.

```bash
cd benchmark
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Validate all dossiers against the schema
python -m incumbent_benchmark.cli validate dossiers/

# Run the full replay and emit per-event scorecards
python -m incumbent_benchmark.cli run dossiers/ -o results/

# Build the aggregate report
python -m incumbent_benchmark.cli aggregate results/ -o results/AGGREGATE.md

# Run the test suite
pytest
```

The pipeline is fully deterministic: no network, no model inference, no
randomness. Two machines on the same commit produce byte-identical output.

## Lockfile

No lockfile is committed. To pin your environment, generate one once with
your preferred tool, e.g.:

```bash
pip install pip-tools
pip-compile pyproject.toml -o requirements.lock
```

## Read this before citing any number

[`docs/METHODOLOGY.md`](docs/METHODOLOGY.md), especially §9 (limits of
text-only simulation) and §11 (valid and invalid readings of the scorecard).
The short version: the incumbent column is historical fact mapped to rubric
anchors; the kernel column is a deterministic consequence of published
encoding choices. They are not epistemically symmetric, and the scorecard
says so on every page.

## Disputing a result

Every score traces to a reviewable line: a dossier's `permitted_moves`
coding, a `kernel_mapping`, or a rubric anchor. If you think an encoding is
wrong, open a pull request against the file. Changes to dossiers and the
rubric flow through the same amendment pipeline as the constitution itself.