# The Incumbent Benchmark Benchmark governance the way we benchmark models. Thirty real constitutional stress events — contested certifications, shutdowns, emergency powers, court capture, secession crises, self-coups — encoded as structured dossiers and replayed under the FablePool constitutional kernel v0.1, producing a side-by-side scorecard against the outcome the incumbent constitution actually produced. ## Layout ``` benchmark/ ├── dossiers/ # 30 structured event dossiers (YAML) + INDEX.md ├── src/incumbent_benchmark/ │ ├── schema.py # Pydantic dossier schema │ ├── kernel.py # Kernel v0.1 as a machine-readable rule table │ ├── rubric.py # Four-dimension lexicographic scoring │ ├── harness.py # Deterministic rule-trace replay │ ├── scorecard.py # Per-event side-by-side scorecards │ ├── aggregate.py # Cross-event aggregate report │ └── cli.py # Command-line entry points ├── docs/ │ ├── RUBRIC.md # Normative scoring anchors │ ├── SCORECARD.md # How to read a scorecard │ └── METHODOLOGY.md # Methodology paper, including limits of text-only simulation └── tests/ # Harness self-tests ``` ## Quickstart Requires Python 3.10+. ```bash cd benchmark python -m venv .venv && source .venv/bin/activate pip install -e ".[dev]" # Validate all dossiers against the schema python -m incumbent_benchmark.cli validate dossiers/ # Run the full replay and emit per-event scorecards python -m incumbent_benchmark.cli run dossiers/ -o results/ # Build the aggregate report python -m incumbent_benchmark.cli aggregate results/ -o results/AGGREGATE.md # Run the test suite pytest ``` The pipeline is fully deterministic: no network, no model inference, no randomness. Two machines on the same commit produce byte-identical output. ## Lockfile No lockfile is committed. To pin your environment, generate one once with your preferred tool, e.g.: ```bash pip install pip-tools pip-compile pyproject.toml -o requirements.lock ``` ## Read this before citing any number [`docs/METHODOLOGY.md`](docs/METHODOLOGY.md), especially §9 (limits of text-only simulation) and §11 (valid and invalid readings of the scorecard). The short version: the incumbent column is historical fact mapped to rubric anchors; the kernel column is a deterministic consequence of published encoding choices. They are not epistemically symmetric, and the scorecard says so on every page. ## Disputing a result Every score traces to a reviewable line: a dossier's `permitted_moves` coding, a `kernel_mapping`, or a rubric anchor. If you think an encoding is wrong, open a pull request against the file. Changes to dossiers and the rubric flow through the same amendment pipeline as the constitution itself.