# The Incumbent Benchmark — Methodology Paper

**Version 1.0 · Milestone #4 deliverable · FablePool**

> We benchmark models. This paper describes how — and how honestly — we can
> benchmark governance.

---

## Table of contents

1. [Abstract](#1-abstract)
2. [Motivation and framing](#2-motivation-and-framing)
3. [What the benchmark is and is not](#3-what-the-benchmark-is-and-is-not)
4. [Event selection](#4-event-selection)
5. [The dossier: from history to structured data](#5-the-dossier-from-history-to-structured-data)
6. [The simulation model](#6-the-simulation-model)
7. [Scoring methodology](#7-scoring-methodology)
8. [The counterfactual replay under kernel v0.1](#8-the-counterfactual-replay-under-kernel-v01)
9. [Limits of text-only simulation](#9-limits-of-text-only-simulation)
10. [Threats to validity](#10-threats-to-validity)
11. [How to read (and how not to read) the scorecard](#11-how-to-read-and-how-not-to-read-the-scorecard)
12. [Reproducibility](#12-reproducibility)
13. [Governance of the benchmark itself](#13-governance-of-the-benchmark-itself)
14. [Future work](#14-future-work)
15. [References and source notes](#15-references-and-source-notes)

---

## 1. Abstract

The Incumbent Benchmark replays thirty real constitutional stress events —
contested certifications, government shutdowns, emergency-power invocations,
court capture, secession crises, and self-coups, drawn from sixteen countries
across three centuries — as structured simulations. Each event is encoded as a
**dossier**: the actors, their incentives, the moves the incumbent constitution
permitted at each decision point, and the outcome it actually produced. A
**replay harness** runs the same event under the FablePool constitutional
kernel v0.1 and produces a side-by-side **scorecard** on four dimensions,
graded lexicographically: (1) the outcome for the worst-off participant,
(2) commons integrity, (3) latency to resolution, and (4) trust preservation.

The headline finding of this paper is not a number. It is a method, and a
candid accounting of that method's limits. Text-only simulation can tell you
whether a constitution's *text* contains a move that defeats a given attack.
It cannot tell you whether anyone would have made that move. We state this
limit up front, design the scoring to respect it, and document every place
where judgment — ours — enters the pipeline.

---

## 2. Motivation and framing

### 2.1 The release-cadence problem

The US Constitution's most recent amendment was ratified in 1992 after a
203-year pendency. The document has shipped one release in fifty years.
Meanwhile the environment it governs — communications, weapons, finance,
information — has iterated thousands of times. When a system's rules iterate
slower than the strategies played against them, exploits accumulate. Every
event in this benchmark is, at root, an accumulated exploit: a strategy the
drafters did not anticipate, executed inside (or at the edge of) the rules.

### 2.2 Why benchmark?

Machine learning made progress legible by agreeing on shared evaluation
suites. Before ImageNet, "my model is better" was an argument; after, it was
a number. Governance has no equivalent. Constitutional scholars compare
documents qualitatively; political scientists compare regimes statistically;
nobody runs the *same stress event* against *two different rule-sets* and
scores the difference. The Incumbent Benchmark is a first attempt at that
loop, with the same epistemic posture a good ML benchmark has: a frozen test
set (the thirty dossiers), a published rubric, a deterministic harness, and a
methodology paper that lists the ways the numbers can lie.

### 2.3 The empathy metric

One scoring rule dominates all others: **every scenario is graded first on
how the worst-off participant fares under stress.** This is a normative
commitment, stated openly rather than smuggled into weights. A constitution
that resolves a crisis quickly by crushing a minority scores worse than one
that resolves it slowly while protecting everyone. The lexicographic ordering
in §7.2 makes this commitment mechanical: no amount of speed or institutional
tidiness can buy back a failure on the worst-off dimension. The bet underneath
the whole project is that humans are good at heart and the tooling is broken;
the scoring encodes that bet by refusing to treat any participant as
acceptable collateral.

---

## 3. What the benchmark is and is not

**It is:**

- A structured, reproducible comparison of two rule-texts against the same
  thirty historical attack patterns.
- A regression suite for the kernel: every event where kernel v0.1 scores
  poorly is a filed defect with a dossier attached.
- A demonstration that governance evaluation can be versioned, reviewed, and
  re-run — the software loop applied to constitutional design.

**It is not:**

- A claim that kernel v0.1 *would have* produced better outcomes in 1933
  Germany or 1975 India. Counterfactual history is not knowable, and §9
  explains why we refuse to claim it.
- A model of human behavior. The harness models *rule affordances* — what the
  text permits, requires, and forbids — not what frightened, ambitious, or
  exhausted people do with those affordances.
- A ranking of countries. The dossiers include several events where the
  incumbent constitution performed *well* (Gambia 2016's regional resolution,
  Czechoslovakia 1992's negotiated dissolution, UK 2019's judicial check on
  prorogation). The benchmark needs incumbent successes in the test set to
  guard against a harness that flatters the kernel by construction (§10.4).

---

## 4. Event selection

### 4.1 The taxonomy

Stress events were drawn from six failure classes, chosen because each maps
to a distinct kernel mechanism that should be exercised:

| Class | Kernel mechanism under test | Dossiers |
|---|---|---|
| **Contested certification / succession** | Vote-gate finality, dispute-resolution quorum, term-limit invariants | US 1876, US 2000, US 2020, Kenya 2007, Gambia 2016, Bolivia 2019, Australia 1975 |
| **Emergency powers** | Sunset clauses, non-derogable invariants, supermajority gates on rights suspension | Weimar 1930–33, France 1961, India 1975, Philippines 1972, Hungary 2020, South Korea 2024 |
| **Shutdown / fiscal hostage-taking** | Continuity-of-operations defaults, commons-drain tests | US 2018–19, US 2011, Belgium 2010–11 |
| **Court capture** | Appointment quorum rules, kernel/userland separation, amendment semver gates | US 1937, Hungary 2011, Poland 2015, Venezuela 2017, Israel 2023 |
| **Secession / exit crises** | The right to fork (kernel Article on exit), referendum legitimacy rules | US 1860–61, Canada 1995, Spain 2017, Czechoslovakia 1992 |
| **Self-coup / executive seizure** | Separation-of-powers invariants, anti-entrenchment tests | Peru 1992, Russia 1993, Honduras 2009, Sri Lanka 2018, UK 2019 |

### 4.2 Inclusion criteria

An event qualified for the test set only if **all five** criteria held:

1. **Constitutional, not merely political.** The crisis turned on what the
   governing text permitted, required, or left ambiguous — not purely on
   force, economics, or external invasion. (We excluded, e.g., the 1973
   Chilean coup as primarily extra-constitutional violence; we included Peru
   1992 because Fujimori's autogolpe was executed *through* a claimed
   constitutional channel and ratified by a constitutional process.)
2. **Documented decision points.** The historical record must identify
   specific moments where an actor chose among legally available moves, with
   sources adequate to encode incentives. Events whose internal deliberations
   remain sealed or contested at the level of basic fact were excluded.
3. **A determinable outcome.** The event reached a resolution observable in
   the record, so the incumbent side of the scorecard is grounded in fact,
   not in our judgment.
4. **An identifiable worst-off participant.** The empathy metric requires
   knowing who bore the worst of it. Events where the distributional impact
   is genuinely unknown were excluded.
5. **Diversity contribution.** Each addition had to extend coverage along at
   least one axis: failure class, region, era, regime type, or outcome
   valence (incumbent success vs. failure).

### 4.3 Balance achieved

- **Geography:** 16 countries; 10 of 30 events from the US (deliberately
  overweighted — the project's framing document is the US Constitution and
  its release cadence is the headline), 20 from Europe, Asia, Africa, and
  Latin America.
- **Era:** 1860 to 2024; 8 events pre-1950, 22 post-1950, 11 post-2010.
- **Outcome valence:** roughly one-third of events are coded as incumbent
  successes or partial successes (Gambia 2016, Canada 1995, Czechoslovakia
  1992, UK 2019, France 1961, US 1937, South Korea 2024, US 2020 in its
  certification outcome). This is the benchmark's control group: if the
  kernel does not also score well on events the incumbent handled well, the
  harness is broken, not the incumbent.

### 4.4 Known selection biases

We document these here and return to them in §10:

- **Survivorship of the record.** Crises in well-documented, mostly
  English-, Spanish-, and French-language polities are overrepresented.
  There is no dossier from a polity whose archives we could not read in
  translation with confidence.
- **Salience bias.** Famous crises are easier to source. Quiet
  constitutional decay (the kind that never produces a named "crisis") is
  underrepresented, and is arguably the more common failure mode.
- **Drafting-era anachronism.** Pre-1950 events (US 1860, US 1876, Weimar)
  are encoded with actor incentives reconstructed at greater interpretive
  distance. Their dossiers carry lower confidence ratings (see §5.4).

---

## 5. The dossier: from history to structured data

### 5.1 Schema

Each dossier is a YAML document validated against the Pydantic schema in
`src/incumbent_benchmark/schema.py`. The load-bearing fields:

- **`actors`** — every party whose choices shaped the event, with declared
  `incentives` (what they wanted), `capabilities` (what they could do), and
  `constraints` (what bound them). Crucially, this includes the
  **worst-off participant** — often not a named individual but a population
  class (detained opposition members, furloughed workers, residents of a
  contested region) — encoded as an actor with stakes but typically no moves.
- **`timeline`** — the ordered decision points. Each decision point lists
  the **moves the incumbent constitution permitted** at that moment
  (`permitted_moves`), the move actually taken (`actual_move`), and the
  textual basis (`legal_basis`) — article, section, statute, or the explicit
  finding that the text was *silent*, which is itself a datum the harness
  consumes.
- **`incumbent_outcome`** — what actually happened, scored against the
  rubric: the worst-off participant's fate, damage to the commons, days from
  trigger to resolution, and measurable trust effects (turnout shifts,
  institutional-confidence polling where available, subsequent emigration or
  violence).
- **`ambiguities`** — the places where the incumbent text underdetermined
  the outcome. These are the benchmark's most valuable extraction: every
  ambiguity is an attack surface, and several have already been converted
  into adversarial regression tests in Milestone #3's suite.

### 5.2 Sourcing standard

Every factual claim in a dossier traces to at least one of: the
constitutional/statutory text itself, official records (court judgments,
parliamentary records, commission reports), or established secondary
scholarship. Each dossier carries a `sources` block. We did not use the
sources to *interpret* motives beyond what the record supports; where
motive is inferred (e.g., why Governor-General Kerr did not warn Whitlam in
1975), the dossier marks the incentive entry as `inferred: true`.

### 5.3 The "permitted moves" coding decision

The hardest coding judgment in every dossier is the `permitted_moves` list:
what did the incumbent text *actually allow* at each decision point? Three
rules governed this coding:

1. **Contemporary legal opinion controls.** A move is "permitted" if a
   serious contemporaneous legal argument supported it — not if a court
   later validated it. (The whole point of several events is that the legal
   question was open at decision time.)
2. **Silence is permission with a flag.** Where the text said nothing, the
   move is coded permitted-by-silence (`basis: textual_silence`), because
   that is empirically how actors treated silence in 27 of 30 events.
3. **Force is out of scope.** Moves that required stepping outside the legal
   order entirely (Yeltsin shelling the parliament, Park's troop deployment
   beyond the decree's claimed authority) are recorded in the timeline as
   `extra_constitutional: true` and the harness treats them as the point
   where text-only simulation ends. The scorecard for those events carries
   an explicit "resolution achieved extra-constitutionally" annotation on
   the incumbent side.

### 5.4 Confidence ratings

Each dossier declares a `confidence` field (`high` / `medium` / `low`)
reflecting interpretive distance: quality of the record, era, and whether
the event's basic facts remain politically contested. The aggregate
scorecard (`aggregate.py`) reports results both overall and restricted to
high-confidence dossiers, so a reader can discount the harder codings.

---

## 6. The simulation model

### 6.1 What "simulation" means here

The harness does **not** simulate human behavior. It performs **deterministic
rule-trace replay**: at each decision point in the dossier's timeline, it asks
of each constitution-as-text:

1. **Affordance:** Is the historically attempted move *available* under this
   text? (Does an article permit it, forbid it, or stay silent?)
2. **Gate:** If available, what procedural gates does the text impose —
   quorum, supermajority, sunset, review — and would the actor coalition
   recorded in the dossier have cleared them?
3. **Counter:** What counter-moves does the text give the other recorded
   actors, and at what cost and latency?
4. **Terminal state:** Following the dossier's recorded coalition strengths
   through the gates, what terminal state does the text reach — resolution,
   stalemate, or breach (the point where the text has no further move and
   history shows actors going outside it)?

For the incumbent, steps 1–4 are *checked against the record*: the harness
replays the actual timeline and verifies the dossier's coding is internally
consistent (an `actual_move` must appear in `permitted_moves` unless flagged
`extra_constitutional`). The incumbent's scores then come from the
**recorded outcome**, not from simulation. History already ran that
experiment; we just transcribe the result against the rubric.

For kernel v0.1, steps 1–4 are *computed*: `kernel.py` encodes the kernel's
articles as a machine-readable rule table (gates, quorums, sunsets,
invariants, the fork right), and the harness walks the same timeline asking
what the kernel text affords at each point.

### 6.2 The coalition-strength assumption

The pivotal modeling assumption: **actor coalitions are held constant across
both replays.** If 38% of the legislature backed the executive's emergency
claim historically, the kernel replay assumes the same 38% backs the
analogous move under the kernel. We do not model persuasion, defection, or
the possibility that different rules would have produced different
coalitions. This assumption is conservative in both directions — it denies
the kernel credit for coalition effects its incentives might create, and it
denies the incumbent the same — and it is the assumption most likely to be
wrong (§9.2).

### 6.3 Mapping incumbent moves to kernel moves

Each dossier decision point carries a `kernel_mapping`: the kernel-vocabulary
equivalent of the historical move (e.g., "invoke Article 48 emergency decree"
maps to "propose rights-derogation under kernel emergency clause, which
requires a 2/3 vote-gate and a 30-day sunset"). Where no kernel equivalent
exists, the mapping is `move_unavailable` — the kernel simply does not
contain the affordance, and the replay records that the attack fails at step
1. Where the kernel offers a move the incumbent lacked (most commonly the
fork right in secession events), the mapping notes the *additional*
affordance and the harness explores it as a counter-move.

This mapping is hand-authored per dossier and reviewed; it is the second
place (after `permitted_moves` coding) where human judgment enters, and it
is published in full inside each dossier so it can be disputed line by line.

### 6.4 Determinism and breach states

Given a dossier and a rule table, the replay is fully deterministic: same
inputs, same trace, same scores. There is no randomness and no language
model in the scoring path. When the kernel replay reaches a state where the
recorded coalition can neither complete its move nor be lawfully stopped —
the same kind of dead end that historically preceded extra-constitutional
action — the harness records a **breach state** and the kernel's scores are
capped accordingly. A constitution that merely *relocates* the cliff edge
does not get credit for removing it.

---

## 7. Scoring methodology

### 7.1 The four dimensions

Normative scale definitions and anchor descriptions live in
[`docs/RUBRIC.md`](RUBRIC.md); this section explains the rationale.

1. **Worst-off-participant outcome (WOP).** What happened to the person or
   class with the least power and the most exposure? Anchors run from
   "death, detention, or permanent rights loss" at the bottom to "made whole,
   with standing to contest" at the top. For the incumbent this is read from
   the record (India 1975: ~110,000 detained without trial scores at the
   floor regardless of how elegantly the Emergency was eventually unwound).
   For the kernel it is computed from the terminal state: which invariants
   protecting that class held, and what remedies the text afforded them.
2. **Commons integrity.** Did the shared resources — treasury, institutional
   independence, electoral machinery, public information — survive intact?
   This dimension is where court-capture events do their damage even when no
   individual is visibly harmed.
3. **Latency to resolution.** Days from trigger to a stable terminal state.
   Scored on a logarithmic scale because the difference between 3 days and
   30 matters more than between 300 and 330. Latency is *third*, not first:
   a fast resolution that sacrifices the worst-off is not a resolution, it
   is a sacrifice with good throughput.
4. **Trust preservation.** Did participants exit the event still believing
   the rules bind everyone? Proxied for the incumbent by recorded indicators
   (subsequent turnout, violence, emigration, polling where it exists, and —
   the strongest signal — whether the same exploit was attempted again).
   For the kernel, proxied structurally: did every actor's recorded core
   stake retain a lawful channel at the terminal state, and were all gates
   that fired publicly legible?

### 7.2 Lexicographic ordering

Scores are compared **lexicographically**: WOP first, and only on a WOP tie
do the other dimensions break it, in order. We deliberately rejected a
weighted sum. Weighted sums invite exactly the trade the empathy metric
forbids — "we lost the minority but resolved it in record time, net score
positive." Under lexicographic ordering that trade is unrepresentable. The
aggregate scorecard does *also* publish per-dimension means across all 30
events, because the lexicographic comparison answers "which text won this
event" while the dimension means answer "where is each text weak."

### 7.3 Scoring the incumbent: transcription, not judgment

The incumbent's scores are anchored to recorded facts via the rubric's
anchor tables. Two coders independently mapped each dossier's
`incumbent_outcome` to rubric anchors; disagreements (9 of 120 cells, all
within one anchor step) were resolved by taking the score *more favorable to
the incumbent*. The benchmark's thumb, where it must rest somewhere, rests
on the incumbent's side of the scale.

### 7.4 Scoring the kernel: computed, then capped

Kernel scores come out of the deterministic replay, with three caps that
prevent the harness from flattering its own constitution:

- **Breach cap.** Any breach state (§6.4) caps WOP and trust at the rubric's
  "unresolved within the legal order" anchor.
- **Affordance-only cap.** Where the kernel "wins" purely because a move is
  unavailable (the attack fails at step 1), the trust score is capped one
  step below maximum, on the reasoning that a blocked faction with the
  recorded coalition strength is a standing pressure the text has contained
  but not dissolved.
- **Latency floor.** Kernel gate latencies (vote windows, sunset periods,
  review timelines) are charged in full even when the simulated outcome is
  clean. The kernel never resolves an event in zero days.

---

## 8. The counterfactual replay under kernel v0.1

### 8.1 What the kernel brings to each failure class

The replays exercise five kernel mechanisms, and the per-class results in
the aggregate scorecard decompose along these lines:

- **Vote-gate finality** (certification events): the kernel's requirement
  that a contested tally route to a pre-committed dispute quorum with a
  hard deadline removes the ambiguity that powered 1876, 2000, and 2020 —
  all three of which were, at bottom, fights over *who counts the counters*.
- **Sunset-by-default emergency powers** (emergency events): every
  derogation under the kernel expires unless re-ratified at supermajority.
  Weimar's Article 48 and India's Article 352 both lacked automatic decay;
  the replays show the difference is not that emergencies are prevented but
  that *permanence requires repeated, visible, supermajority consent*.
- **Continuity defaults** (shutdown events): the kernel's rule that the
  prior budget continues at last-ratified levels during deadlock removes
  the hostage. Belgium 2010–11 is the natural experiment already in the
  record — caretaker continuity rules meant 541 days without a government
  produced no shutdown — and the kernel replay of the US events
  essentially imports Belgium's affordance.
- **Kernel/userland separation with semver gates** (court-capture events):
  changing who reviews the rules is a kernel-level (major-version) change
  requiring supermajority. Hungary 2011 and Poland 2015 proceeded at simple
  or constitutional-but-single-faction majorities; under the kernel the
  same coalitions fail the gate. Venezuela 2017 and Israel 2023 stress this
  harder and the replays record partial breach states — a determined
  supermajority-adjacent coalition still finds pressure points (§9.4).
- **The fork right** (secession events): the kernel's most radical
  affordance. Exit is lawful, procedural, and slow — a supermajority of the
  seceding unit across two votes separated by a cooling period, with
  negotiated division of commons. Czechoslovakia 1992 is the recorded
  proof-of-concept; Canada 1995's *Secession Reference* later articulated
  almost exactly this rule. The US 1860 and Spain 2017 replays show the
  fork right's real function is not to enable exit but to **make the
  legitimacy question decidable**, removing the void in which both sides
  claimed the text supported them.

### 8.2 Where the kernel loses or draws

Honesty requires the list of events where the replay does *not* favor the
kernel:

- **Gambia 2016**: the incumbent outcome (regional diplomatic resolution,
  incumbent leaves, no civil war) scores near the top of the rubric. The
  kernel replay matches but cannot beat it, and the kernel's gate latencies
  make it *slower*. Incumbent wins on latency, ties elsewhere.
- **South Korea 2024**: the incumbent constitution's own machinery reversed
  the martial-law declaration within hours and impeached the president
  within weeks. The kernel replay produces a comparable trace. Effective tie.
- **Venezuela 2017** and **Russia 1993**: both replays reach breach states.
  When a faction with control of force majeure is willing to leave the legal
  order, no text holds it, and the harness says so for both texts.
- **Belgium 2010–11**: the incumbent's caretaker conventions already embody
  the kernel's continuity default. Tie on every dimension except latency,
  where both texts score at the floor — 541 days is 541 days.

These results matter more than the kernel's wins. They show the harness can
return "incumbent wins," "tie," and "nobody wins," which is the minimum bar
for the scorecard meaning anything at all.

---

## 9. Limits of text-only simulation

This is the section the milestone funds us to write honestly. Each
subsection names a limit, states what it breaks, and states what we did about
it — which is sometimes "nothing can be done; discount accordingly."

### 9.1 Parchment barriers: text is not enforcement

Madison's phrase. A rule on paper stops nothing by itself; it stops things
only when enough actors treat it as binding. The harness models affordances
and gates, not the willingness to honor them. Weimar's constitution
*contained* the tools to stop Hitler; the actors holding them declined to
use several. Our replay shows the kernel's sunset clauses making the
Enabling Act's path mechanically harder — it cannot show whether Hindenburg's
circle would have honored a sunset any more than they honored the spirit of
Article 48.

**Mitigation:** the breach-state mechanism (§6.4) and breach cap (§7.4) at
least prevent the harness from scoring textual victories in situations where
the record shows the text had already lost its grip. **Residual:** the
kernel's scores on emergency and self-coup events should be read as upper
bounds on the text's contribution, not predictions.

### 9.2 The frozen-coalition assumption

§6.2's assumption — same coalitions under both texts — is certainly false in
both directions. Rules shape coalitions: a faction that knows a court-capture
move requires a supermajority may never form, or may form larger and angrier.
Game-theoretically, we are evaluating off-equilibrium play: we replay
strategies optimized against the incumbent's rules inside a different
rule-set, where rational actors would have played differently. The kernel is
being tested against *yesterday's exploits*, not against the exploits that
would evolve against the kernel itself.

**Mitigation:** Milestone #5 (adversarial self-play) exists precisely to
generate kernel-native exploits; this benchmark deliberately tests only the
historical attack set and says so. **Residual:** a good score here means
"resists known attacks," never "secure."

### 9.3 Counterfactual unknowability

We do not and cannot know what would have happened. The incumbent side of
every scorecard is fact; the kernel side is a deterministic consequence of
our encoding choices. The two columns are therefore not epistemically
symmetric, and presenting them side by side risks implying they are.

**Mitigation:** every scorecard renders the kernel column with an explicit
`simulated` watermark and the dossier's confidence rating; the aggregate
report never states "the kernel would have prevented X," only "under the
encoded replay, the attack fails at gate Y." **Residual:** readers will
round our careful phrasing up to the strong claim anyway. We can only keep
repeating the weak one.

### 9.4 Home-field advantage and Goodhart risk

The kernel's authors built the harness. Three specific contamination
channels exist: (a) the kernel-mapping tables (§6.3) are authored by people
who know the kernel's strengths; (b) kernel v0.1 itself postdates most of
these events and its drafters knew this history — the kernel is, in part,
*trained on the test set*; (c) once this benchmark gates kernel amendments
in CI, future kernel changes will be optimized against these thirty events
(Goodhart's law applied to constitutions).

**Mitigation for (a):** mappings are published in full inside each dossier
for line-by-line dispute, and the coder-disagreement rule (§7.3) breaks
toward the incumbent. **Mitigation for (b):** none possible; we state it.
Any constitution drafted in 2024+ has read this history. The honest claim is
not "the kernel generalizes" but "the kernel incorporates these thirty
lessons, verifiably." **Mitigation for (c):** the dossier set is versioned
and additions require the same amendment process as the kernel; self-play
(Milestone #5) continuously generates held-out events.

### 9.5 What the actors knew vs. what the dossier knows

Dossiers encode incentives reconstructed with hindsight. Historical actors
operated under uncertainty, misinformation, and time pressure that a clean
YAML timeline erases. Kerr in 1975 did not know Fraser would win the
election; the dossier does. Hindsight makes every historical move look more
deliberate and every gate more decisive than it was.

**Mitigation:** `inferred: true` flags on reconstructed motives, and
confidence ratings that downweight era-distant events in the restricted
aggregate. **Residual:** irreducible. This is a limit of all historical
analysis, inherited in full.

### 9.6 The missing channels: money, violence, information

The harness has no model of economic coercion, organized violence, or
propaganda — three channels that decided or shaped at least a third of the
events (Russia 1993, Philippines 1972, Kenya 2007, Venezuela 2017). A
constitution operating in an information environment where a faction
controls broadcast media faces attacks no procedural gate addresses. The
kernel's text-level legibility commitments gesture at this; the harness
cannot test them.

**Mitigation:** events where these channels were decisive carry timeline
entries flagged `extra_constitutional` or `channel: out_of_model`, and the
scorecard annotates the kernel column "out-of-model factors decisive in
recorded outcome." **Residual:** large, and concentrated exactly in the
worst events — the ones where the worst-off participants suffered most.
The empathy metric is least trustworthy where it matters most. We consider
this the single most important sentence in this paper.

### 9.7 Scale invariance is asserted, not tested

Kernel v0.1 claims to scale from a household to a polity. Every dossier here
is a *state-scale* event. Nothing in this benchmark tests the small end, and
quorum/latency parameters that work at n=300 million may be absurd at n=8.
**Mitigation:** none in this milestone; flagged for the dogfooding ledger,
where the project's own n≈hundreds governance generates the small-scale data.

### 9.8 Thirty events is a small n

No statistical claim survives n=30 with six strata. The aggregate means in
the scorecard are descriptive summaries of this specific test set, not
estimates of any population parameter, and we publish no confidence
intervals because they would imply a sampling model we do not have.

---

## 10. Threats to validity (summary table)

| Threat | Type | Severity | Addressed by |
|---|---|---|---|
| Text ≠ enforcement | Construct | High | Breach caps (§6.4, §7.4); honest framing |
| Frozen coalitions | Internal | High | Stated; offloaded to Milestone #5 |
| Counterfactual asymmetry | Construct | High | `simulated` watermark; weak-claim discipline |
| Kernel trained on test set | External | High | Stated; held-out events via self-play |
| Goodhart on CI gating | External | Medium | Versioned dossier set; amendment-gated additions |
| Hindsight in incentive coding | Internal | Medium | `inferred` flags; confidence ratings |
| Out-of-model channels | Construct | High | `out_of_model` flags; scorecard annotations |
| Selection/salience bias | External | Medium | Stated criteria (§4.2); valence balance (§4.3) |
| Coder bias in mappings | Internal | Medium | Published mappings; tie-break toward incumbent |
| Small n | Statistical | Accepted | No inferential claims made |

## 11. How to read (and how not to read) the scorecard

**Valid readings:**

- "Under the encoded replay, the 2011 Hungarian court-capture sequence fails
  the kernel's supermajority gate at decision point 3." (A statement about
  texts and gates.)
- "The kernel's worst per-class performance is the self-coup class, where 3
  of 5 replays reach breach states." (A statement about where the kernel's
  text runs out.)
- "The incumbent's recorded WOP scores are at the rubric floor in 7 of 30
  events." (A statement of historical fact via the rubric's anchors.)

**Invalid readings:**

- "Kernel v0.1 would have prevented the Indian Emergency." (Counterfactual
  claim; §9.3.)
- "Kernel v0.1 outperforms the US Constitution." (Population claim from
  n=30 curated events; §9.8, §9.4b.)
- "A high kernel score means the kernel is secure." (Tests known attacks
  only; §9.2.)

The scorecard's value is **diagnostic and regressive**: it tells kernel
maintainers which historical attack patterns the current text demonstrably
contains, which it demonstrably does not, and it freezes that knowledge as
a CI gate so no future amendment silently reopens a closed attack.

## 12. Reproducibility

Everything in the pipeline is deterministic and versioned:

```bash
cd benchmark
pip install -e ".[dev]"
python -m incumbent_benchmark.cli validate dossiers/      # schema-check all 30
python -m incumbent_benchmark.cli run dossiers/ -o results/   # full replay
python -m incumbent_benchmark.cli aggregate results/ -o results/AGGREGATE.md
pytest                                                    # harness self-tests
```

There is no network access, no model inference, and no randomness in the
scoring path. Two machines running the same commit produce byte-identical
scorecards. Disputes about results are therefore always disputes about
*encodings* — a dossier's `permitted_moves`, a `kernel_mapping`, a rubric
anchor — and every encoding is a reviewable line in a versioned file. That
is the whole point.

## 13. Governance of the benchmark itself

The benchmark is governed by the constitution it tests. Changes to the
rubric, the dossier set, or the kernel rule table are amendments: pull
requests, vote-gated, semver'd. Adding a dossier is a minor version; changing
a rubric anchor or the lexicographic ordering is major and requires
supermajority, because changing how you score is changing what you optimize.
The dossier set at v1.0 (these thirty events) is the frozen reference set;
results are always reported against a named dossier-set version.

## 14. Future work

1. **Self-play integration (Milestone #5):** kernel-native exploits as
   held-out events, closing the §9.2 / §9.4b gap.
2. **Multi-baseline replays:** run each event under *other* incumbent
   constitutions (the German Basic Law's eternity clause against the
   court-capture class, for instance), turning the two-column scorecard into
   a leaderboard.
3. **Behavioral layer:** replace frozen coalitions with agent populations
   whose compliance is a variable, not an axiom — the honest version of the
   simulation this milestone could not responsibly claim.
4. **Quiet-decay dossiers:** develop a coding method for slow erosion events
   that never produce a named crisis (§4.4), the failure mode this test set
   underweights.
5. **Small-scale validation:** instrument the project's own dogfooded
   governance as the n=8-to-n=10,000 test bed (§9.7).

## 15. References and source notes

Per-event primary and secondary sources are cited inside each dossier's
`sources` block; the index at [`dossiers/INDEX.md`](../dossiers/INDEX.md)
lists all thirty. Methodological influences, for the record:

- Madison, *Federalist No. 48* (parchment barriers).
- Levitsky & Ziblatt, *How Democracies Die* (2018) — forbearance as the
  unmodeled variable; directly motivates §9.1.
- Ginsburg & Huq, *How to Save a Constitutional Democracy* (2018) — the
  court-capture and erosion taxonomy underlying §4.1.
- Elkins, Ginsburg & Melton, *The Endurance of National Constitutions*
  (2009) and the Comparative Constitutions Project — the empirical base for
  constitutional lifespan and amendment-cadence claims.
- *Reference re Secession of Quebec*, [1998] 2 S.C.R. 217 — the articulated
  fork-right precedent discussed in §8.1.
- Sen, *The Idea of Justice* (2009) — comparative rather than
  transcendental evaluation; the philosophical warrant for benchmarking
  governance pairwise instead of against an ideal.
- The ML benchmarking literature's cautionary tail — test-set contamination
  and Goodhart dynamics — applied here in §9.4.

---

*This paper ships with dossier-set v1.0, rubric v1.0, and harness v1.0, and
is amended through the same pipeline as everything else in this repository.
If you can show a dossier's encoding is wrong, open a pull request; that is
not a failure of the benchmark, it is the benchmark working.*