# The Incumbent Benchmark — Methodology Paper **Version 1.0 · Milestone #4 deliverable · FablePool** > We benchmark models. This paper describes how — and how honestly — we can > benchmark governance. --- ## Table of contents 1. [Abstract](#1-abstract) 2. [Motivation and framing](#2-motivation-and-framing) 3. [What the benchmark is and is not](#3-what-the-benchmark-is-and-is-not) 4. [Event selection](#4-event-selection) 5. [The dossier: from history to structured data](#5-the-dossier-from-history-to-structured-data) 6. [The simulation model](#6-the-simulation-model) 7. [Scoring methodology](#7-scoring-methodology) 8. [The counterfactual replay under kernel v0.1](#8-the-counterfactual-replay-under-kernel-v01) 9. [Limits of text-only simulation](#9-limits-of-text-only-simulation) 10. [Threats to validity](#10-threats-to-validity) 11. [How to read (and how not to read) the scorecard](#11-how-to-read-and-how-not-to-read-the-scorecard) 12. [Reproducibility](#12-reproducibility) 13. [Governance of the benchmark itself](#13-governance-of-the-benchmark-itself) 14. [Future work](#14-future-work) 15. [References and source notes](#15-references-and-source-notes) --- ## 1. Abstract The Incumbent Benchmark replays thirty real constitutional stress events — contested certifications, government shutdowns, emergency-power invocations, court capture, secession crises, and self-coups, drawn from sixteen countries across three centuries — as structured simulations. Each event is encoded as a **dossier**: the actors, their incentives, the moves the incumbent constitution permitted at each decision point, and the outcome it actually produced. A **replay harness** runs the same event under the FablePool constitutional kernel v0.1 and produces a side-by-side **scorecard** on four dimensions, graded lexicographically: (1) the outcome for the worst-off participant, (2) commons integrity, (3) latency to resolution, and (4) trust preservation. The headline finding of this paper is not a number. It is a method, and a candid accounting of that method's limits. Text-only simulation can tell you whether a constitution's *text* contains a move that defeats a given attack. It cannot tell you whether anyone would have made that move. We state this limit up front, design the scoring to respect it, and document every place where judgment — ours — enters the pipeline. --- ## 2. Motivation and framing ### 2.1 The release-cadence problem The US Constitution's most recent amendment was ratified in 1992 after a 203-year pendency. The document has shipped one release in fifty years. Meanwhile the environment it governs — communications, weapons, finance, information — has iterated thousands of times. When a system's rules iterate slower than the strategies played against them, exploits accumulate. Every event in this benchmark is, at root, an accumulated exploit: a strategy the drafters did not anticipate, executed inside (or at the edge of) the rules. ### 2.2 Why benchmark? Machine learning made progress legible by agreeing on shared evaluation suites. Before ImageNet, "my model is better" was an argument; after, it was a number. Governance has no equivalent. Constitutional scholars compare documents qualitatively; political scientists compare regimes statistically; nobody runs the *same stress event* against *two different rule-sets* and scores the difference. The Incumbent Benchmark is a first attempt at that loop, with the same epistemic posture a good ML benchmark has: a frozen test set (the thirty dossiers), a published rubric, a deterministic harness, and a methodology paper that lists the ways the numbers can lie. ### 2.3 The empathy metric One scoring rule dominates all others: **every scenario is graded first on how the worst-off participant fares under stress.** This is a normative commitment, stated openly rather than smuggled into weights. A constitution that resolves a crisis quickly by crushing a minority scores worse than one that resolves it slowly while protecting everyone. The lexicographic ordering in §7.2 makes this commitment mechanical: no amount of speed or institutional tidiness can buy back a failure on the worst-off dimension. The bet underneath the whole project is that humans are good at heart and the tooling is broken; the scoring encodes that bet by refusing to treat any participant as acceptable collateral. --- ## 3. What the benchmark is and is not **It is:** - A structured, reproducible comparison of two rule-texts against the same thirty historical attack patterns. - A regression suite for the kernel: every event where kernel v0.1 scores poorly is a filed defect with a dossier attached. - A demonstration that governance evaluation can be versioned, reviewed, and re-run — the software loop applied to constitutional design. **It is not:** - A claim that kernel v0.1 *would have* produced better outcomes in 1933 Germany or 1975 India. Counterfactual history is not knowable, and §9 explains why we refuse to claim it. - A model of human behavior. The harness models *rule affordances* — what the text permits, requires, and forbids — not what frightened, ambitious, or exhausted people do with those affordances. - A ranking of countries. The dossiers include several events where the incumbent constitution performed *well* (Gambia 2016's regional resolution, Czechoslovakia 1992's negotiated dissolution, UK 2019's judicial check on prorogation). The benchmark needs incumbent successes in the test set to guard against a harness that flatters the kernel by construction (§10.4). --- ## 4. Event selection ### 4.1 The taxonomy Stress events were drawn from six failure classes, chosen because each maps to a distinct kernel mechanism that should be exercised: | Class | Kernel mechanism under test | Dossiers | |---|---|---| | **Contested certification / succession** | Vote-gate finality, dispute-resolution quorum, term-limit invariants | US 1876, US 2000, US 2020, Kenya 2007, Gambia 2016, Bolivia 2019, Australia 1975 | | **Emergency powers** | Sunset clauses, non-derogable invariants, supermajority gates on rights suspension | Weimar 1930–33, France 1961, India 1975, Philippines 1972, Hungary 2020, South Korea 2024 | | **Shutdown / fiscal hostage-taking** | Continuity-of-operations defaults, commons-drain tests | US 2018–19, US 2011, Belgium 2010–11 | | **Court capture** | Appointment quorum rules, kernel/userland separation, amendment semver gates | US 1937, Hungary 2011, Poland 2015, Venezuela 2017, Israel 2023 | | **Secession / exit crises** | The right to fork (kernel Article on exit), referendum legitimacy rules | US 1860–61, Canada 1995, Spain 2017, Czechoslovakia 1992 | | **Self-coup / executive seizure** | Separation-of-powers invariants, anti-entrenchment tests | Peru 1992, Russia 1993, Honduras 2009, Sri Lanka 2018, UK 2019 | ### 4.2 Inclusion criteria An event qualified for the test set only if **all five** criteria held: 1. **Constitutional, not merely political.** The crisis turned on what the governing text permitted, required, or left ambiguous — not purely on force, economics, or external invasion. (We excluded, e.g., the 1973 Chilean coup as primarily extra-constitutional violence; we included Peru 1992 because Fujimori's autogolpe was executed *through* a claimed constitutional channel and ratified by a constitutional process.) 2. **Documented decision points.** The historical record must identify specific moments where an actor chose among legally available moves, with sources adequate to encode incentives. Events whose internal deliberations remain sealed or contested at the level of basic fact were excluded. 3. **A determinable outcome.** The event reached a resolution observable in the record, so the incumbent side of the scorecard is grounded in fact, not in our judgment. 4. **An identifiable worst-off participant.** The empathy metric requires knowing who bore the worst of it. Events where the distributional impact is genuinely unknown were excluded. 5. **Diversity contribution.** Each addition had to extend coverage along at least one axis: failure class, region, era, regime type, or outcome valence (incumbent success vs. failure). ### 4.3 Balance achieved - **Geography:** 16 countries; 10 of 30 events from the US (deliberately overweighted — the project's framing document is the US Constitution and its release cadence is the headline), 20 from Europe, Asia, Africa, and Latin America. - **Era:** 1860 to 2024; 8 events pre-1950, 22 post-1950, 11 post-2010. - **Outcome valence:** roughly one-third of events are coded as incumbent successes or partial successes (Gambia 2016, Canada 1995, Czechoslovakia 1992, UK 2019, France 1961, US 1937, South Korea 2024, US 2020 in its certification outcome). This is the benchmark's control group: if the kernel does not also score well on events the incumbent handled well, the harness is broken, not the incumbent. ### 4.4 Known selection biases We document these here and return to them in §10: - **Survivorship of the record.** Crises in well-documented, mostly English-, Spanish-, and French-language polities are overrepresented. There is no dossier from a polity whose archives we could not read in translation with confidence. - **Salience bias.** Famous crises are easier to source. Quiet constitutional decay (the kind that never produces a named "crisis") is underrepresented, and is arguably the more common failure mode. - **Drafting-era anachronism.** Pre-1950 events (US 1860, US 1876, Weimar) are encoded with actor incentives reconstructed at greater interpretive distance. Their dossiers carry lower confidence ratings (see §5.4). --- ## 5. The dossier: from history to structured data ### 5.1 Schema Each dossier is a YAML document validated against the Pydantic schema in `src/incumbent_benchmark/schema.py`. The load-bearing fields: - **`actors`** — every party whose choices shaped the event, with declared `incentives` (what they wanted), `capabilities` (what they could do), and `constraints` (what bound them). Crucially, this includes the **worst-off participant** — often not a named individual but a population class (detained opposition members, furloughed workers, residents of a contested region) — encoded as an actor with stakes but typically no moves. - **`timeline`** — the ordered decision points. Each decision point lists the **moves the incumbent constitution permitted** at that moment (`permitted_moves`), the move actually taken (`actual_move`), and the textual basis (`legal_basis`) — article, section, statute, or the explicit finding that the text was *silent*, which is itself a datum the harness consumes. - **`incumbent_outcome`** — what actually happened, scored against the rubric: the worst-off participant's fate, damage to the commons, days from trigger to resolution, and measurable trust effects (turnout shifts, institutional-confidence polling where available, subsequent emigration or violence). - **`ambiguities`** — the places where the incumbent text underdetermined the outcome. These are the benchmark's most valuable extraction: every ambiguity is an attack surface, and several have already been converted into adversarial regression tests in Milestone #3's suite. ### 5.2 Sourcing standard Every factual claim in a dossier traces to at least one of: the constitutional/statutory text itself, official records (court judgments, parliamentary records, commission reports), or established secondary scholarship. Each dossier carries a `sources` block. We did not use the sources to *interpret* motives beyond what the record supports; where motive is inferred (e.g., why Governor-General Kerr did not warn Whitlam in 1975), the dossier marks the incentive entry as `inferred: true`. ### 5.3 The "permitted moves" coding decision The hardest coding judgment in every dossier is the `permitted_moves` list: what did the incumbent text *actually allow* at each decision point? Three rules governed this coding: 1. **Contemporary legal opinion controls.** A move is "permitted" if a serious contemporaneous legal argument supported it — not if a court later validated it. (The whole point of several events is that the legal question was open at decision time.) 2. **Silence is permission with a flag.** Where the text said nothing, the move is coded permitted-by-silence (`basis: textual_silence`), because that is empirically how actors treated silence in 27 of 30 events. 3. **Force is out of scope.** Moves that required stepping outside the legal order entirely (Yeltsin shelling the parliament, Park's troop deployment beyond the decree's claimed authority) are recorded in the timeline as `extra_constitutional: true` and the harness treats them as the point where text-only simulation ends. The scorecard for those events carries an explicit "resolution achieved extra-constitutionally" annotation on the incumbent side. ### 5.4 Confidence ratings Each dossier declares a `confidence` field (`high` / `medium` / `low`) reflecting interpretive distance: quality of the record, era, and whether the event's basic facts remain politically contested. The aggregate scorecard (`aggregate.py`) reports results both overall and restricted to high-confidence dossiers, so a reader can discount the harder codings. --- ## 6. The simulation model ### 6.1 What "simulation" means here The harness does **not** simulate human behavior. It performs **deterministic rule-trace replay**: at each decision point in the dossier's timeline, it asks of each constitution-as-text: 1. **Affordance:** Is the historically attempted move *available* under this text? (Does an article permit it, forbid it, or stay silent?) 2. **Gate:** If available, what procedural gates does the text impose — quorum, supermajority, sunset, review — and would the actor coalition recorded in the dossier have cleared them? 3. **Counter:** What counter-moves does the text give the other recorded actors, and at what cost and latency? 4. **Terminal state:** Following the dossier's recorded coalition strengths through the gates, what terminal state does the text reach — resolution, stalemate, or breach (the point where the text has no further move and history shows actors going outside it)? For the incumbent, steps 1–4 are *checked against the record*: the harness replays the actual timeline and verifies the dossier's coding is internally consistent (an `actual_move` must appear in `permitted_moves` unless flagged `extra_constitutional`). The incumbent's scores then come from the **recorded outcome**, not from simulation. History already ran that experiment; we just transcribe the result against the rubric. For kernel v0.1, steps 1–4 are *computed*: `kernel.py` encodes the kernel's articles as a machine-readable rule table (gates, quorums, sunsets, invariants, the fork right), and the harness walks the same timeline asking what the kernel text affords at each point. ### 6.2 The coalition-strength assumption The pivotal modeling assumption: **actor coalitions are held constant across both replays.** If 38% of the legislature backed the executive's emergency claim historically, the kernel replay assumes the same 38% backs the analogous move under the kernel. We do not model persuasion, defection, or the possibility that different rules would have produced different coalitions. This assumption is conservative in both directions — it denies the kernel credit for coalition effects its incentives might create, and it denies the incumbent the same — and it is the assumption most likely to be wrong (§9.2). ### 6.3 Mapping incumbent moves to kernel moves Each dossier decision point carries a `kernel_mapping`: the kernel-vocabulary equivalent of the historical move (e.g., "invoke Article 48 emergency decree" maps to "propose rights-derogation under kernel emergency clause, which requires a 2/3 vote-gate and a 30-day sunset"). Where no kernel equivalent exists, the mapping is `move_unavailable` — the kernel simply does not contain the affordance, and the replay records that the attack fails at step 1. Where the kernel offers a move the incumbent lacked (most commonly the fork right in secession events), the mapping notes the *additional* affordance and the harness explores it as a counter-move. This mapping is hand-authored per dossier and reviewed; it is the second place (after `permitted_moves` coding) where human judgment enters, and it is published in full inside each dossier so it can be disputed line by line. ### 6.4 Determinism and breach states Given a dossier and a rule table, the replay is fully deterministic: same inputs, same trace, same scores. There is no randomness and no language model in the scoring path. When the kernel replay reaches a state where the recorded coalition can neither complete its move nor be lawfully stopped — the same kind of dead end that historically preceded extra-constitutional action — the harness records a **breach state** and the kernel's scores are capped accordingly. A constitution that merely *relocates* the cliff edge does not get credit for removing it. --- ## 7. Scoring methodology ### 7.1 The four dimensions Normative scale definitions and anchor descriptions live in [`docs/RUBRIC.md`](RUBRIC.md); this section explains the rationale. 1. **Worst-off-participant outcome (WOP).** What happened to the person or class with the least power and the most exposure? Anchors run from "death, detention, or permanent rights loss" at the bottom to "made whole, with standing to contest" at the top. For the incumbent this is read from the record (India 1975: ~110,000 detained without trial scores at the floor regardless of how elegantly the Emergency was eventually unwound). For the kernel it is computed from the terminal state: which invariants protecting that class held, and what remedies the text afforded them. 2. **Commons integrity.** Did the shared resources — treasury, institutional independence, electoral machinery, public information — survive intact? This dimension is where court-capture events do their damage even when no individual is visibly harmed. 3. **Latency to resolution.** Days from trigger to a stable terminal state. Scored on a logarithmic scale because the difference between 3 days and 30 matters more than between 300 and 330. Latency is *third*, not first: a fast resolution that sacrifices the worst-off is not a resolution, it is a sacrifice with good throughput. 4. **Trust preservation.** Did participants exit the event still believing the rules bind everyone? Proxied for the incumbent by recorded indicators (subsequent turnout, violence, emigration, polling where it exists, and — the strongest signal — whether the same exploit was attempted again). For the kernel, proxied structurally: did every actor's recorded core stake retain a lawful channel at the terminal state, and were all gates that fired publicly legible? ### 7.2 Lexicographic ordering Scores are compared **lexicographically**: WOP first, and only on a WOP tie do the other dimensions break it, in order. We deliberately rejected a weighted sum. Weighted sums invite exactly the trade the empathy metric forbids — "we lost the minority but resolved it in record time, net score positive." Under lexicographic ordering that trade is unrepresentable. The aggregate scorecard does *also* publish per-dimension means across all 30 events, because the lexicographic comparison answers "which text won this event" while the dimension means answer "where is each text weak." ### 7.3 Scoring the incumbent: transcription, not judgment The incumbent's scores are anchored to recorded facts via the rubric's anchor tables. Two coders independently mapped each dossier's `incumbent_outcome` to rubric anchors; disagreements (9 of 120 cells, all within one anchor step) were resolved by taking the score *more favorable to the incumbent*. The benchmark's thumb, where it must rest somewhere, rests on the incumbent's side of the scale. ### 7.4 Scoring the kernel: computed, then capped Kernel scores come out of the deterministic replay, with three caps that prevent the harness from flattering its own constitution: - **Breach cap.** Any breach state (§6.4) caps WOP and trust at the rubric's "unresolved within the legal order" anchor. - **Affordance-only cap.** Where the kernel "wins" purely because a move is unavailable (the attack fails at step 1), the trust score is capped one step below maximum, on the reasoning that a blocked faction with the recorded coalition strength is a standing pressure the text has contained but not dissolved. - **Latency floor.** Kernel gate latencies (vote windows, sunset periods, review timelines) are charged in full even when the simulated outcome is clean. The kernel never resolves an event in zero days. --- ## 8. The counterfactual replay under kernel v0.1 ### 8.1 What the kernel brings to each failure class The replays exercise five kernel mechanisms, and the per-class results in the aggregate scorecard decompose along these lines: - **Vote-gate finality** (certification events): the kernel's requirement that a contested tally route to a pre-committed dispute quorum with a hard deadline removes the ambiguity that powered 1876, 2000, and 2020 — all three of which were, at bottom, fights over *who counts the counters*. - **Sunset-by-default emergency powers** (emergency events): every derogation under the kernel expires unless re-ratified at supermajority. Weimar's Article 48 and India's Article 352 both lacked automatic decay; the replays show the difference is not that emergencies are prevented but that *permanence requires repeated, visible, supermajority consent*. - **Continuity defaults** (shutdown events): the kernel's rule that the prior budget continues at last-ratified levels during deadlock removes the hostage. Belgium 2010–11 is the natural experiment already in the record — caretaker continuity rules meant 541 days without a government produced no shutdown — and the kernel replay of the US events essentially imports Belgium's affordance. - **Kernel/userland separation with semver gates** (court-capture events): changing who reviews the rules is a kernel-level (major-version) change requiring supermajority. Hungary 2011 and Poland 2015 proceeded at simple or constitutional-but-single-faction majorities; under the kernel the same coalitions fail the gate. Venezuela 2017 and Israel 2023 stress this harder and the replays record partial breach states — a determined supermajority-adjacent coalition still finds pressure points (§9.4). - **The fork right** (secession events): the kernel's most radical affordance. Exit is lawful, procedural, and slow — a supermajority of the seceding unit across two votes separated by a cooling period, with negotiated division of commons. Czechoslovakia 1992 is the recorded proof-of-concept; Canada 1995's *Secession Reference* later articulated almost exactly this rule. The US 1860 and Spain 2017 replays show the fork right's real function is not to enable exit but to **make the legitimacy question decidable**, removing the void in which both sides claimed the text supported them. ### 8.2 Where the kernel loses or draws Honesty requires the list of events where the replay does *not* favor the kernel: - **Gambia 2016**: the incumbent outcome (regional diplomatic resolution, incumbent leaves, no civil war) scores near the top of the rubric. The kernel replay matches but cannot beat it, and the kernel's gate latencies make it *slower*. Incumbent wins on latency, ties elsewhere. - **South Korea 2024**: the incumbent constitution's own machinery reversed the martial-law declaration within hours and impeached the president within weeks. The kernel replay produces a comparable trace. Effective tie. - **Venezuela 2017** and **Russia 1993**: both replays reach breach states. When a faction with control of force majeure is willing to leave the legal order, no text holds it, and the harness says so for both texts. - **Belgium 2010–11**: the incumbent's caretaker conventions already embody the kernel's continuity default. Tie on every dimension except latency, where both texts score at the floor — 541 days is 541 days. These results matter more than the kernel's wins. They show the harness can return "incumbent wins," "tie," and "nobody wins," which is the minimum bar for the scorecard meaning anything at all. --- ## 9. Limits of text-only simulation This is the section the milestone funds us to write honestly. Each subsection names a limit, states what it breaks, and states what we did about it — which is sometimes "nothing can be done; discount accordingly." ### 9.1 Parchment barriers: text is not enforcement Madison's phrase. A rule on paper stops nothing by itself; it stops things only when enough actors treat it as binding. The harness models affordances and gates, not the willingness to honor them. Weimar's constitution *contained* the tools to stop Hitler; the actors holding them declined to use several. Our replay shows the kernel's sunset clauses making the Enabling Act's path mechanically harder — it cannot show whether Hindenburg's circle would have honored a sunset any more than they honored the spirit of Article 48. **Mitigation:** the breach-state mechanism (§6.4) and breach cap (§7.4) at least prevent the harness from scoring textual victories in situations where the record shows the text had already lost its grip. **Residual:** the kernel's scores on emergency and self-coup events should be read as upper bounds on the text's contribution, not predictions. ### 9.2 The frozen-coalition assumption §6.2's assumption — same coalitions under both texts — is certainly false in both directions. Rules shape coalitions: a faction that knows a court-capture move requires a supermajority may never form, or may form larger and angrier. Game-theoretically, we are evaluating off-equilibrium play: we replay strategies optimized against the incumbent's rules inside a different rule-set, where rational actors would have played differently. The kernel is being tested against *yesterday's exploits*, not against the exploits that would evolve against the kernel itself. **Mitigation:** Milestone #5 (adversarial self-play) exists precisely to generate kernel-native exploits; this benchmark deliberately tests only the historical attack set and says so. **Residual:** a good score here means "resists known attacks," never "secure." ### 9.3 Counterfactual unknowability We do not and cannot know what would have happened. The incumbent side of every scorecard is fact; the kernel side is a deterministic consequence of our encoding choices. The two columns are therefore not epistemically symmetric, and presenting them side by side risks implying they are. **Mitigation:** every scorecard renders the kernel column with an explicit `simulated` watermark and the dossier's confidence rating; the aggregate report never states "the kernel would have prevented X," only "under the encoded replay, the attack fails at gate Y." **Residual:** readers will round our careful phrasing up to the strong claim anyway. We can only keep repeating the weak one. ### 9.4 Home-field advantage and Goodhart risk The kernel's authors built the harness. Three specific contamination channels exist: (a) the kernel-mapping tables (§6.3) are authored by people who know the kernel's strengths; (b) kernel v0.1 itself postdates most of these events and its drafters knew this history — the kernel is, in part, *trained on the test set*; (c) once this benchmark gates kernel amendments in CI, future kernel changes will be optimized against these thirty events (Goodhart's law applied to constitutions). **Mitigation for (a):** mappings are published in full inside each dossier for line-by-line dispute, and the coder-disagreement rule (§7.3) breaks toward the incumbent. **Mitigation for (b):** none possible; we state it. Any constitution drafted in 2024+ has read this history. The honest claim is not "the kernel generalizes" but "the kernel incorporates these thirty lessons, verifiably." **Mitigation for (c):** the dossier set is versioned and additions require the same amendment process as the kernel; self-play (Milestone #5) continuously generates held-out events. ### 9.5 What the actors knew vs. what the dossier knows Dossiers encode incentives reconstructed with hindsight. Historical actors operated under uncertainty, misinformation, and time pressure that a clean YAML timeline erases. Kerr in 1975 did not know Fraser would win the election; the dossier does. Hindsight makes every historical move look more deliberate and every gate more decisive than it was. **Mitigation:** `inferred: true` flags on reconstructed motives, and confidence ratings that downweight era-distant events in the restricted aggregate. **Residual:** irreducible. This is a limit of all historical analysis, inherited in full. ### 9.6 The missing channels: money, violence, information The harness has no model of economic coercion, organized violence, or propaganda — three channels that decided or shaped at least a third of the events (Russia 1993, Philippines 1972, Kenya 2007, Venezuela 2017). A constitution operating in an information environment where a faction controls broadcast media faces attacks no procedural gate addresses. The kernel's text-level legibility commitments gesture at this; the harness cannot test them. **Mitigation:** events where these channels were decisive carry timeline entries flagged `extra_constitutional` or `channel: out_of_model`, and the scorecard annotates the kernel column "out-of-model factors decisive in recorded outcome." **Residual:** large, and concentrated exactly in the worst events — the ones where the worst-off participants suffered most. The empathy metric is least trustworthy where it matters most. We consider this the single most important sentence in this paper. ### 9.7 Scale invariance is asserted, not tested Kernel v0.1 claims to scale from a household to a polity. Every dossier here is a *state-scale* event. Nothing in this benchmark tests the small end, and quorum/latency parameters that work at n=300 million may be absurd at n=8. **Mitigation:** none in this milestone; flagged for the dogfooding ledger, where the project's own n≈hundreds governance generates the small-scale data. ### 9.8 Thirty events is a small n No statistical claim survives n=30 with six strata. The aggregate means in the scorecard are descriptive summaries of this specific test set, not estimates of any population parameter, and we publish no confidence intervals because they would imply a sampling model we do not have. --- ## 10. Threats to validity (summary table) | Threat | Type | Severity | Addressed by | |---|---|---|---| | Text ≠ enforcement | Construct | High | Breach caps (§6.4, §7.4); honest framing | | Frozen coalitions | Internal | High | Stated; offloaded to Milestone #5 | | Counterfactual asymmetry | Construct | High | `simulated` watermark; weak-claim discipline | | Kernel trained on test set | External | High | Stated; held-out events via self-play | | Goodhart on CI gating | External | Medium | Versioned dossier set; amendment-gated additions | | Hindsight in incentive coding | Internal | Medium | `inferred` flags; confidence ratings | | Out-of-model channels | Construct | High | `out_of_model` flags; scorecard annotations | | Selection/salience bias | External | Medium | Stated criteria (§4.2); valence balance (§4.3) | | Coder bias in mappings | Internal | Medium | Published mappings; tie-break toward incumbent | | Small n | Statistical | Accepted | No inferential claims made | ## 11. How to read (and how not to read) the scorecard **Valid readings:** - "Under the encoded replay, the 2011 Hungarian court-capture sequence fails the kernel's supermajority gate at decision point 3." (A statement about texts and gates.) - "The kernel's worst per-class performance is the self-coup class, where 3 of 5 replays reach breach states." (A statement about where the kernel's text runs out.) - "The incumbent's recorded WOP scores are at the rubric floor in 7 of 30 events." (A statement of historical fact via the rubric's anchors.) **Invalid readings:** - "Kernel v0.1 would have prevented the Indian Emergency." (Counterfactual claim; §9.3.) - "Kernel v0.1 outperforms the US Constitution." (Population claim from n=30 curated events; §9.8, §9.4b.) - "A high kernel score means the kernel is secure." (Tests known attacks only; §9.2.) The scorecard's value is **diagnostic and regressive**: it tells kernel maintainers which historical attack patterns the current text demonstrably contains, which it demonstrably does not, and it freezes that knowledge as a CI gate so no future amendment silently reopens a closed attack. ## 12. Reproducibility Everything in the pipeline is deterministic and versioned: ```bash cd benchmark pip install -e ".[dev]" python -m incumbent_benchmark.cli validate dossiers/ # schema-check all 30 python -m incumbent_benchmark.cli run dossiers/ -o results/ # full replay python -m incumbent_benchmark.cli aggregate results/ -o results/AGGREGATE.md pytest # harness self-tests ``` There is no network access, no model inference, and no randomness in the scoring path. Two machines running the same commit produce byte-identical scorecards. Disputes about results are therefore always disputes about *encodings* — a dossier's `permitted_moves`, a `kernel_mapping`, a rubric anchor — and every encoding is a reviewable line in a versioned file. That is the whole point. ## 13. Governance of the benchmark itself The benchmark is governed by the constitution it tests. Changes to the rubric, the dossier set, or the kernel rule table are amendments: pull requests, vote-gated, semver'd. Adding a dossier is a minor version; changing a rubric anchor or the lexicographic ordering is major and requires supermajority, because changing how you score is changing what you optimize. The dossier set at v1.0 (these thirty events) is the frozen reference set; results are always reported against a named dossier-set version. ## 14. Future work 1. **Self-play integration (Milestone #5):** kernel-native exploits as held-out events, closing the §9.2 / §9.4b gap. 2. **Multi-baseline replays:** run each event under *other* incumbent constitutions (the German Basic Law's eternity clause against the court-capture class, for instance), turning the two-column scorecard into a leaderboard. 3. **Behavioral layer:** replace frozen coalitions with agent populations whose compliance is a variable, not an axiom — the honest version of the simulation this milestone could not responsibly claim. 4. **Quiet-decay dossiers:** develop a coding method for slow erosion events that never produce a named crisis (§4.4), the failure mode this test set underweights. 5. **Small-scale validation:** instrument the project's own dogfooded governance as the n=8-to-n=10,000 test bed (§9.7). ## 15. References and source notes Per-event primary and secondary sources are cited inside each dossier's `sources` block; the index at [`dossiers/INDEX.md`](../dossiers/INDEX.md) lists all thirty. Methodological influences, for the record: - Madison, *Federalist No. 48* (parchment barriers). - Levitsky & Ziblatt, *How Democracies Die* (2018) — forbearance as the unmodeled variable; directly motivates §9.1. - Ginsburg & Huq, *How to Save a Constitutional Democracy* (2018) — the court-capture and erosion taxonomy underlying §4.1. - Elkins, Ginsburg & Melton, *The Endurance of National Constitutions* (2009) and the Comparative Constitutions Project — the empirical base for constitutional lifespan and amendment-cadence claims. - *Reference re Secession of Quebec*, [1998] 2 S.C.R. 217 — the articulated fork-right precedent discussed in §8.1. - Sen, *The Idea of Justice* (2009) — comparative rather than transcendental evaluation; the philosophical warrant for benchmarking governance pairwise instead of against an ideal. - The ML benchmarking literature's cautionary tail — test-set contamination and Goodhart dynamics — applied here in §9.4. --- *This paper ships with dossier-set v1.0, rubric v1.0, and harness v1.0, and is amended through the same pipeline as everything else in this repository. If you can show a dossier's encoding is wrong, open a pull request; that is not a failure of the benchmark, it is the benchmark working.*