# Authoring a Stress-Event Dossier The dossier corpus is the ground truth of the Incumbent Benchmark. Every score the harness produces is only as honest as the dossier underneath it. This guide is the standard a new dossier must meet before it can land — dossier #31 and beyond go through the same gate the first thirty did, plus review and (for additions to the scored corpus) a ratification vote under the amendment pipeline, because **changing the benchmark changes what "better governance" means for every fork downstream**. The machine-readable contract for a dossier lives in `src/incumbent_benchmark/schema.py` and is enforced at load time by the harness; the corpus-level invariants (completeness, shared shape, anti-stub floors, country/era diversity) are enforced in CI by `tests/test_dossiers.py`. This document covers everything the schema *cannot* enforce: research quality, neutrality, and the judgment calls involved in turning a historical crisis into a structured simulation config. --- ## 1. What qualifies as a stress event A dossier documents a moment when a real constitution was load-tested. Not every political crisis qualifies. The event must satisfy all four criteria: 1. **A constitutional mechanism was the load-bearing element.** The crisis ran *through* the document: a certification procedure, an emergency clause, a confidence convention, a court's composition rules, a secession question the text did or did not answer. Crises that were purely military, economic, or extra-legal from the first move (a tank column with no constitutional fig leaf) belong in a different corpus — the benchmark measures texts, and a text that was never consulted cannot be scored. 2. **The incumbent outcome is documented and settled enough to score.** We need to know, from the historical record, how the event actually resolved: who held power afterward, what happened to the worst-off participants, how long resolution took, what institutional trust survived. Events still in motion may be drafted as dossiers but are tagged unscored until the dust settles; provisional scores against moving history are how benchmarks lose credibility. 3. **The causal chain from text to outcome is reconstructible.** For each pivotal moment you must be able to say which constitutional provision permitted, required, or failed to prohibit the move that was made. If the honest answer is "the text was irrelevant; pure power decided," say so in the dossier — that *is* a finding about the incumbent (a text that provides no traction under stress scores poorly on its own) — but the reconstruction must be attempted, not skipped. 4. **It adds coverage.** The corpus deliberately spans six failure families (contested certification/transfer, emergency powers, fiscal and formation deadlock, court capture, secession/dissolution, executive self-coup) across many countries and two-plus centuries. A 31st US election dispute is worth less than the first dossier from a region or mechanism we have not stressed. CI enforces a floor on country and era diversity; reviewers enforce the spirit of it. ## 2. Research standard A dossier is a research artifact first and a config second. The bar: - **Three independent sources minimum** for the factual skeleton (actors, dates, moves, outcome), at least one of which is either a primary document (the constitutional text in force at the time, court opinions, official proclamations, parliamentary records) or a peer-reviewed / book-length scholarly treatment. News coverage alone is not sufficient for events older than a year. - **Quote the operative constitutional text** in the dossier, in the language of the relevant provision (translated where necessary, with the translation sourced). The harness replays moves against *rules*; the rules must be in the record, not paraphrased from memory. - **Date everything.** Latency-to-resolution is a scored dimension. Every entry in the timeline carries a date; where the historical record is imprecise, record the uncertainty rather than a false precision. - **Handle contested facts explicitly.** Many of these events are still politically live (was 2019 Bolivia a coup or a fraud-annulled election? both literatures exist). The dossier does not adjudicate. It records the competing factual claims, attributes each to its holders, and — this is the important part — scores only on facts that are common ground across serious accounts, or scores both branches and reports the spread. A dossier that quietly picks a side in a live historiographical fight will be bounced in review. - **Name the worst-off participants concretely.** The empathy metric is scored first, so the dossier must identify, with sources, who actually bore the worst outcomes: detained opposition figures, furloughed workers who missed rent, residents of regions under emergency rule, ethnic communities targeted in post-election violence. "Citizens generally" is not an acceptable answer; stress events have specific casualties. ## 3. From history to simulation config The conversion from narrative to structured config is where most of the intellectual honesty lives. Principles: ### 3.1 Actors are modeled by incentive, not by name Each actor entry captures a *role under incentive*: what the actor wanted, what they believed, what resources and constitutional levers they held, and what they feared. The named historical individual is recorded for citation, but the simulation runs on the incentive structure — because the counterfactual question the benchmark asks is "what does the *rule system* do with these incentives," not "what would this specific person have done." If your actor model only produces the historical outcome when you assume the historical personality, the constitutional finding is weak, and the dossier should say so. ### 3.2 The permitted-move set is the heart of the dossier For each actor, enumerate the moves the incumbent constitution permitted, required, or left ambiguous at each decision point — including the moves *not* taken. The drama of most of these events lives in the ambiguous column (could the Vice President reject electors? could the Governor- General dismiss a PM with supply blocked in the Senate?). Ambiguity is recorded as ambiguity, with the competing readings sourced. Under replay, the kernel-side comparison frequently turns on whether kernel v0.1 closes an ambiguity the incumbent left open — that comparison is only honest if the ambiguity was honestly recorded. ### 3.3 Record the actual outcome before running any replay The incumbent scorecard (worst-off outcome, commons integrity, latency, trust preservation — see `docs/RUBRIC.md`) is scored from the historical record *before* the dossier is ever run under kernel v0.1, and committed in the same change. This ordering is deliberate: it prevents the most tempting form of benchmark corruption, which is tuning the historical baseline after seeing how the kernel performs against it. ### 3.4 Counterfactual discipline The replay substitutes one variable — the rule system — and holds the actors' incentives fixed. Everything the kernel "wins" must be traceable to a specific kernel mechanism (a vote gate, a forced-default on deadlock, a supermajority requirement, the fork right) acting on the recorded incentives. If a kernel advantage depends on assuming better-behaved actors, it is an artifact, not a result, and the dossier's notes must flag it. The methodology paper (`docs/METHODOLOGY.md`) discusses the limits of this at length; every dossier inherits those limits and should not claim past them. ## 4. Neutrality rules These events involve living political movements. The corpus survives only if all sides of every dispute can read their dossier and recognize the facts, even where they dispute the framing. - Use institutional descriptions, not partisan labels, for actors. - Attribute every characterization ("widely viewed as," "the court held," "opposition parties alleged") to a source. - Apply the same scrutiny to events that flatter different ideological priors — the corpus pairs them deliberately (court capture dossiers cover Hungary, Poland, Venezuela, Israel, *and* the 1937 US court-packing attempt) and additions should preserve that balance. - No dossier scores a *country* or a *party*. It scores a rule system's performance in one episode. The dossier prose must maintain that frame. ## 5. Mechanics of landing a dossier 1. **Draft** the YAML in `dossiers/`, kebab-case filename: `country-year-shortname.yaml`. The schema in `src/incumbent_benchmark/schema.py` is canonical; do not invent fields — if the schema can't express something you need, that is a schema PR first, reviewed separately. 2. **Validate locally**: `pip install -e . && pytest tests/` from `benchmark/`. The corpus tests will fail on stubs, structural drift, and parse errors before a human ever reviews you. 3. **Replay locally** via the CLI (see `README.md`) and inspect the side-by-side scorecard. You are checking for absurdities — a replay that resolves a months-long crisis in one round usually means a permitted-move set is underspecified, not that the kernel is brilliant. 4. **Update** `dossiers/INDEX.md` with the new entry under its failure family. 5. **Open the PR** with sources listed in the dossier itself. Review requires at least one reviewer checking sources, not just structure. 6. **Ratification**: additions to the *scored* corpus change the benchmark's meaning and go through the standard amendment vote gate (simple majority; corpus changes are a minor version). Dossiers can merge as `unscored` drafts without a vote. ## 6. Common pitfalls (each of these was caught in the first thirty) - **Hindsight teleology**: writing actor incentives so the historical outcome is the only possible one. The 1876 Hayes–Tilden dossier went through three drafts before the Electoral Commission's emergence stopped looking inevitable. - **Protagonist smuggling**: modeling one side's moves in fine grain and the other's as a monolith. Symmetric resolution of actor detail is a review checklist item. - **Scoring the regime, not the episode**: Hungary appears twice (2011 court capture, 2020 enabling act) precisely because they are different mechanisms; a dossier must not import the other episode's outcome into its own baseline. - **Latency laundering**: choosing start/end dates that flatter one system. The dossier must justify its bracketing dates in the timeline notes, and the same brackets apply to both the incumbent baseline and the kernel replay. - **Quiet maximalism on ambiguity**: recording a contested constitutional reading as settled. If serious lawyers disagreed *at the time*, the move is ambiguous, whatever later courts said.