# Authoring a Stress-Event Dossier

The dossier corpus is the ground truth of the Incumbent Benchmark. Every
score the harness produces is only as honest as the dossier underneath it.
This guide is the standard a new dossier must meet before it can land —
dossier #31 and beyond go through the same gate the first thirty did, plus
review and (for additions to the scored corpus) a ratification vote under
the amendment pipeline, because **changing the benchmark changes what
"better governance" means for every fork downstream**.

The machine-readable contract for a dossier lives in
`src/incumbent_benchmark/schema.py` and is enforced at load time by the
harness; the corpus-level invariants (completeness, shared shape, anti-stub
floors, country/era diversity) are enforced in CI by
`tests/test_dossiers.py`. This document covers everything the schema
*cannot* enforce: research quality, neutrality, and the judgment calls
involved in turning a historical crisis into a structured simulation
config.

---

## 1. What qualifies as a stress event

A dossier documents a moment when a real constitution was load-tested. Not
every political crisis qualifies. The event must satisfy all four criteria:

1. **A constitutional mechanism was the load-bearing element.** The crisis
   ran *through* the document: a certification procedure, an emergency
   clause, a confidence convention, a court's composition rules, a
   secession question the text did or did not answer. Crises that were
   purely military, economic, or extra-legal from the first move (a tank
   column with no constitutional fig leaf) belong in a different corpus —
   the benchmark measures texts, and a text that was never consulted
   cannot be scored.

2. **The incumbent outcome is documented and settled enough to score.**
   We need to know, from the historical record, how the event actually
   resolved: who held power afterward, what happened to the worst-off
   participants, how long resolution took, what institutional trust
   survived. Events still in motion may be drafted as dossiers but are
   tagged unscored until the dust settles; provisional scores against
   moving history are how benchmarks lose credibility.

3. **The causal chain from text to outcome is reconstructible.** For each
   pivotal moment you must be able to say which constitutional provision
   permitted, required, or failed to prohibit the move that was made. If
   the honest answer is "the text was irrelevant; pure power decided," say
   so in the dossier — that *is* a finding about the incumbent (a text that
   provides no traction under stress scores poorly on its own) — but the
   reconstruction must be attempted, not skipped.

4. **It adds coverage.** The corpus deliberately spans six failure
   families (contested certification/transfer, emergency powers, fiscal
   and formation deadlock, court capture, secession/dissolution, executive
   self-coup) across many countries and two-plus centuries. A 31st US
   election dispute is worth less than the first dossier from a region or
   mechanism we have not stressed. CI enforces a floor on country and era
   diversity; reviewers enforce the spirit of it.

## 2. Research standard

A dossier is a research artifact first and a config second. The bar:

- **Three independent sources minimum** for the factual skeleton (actors,
  dates, moves, outcome), at least one of which is either a primary
  document (the constitutional text in force at the time, court opinions,
  official proclamations, parliamentary records) or a peer-reviewed /
  book-length scholarly treatment. News coverage alone is not sufficient
  for events older than a year.
- **Quote the operative constitutional text** in the dossier, in the
  language of the relevant provision (translated where necessary, with the
  translation sourced). The harness replays moves against *rules*; the
  rules must be in the record, not paraphrased from memory.
- **Date everything.** Latency-to-resolution is a scored dimension. Every
  entry in the timeline carries a date; where the historical record is
  imprecise, record the uncertainty rather than a false precision.
- **Handle contested facts explicitly.** Many of these events are still
  politically live (was 2019 Bolivia a coup or a fraud-annulled election?
  both literatures exist). The dossier does not adjudicate. It records the
  competing factual claims, attributes each to its holders, and — this is
  the important part — scores only on facts that are common ground across
  serious accounts, or scores both branches and reports the spread. A
  dossier that quietly picks a side in a live historiographical fight will
  be bounced in review.
- **Name the worst-off participants concretely.** The empathy metric is
  scored first, so the dossier must identify, with sources, who actually
  bore the worst outcomes: detained opposition figures, furloughed workers
  who missed rent, residents of regions under emergency rule, ethnic
  communities targeted in post-election violence. "Citizens generally" is
  not an acceptable answer; stress events have specific casualties.

## 3. From history to simulation config

The conversion from narrative to structured config is where most of the
intellectual honesty lives. Principles:

### 3.1 Actors are modeled by incentive, not by name

Each actor entry captures a *role under incentive*: what the actor wanted,
what they believed, what resources and constitutional levers they held,
and what they feared. The named historical individual is recorded for
citation, but the simulation runs on the incentive structure — because the
counterfactual question the benchmark asks is "what does the *rule system*
do with these incentives," not "what would this specific person have done."
If your actor model only produces the historical outcome when you assume
the historical personality, the constitutional finding is weak, and the
dossier should say so.

### 3.2 The permitted-move set is the heart of the dossier

For each actor, enumerate the moves the incumbent constitution permitted,
required, or left ambiguous at each decision point — including the moves
*not* taken. The drama of most of these events lives in the ambiguous
column (could the Vice President reject electors? could the Governor-
General dismiss a PM with supply blocked in the Senate?). Ambiguity is
recorded as ambiguity, with the competing readings sourced. Under replay,
the kernel-side comparison frequently turns on whether kernel v0.1 closes
an ambiguity the incumbent left open — that comparison is only honest if
the ambiguity was honestly recorded.

### 3.3 Record the actual outcome before running any replay

The incumbent scorecard (worst-off outcome, commons integrity, latency,
trust preservation — see `docs/RUBRIC.md`) is scored from the historical
record *before* the dossier is ever run under kernel v0.1, and committed
in the same change. This ordering is deliberate: it prevents the most
tempting form of benchmark corruption, which is tuning the historical
baseline after seeing how the kernel performs against it.

### 3.4 Counterfactual discipline

The replay substitutes one variable — the rule system — and holds the
actors' incentives fixed. Everything the kernel "wins" must be traceable
to a specific kernel mechanism (a vote gate, a forced-default on deadlock,
a supermajority requirement, the fork right) acting on the recorded
incentives. If a kernel advantage depends on assuming better-behaved
actors, it is an artifact, not a result, and the dossier's notes must
flag it. The methodology paper (`docs/METHODOLOGY.md`) discusses the
limits of this at length; every dossier inherits those limits and should
not claim past them.

## 4. Neutrality rules

These events involve living political movements. The corpus survives only
if all sides of every dispute can read their dossier and recognize the
facts, even where they dispute the framing.

- Use institutional descriptions, not partisan labels, for actors.
- Attribute every characterization ("widely viewed as," "the court held,"
  "opposition parties alleged") to a source.
- Apply the same scrutiny to events that flatter different ideological
  priors — the corpus pairs them deliberately (court capture dossiers
  cover Hungary, Poland, Venezuela, Israel, *and* the 1937 US court-packing
  attempt) and additions should preserve that balance.
- No dossier scores a *country* or a *party*. It scores a rule system's
  performance in one episode. The dossier prose must maintain that frame.

## 5. Mechanics of landing a dossier

1. **Draft** the YAML in `dossiers/`, kebab-case filename:
   `country-year-shortname.yaml`. The schema in
   `src/incumbent_benchmark/schema.py` is canonical; do not invent fields —
   if the schema can't express something you need, that is a schema PR
   first, reviewed separately.
2. **Validate locally**: `pip install -e . && pytest tests/` from
   `benchmark/`. The corpus tests will fail on stubs, structural drift,
   and parse errors before a human ever reviews you.
3. **Replay locally** via the CLI (see `README.md`) and inspect the
   side-by-side scorecard. You are checking for absurdities — a replay
   that resolves a months-long crisis in one round usually means a
   permitted-move set is underspecified, not that the kernel is brilliant.
4. **Update** `dossiers/INDEX.md` with the new entry under its failure
   family.
5. **Open the PR** with sources listed in the dossier itself. Review
   requires at least one reviewer checking sources, not just structure.
6. **Ratification**: additions to the *scored* corpus change the
   benchmark's meaning and go through the standard amendment vote gate
   (simple majority; corpus changes are a minor version). Dossiers can
   merge as `unscored` drafts without a vote.

## 6. Common pitfalls (each of these was caught in the first thirty)

- **Hindsight teleology**: writing actor incentives so the historical
  outcome is the only possible one. The 1876 Hayes–Tilden dossier went
  through three drafts before the Electoral Commission's emergence stopped
  looking inevitable.
- **Protagonist smuggling**: modeling one side's moves in fine grain and
  the other's as a monolith. Symmetric resolution of actor detail is a
  review checklist item.
- **Scoring the regime, not the episode**: Hungary appears twice (2011
  court capture, 2020 enabling act) precisely because they are different
  mechanisms; a dossier must not import the other episode's outcome into
  its own baseline.
- **Latency laundering**: choosing start/end dates that flatter one
  system. The dossier must justify its bracketing dates in the timeline
  notes, and the same brackets apply to both the incumbent baseline and
  the kernel replay.
- **Quiet maximalism on ambiguity**: recording a contested constitutional
  reading as settled. If serious lawyers disagreed *at the time*, the
  move is ambiguous, whatever later courts said.