# Self-Play Framework Architecture This document is the engineering reference for the milestone-5 adversarial self-play framework: what each module is responsible for, the invariants the system maintains, and the public API that tests, tournaments, and the exploit-to-test pipeline are written against. ## Design goals 1. **Optimism in the defaults, paranoia in the tests.** Honest agents follow simple cooperative policies. Red-team agents are given explicit capture objectives and are free to do anything *legal*. The framework's single non-negotiable invariant is that no illegal state transition is ever applied — capture must happen *within* the rules, or it doesn't count. 2. **Every run is deterministic and replayable.** A tournament is fully specified by (kernel text, config, seed). An exploit is fully specified by its recorded action transcript. If a transcript cannot be replayed step-for-step, it is not an exploit record; it is an anecdote. 3. **Exploits are one-way doors.** Once an exploit is recorded, it becomes a permanent regression test. A kernel amendment may close it; nothing may delete it. ## Module map ``` src/fable_selfplay/ ├── kernel.py # Load and query kernel YAML (versioned constitution text) ├── state.py # WorldState: treasury, balances, proposals, emergency flags ├── actions.py # The closed action vocabulary (dataclasses) ├── legality.py # check_legality(state, action, kernel) — the gate ├── events.py # Append-only event log entries ├── environment.py # Turn-based environment; applies only legal actions ├── agents.py # Honest and red-team agent policies + capture objectives ├── detectors.py # Online exploit detectors over the event stream ├── metrics.py # Scoring, worst-off-first ("empathy metric") ├── tournament.py # Orchestration: episodes, seeds, role rosters, reports ├── replay.py # Deterministic transcript replay against any kernel ├── exploit_to_test.py # Exploit record -> generated regression test └── cli.py # Command-line entry points ``` ## The turn loop Each round, every citizen (in seeded-shuffled order) submits one action from the closed vocabulary in `actions.py`. The environment routes the action through `legality.check_legality` **before** any state mutation: - If legal: the action is applied, and one or more `Event` records are appended to the immutable event log. - If illegal: the state is untouched, and a `rejected` event is logged with the reason and the kernel article that blocked it. Illegal attempts are data — detectors use rejection patterns to spot probing behavior — but they never affect the world. Proposals carry voting windows measured in rounds. Votes are tallied when the window closes; quorum and threshold rules come from the kernel parameters, never from constants in code. This is what makes a kernel patch testable: the same transcript replayed under different kernel text produces different legality outcomes. ## Public API (stable for this milestone) These are the surfaces that `tests/`, `tournament.py`, and external scripts rely on. Changes here are breaking changes. ### `kernel` ```python load_kernel(path: str | Path) -> Kernel Kernel.version: str # e.g. "0.1.0", "0.2.0" Kernel.params: dict[str, Any] # quorum, thresholds, windows, caps Kernel.articles # parsed article structure Kernel.param(name, default=None) ``` ### `actions` The closed vocabulary. All actions name their `actor` (a citizen id): - `ProposeSpend(actor, amount, recipient, memo="")` - `ProposeAmendment(actor, changes)` — `changes` is a kernel-param patch - `Vote(actor, proposal_id, support)` - `DeclareEmergency(actor, reason)` / `EndEmergency(actor)` - `Exit(actor)` — invoke the right to fork/leave with pro-rata share - `Pass(actor)` ### `legality` ```python check_legality(state, action, kernel) -> LegalityResult LegalityResult.legal: bool LegalityResult.reason: str | None # human-readable, cites the rule LegalityResult.article: str | None # kernel article id that decided it ``` `check_legality` is pure: it never mutates state and never consults global configuration. This purity is what `replay.py` depends on. ### `environment` ```python Environment(kernel, num_citizens=7, initial_treasury=1000.0, seed=0) env.citizens() # ["c0", ..., "cN-1"] env.state # WorldState (treasury, balances, proposals, ...) env.turn # current round counter env.step(action) # -> StepResult(legal, reason, events) env.events # append-only event log for the whole run ``` `step` applies exactly one action through the legality gate. Round advancement and proposal-window resolution are handled internally when all citizens have acted. ## Scoring: worst-off first `metrics.py` grades every episode on a lexicographic ordering: the welfare of the **worst-off** participant under stress is compared first; only on ties do aggregate measures (treasury integrity, participation, legitimacy) break them. A run where the median citizen prospers while one faction is starved *loses* to a run with lower aggregate welfare and a protected floor. This is deliberate and is the project's one scoring rule above all others. ## Detectors and exploit records Detectors in `detectors.py` are pure functions over the event stream. They fire when a capture objective's success condition is met (treasury drained, faction suppressed, emergency overstayed, exit blocked) **and** every action in the causal chain was legal. A firing produces an exploit record — the JSON files in `exploits/` — containing the minimal replayable transcript. The record format is documented in `docs/exploit-pipeline.md` and machine-checked by `exploits/SCHEMA.json` and `scripts/verify_exploit_coverage.py`. ## Determinism guarantees - All randomness flows from a single seeded `random.Random` owned by the tournament; agents receive child seeds derived from (tournament seed, episode index, citizen id). - Iteration order over citizens, proposals, and ballots is explicitly sorted or seeded-shuffled — never dependent on dict insertion order across runs. - Replays bypass agent policies entirely: `replay.py` feeds the recorded transcript directly through the legality gate, so a replay's outcome depends only on (transcript, kernel text).