# Experiment Protocol ## Purpose This protocol defines how to run reproducible evaluations using the benchmark plan. It is written for later harness implementers and evaluators who compare Python, Rust, TypeScript, and the new LLM-optimized language. ## Pre-run preparation ### 1. Select benchmark suite version Record: - benchmark catalog version, - task fixture version, - hidden test version, - scoring rubric version. The catalog in this milestone is version `0.1`. ### 2. Select languages Required languages: - Python, - Rust, - TypeScript, - new language prototype. For each language, record: - compiler/interpreter/runtime version, - package manager version, - formatter/linter versions if used, - operating system and architecture, - dependency policy. ### 3. Select model settings For each model run, record: - provider, - model name, - model version or date when available, - context window, - temperature, - top-p or equivalent sampling parameter, - seed if available, - system prompt, - tool permissions. Use the same settings across languages unless the experiment explicitly studies model settings. ### 4. Choose evaluation mode Use one of: - `one_shot` - `compiler_repair` - `agent_loop` - `maintenance` Do not mix modes in aggregate scores without reporting them separately. ## Prompt preparation Each task prompt should include: 1. Task ID and title. 2. Goal. 3. Required behavior. 4. Input and output formats. 5. Public examples. 6. Edge cases. 7. Error behavior. 8. Performance constraints. 9. Security constraints. 10. Allowed dependencies. 11. Required entrypoint. 12. Language-specific build/test instructions. The semantic task must remain equivalent across languages. If a language needs extra boilerplate instructions, count those tokens as part of that language’s context requirement. ## Run protocol by mode ### Mode A: One-shot generation 1. Provide the prompt to the model. 2. Capture the generated files exactly. 3. Do not provide diagnostics or test output back to the model. 4. Run formatting if the language’s normal workflow does so automatically; record formatter effects. 5. Run build/typecheck. 6. Run public tests. 7. Run hidden tests. 8. Run security/performance checks where applicable. 9. Score the final generated artifact. Record all failures, including missing manifests and missing entrypoints. ### Mode B: Compiler-guided repair 1. Provide the initial prompt. 2. Capture generated files. 3. Run build/typecheck/lint according to the language adapter. 4. If the build passes, run public tests. 5. If diagnostics or public tests fail, provide the model the exact diagnostic output permitted by the protocol. 6. Allow the model to edit files. 7. Repeat for up to five repair iterations. 8. After the final iteration or success, run hidden tests and scoring checks. 9. Score the best valid artifact according to predeclared selection rules. Rules: - Do not provide hidden test output during repair. - Do not summarize diagnostics differently per language. - If diagnostics are too large, truncate using a deterministic policy and record truncation. - Record whether the model fixed, ignored, or regressed each diagnostic. ### Mode C: Test-guided agent loop 1. Provide the task prompt and public tests. 2. Allow the agent to inspect files, edit, run permitted commands, and iterate. 3. Enforce the wall-clock, token, or tool-call budget. 4. Log every action. 5. Prevent access to hidden tests. 6. At budget end or agent completion, freeze the artifact. 7. Run hidden tests, security checks, performance checks, and scoring. Rules: - Network access is disabled unless the task explicitly allows it. - Package installation must use declared dependencies only. - The agent may create tests, but generated tests do not replace benchmark tests. - The agent may not alter benchmark fixtures. ### Mode D: Maintenance/refactoring 1. Provide the existing project and change request. 2. Provide existing public tests. 3. Require a patch rather than a full rewrite unless the task permits rewrite. 4. Run existing tests and new task tests. 5. Score regressions, patch focus, interface preservation, and hidden behavior. 6. Run human review sampling for maintainability-sensitive tasks. Rules: - Existing public APIs should remain stable unless the specification says otherwise. - Removed tests are treated as protocol violations unless explicitly justified by the task. - Large rewrites are penalized when a focused patch is expected. ## Artifact capture Each run should archive: - prompt text, - generated source files, - manifests, - generated lockfiles produced by real tooling, - build logs, - diagnostics, - test logs, - performance logs, - security scan outputs, - model outputs, - tool-call transcript, - final score record. Artifacts should be content-addressed or otherwise immutable after scoring. ## Defect classification After each run, classify defects using the taxonomy from `docs/evaluation_framework.md`. Required fields: - first blocking defect, - all observed defect labels, - whether defect was present in initial generation, - whether defect was repaired, - whether repair introduced a regression, - evidence link to log/test/diagnostic. If multiple defects exist, classify all significant ones. Do not stop defect analysis at the first syntax error when later repair reveals semantic problems. ## Scoring procedure 1. Confirm protocol compliance. 2. Score build/tooling validity. 3. Score public, hidden, property, and edge-case tests. 4. Score performance constraints. 5. Score security checks. 6. Score maintainability rubric. 7. Score reproducibility. 8. Apply repair-efficiency scoring for modes that allow repair. 9. Record total and category scores. Use `evaluation/scoring_rubric.md` as the default rubric. Task-specific rubrics may override weights only if declared before the run. ## Sampling plan Recommended pilot: - 5 samples per task/language/mode, - 10 seed tasks, - at least one model, - both one-shot and compiler-repair modes for micro tasks, - agent-loop mode for meso tasks. Recommended claim-strength experiment: - 10 or more samples per task/language/mode, - all seed tasks, - at least two model families if feasible, - bootstrap confidence intervals, - failure taxonomy review. ## Randomization To reduce ordering effects: - randomize task order per model run, - rotate language order, - record timestamps, - avoid changing prompts mid-experiment, - freeze benchmark fixtures during a run batch. ## Environment controls Unless a task says otherwise: - disable external network access during build and scoring after dependencies are resolved, - use clean workspaces, - clear language-specific caches where practical, - set deterministic random seeds, - set locale and timezone explicitly, - limit CPU and memory consistently, - isolate filesystem access to the task workspace. ## Dependency controls Allowed dependencies must be declared in the task. Evaluators should distinguish: - standard library use, - declared third-party dependencies, - undeclared dependency attempts, - wrong-version API use, - package installation failure. A solution that imports a package without declaring it should receive a reproducibility and build penalty even if the evaluator environment happens to contain that package. ## Hidden tests Hidden tests should be: - unavailable to the model, - unavailable during repair, - deterministic, - versioned, - designed to test behavior specified in the prompt, - not dependent on obscure unstated requirements. Hidden tests should not punish reasonable interpretations absent from the prompt. If they do, revise the task for future versions and mark affected results. ## Human review sampling Human review is recommended for: - all macro tasks in claim-strength experiments, - at least 20% of successful meso submissions, - all security-sensitive submissions with high automated scores, - a sample of failed submissions to identify misleading automated scores. Use at least two reviewers for important claims where possible. Resolve disagreements through rubric discussion, not hidden preference. ## Result reporting For each experiment, publish: - benchmark version, - task list, - language/tool versions, - model settings, - prompt templates, - aggregate results, - per-task results, - confidence intervals, - failure taxonomy, - repair iteration distributions, - token and tool-call costs, - security findings, - performance findings, - human review summary, - known limitations, - archived artifacts or digests. ## Audit checklist Before accepting a result set, verify: - All languages used equivalent task semantics. - Hidden tests were not exposed during generation or repair. - Tool versions were recorded. - Dependency policies were enforced. - Scoring weights were declared before runs. - Failed builds were counted, not discarded. - Unsupported new-language features were reported honestly. - Multiple samples were aggregated correctly. - Human review sampling was documented. - Artifacts are sufficient to reproduce representative runs. ## Protocol completion status This protocol is complete for milestone 1. It specifies how later executable tooling should run, capture, score, and report evaluations.