# Evaluation Framework

## Purpose

This framework defines how to measure whether the future language is actually better for LLM-generated software than Python, Rust, and TypeScript.

The framework evaluates an end-to-end process:

```text
task prompt → model generation → build/typecheck → repair loop → tests → scoring → report
```

It measures correctness, repairability, safety, maintainability, performance, and reproducibility.

## Baseline languages

The required baselines are:

1. **Python**
   - Representative of dynamic, high-prevalence languages.
   - Should use current CPython and task-appropriate type checking/linting.
   - Strengths: generation familiarity, brevity, rich libraries.
   - Expected weaknesses: runtime failures, weak enforced contracts, packaging drift.

2. **Rust**
   - Representative of strong static guarantees and compiler-assisted repair.
   - Should use current stable Rust, Cargo, formatter, and Clippy where appropriate.
   - Strengths: type and memory safety, explicit errors, deterministic tooling.
   - Expected weaknesses: generation difficulty, ownership/lifetime complexity.

3. **TypeScript**
   - Representative of gradual/static web ecosystem language.
   - Should use strict compiler settings and a current Node.js LTS-compatible runtime.
   - Strengths: structural typing, web/API prevalence, editor tooling.
   - Expected weaknesses: runtime type erasure, ecosystem churn, escape hatches.

The new language must be evaluated with comparable tooling maturity as it becomes available. Early prototypes should be marked clearly so results are not overclaimed.

## Evaluation modes

Different LLM coding workflows should be measured separately.

### Mode A: One-shot generation

The model receives the task specification and writes a solution without seeing build or test feedback.

Measures:

- parse success,
- build/typecheck success,
- visible test pass,
- hidden test pass,
- initial defect classes.

Use this mode to measure how predictable the language is from prompt alone.

### Mode B: Compiler-guided repair

The model receives build/typecheck/lint diagnostics and may revise the solution for a fixed number of repair iterations.

Recommended initial budget:

- maximum 5 repair iterations,
- same prompt context budget across languages,
- diagnostics included exactly as emitted by the toolchain,
- no human hints beyond the protocol.

Measures:

- repair convergence,
- diagnostic usefulness,
- token/tool-call cost,
- regression after repair.

Use this mode to measure the value of structured diagnostics.

### Mode C: Test-guided agent loop

The model or agent may run tests, inspect failures, edit files, and repeat within a fixed wall-clock/tool-call budget.

Recommended initial budget:

- maximum 15 minutes or equivalent tool-call cap per task,
- public tests available,
- hidden tests withheld until scoring,
- all actions logged.

Measures:

- realistic agent productivity,
- overfitting to public tests,
- end-to-end cost,
- robustness of project tooling.

### Mode D: Maintenance/refactoring

The model receives an existing solution and must modify behavior without breaking prior tests.

Measures:

- semantic preservation,
- interface consistency,
- regression rate,
- patch size,
- human reviewability.

This mode is important because real generated software is edited repeatedly.

## Primary metrics

### Functional correctness

Whether the submitted solution passes the benchmark’s tests.

Recommended breakdown:

- `public_test_pass_rate`
- `hidden_test_pass_rate`
- `property_test_pass_rate`
- `edge_case_pass_rate`

Primary score should privilege hidden and property tests over public examples.

### Build validity

Whether the generated project builds before semantic testing.

Submetrics:

- parse success,
- format success,
- typecheck success,
- dependency resolution success,
- runnable entrypoint success.

### Repair cost

How much effort is needed to reach a valid solution.

Submetrics:

- number of repair iterations,
- total generated tokens,
- diagnostic tokens consumed,
- total wall-clock time,
- tool calls,
- number of files rewritten.

### Hallucination and interface defects

Counts defects caused by invented or mismatched code artifacts.

Submetrics:

- unknown functions/types/modules,
- wrong import paths,
- wrong dependency versions,
- calls inconsistent with declared signatures,
- cross-file name or schema mismatch,
- generated tests referring to nonexistent behavior.

### Safety and security

Measures whether generated code introduces known vulnerability classes.

Submetrics:

- command injection,
- SQL/query injection,
- path traversal,
- unsafe deserialization,
- SSRF-like uncontrolled network access,
- secret leakage,
- authorization bypass,
- insecure randomness,
- data race or concurrency safety issue,
- unsafe escape hatch usage.

### Maintainability

Assessed through automated and human review.

Submetrics:

- idiomatic use of language features,
- modularity,
- naming clarity,
- unnecessary complexity,
- contract/test quality,
- minimal use of escape hatches,
- semantic diff clarity.

### Performance

Task-specific resource constraints.

Submetrics:

- runtime,
- memory usage,
- asymptotic complexity classification,
- timeout rate,
- throughput/latency where relevant.

Performance should not dominate tasks that are primarily semantic, but timeouts and extreme inefficiency should be penalized.

### Reproducibility

Whether another evaluator can rebuild and rerun the solution from declared artifacts.

Submetrics:

- manifest completeness,
- no undeclared dependencies,
- deterministic tests,
- recorded tool versions,
- no hidden network dependence unless task requires it,
- stable random seeds where applicable.

## Composite scoring

A default 100-point score is defined in `evaluation/scoring_rubric.md`. The recommended top-level allocation is:

| Category | Points |
| --- | ---: |
| Functional correctness | 40 |
| Build/typecheck/tooling validity | 15 |
| Repair efficiency | 10 |
| Safety/security | 10 |
| Maintainability | 10 |
| Performance | 10 |
| Reproducibility | 5 |

For pure one-shot mode, repair efficiency should be reported separately or scored as zero-effort if no repair is attempted. For security-focused tasks, safety may be weighted higher, but the weighting must be declared before runs.

## Defect taxonomy

Every failed run should classify defects using stable labels:

- `syntax_error`
- `type_error`
- `missing_dependency`
- `unknown_symbol`
- `wrong_api_version`
- `cross_file_mismatch`
- `unhandled_error`
- `null_or_optional_misuse`
- `runtime_exception`
- `logic_error`
- `edge_case_failure`
- `performance_timeout`
- `memory_limit`
- `security_vulnerability`
- `concurrency_error`
- `test_overfit`
- `nondeterminism`
- `poor_maintainability`
- `protocol_violation`

A single run may have multiple labels. The first blocking defect should also be recorded.

## Instrumentation data model

A later benchmark harness should record each attempt as structured data. Recommended fields:

```json
{
  "run_id": "string",
  "suite_version": "string",
  "task_id": "string",
  "language": "python|rust|typescript|new_language",
  "model": {
    "provider": "string",
    "name": "string",
    "version": "string",
    "temperature": "number",
    "seed": "number_or_null"
  },
  "mode": "one_shot|compiler_repair|agent_loop|maintenance",
  "prompt_tokens": "number",
  "completion_tokens": "number",
  "tool_calls": "number",
  "repair_iteration": "number",
  "build": {
    "parse_ok": "boolean",
    "format_ok": "boolean",
    "typecheck_ok": "boolean",
    "dependency_ok": "boolean",
    "entrypoint_ok": "boolean"
  },
  "tests": {
    "public_passed": "number",
    "public_total": "number",
    "hidden_passed": "number",
    "hidden_total": "number",
    "property_passed": "number",
    "property_total": "number"
  },
  "performance": {
    "runtime_ms": "number_or_null",
    "memory_mb": "number_or_null",
    "timed_out": "boolean"
  },
  "security": {
    "critical": "number",
    "high": "number",
    "medium": "number",
    "low": "number"
  },
  "defects": ["string"],
  "score": "number",
  "artifacts": {
    "source_archive": "path_or_digest",
    "logs": "path_or_digest",
    "diagnostics": "path_or_digest"
  }
}
```

The final schema may evolve, but equivalent fields should remain available for longitudinal comparison.

## Prompt controls

To compare languages fairly:

- Use semantically equivalent prompts for each language.
- State allowed standard libraries and dependencies explicitly.
- Include the same public examples and constraints.
- Avoid giving the new language more detailed hints than baselines.
- Record whether language documentation or API summaries are available in context.
- Keep repair feedback faithful to each toolchain’s actual output.
- Do not include hidden tests in prompts or repair loops.

When a language requires ceremony, prompt that ceremony explicitly only if a competent user would normally provide it. Otherwise, it should count against the language’s context cost.

## Model controls

For each benchmark release, record:

- model provider and model identifier,
- model version/date when available,
- temperature and sampling settings,
- seed when available,
- context window,
- tool availability,
- system prompt,
- retry policy.

Recommended initial sampling:

- 5 independent samples per task/language/mode for pilot runs,
- 10 or more samples for claims of superiority,
- fixed randomization of task order to reduce temporal/provider effects.

## Statistical reporting

Benchmark reports should include:

- per-task results,
- per-category aggregate results,
- confidence intervals or bootstrap intervals,
- median and distribution of repair iterations,
- failure taxonomy histograms,
- effect sizes against each baseline,
- ablations for context size and diagnostics,
- separate reports for one-shot and repaired modes.

A single aggregate leaderboard is insufficient. The new language may outperform in repairability but underperform in ecosystem tasks; that distinction matters.

## Human review protocol

Automated tests cannot catch all maintainability and security issues. A subset of benchmark submissions should be reviewed by humans.

Reviewers should be blind to language identity where feasible, though full blinding may be impossible. Reviews should score:

- clarity,
- modularity,
- contract usefulness,
- unnecessary complexity,
- risky operations,
- ease of changing requirements,
- confidence in correctness.

Human review should be sampled across successful and failed programs to identify hidden quality differences.

## Security evaluation

Security tasks should include both direct vulnerability prompts and ordinary application prompts where vulnerabilities can appear. The goal is to measure safe defaults, not only performance on “security quiz” tasks.

Each security-sensitive task should define:

- assets,
- trust boundaries,
- attacker-controlled inputs,
- forbidden sinks,
- required sanitization or authorization,
- expected vulnerability tests.

Security defects should be severity-rated. Critical vulnerabilities should strongly penalize otherwise functional solutions.

## Performance evaluation

Performance constraints should be task-specific and declared before runs.

For algorithmic tasks:

- input size ranges,
- expected complexity class,
- time and memory limits.

For service-like tasks:

- throughput/latency targets,
- concurrency level,
- backpressure behavior,
- resource cleanup.

The benchmark should distinguish between:

- unacceptable timeouts,
- minor constant-factor slowdowns,
- algorithmic complexity failures,
- performance irrelevant to task goals.

## Reproducibility standard

Every submitted solution should include enough information to rerun:

- source files,
- manifest files,
- tool versions,
- generated lockfiles if produced by real tooling during the run,
- test command,
- environment variables,
- seed data,
- logs.

The evaluation harness should generate lockfiles through actual package managers rather than hand-writing them. Network access should be disabled during scoring except where explicitly required by a task.

## Validity threats

Likely threats and mitigations:

| Threat | Mitigation |
| --- | --- |
| Training-data contamination | Use newly authored tasks, hidden tests, and task variants. |
| Baseline unfairness | Use idiomatic tooling and expert-reviewed prompts for Python, Rust, and TypeScript. |
| Model drift | Record model versions and rerun key experiments periodically. |
| Overfitting to public tests | Use hidden/property tests and mutation checks. |
| Prototype immaturity | Separate language-design score from implementation maturity during early milestones. |
| Ecosystem imbalance | Include both dependency-light and dependency-using tasks. |
| Human scoring bias | Use rubrics, multiple reviewers, and spot audits. |

## Reporting template

A benchmark report should contain:

1. Suite version and commit.
2. Language/toolchain versions.
3. Model settings.
4. Prompt protocol.
5. Task list and category weights.
6. Aggregate scores.
7. Per-task tables.
8. Failure taxonomy.
9. Repair loop analysis.
10. Security findings.
11. Performance findings.
12. Human review summary.
13. Reproducibility artifacts.
14. Known limitations.
15. Design implications for the next language iteration.

## Acceptance criteria for the framework

This framework is complete for milestone 1 because it defines:

- baseline languages,
- evaluation modes,
- primary and secondary metrics,
- scoring structure,
- defect taxonomy,
- instrumentation data model,
- prompt/model controls,
- security/performance/reproducibility standards,
- reporting requirements.

Later milestones should implement these procedures in executable tooling and revise them based on empirical findings.