# Initial Benchmark and Evaluation Plan

## Objective

The benchmark suite will determine whether the new language improves LLM-generated software compared with Python, Rust, and TypeScript. It must test more than algorithm puzzles: generated code should build, handle errors, preserve interfaces, avoid vulnerabilities, and remain maintainable across edits.

The initial plan creates a staged benchmark suite that later milestones can implement as executable tasks with public tests, hidden tests, property checks, security checks, and performance constraints.

## Design goals

The benchmark suite should be:

1. **Comparable**: Python, Rust, TypeScript, and the new language receive equivalent task requirements.
2. **Diverse**: Include algorithms, data transformation, CLI tools, services, concurrency, security, and maintenance.
3. **Reproducible**: Inputs, tools, prompts, and outputs are recorded.
4. **Hard to game**: Hidden tests, variants, property tests, and mutation checks reduce overfitting.
5. **LLM-relevant**: Tasks target known model failure modes.
6. **Incremental**: The suite can run against early prototypes while expanding as tooling matures.
7. **Transparent**: Scoring and defect classifications are public.

## Benchmark tiers

### Tier 1: Micro tasks

Small, single-file or minimal multi-file tasks. They test syntax, basic types, parsing, data modeling, edge cases, and small algorithms.

Typical duration:

- one-shot generation under 10 minutes,
- repair mode under 5 iterations.

Example categories:

- parsing structured text,
- validating records,
- topological sort,
- interval merging,
- small expression evaluator,
- deterministic serialization.

Why this matters:

Micro tasks expose first-pass syntax and semantic correctness without large project noise.

### Tier 2: Meso tasks

Moderate programs with multiple modules, persistent data, error handling, and tests.

Typical duration:

- agent loop under 15 minutes,
- several files,
- realistic input/output constraints.

Example categories:

- CLI task manager,
- markdown link checker,
- config migration tool,
- log aggregation pipeline,
- plugin/event dispatcher,
- safe file synchronizer.

Why this matters:

Meso tasks reveal cross-file consistency, module design, dependency management, and repairability.

### Tier 3: Macro tasks

Larger applications or system components that require architecture, security, concurrency, and maintenance.

Typical duration:

- longer agent loop,
- hidden integration tests,
- human review sample.

Example categories:

- double-entry ledger,
- package resolver,
- HTTP API service,
- redaction pipeline,
- rules engine,
- CRDT-style merge component.

Why this matters:

Macro tasks better approximate production generated software.

### Tier 4: Maintenance tasks

Existing code is modified according to a new requirement. The model must avoid regressions.

Example categories:

- add a field to a persisted schema,
- change validation rules,
- refactor sync code to async,
- add authorization to an existing endpoint,
- optimize an algorithm without changing behavior.

Why this matters:

Most real software cost occurs after initial generation.

## Baseline methodology

Each task should be implemented for:

- Python,
- Rust,
- TypeScript,
- the new language.

The baselines should not be intentionally handicapped.

### Python baseline expectations

Use:

- current CPython,
- standard library unless dependencies are explicitly allowed,
- type hints where natural,
- a type checker for tasks where static validation is part of the comparison,
- formatter/linter where task-appropriate.

Python prompts should not require Rust-like structure unless the task needs it.

### Rust baseline expectations

Use:

- current stable Rust,
- Cargo,
- formatter,
- Clippy where task-appropriate,
- idiomatic `Result`/`Option`,
- crates only when task permits dependencies.

Rust prompts should not forbid normal idioms such as enums, traits, iterators, or crates unless the task is dependency-free.

### TypeScript baseline expectations

Use:

- current TypeScript compiler in strict mode,
- Node.js LTS-compatible runtime,
- package manifest when dependencies are allowed,
- formatter/linter where task-appropriate,
- runtime validation when task requires external input validation.

TypeScript prompts should state the module format and runtime target to avoid avoidable ambiguity.

### New language expectations

The new language should be evaluated with its real compiler/tooling state. If a prototype lacks features needed by a task, that should be recorded as unsupported rather than silently changing task semantics.

## Task families

### 1. Data parsing and validation

Representative tasks:

- parse mixed-format log lines into typed events,
- validate CSV records with recoverable errors,
- parse and evaluate a small expression language,
- canonicalize and diff JSON-like data.

Failure modes targeted:

- off-by-one parsing,
- weak error handling,
- nullable/optional misuse,
- stringly typed variants,
- hidden edge cases.

Language features tested:

- sum types,
- pattern matching,
- typed errors,
- parser libraries or standard parsing utilities,
- executable examples.

### 2. Algorithms with edge cases

Representative tasks:

- topological sort with cycle reporting,
- interval merge with boundary semantics,
- least-recently-used cache,
- dependency resolution with constraints,
- stable grouping and sorting.

Failure modes targeted:

- incomplete edge-case handling,
- performance mistakes,
- nondeterminism,
- wrong data-structure choice.

Language features tested:

- collection APIs,
- error/variant modeling,
- performance legibility,
- property tests.

### 3. Command-line tools

Representative tasks:

- task manager with JSON persistence,
- markdown link checker,
- directory duplicate finder,
- configuration migration command,
- deterministic report generator.

Failure modes targeted:

- undeclared filesystem effects,
- path traversal,
- improper error messages,
- partial writes,
- hidden environment assumptions.

Language features tested:

- filesystem capabilities,
- typed CLI arguments,
- structured errors,
- atomic file writes,
- reproducibility.

### 4. Services and APIs

Representative tasks:

- small HTTP JSON service,
- request validation and routing,
- rate limiter,
- API client with retries using a fake server,
- idempotent operation endpoint.

Failure modes targeted:

- wrong serialization,
- missing validation,
- security bugs,
- concurrency errors,
- framework hallucination.

Language features tested:

- typed schemas,
- effect/capability declarations,
- structured concurrency,
- safe defaults,
- local API summaries.

### 5. Security-sensitive programming

Representative tasks:

- safe path join and file serving,
- SQL-like query builder over a test database abstraction,
- secret redaction,
- command construction without shell injection,
- authorization policy enforcement.

Failure modes targeted:

- path traversal,
- injection,
- secret leakage,
- unsafe deserialization,
- missing authorization.

Language features tested:

- taint/effect tracking,
- capability model,
- safe standard APIs,
- security diagnostics.

### 6. Concurrent and asynchronous systems

Representative tasks:

- bounded worker pool,
- cancellable batch processor,
- actor-style event dispatcher,
- debounce/throttle scheduler,
- concurrent crawler over a fake in-memory web.

Failure modes targeted:

- races,
- deadlocks,
- leaked tasks,
- forgotten awaits,
- cancellation mishandling.

Language features tested:

- structured concurrency,
- typed channels/messages,
- cancellation scopes,
- resource cleanup.

### 7. Maintenance and refactoring

Representative tasks:

- add schema versioning to an existing CLI app,
- replace a data structure while preserving API,
- split a module without changing exports,
- add logging without leaking secrets,
- optimize slow code with regression tests.

Failure modes targeted:

- cross-file mismatches,
- regressions,
- overbroad rewrites,
- stale tests,
- incorrect migration semantics.

Language features tested:

- project graph summaries,
- interface stability,
- semantic diffs,
- contract preservation.

## Seed benchmark task list

The machine-readable catalog in `evaluation/benchmark_catalog.yaml` contains the initial seed tasks. The recommended first implemented subset is:

1. `micro_log_event_parser`
2. `micro_topological_sort`
3. `micro_interval_merge`
4. `micro_expr_eval`
5. `meso_cli_task_store`
6. `meso_markdown_link_checker`
7. `meso_config_migrator`
8. `meso_secret_redactor`
9. `macro_double_entry_ledger`
10. `maint_add_schema_versioning`

This subset gives early coverage of parsing, algorithms, persistence, security, cross-file consistency, and maintenance.

## Prompt structure

Each benchmark task should provide a language-neutral specification plus language-specific wrapper details.

Recommended task prompt sections:

1. Goal.
2. Required behavior.
3. Inputs and outputs.
4. Public examples.
5. Edge cases.
6. Error behavior.
7. Performance constraints.
8. Security constraints.
9. Allowed dependencies.
10. Required project shape or entrypoint.
11. Tests to satisfy.
12. Scoring notes.

Language-specific prompts should only adapt syntax/tooling instructions, not change semantics.

## Test structure

Each task should include:

- public example tests,
- hidden unit tests,
- hidden edge-case tests,
- property tests where appropriate,
- negative tests for invalid input,
- performance tests for relevant tasks,
- security tests for sensitive tasks,
- regression tests for maintenance tasks.

Visible tests should verify basic behavior but should not exhaustively reveal hidden cases.

## Scoring by task type

### Micro tasks

Emphasize:

- correctness,
- build validity,
- edge cases,
- minimal repair.

Performance matters only when input sizes are specified.

### Meso tasks

Emphasize:

- correctness,
- module consistency,
- error handling,
- reproducibility,
- maintainability.

### Macro tasks

Emphasize:

- architecture,
- integration behavior,
- security,
- performance,
- maintainability,
- human review.

### Maintenance tasks

Emphasize:

- regression avoidance,
- minimal coherent patches,
- interface preservation,
- migration correctness.

## Contamination and leakage controls

To reduce benchmark contamination:

- Author original tasks and hidden tests.
- Generate task variants with different domain names and data.
- Keep hidden tests private until official scoring.
- Avoid using exact prompts from common public benchmarks.
- Separate public examples from scoring assertions.
- Record if any model had prior access to benchmark materials.
- Rotate hidden cases between benchmark releases.

## Fairness controls

To avoid unfair comparisons:

- Use equally detailed prompts across languages.
- Allow idiomatic standard libraries.
- Allow dependencies only when all languages have comparable access.
- Do not compare a mature baseline ecosystem against an intentionally dependency-free new language without labeling the limitation.
- Keep time, repair, and token budgets equal.
- Run multiple model samples.
- Report unsupported tasks separately from failed tasks.

## Benchmark harness requirements for later milestones

The future executable harness should provide:

- task runner abstraction,
- language adapter abstraction,
- deterministic environment setup,
- build/test commands for each language,
- hidden-test isolation,
- structured log capture,
- scoring implementation,
- defect classification support,
- artifact archiving,
- report generation.

Recommended language adapter fields:

- language identifier,
- toolchain version command,
- build command,
- format command,
- lint command,
- test command,
- source file conventions,
- manifest conventions,
- dependency policy,
- diagnostic parser.

## Initial result report shape

A published result should include:

| Field | Description |
| --- | --- |
| Suite version | Benchmark catalog/test version. |
| Language version | Toolchain/compiler version. |
| Model settings | Model, temperature, context, system prompt. |
| Mode | One-shot, repair, agent, or maintenance. |
| Tasks attempted | Count and IDs. |
| Success rates | Build, public tests, hidden tests. |
| Repair metrics | Iterations, tokens, time. |
| Defect taxonomy | Failure labels by language. |
| Security findings | Vulnerability counts and severity. |
| Performance findings | Time/memory distributions. |
| Human review | Sample size and scores. |
| Limitations | Known threats to validity. |

## Expected insights

The benchmark should answer questions such as:

- Does the new language reduce hallucinated symbols?
- Do explicit effects reduce missed I/O/security constraints?
- Do structured diagnostics reduce repair loops?
- Does mandatory boundary typing improve hidden-test pass rate?
- Does the syntax reduce parse/typecheck failures?
- Does the language remain maintainable for human reviewers?
- Which task families still favor Python, Rust, or TypeScript?
- Is the new language’s training-data disadvantage overcome by tooling?

## Acceptance criteria for this plan

This plan is complete for milestone 1 because it defines:

- benchmark goals,
- baseline methodology,
- tier structure,
- task families,
- seed task list,
- prompt and test structure,
- scoring approach,
- contamination and fairness controls,
- later harness requirements,
- reporting expectations.

Later milestones should convert the catalog into executable benchmark tasks and use the framework to produce empirical results.