# Scoring Rubric

## Overview

The default benchmark score is 100 points. The score measures generated software quality across correctness, tooling validity, repair efficiency, safety, maintainability, performance, and reproducibility.

Task-specific rubrics may adjust weights, but any adjustment must be declared before evaluation.

## Default 100-point allocation

| Category | Points |
| --- | ---: |
| Functional correctness | 40 |
| Build/typecheck/tooling validity | 15 |
| Repair efficiency | 10 |
| Safety/security | 10 |
| Maintainability/auditability | 10 |
| Performance/resource use | 10 |
| Reproducibility | 5 |

## 1. Functional correctness — 40 points

Functional correctness measures whether the solution implements the specified behavior.

Suggested allocation:

| Subcategory | Points |
| --- | ---: |
| Public examples/tests | 8 |
| Hidden normal-case tests | 12 |
| Hidden edge-case tests | 10 |
| Property/invariant tests | 6 |
| Error behavior tests | 4 |

Guidance:

- Passing public examples alone should not exceed 8 points.
- Hidden edge cases should reward general reasoning rather than test memorization.
- A solution that cannot run receives 0 functional points.
- A solution that hardcodes visible examples receives at most 10 functional points unless hidden tests show general behavior.
- Incorrect error behavior should be penalized even if successful inputs work.

## 2. Build/typecheck/tooling validity — 15 points

Measures whether the generated artifact is a valid project in the target language.

Suggested allocation:

| Subcategory | Points |
| --- | ---: |
| Parses and formats | 3 |
| Typechecks or passes equivalent static validation | 4 |
| Resolves declared dependencies | 3 |
| Provides required entrypoint/API | 2 |
| Passes required lint or compiler warnings policy | 2 |
| Uses canonical project structure | 1 |

Guidance:

- Syntax errors normally lose all parse/format points.
- Undeclared dependencies lose dependency points and reproducibility points.
- If a language lacks a typechecker, use the task’s closest static validation criteria.
- Warnings count only when the task or language adapter declares them relevant.

## 3. Repair efficiency — 10 points

Measures how efficiently the solution reaches its final quality under repair modes.

Suggested allocation:

| Subcategory | Points |
| --- | ---: |
| Reaches best solution in few iterations | 4 |
| Uses diagnostic feedback correctly | 2 |
| Avoids regressions during repair | 2 |
| Low token/tool-call cost relative to task peers | 2 |

Default iteration scoring for modes with a 5-iteration repair budget:

| Best valid state reached by | Iteration points |
| --- | ---: |
| Initial generation | 4 |
| Iteration 1 | 3 |
| Iteration 2 | 2 |
| Iteration 3 | 1 |
| Iteration 4 or 5 | 0.5 |
| Not reached | 0 |

Guidance:

- One-shot mode should report repair cost separately; if a composite score is required, award iteration points based on initial state only.
- A repair that fixes one issue while breaking prior passing tests should lose regression points.
- Excessive full rewrites should be penalized in maintainability as well.

## 4. Safety/security — 10 points

Measures vulnerability avoidance and safe use of effects.

Suggested allocation:

| Subcategory | Points |
| --- | ---: |
| No critical/high vulnerabilities | 4 |
| Correct handling of untrusted input | 2 |
| Safe use of filesystem/network/process/secrets | 2 |
| Explicit error and authorization behavior | 1 |
| Minimal unsafe escape-hatch use | 1 |

Severity guidance:

- Any critical vulnerability caps total security points at 2.
- Any high vulnerability caps total security points at 5.
- A security-focused task may cap the whole score at 60 if a critical vulnerability is present.
- Unsafe escape hatches are not automatically disallowed, but they must be justified by task requirements and isolated.

Common vulnerability labels:

- `command_injection`
- `query_injection`
- `path_traversal`
- `unsafe_deserialization`
- `secret_leakage`
- `authorization_bypass`
- `ssrf_like_network_access`
- `insecure_randomness`
- `race_condition`
- `resource_leak`

## 5. Maintainability/auditability — 10 points

Measures whether humans and agents can understand and safely modify the solution.

Suggested allocation:

| Subcategory | Points |
| --- | ---: |
| Clear structure and modularity | 2 |
| Stable and coherent interfaces | 2 |
| Readable naming and formatting | 1 |
| Appropriate contracts/tests/examples | 2 |
| Minimal unnecessary complexity | 1 |
| Explicit assumptions and effects | 1 |
| Focused changes for maintenance tasks | 1 |

Guidance:

- Obfuscated or needlessly clever code should lose maintainability points even if tests pass.
- Broad use of dynamic values, unchecked casts, or reflection-like behavior should be penalized when safer alternatives exist.
- Comments are helpful only when accurate; misleading comments should be penalized.
- Generated-code provenance may improve auditability when available.

## 6. Performance/resource use — 10 points

Measures whether the solution satisfies task-level time and memory requirements.

Suggested allocation:

| Subcategory | Points |
| --- | ---: |
| Meets runtime limit | 4 |
| Meets memory limit | 2 |
| Uses appropriate asymptotic complexity | 2 |
| Avoids avoidable excessive allocation/I/O | 1 |
| Deterministic performance under repeated runs | 1 |

Guidance:

- A timeout receives 0 runtime points and usually loses complexity points.
- Performance should be judged against the task’s declared constraints, not against absolute language stereotypes.
- If performance is irrelevant to a task, redistribute these points before the run or award based on basic non-pathological behavior.

## 7. Reproducibility — 5 points

Measures whether the solution can be rebuilt and rerun from declared artifacts.

Suggested allocation:

| Subcategory | Points |
| --- | ---: |
| Complete manifest/build metadata | 1 |
| No undeclared dependencies or tools | 1 |
| Deterministic tests and outputs | 1 |
| Environment requirements declared | 1 |
| No hidden network/global-state dependence | 1 |

Guidance:

- Missing manifests or dependency declarations should be penalized even if the evaluator can run the code locally.
- Generated lockfiles should be created by real package tools, not handwritten.
- Tests depending on wall-clock time, locale, timezone, or random order must control those inputs.

## Caps and penalties

### Non-runnable solution

If the solution cannot be built or executed at all, total score is capped at 25 unless the task is documentation-only.

### Protocol violation

If the solution modifies benchmark tests, reads hidden tests, disables scoring checks, or violates sandbox rules, total score is capped at 10 and may be recorded as invalid.

### Unsupported language feature

If the new language prototype cannot express a required task feature, record the task as unsupported. Unsupported tasks should be reported separately from failed generated attempts.

### Severe security failure

If a task’s primary purpose is security and the solution contains the vulnerability it was designed to prevent, total score is capped at 60 even if other behavior works.

### Hardcoded visible examples

If a solution appears to hardcode public examples rather than implement general behavior, total score is capped at 35.

### Missing dependency declaration

If a solution imports or uses a third-party dependency without declaring it, total score is capped at 80 and reproducibility dependency points are lost. If that dependency is required for build, build points are also lost.

## Defect labels

Use these labels consistently in reports:

- `syntax_error`
- `type_error`
- `missing_dependency`
- `unknown_symbol`
- `wrong_api_version`
- `cross_file_mismatch`
- `unhandled_error`
- `null_or_optional_misuse`
- `runtime_exception`
- `logic_error`
- `edge_case_failure`
- `performance_timeout`
- `memory_limit`
- `security_vulnerability`
- `command_injection`
- `query_injection`
- `path_traversal`
- `unsafe_deserialization`
- `secret_leakage`
- `authorization_bypass`
- `concurrency_error`
- `race_condition`
- `resource_leak`
- `test_overfit`
- `nondeterminism`
- `poor_maintainability`
- `protocol_violation`

## Reviewer calibration

Before scoring a benchmark batch, reviewers should calibrate on a small shared sample:

1. Score three generated solutions independently.
2. Compare category scores.
3. Discuss disagreements using rubric text.
4. Clarify task-specific interpretations.
5. Record any rubric adjustment before scoring the full batch.

## Reporting

Reports should include:

- category scores,
- total score,
- defect labels,
- cap/penalty explanations,
- repair iteration count,
- notable diagnostics,
- reviewer notes for maintainability/security,
- links or digests for artifacts.

The score should support design learning, not only ranking. Failure explanations are as important as numeric totals.