# Scoring Rubric ## Overview The default benchmark score is 100 points. The score measures generated software quality across correctness, tooling validity, repair efficiency, safety, maintainability, performance, and reproducibility. Task-specific rubrics may adjust weights, but any adjustment must be declared before evaluation. ## Default 100-point allocation | Category | Points | | --- | ---: | | Functional correctness | 40 | | Build/typecheck/tooling validity | 15 | | Repair efficiency | 10 | | Safety/security | 10 | | Maintainability/auditability | 10 | | Performance/resource use | 10 | | Reproducibility | 5 | ## 1. Functional correctness — 40 points Functional correctness measures whether the solution implements the specified behavior. Suggested allocation: | Subcategory | Points | | --- | ---: | | Public examples/tests | 8 | | Hidden normal-case tests | 12 | | Hidden edge-case tests | 10 | | Property/invariant tests | 6 | | Error behavior tests | 4 | Guidance: - Passing public examples alone should not exceed 8 points. - Hidden edge cases should reward general reasoning rather than test memorization. - A solution that cannot run receives 0 functional points. - A solution that hardcodes visible examples receives at most 10 functional points unless hidden tests show general behavior. - Incorrect error behavior should be penalized even if successful inputs work. ## 2. Build/typecheck/tooling validity — 15 points Measures whether the generated artifact is a valid project in the target language. Suggested allocation: | Subcategory | Points | | --- | ---: | | Parses and formats | 3 | | Typechecks or passes equivalent static validation | 4 | | Resolves declared dependencies | 3 | | Provides required entrypoint/API | 2 | | Passes required lint or compiler warnings policy | 2 | | Uses canonical project structure | 1 | Guidance: - Syntax errors normally lose all parse/format points. - Undeclared dependencies lose dependency points and reproducibility points. - If a language lacks a typechecker, use the task’s closest static validation criteria. - Warnings count only when the task or language adapter declares them relevant. ## 3. Repair efficiency — 10 points Measures how efficiently the solution reaches its final quality under repair modes. Suggested allocation: | Subcategory | Points | | --- | ---: | | Reaches best solution in few iterations | 4 | | Uses diagnostic feedback correctly | 2 | | Avoids regressions during repair | 2 | | Low token/tool-call cost relative to task peers | 2 | Default iteration scoring for modes with a 5-iteration repair budget: | Best valid state reached by | Iteration points | | --- | ---: | | Initial generation | 4 | | Iteration 1 | 3 | | Iteration 2 | 2 | | Iteration 3 | 1 | | Iteration 4 or 5 | 0.5 | | Not reached | 0 | Guidance: - One-shot mode should report repair cost separately; if a composite score is required, award iteration points based on initial state only. - A repair that fixes one issue while breaking prior passing tests should lose regression points. - Excessive full rewrites should be penalized in maintainability as well. ## 4. Safety/security — 10 points Measures vulnerability avoidance and safe use of effects. Suggested allocation: | Subcategory | Points | | --- | ---: | | No critical/high vulnerabilities | 4 | | Correct handling of untrusted input | 2 | | Safe use of filesystem/network/process/secrets | 2 | | Explicit error and authorization behavior | 1 | | Minimal unsafe escape-hatch use | 1 | Severity guidance: - Any critical vulnerability caps total security points at 2. - Any high vulnerability caps total security points at 5. - A security-focused task may cap the whole score at 60 if a critical vulnerability is present. - Unsafe escape hatches are not automatically disallowed, but they must be justified by task requirements and isolated. Common vulnerability labels: - `command_injection` - `query_injection` - `path_traversal` - `unsafe_deserialization` - `secret_leakage` - `authorization_bypass` - `ssrf_like_network_access` - `insecure_randomness` - `race_condition` - `resource_leak` ## 5. Maintainability/auditability — 10 points Measures whether humans and agents can understand and safely modify the solution. Suggested allocation: | Subcategory | Points | | --- | ---: | | Clear structure and modularity | 2 | | Stable and coherent interfaces | 2 | | Readable naming and formatting | 1 | | Appropriate contracts/tests/examples | 2 | | Minimal unnecessary complexity | 1 | | Explicit assumptions and effects | 1 | | Focused changes for maintenance tasks | 1 | Guidance: - Obfuscated or needlessly clever code should lose maintainability points even if tests pass. - Broad use of dynamic values, unchecked casts, or reflection-like behavior should be penalized when safer alternatives exist. - Comments are helpful only when accurate; misleading comments should be penalized. - Generated-code provenance may improve auditability when available. ## 6. Performance/resource use — 10 points Measures whether the solution satisfies task-level time and memory requirements. Suggested allocation: | Subcategory | Points | | --- | ---: | | Meets runtime limit | 4 | | Meets memory limit | 2 | | Uses appropriate asymptotic complexity | 2 | | Avoids avoidable excessive allocation/I/O | 1 | | Deterministic performance under repeated runs | 1 | Guidance: - A timeout receives 0 runtime points and usually loses complexity points. - Performance should be judged against the task’s declared constraints, not against absolute language stereotypes. - If performance is irrelevant to a task, redistribute these points before the run or award based on basic non-pathological behavior. ## 7. Reproducibility — 5 points Measures whether the solution can be rebuilt and rerun from declared artifacts. Suggested allocation: | Subcategory | Points | | --- | ---: | | Complete manifest/build metadata | 1 | | No undeclared dependencies or tools | 1 | | Deterministic tests and outputs | 1 | | Environment requirements declared | 1 | | No hidden network/global-state dependence | 1 | Guidance: - Missing manifests or dependency declarations should be penalized even if the evaluator can run the code locally. - Generated lockfiles should be created by real package tools, not handwritten. - Tests depending on wall-clock time, locale, timezone, or random order must control those inputs. ## Caps and penalties ### Non-runnable solution If the solution cannot be built or executed at all, total score is capped at 25 unless the task is documentation-only. ### Protocol violation If the solution modifies benchmark tests, reads hidden tests, disables scoring checks, or violates sandbox rules, total score is capped at 10 and may be recorded as invalid. ### Unsupported language feature If the new language prototype cannot express a required task feature, record the task as unsupported. Unsupported tasks should be reported separately from failed generated attempts. ### Severe security failure If a task’s primary purpose is security and the solution contains the vulnerability it was designed to prevent, total score is capped at 60 even if other behavior works. ### Hardcoded visible examples If a solution appears to hardcode public examples rather than implement general behavior, total score is capped at 35. ### Missing dependency declaration If a solution imports or uses a third-party dependency without declaring it, total score is capped at 80 and reproducibility dependency points are lost. If that dependency is required for build, build points are also lost. ## Defect labels Use these labels consistently in reports: - `syntax_error` - `type_error` - `missing_dependency` - `unknown_symbol` - `wrong_api_version` - `cross_file_mismatch` - `unhandled_error` - `null_or_optional_misuse` - `runtime_exception` - `logic_error` - `edge_case_failure` - `performance_timeout` - `memory_limit` - `security_vulnerability` - `command_injection` - `query_injection` - `path_traversal` - `unsafe_deserialization` - `secret_leakage` - `authorization_bypass` - `concurrency_error` - `race_condition` - `resource_leak` - `test_overfit` - `nondeterminism` - `poor_maintainability` - `protocol_violation` ## Reviewer calibration Before scoring a benchmark batch, reviewers should calibrate on a small shared sample: 1. Score three generated solutions independently. 2. Compare category scores. 3. Discuss disagreements using rubric text. 4. Clarify task-specific interpretations. 5. Record any rubric adjustment before scoring the full batch. ## Reporting Reports should include: - category scores, - total score, - defect labels, - cap/penalty explanations, - repair iteration count, - notable diagnostics, - reviewer notes for maintainability/security, - links or digests for artifacts. The score should support design learning, not only ranking. Failure explanations are as important as numeric totals.