# Benchmark Task Specification Template Use this template when converting a catalog entry into an executable benchmark task. Every field should be completed before the task is included in an official benchmark run. ## Metadata - **Task ID**: - **Title**: - **Suite version**: - **Task version**: - **Tier**: micro / meso / macro / maintenance - **Category**: - **Evaluation modes**: - **Author/reviewer**: - **Last reviewed**: ## Goal Describe the user-visible outcome in one or two paragraphs. ## Required behavior List exact behavioral requirements. 1. 2. 3. ## Inputs Describe all inputs, including files, command-line arguments, API requests, environment variables, fixtures, and stdin. ## Outputs Describe all outputs, including return values, stdout/stderr, files, API responses, logs, and error formats. ## Public examples Provide visible examples that clarify expected behavior without exhausting hidden cases. ### Example 1 Input: ```text ``` Expected output: ```text ``` ### Example 2 Input: ```text ``` Expected output: ```text ``` ## Edge cases List edge cases that are in scope. Hidden tests may cover these. - Empty input: - Invalid input: - Boundary values: - Duplicate values: - Ordering: - Platform assumptions: - Concurrency timing: - Security boundary: ## Error behavior Specify how errors must be represented, returned, printed, or propagated. Include: - invalid input errors, - missing resource errors, - permission errors, - parse errors, - validation errors, - internal errors, - whether processing should continue after recoverable errors. ## Security constraints Specify trust boundaries and forbidden behavior. Include where relevant: - attacker-controlled inputs, - filesystem root boundaries, - network restrictions, - shell/subprocess restrictions, - secret-handling rules, - authorization rules, - deserialization rules, - logging rules. ## Performance constraints Specify input size, runtime limit, memory limit, and expected complexity where applicable. ## Allowed dependencies State whether third-party dependencies are allowed. - Python: - Rust: - TypeScript: - New language: If dependencies are allowed, define acceptable categories and version policy. ## Required project shape Define expected files, entrypoint, command, or callable API for each language. - Python: - Rust: - TypeScript: - New language: ## Build and test commands Define the exact commands that the harness should run for each language once executable benchmark tooling exists. - Python build/check: - Python test: - Rust build/check: - Rust test: - TypeScript build/check: - TypeScript test: - New language build/check: - New language test: ## Public tests Describe public tests included with the prompt or fixture. ## Hidden tests Describe the behavioral classes covered by hidden tests without revealing exact cases. ## Property tests Describe properties to check, generators, seeds, and shrinking expectations where applicable. ## Performance tests Describe performance fixture generation and measurement method. ## Security tests Describe security checks, exploit attempts, or static analysis expectations. ## Scoring weights Use the default rubric unless this task requires declared adjustments. | Category | Points | | --- | ---: | | Functional correctness | 40 | | Build/typecheck/tooling validity | 15 | | Repair efficiency | 10 | | Safety/security | 10 | | Maintainability/auditability | 10 | | Performance/resource use | 10 | | Reproducibility | 5 | ## Defect labels to watch List expected defect labels for this task. - - - ## Fairness notes Explain any language-specific concerns and how prompts/tooling avoid unfair advantage. ## Contamination notes Explain why the task is not a direct copy of a common public benchmark and how variants or hidden tests reduce memorization. ## Human review notes Specify whether human review is required and what reviewers should focus on. ## Acceptance checklist A task is ready for official use when: - Semantics are equivalent across languages. - Public examples are clear. - Hidden tests check specified behavior only. - Dependencies are declared. - Build/test commands are defined. - Security and performance constraints are explicit where relevant. - Scoring weights are declared. - Defect labels are selected. - Fairness concerns are documented.