# Benchmark Task Specification Template

Use this template when converting a catalog entry into an executable benchmark task. Every field should be completed before the task is included in an official benchmark run.

## Metadata

- **Task ID**:
- **Title**:
- **Suite version**:
- **Task version**:
- **Tier**: micro / meso / macro / maintenance
- **Category**:
- **Evaluation modes**:
- **Author/reviewer**:
- **Last reviewed**:

## Goal

Describe the user-visible outcome in one or two paragraphs.

## Required behavior

List exact behavioral requirements.

1.
2.
3.

## Inputs

Describe all inputs, including files, command-line arguments, API requests, environment variables, fixtures, and stdin.

## Outputs

Describe all outputs, including return values, stdout/stderr, files, API responses, logs, and error formats.

## Public examples

Provide visible examples that clarify expected behavior without exhausting hidden cases.

### Example 1

Input:

```text
```

Expected output:

```text
```

### Example 2

Input:

```text
```

Expected output:

```text
```

## Edge cases

List edge cases that are in scope. Hidden tests may cover these.

- Empty input:
- Invalid input:
- Boundary values:
- Duplicate values:
- Ordering:
- Platform assumptions:
- Concurrency timing:
- Security boundary:

## Error behavior

Specify how errors must be represented, returned, printed, or propagated.

Include:

- invalid input errors,
- missing resource errors,
- permission errors,
- parse errors,
- validation errors,
- internal errors,
- whether processing should continue after recoverable errors.

## Security constraints

Specify trust boundaries and forbidden behavior.

Include where relevant:

- attacker-controlled inputs,
- filesystem root boundaries,
- network restrictions,
- shell/subprocess restrictions,
- secret-handling rules,
- authorization rules,
- deserialization rules,
- logging rules.

## Performance constraints

Specify input size, runtime limit, memory limit, and expected complexity where applicable.

## Allowed dependencies

State whether third-party dependencies are allowed.

- Python:
- Rust:
- TypeScript:
- New language:

If dependencies are allowed, define acceptable categories and version policy.

## Required project shape

Define expected files, entrypoint, command, or callable API for each language.

- Python:
- Rust:
- TypeScript:
- New language:

## Build and test commands

Define the exact commands that the harness should run for each language once executable benchmark tooling exists.

- Python build/check:
- Python test:
- Rust build/check:
- Rust test:
- TypeScript build/check:
- TypeScript test:
- New language build/check:
- New language test:

## Public tests

Describe public tests included with the prompt or fixture.

## Hidden tests

Describe the behavioral classes covered by hidden tests without revealing exact cases.

## Property tests

Describe properties to check, generators, seeds, and shrinking expectations where applicable.

## Performance tests

Describe performance fixture generation and measurement method.

## Security tests

Describe security checks, exploit attempts, or static analysis expectations.

## Scoring weights

Use the default rubric unless this task requires declared adjustments.

| Category | Points |
| --- | ---: |
| Functional correctness | 40 |
| Build/typecheck/tooling validity | 15 |
| Repair efficiency | 10 |
| Safety/security | 10 |
| Maintainability/auditability | 10 |
| Performance/resource use | 10 |
| Reproducibility | 5 |

## Defect labels to watch

List expected defect labels for this task.

-
-
-

## Fairness notes

Explain any language-specific concerns and how prompts/tooling avoid unfair advantage.

## Contamination notes

Explain why the task is not a direct copy of a common public benchmark and how variants or hidden tests reduce memorization.

## Human review notes

Specify whether human review is required and what reviewers should focus on.

## Acceptance checklist

A task is ready for official use when:

- Semantics are equivalent across languages.
- Public examples are clear.
- Hidden tests check specified behavior only.
- Dependencies are declared.
- Build/test commands are defined.
- Security and performance constraints are explicit where relevant.
- Scoring weights are declared.
- Defect labels are selected.
- Fairness concerns are documented.