# PeliBench-on-a-Bicycle — Benchmark Design & Evaluation Methodology

**Version:** 1.0 (Milestone 1 deliverable)
**Status:** Approved for implementation
**Scope:** Design of a multi-model benchmark in which LLMs generate an SVG of "a pelican riding a bicycle," scored via Scorecard.io, with a frontend for visual comparison and user-driven prompt variation.

---

## 1. Goals & Non-Goals

### 1.1 Goals
- Run an identical generative task ("draw a pelican on a bicycle as SVG") across a heterogeneous model matrix and score results with a reproducible rubric.
- Instrument all metrics in Scorecard.io as the system of record for runs, testsets, and scores.
- Render every model's SVG output visually in a frontend gallery, side-by-side, with per-metric scores.
- Let users edit the prompt (or pick from a structured variation taxonomy), re-run the benchmark, and see effects on output and scores.
- Make adding a new model a configuration change, not a code change (adapter spec, §3).

### 1.2 Non-Goals
- General-purpose model evaluation (reasoning, coding, etc.).
- Fine-tuning or training models.
- Real-time streaming of token-by-token generation to the frontend (we show final artifacts; streaming is a stretch goal).
- Pixel-perfect aesthetic judgment — the rubric (§5) deliberately mixes deterministic checks with LLM-as-judge scoring and acknowledges judge variance.

---

## 2. Model Matrix

| ID (slug) | Display Name | Provider | API Family | Notes |
|---|---|---|---|---|
| `claude-opus` | Claude Opus | Anthropic | Anthropic Messages | Latest available Opus version, pinned per run |
| `claude-fable` | Claude Fable | Anthropic | Anthropic Messages | Pinned snapshot |
| `claude-haiku` | Claude Haiku | Anthropic | Anthropic Messages | Cheap/fast tier baseline |
| `gpt-5.5` | GPT-5.5 | OpenAI | Chat Completions / Responses | Pinned snapshot |
| `qwen` | Qwen (latest large) | Alibaba / hosted | OpenAI-compatible | Via hosted endpoint (e.g., Together/Fireworks) |
| `gpt-oss` | gpt-oss | OpenAI (open weights) | OpenAI-compatible | Self-hosted or hosted inference |
| `gemini-flash-3.5` | Gemini Flash 3.5 | Google | Gemini API | Fast tier |
| `gemini-pro-3.0` | Gemini Pro 3.0 | Google | Gemini API | Quality tier |

**Version pinning policy:** Every run records the *exact* model version string returned by the provider (`model_version_resolved`), not just the alias. Comparisons across runs are only valid when resolved versions match; the frontend flags version drift.

**Sampling defaults (fixed across all models unless a variation axis overrides):**
- `temperature: 1.0`, `top_p: 1.0` (provider defaults where a parameter is unsupported)
- `max_output_tokens: 8192`
- `n_samples_per_model: 3` (median score reported; all 3 rendered in gallery)
- No system prompt beyond the canonical one in §4 (some providers require one; the adapter maps it).

---

## 3. Extensible Adapter Spec

Each model is described by a declarative config plus a thin adapter implementing one interface. Adding a model = one YAML entry + (if the API family is new) one adapter class.

### 3.1 Adapter interface (TypeScript)

```ts
export interface ModelAdapter {
  /** Slug, must match registry entry */
  readonly id: string;

  /** Resolve alias -> exact version string for reproducibility */
  resolveVersion(): Promise<string>;

  /** Single completion. Must be stateless and side-effect free. */
  generate(req: GenerationRequest): Promise<GenerationResult>;
}

export interface GenerationRequest {
  systemPrompt: string | null;
  userPrompt: string;
  temperature: number;
  topP: number;
  maxOutputTokens: number;
  seed?: number;            // passed through if provider supports it
  timeoutMs: number;        // harness enforces regardless
}

export interface GenerationResult {
  rawText: string;          // full model output, untouched
  modelVersionResolved: string;
  usage: { inputTokens: number; outputTokens: number };
  latencyMs: number;
  finishReason: "stop" | "length" | "content_filter" | "error";
  providerRequestId?: string;
}
```

### 3.2 Registry entry (YAML)

```yaml
- id: gemini-flash-3.5
  display_name: "Gemini Flash 3.5"
  adapter: gemini            # selects adapter class: anthropic | openai | openai_compatible | gemini
  model_alias: "gemini-flash-3.5"
  endpoint: null             # required for openai_compatible
  auth_env: GOOGLE_API_KEY
  pricing:                   # for cost metric, USD per 1M tokens
    input: 0.10
    output: 0.40
  capabilities:
    supports_system_prompt: true
    supports_seed: false
  rate_limit:
    rpm: 60
    concurrent: 4
  enabled: true
```

**Rule:** the harness never imports provider SDKs outside `adapters/`. SVG extraction, scoring, and Scorecard reporting operate only on `GenerationResult`.

---

## 4. Canonical Prompt & Variation Taxonomy

### 4.1 Canonical prompt (testcase id: `canonical-v1`)

> Generate an SVG image of a pelican riding a bicycle.
>
> Requirements:
> - Output only a single, complete, valid SVG document. No markdown fences, no explanation.
> - Use a `viewBox` of `0 0 512 512`.
> - The pelican must be recognizably a pelican (large beak with pouch) and must be positioned riding the bicycle.
> - The bicycle must have two wheels, a frame, handlebars, a seat, and pedals.

Canonical system prompt: `"You are an expert SVG illustrator. You respond with raw SVG markup only."`

**Versioning:** The canonical prompt is content-addressed: `prompt_hash = sha256(system + "\x00" + user)`. Every score in Scorecard carries the hash so prompt edits never silently pollute comparisons.

### 4.2 Prompt-variation taxonomy

User-modified prompts are classified along orthogonal axes. Each variation is stored as `{base: canonical-v1, deltas: [...]}` so the frontend can show "what changed" and aggregate effects per axis.

| Axis | Code | Example values | Hypothesis tested |
|---|---|---|---|
| **Detail level** | `DET` | minimal ("draw a pelican on a bike, svg"), canonical, hyper-specified (colors, proportions) | Does specification help weaker models more? |
| **Style constraint** | `STY` | flat design, line art, pixel-art-in-svg, watercolor imitation, Bauhaus | Style adherence vs. structural collapse |
| **Technical constraint** | `TEC` | ≤30 elements, paths only (no primitives), single `<path>`, must use gradients, must animate (SMIL/CSS) | Constraint-following under structure pressure |
| **Compositional twist** | `CMP` | pelican doing a wheelie, two pelicans on a tandem, side view vs. front view, pelican wearing a helmet | Compositional generalization |
| **Format/output framing** | `FMT` | "output only SVG" vs. no instruction, JSON-wrapped SVG, fenced code block requested | Extraction robustness; instruction-following |
| **Adversarial/negation** | `ADV` | "no circles allowed", "the bicycle must NOT have a seat", swap subject ("bicycle riding a pelican") | Negation handling, literal vs. likely interpretation |

Free-form user prompts that don't match a taxonomy delta are tagged `CUSTOM` and still fully scored; they're excluded from per-axis aggregate charts.

---

## 5. Scoring Rubric

Total **100 points** across six dimensions. Dimensions 1–2 are **deterministic** (code-evaluated). Dimensions 3–6 are **judge-evaluated** (vision-capable LLM judge scoring the *rendered PNG* plus the SVG source), each on a defined 0–N scale with anchored descriptions.

### 5.1 Dimensions

| # | Dimension | Pts | Method |
|---|---|---|---|
| 1 | **SVG Validity** | 15 | Deterministic |
| 2 | **Renderability** | 10 | Deterministic |
| 3 | **Pelican Anatomical Completeness** | 25 | Judge (vision) |
| 4 | **Bicycle Structure** | 25 | Judge (vision) |
| 5 | **Composition** | 15 | Judge (vision) |
| 6 | **Creativity & Craft** | 10 | Judge (vision) |

#### 1. SVG Validity (15 pts, deterministic)
- +5: Output contains exactly one extractable SVG document (extraction pipeline: strip markdown fences → find first `<svg` to matching `</svg>`).
- +5: XML well-formed (passes strict XML parse).
- +3: Declares `viewBox` (any value; canonical 512×512 not required for variations).
- +2: No undefined references (`url(#id)` / `href="#id"` targets exist).

#### 2. Renderability (10 pts, deterministic)
Render via headless `resvg` to 512×512 PNG.
- +5: Renders without error.
- +3: Non-blank output (≥1% of pixels differ from background).
- +2: Drawn content occupies ≥10% of viewBox bounding area (not a tiny blob in a corner).

#### 3. Pelican Anatomical Completeness (25 pts, judge)
Checklist scoring; judge marks each item present/absent on the rendered image:
- Body (5), Head (3), **Large beak with visible pouch — the pelican signature** (7), Eye (2), Wing(s) (3), Legs/feet positioned plausibly for riding (3), Overall "reads as a pelican, not a generic bird" (2).

#### 4. Bicycle Structure (25 pts, judge)
- Two wheels (6), wheels approximately round and similar size (3), frame connecting wheels (5), handlebars (4), seat/saddle (3), pedals/crank (3), "reads as a bicycle" gestalt (1).

#### 5. Composition (15 pts, judge)
- Pelican is *on* the bicycle (seated/contacting seat & handlebars region) (7), plausible scale relationship (4), scene coherence — no severe overlapping garbage, elements not floating disconnected (4).

#### 6. Creativity & Craft (10 pts, judge)
- Color use beyond default black (3), detail/polish (shading, background, accessories) (4), charm/originality (3). *Anchors:* 0 = monochrome stick figures; 10 = stylistically coherent, delightful illustration.

### 5.2 Judge protocol
- **Judge model:** one pinned vision-capable model (config: `judge.model`), *excluded from leaderboard claims about itself* — its self-scores are flagged `self_judged: true` in Scorecard and visually marked in the UI.
- **Inputs:** rendered PNG (primary) + raw SVG source (secondary, for tie-breaks like "is that shape a pedal").
- **Output contract:** strict JSON matching `JudgeVerdict` schema (see `schemas/judge-verdict.schema.json`); non-conforming judge output is retried up to 2×, then the run is marked `judge_failed` and excluded from aggregates.
- **Variance control:** judge temperature 0; each artifact judged once by default, 3× with median for "certified" leaderboard runs.
- **Variation-aware judging:** for `ADV`/`CMP` variations, the judge prompt includes the user's deltas so constraint-following is scored against the *modified* spec (e.g., "no seat" means seat-absence scores the seat points).

### 5.3 Auxiliary metrics (recorded, not in the 100-pt score)
- `latency_ms`, `output_tokens`, `cost_usd` (from registry pricing), `svg_element_count`, `svg_bytes`, `extraction_required_repair` (bool).

---

## 6. Scorecard.io Mapping

| PeliBench concept | Scorecard object | Notes |
|---|---|---|
| The benchmark | **Project** `pelibench` | Single project |
| Canonical prompt + each saved variation | **Testset**; each prompt = **Testcase** | Testcase `inputs`: `{system_prompt, user_prompt, prompt_hash, taxonomy_tags}` |
| One model × one prompt × one sample | **Record** in a **Run** | Run = one full benchmark execution (all enabled models × selected testcases × n samples) |
| Each rubric dimension | **Metric** (custom, integer-range) | `svg_validity` (0–15), `renderability` (0–10), `pelican_anatomy` (0–25), `bicycle_structure` (0–25), `composition` (0–15), `creativity` (0–10), `total_score` (0–100) |
| Deterministic checks | Metrics scored client-side, pushed via API | Harness computes, attaches to record |
| Judge dimensions | Scorecard **AI-judge metric** where supported; otherwise client-scored with judge transcript attached as record metadata | Judge prompt + verdict JSON stored in `record.metadata.judge` |
| Auxiliary metrics | Numeric metrics: `latency_ms`, `cost_usd`, `svg_bytes`, `element_count` | Not part of `total_score` |
| Model identity | Record metadata: `{model_id, model_version_resolved, sampling_params}` | Enables Scorecard-side filtering/grouping by model |

**Run lifecycle:** harness creates Run → for each (model, testcase, sample): generate → extract → deterministic-score → render → judge → `POST record` with output (raw text + extracted SVG + PNG URL), all metric scores, and metadata → finalize Run. The frontend reads aggregates from our API, which proxies/caches Scorecard run data (we never expose the Scorecard API key to the browser).

---

## 7. System Architecture

```
┌────────────────────────────────────────────────────────────────────┐
│                            FRONTEND (Next.js)                      │
│  Gallery grid  │  Leaderboard  │  Prompt editor  │  Run history    │
└───────┬────────────────────────────────────────────────┬───────────┘
        │ REST (read scores, artifacts)                  │ POST /runs
┌───────▼────────────────────────────────────────────────▼───────────┐
│                       API SERVER (FastAPI)                          │
│  /runs  /runs/{id}  /models  /prompts  /artifacts  /aggregates     │
│  AuthN (user sessions) · rate limiting on user-triggered runs      │
└───┬───────────────┬───────────────────────────┬─────────────────────┘
    │ enqueue       │ read/write                │ read
┌───▼─────────┐ ┌───▼─────────┐          ┌──────▼──────────┐
│ JOB QUEUE   │ │ POSTGRES    │          │ OBJECT STORE    │
│ (Redis/RQ)  │ │ runs, jobs, │          │ (S3-compat)     │
└───┬─────────┘ │ prompts,    │          │ svg, png,       │
    │           │ score cache │          │ judge JSON      │
┌───▼──────────────────────────────────────────────────────┐
│                    EVAL HARNESS (workers)                 │
│  ① Generate (ModelAdapter, per-model rate limits)        │
│  ② Extract SVG  ③ Deterministic scorers                  │
│  ④ Render (resvg, sandboxed)  ⑤ Vision judge             │
│  ⑥ Report → Scorecard.io API                             │
└───┬──────────────────────────────┬────────────────────────┘
    │ provider APIs                │ Scorecard API
┌───▼──────────────┐         ┌─────▼──────────┐
│ Anthropic·OpenAI │         │ Scorecard.io   │
│ Google·OSS hosts │         │ (system of     │
└──────────────────┘         │  record)       │
                             └────────────────┘
```

**Pipeline detail (per record):**
```
prompt ──> adapter.generate ──> rawText
rawText ──> extractor ──> svg | EXTRACTION_FAIL (validity=partial, render=0, judge skipped)
svg ──> xml-validate ──> validity score
svg ──> resvg sandbox ──> png | RENDER_FAIL ──> renderability score
(png + svg + variation deltas) ──> vision judge ──> JudgeVerdict ──> dims 3–6
all scores + artifacts ──> Scorecard record + local cache + object store
```

**Security notes:** SVGs are untrusted model output. Rendering happens server-side in resvg (no script execution); the frontend displays the *PNG* by default, with raw-SVG view rendered inside a sandboxed `<iframe sandbox="">` with CSP blocking scripts/external fetches.

---

## 8. Data Models (Postgres)

```sql
CREATE TABLE prompts (
  id            TEXT PRIMARY KEY,        -- e.g. 'canonical-v1' or 'usr_8f2a...'
  prompt_hash   TEXT NOT NULL UNIQUE,
  system_prompt TEXT,
  user_prompt   TEXT NOT NULL,
  base_prompt_id TEXT REFERENCES prompts(id),
  taxonomy_tags TEXT[] NOT NULL DEFAULT '{}',   -- ['DET:minimal','ADV:no-seat']
  created_by    TEXT,                    -- user id or 'system'
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE runs (
  id              UUID PRIMARY KEY,
  scorecard_run_id TEXT,
  prompt_id       TEXT NOT NULL REFERENCES prompts(id),
  model_ids       TEXT[] NOT NULL,
  n_samples       INT NOT NULL DEFAULT 3,
  status          TEXT NOT NULL CHECK (status IN
                    ('queued','running','completed','failed','partial')),
  certified       BOOLEAN NOT NULL DEFAULT false,  -- 3x judging
  created_by      TEXT,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
  completed_at    TIMESTAMPTZ
);

CREATE TABLE records (
  id              UUID PRIMARY KEY,
  run_id          UUID NOT NULL REFERENCES runs(id),
  scorecard_record_id TEXT,
  model_id        TEXT NOT NULL,
  model_version_resolved TEXT NOT NULL,
  sample_index    INT NOT NULL,
  status          TEXT NOT NULL,          -- ok | extraction_fail | render_fail | judge_failed | provider_error
  raw_output_uri  TEXT,                   -- object store
  svg_uri         TEXT,
  png_uri         TEXT,
  judge_uri       TEXT,                   -- full verdict JSON
  scores          JSONB,                  -- {svg_validity:15, ..., total_score:87}
  aux             JSONB,                  -- {latency_ms, cost_usd, svg_bytes, element_count, output_tokens}
  self_judged     BOOLEAN NOT NULL DEFAULT false,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (run_id, model_id, sample_index)
);
```

---

## 9. API Contracts

Base path `/api/v1`. JSON throughout. Errors: `{error: {code, message}}`.

### 9.1 `GET /models`
Returns the model registry (public fields only — no auth env names).
```json
[{ "id":"claude-opus","display_name":"Claude Opus","enabled":true,
   "capabilities":{"supports_seed":false},"pricing":{"input":15,"output":75} }]
```

### 9.2 `POST /prompts`
Create a prompt variation. Body: `{user_prompt, system_prompt?, base_prompt_id?, taxonomy_tags?}`.
Server computes `prompt_hash`; returns existing prompt if hash collides (dedup). → `201 {id, prompt_hash}`.

### 9.3 `POST /runs`  *(rate-limited per user: default 3/hour)*
```json
{ "prompt_id": "usr_8f2a", "model_ids": ["claude-opus","gpt-5.5"],
  "n_samples": 1, "certified": false }
```
→ `202 {run_id, status:"queued", estimated_cost_usd}`. Server rejects if estimated cost exceeds per-run budget cap (config `budget.max_run_usd`).

### 9.4 `GET /runs/{id}`
```json
{ "run_id":"...", "status":"running", "progress":{"done":9,"total":16},
  "records":[{ "model_id":"claude-opus","sample_index":0,"status":"ok",
    "png_url":"https://cdn/.../r1.png","svg_url":"...",
    "scores":{"svg_validity":15,"renderability":10,"pelican_anatomy":21,
              "bicycle_structure":23,"composition":13,"creativity":7,"total_score":89},
    "aux":{"latency_ms":4210,"cost_usd":0.031},
    "self_judged":false }] }
```
Supports `?wait=true` long-poll (30s) for the frontend's live progress view.

### 9.5 `GET /aggregates?prompt_id=&group_by=model|taxonomy_axis`
Leaderboard data: median `total_score` and per-dimension medians per group, with sample counts and version-drift flags.

### 9.6 Harness ⇄ Scorecard
Harness uses Scorecard SDK: `create_run(project, testset)` → per record `log_record(inputs, output, metadata)` + `score(record, metric, value)` for each of the 7 score metrics and 4 aux metrics → `finalize_run`. Scorecard IDs are written back to `runs`/`records` for cross-linking; the frontend links each record to its Scorecard record page.

---

## 10. Open Questions / Risks (tracked into Milestone 2)

1. **Judge ceiling effects** on dimension 6 — calibrate anchors with a 20-image golden set before first public leaderboard.
2. **Scorecard AI-judge vs. client-side judge** — decide after confirming Scorecard's vision-input support in the metric builder; design supports both (§6).
3. **Cost control for user-triggered runs** — defaults: 1 sample, user picks ≤4 models; full 8-model certified runs are operator-triggered.
4. **gpt-oss hosting** — self-host vs. hosted inference; affects latency comparability (latency reported but never part of the 100-pt score, mitigating this).