# PeliBench-on-a-Bicycle — Benchmark Design & Evaluation Methodology **Version:** 1.0 (Milestone 1 deliverable) **Status:** Approved for implementation **Scope:** Design of a multi-model benchmark in which LLMs generate an SVG of "a pelican riding a bicycle," scored via Scorecard.io, with a frontend for visual comparison and user-driven prompt variation. --- ## 1. Goals & Non-Goals ### 1.1 Goals - Run an identical generative task ("draw a pelican on a bicycle as SVG") across a heterogeneous model matrix and score results with a reproducible rubric. - Instrument all metrics in Scorecard.io as the system of record for runs, testsets, and scores. - Render every model's SVG output visually in a frontend gallery, side-by-side, with per-metric scores. - Let users edit the prompt (or pick from a structured variation taxonomy), re-run the benchmark, and see effects on output and scores. - Make adding a new model a configuration change, not a code change (adapter spec, §3). ### 1.2 Non-Goals - General-purpose model evaluation (reasoning, coding, etc.). - Fine-tuning or training models. - Real-time streaming of token-by-token generation to the frontend (we show final artifacts; streaming is a stretch goal). - Pixel-perfect aesthetic judgment — the rubric (§5) deliberately mixes deterministic checks with LLM-as-judge scoring and acknowledges judge variance. --- ## 2. Model Matrix | ID (slug) | Display Name | Provider | API Family | Notes | |---|---|---|---|---| | `claude-opus` | Claude Opus | Anthropic | Anthropic Messages | Latest available Opus version, pinned per run | | `claude-fable` | Claude Fable | Anthropic | Anthropic Messages | Pinned snapshot | | `claude-haiku` | Claude Haiku | Anthropic | Anthropic Messages | Cheap/fast tier baseline | | `gpt-5.5` | GPT-5.5 | OpenAI | Chat Completions / Responses | Pinned snapshot | | `qwen` | Qwen (latest large) | Alibaba / hosted | OpenAI-compatible | Via hosted endpoint (e.g., Together/Fireworks) | | `gpt-oss` | gpt-oss | OpenAI (open weights) | OpenAI-compatible | Self-hosted or hosted inference | | `gemini-flash-3.5` | Gemini Flash 3.5 | Google | Gemini API | Fast tier | | `gemini-pro-3.0` | Gemini Pro 3.0 | Google | Gemini API | Quality tier | **Version pinning policy:** Every run records the *exact* model version string returned by the provider (`model_version_resolved`), not just the alias. Comparisons across runs are only valid when resolved versions match; the frontend flags version drift. **Sampling defaults (fixed across all models unless a variation axis overrides):** - `temperature: 1.0`, `top_p: 1.0` (provider defaults where a parameter is unsupported) - `max_output_tokens: 8192` - `n_samples_per_model: 3` (median score reported; all 3 rendered in gallery) - No system prompt beyond the canonical one in §4 (some providers require one; the adapter maps it). --- ## 3. Extensible Adapter Spec Each model is described by a declarative config plus a thin adapter implementing one interface. Adding a model = one YAML entry + (if the API family is new) one adapter class. ### 3.1 Adapter interface (TypeScript) ```ts export interface ModelAdapter { /** Slug, must match registry entry */ readonly id: string; /** Resolve alias -> exact version string for reproducibility */ resolveVersion(): Promise; /** Single completion. Must be stateless and side-effect free. */ generate(req: GenerationRequest): Promise; } export interface GenerationRequest { systemPrompt: string | null; userPrompt: string; temperature: number; topP: number; maxOutputTokens: number; seed?: number; // passed through if provider supports it timeoutMs: number; // harness enforces regardless } export interface GenerationResult { rawText: string; // full model output, untouched modelVersionResolved: string; usage: { inputTokens: number; outputTokens: number }; latencyMs: number; finishReason: "stop" | "length" | "content_filter" | "error"; providerRequestId?: string; } ``` ### 3.2 Registry entry (YAML) ```yaml - id: gemini-flash-3.5 display_name: "Gemini Flash 3.5" adapter: gemini # selects adapter class: anthropic | openai | openai_compatible | gemini model_alias: "gemini-flash-3.5" endpoint: null # required for openai_compatible auth_env: GOOGLE_API_KEY pricing: # for cost metric, USD per 1M tokens input: 0.10 output: 0.40 capabilities: supports_system_prompt: true supports_seed: false rate_limit: rpm: 60 concurrent: 4 enabled: true ``` **Rule:** the harness never imports provider SDKs outside `adapters/`. SVG extraction, scoring, and Scorecard reporting operate only on `GenerationResult`. --- ## 4. Canonical Prompt & Variation Taxonomy ### 4.1 Canonical prompt (testcase id: `canonical-v1`) > Generate an SVG image of a pelican riding a bicycle. > > Requirements: > - Output only a single, complete, valid SVG document. No markdown fences, no explanation. > - Use a `viewBox` of `0 0 512 512`. > - The pelican must be recognizably a pelican (large beak with pouch) and must be positioned riding the bicycle. > - The bicycle must have two wheels, a frame, handlebars, a seat, and pedals. Canonical system prompt: `"You are an expert SVG illustrator. You respond with raw SVG markup only."` **Versioning:** The canonical prompt is content-addressed: `prompt_hash = sha256(system + "\x00" + user)`. Every score in Scorecard carries the hash so prompt edits never silently pollute comparisons. ### 4.2 Prompt-variation taxonomy User-modified prompts are classified along orthogonal axes. Each variation is stored as `{base: canonical-v1, deltas: [...]}` so the frontend can show "what changed" and aggregate effects per axis. | Axis | Code | Example values | Hypothesis tested | |---|---|---|---| | **Detail level** | `DET` | minimal ("draw a pelican on a bike, svg"), canonical, hyper-specified (colors, proportions) | Does specification help weaker models more? | | **Style constraint** | `STY` | flat design, line art, pixel-art-in-svg, watercolor imitation, Bauhaus | Style adherence vs. structural collapse | | **Technical constraint** | `TEC` | ≤30 elements, paths only (no primitives), single ``, must use gradients, must animate (SMIL/CSS) | Constraint-following under structure pressure | | **Compositional twist** | `CMP` | pelican doing a wheelie, two pelicans on a tandem, side view vs. front view, pelican wearing a helmet | Compositional generalization | | **Format/output framing** | `FMT` | "output only SVG" vs. no instruction, JSON-wrapped SVG, fenced code block requested | Extraction robustness; instruction-following | | **Adversarial/negation** | `ADV` | "no circles allowed", "the bicycle must NOT have a seat", swap subject ("bicycle riding a pelican") | Negation handling, literal vs. likely interpretation | Free-form user prompts that don't match a taxonomy delta are tagged `CUSTOM` and still fully scored; they're excluded from per-axis aggregate charts. --- ## 5. Scoring Rubric Total **100 points** across six dimensions. Dimensions 1–2 are **deterministic** (code-evaluated). Dimensions 3–6 are **judge-evaluated** (vision-capable LLM judge scoring the *rendered PNG* plus the SVG source), each on a defined 0–N scale with anchored descriptions. ### 5.1 Dimensions | # | Dimension | Pts | Method | |---|---|---|---| | 1 | **SVG Validity** | 15 | Deterministic | | 2 | **Renderability** | 10 | Deterministic | | 3 | **Pelican Anatomical Completeness** | 25 | Judge (vision) | | 4 | **Bicycle Structure** | 25 | Judge (vision) | | 5 | **Composition** | 15 | Judge (vision) | | 6 | **Creativity & Craft** | 10 | Judge (vision) | #### 1. SVG Validity (15 pts, deterministic) - +5: Output contains exactly one extractable SVG document (extraction pipeline: strip markdown fences → find first ``). - +5: XML well-formed (passes strict XML parse). - +3: Declares `viewBox` (any value; canonical 512×512 not required for variations). - +2: No undefined references (`url(#id)` / `href="#id"` targets exist). #### 2. Renderability (10 pts, deterministic) Render via headless `resvg` to 512×512 PNG. - +5: Renders without error. - +3: Non-blank output (≥1% of pixels differ from background). - +2: Drawn content occupies ≥10% of viewBox bounding area (not a tiny blob in a corner). #### 3. Pelican Anatomical Completeness (25 pts, judge) Checklist scoring; judge marks each item present/absent on the rendered image: - Body (5), Head (3), **Large beak with visible pouch — the pelican signature** (7), Eye (2), Wing(s) (3), Legs/feet positioned plausibly for riding (3), Overall "reads as a pelican, not a generic bird" (2). #### 4. Bicycle Structure (25 pts, judge) - Two wheels (6), wheels approximately round and similar size (3), frame connecting wheels (5), handlebars (4), seat/saddle (3), pedals/crank (3), "reads as a bicycle" gestalt (1). #### 5. Composition (15 pts, judge) - Pelican is *on* the bicycle (seated/contacting seat & handlebars region) (7), plausible scale relationship (4), scene coherence — no severe overlapping garbage, elements not floating disconnected (4). #### 6. Creativity & Craft (10 pts, judge) - Color use beyond default black (3), detail/polish (shading, background, accessories) (4), charm/originality (3). *Anchors:* 0 = monochrome stick figures; 10 = stylistically coherent, delightful illustration. ### 5.2 Judge protocol - **Judge model:** one pinned vision-capable model (config: `judge.model`), *excluded from leaderboard claims about itself* — its self-scores are flagged `self_judged: true` in Scorecard and visually marked in the UI. - **Inputs:** rendered PNG (primary) + raw SVG source (secondary, for tie-breaks like "is that shape a pedal"). - **Output contract:** strict JSON matching `JudgeVerdict` schema (see `schemas/judge-verdict.schema.json`); non-conforming judge output is retried up to 2×, then the run is marked `judge_failed` and excluded from aggregates. - **Variance control:** judge temperature 0; each artifact judged once by default, 3× with median for "certified" leaderboard runs. - **Variation-aware judging:** for `ADV`/`CMP` variations, the judge prompt includes the user's deltas so constraint-following is scored against the *modified* spec (e.g., "no seat" means seat-absence scores the seat points). ### 5.3 Auxiliary metrics (recorded, not in the 100-pt score) - `latency_ms`, `output_tokens`, `cost_usd` (from registry pricing), `svg_element_count`, `svg_bytes`, `extraction_required_repair` (bool). --- ## 6. Scorecard.io Mapping | PeliBench concept | Scorecard object | Notes | |---|---|---| | The benchmark | **Project** `pelibench` | Single project | | Canonical prompt + each saved variation | **Testset**; each prompt = **Testcase** | Testcase `inputs`: `{system_prompt, user_prompt, prompt_hash, taxonomy_tags}` | | One model × one prompt × one sample | **Record** in a **Run** | Run = one full benchmark execution (all enabled models × selected testcases × n samples) | | Each rubric dimension | **Metric** (custom, integer-range) | `svg_validity` (0–15), `renderability` (0–10), `pelican_anatomy` (0–25), `bicycle_structure` (0–25), `composition` (0–15), `creativity` (0–10), `total_score` (0–100) | | Deterministic checks | Metrics scored client-side, pushed via API | Harness computes, attaches to record | | Judge dimensions | Scorecard **AI-judge metric** where supported; otherwise client-scored with judge transcript attached as record metadata | Judge prompt + verdict JSON stored in `record.metadata.judge` | | Auxiliary metrics | Numeric metrics: `latency_ms`, `cost_usd`, `svg_bytes`, `element_count` | Not part of `total_score` | | Model identity | Record metadata: `{model_id, model_version_resolved, sampling_params}` | Enables Scorecard-side filtering/grouping by model | **Run lifecycle:** harness creates Run → for each (model, testcase, sample): generate → extract → deterministic-score → render → judge → `POST record` with output (raw text + extracted SVG + PNG URL), all metric scores, and metadata → finalize Run. The frontend reads aggregates from our API, which proxies/caches Scorecard run data (we never expose the Scorecard API key to the browser). --- ## 7. System Architecture ``` ┌────────────────────────────────────────────────────────────────────┐ │ FRONTEND (Next.js) │ │ Gallery grid │ Leaderboard │ Prompt editor │ Run history │ └───────┬────────────────────────────────────────────────┬───────────┘ │ REST (read scores, artifacts) │ POST /runs ┌───────▼────────────────────────────────────────────────▼───────────┐ │ API SERVER (FastAPI) │ │ /runs /runs/{id} /models /prompts /artifacts /aggregates │ │ AuthN (user sessions) · rate limiting on user-triggered runs │ └───┬───────────────┬───────────────────────────┬─────────────────────┘ │ enqueue │ read/write │ read ┌───▼─────────┐ ┌───▼─────────┐ ┌──────▼──────────┐ │ JOB QUEUE │ │ POSTGRES │ │ OBJECT STORE │ │ (Redis/RQ) │ │ runs, jobs, │ │ (S3-compat) │ └───┬─────────┘ │ prompts, │ │ svg, png, │ │ │ score cache │ │ judge JSON │ ┌───▼──────────────────────────────────────────────────────┐ │ EVAL HARNESS (workers) │ │ ① Generate (ModelAdapter, per-model rate limits) │ │ ② Extract SVG ③ Deterministic scorers │ │ ④ Render (resvg, sandboxed) ⑤ Vision judge │ │ ⑥ Report → Scorecard.io API │ └───┬──────────────────────────────┬────────────────────────┘ │ provider APIs │ Scorecard API ┌───▼──────────────┐ ┌─────▼──────────┐ │ Anthropic·OpenAI │ │ Scorecard.io │ │ Google·OSS hosts │ │ (system of │ └──────────────────┘ │ record) │ └────────────────┘ ``` **Pipeline detail (per record):** ``` prompt ──> adapter.generate ──> rawText rawText ──> extractor ──> svg | EXTRACTION_FAIL (validity=partial, render=0, judge skipped) svg ──> xml-validate ──> validity score svg ──> resvg sandbox ──> png | RENDER_FAIL ──> renderability score (png + svg + variation deltas) ──> vision judge ──> JudgeVerdict ──> dims 3–6 all scores + artifacts ──> Scorecard record + local cache + object store ``` **Security notes:** SVGs are untrusted model output. Rendering happens server-side in resvg (no script execution); the frontend displays the *PNG* by default, with raw-SVG view rendered inside a sandboxed `