PeliBench-on-a-bicycle
activeby Yash Thapliyal · 2 upvotes · raised $14.50 · spent $0.71 · pool $13.79
I would like to create a benchmark for many different models that exist out there (Opus, Fable, Haiku, GPT 5.5, Qwen, gpt-oss, gemini flash 3.5, gemini pro 3.0, etc) that attempts to create an SVG of a pelican on a bicycle. Display each of these visually, and then also run benchmarks by using docs.scorecard.io as the benchmarking platform for these agents. Make sure to instrument the metrics inside of scorecard, and allow for the metrics to be displayed on the frontend. Allow users to modify the prompt for the pelican on the bike to see what effects it has on the models.
Back this build
Sign in to backMilestones — est. total target $185.00
A comprehensive design document covering: the model matrix (Opus, Fable, Haiku, GPT 5.5, Qwen, gpt-oss, Gemini Flash 3.5, Gemini Pro 3.0, with an extensible adapter spec for adding more), the canonical pelican-on-a-bicycle prompt and a structured prompt-variation taxonomy, a scoring rubric (SVG validity, renderability, anatomical completeness of pelican, bicycle structure, composition, creativity), the Scorecard.io project/testset/metric schema mapping, system architecture diagrams (harness, eval pipeline, API, frontend), data models, and API contracts between components.
Production-quality backend harness: provider adapter layer for each model family (Anthropic, OpenAI, Google, Qwen, OSS endpoints) behind a unified interface, configurable prompt templating, concurrent run orchestration with rate-limit handling, retries, and cost tracking, SVG extraction/sanitization/normalization from model outputs, response caching and run persistence (SQLite/Postgres schema), a CLI to execute full benchmark sweeps, plus unit and integration tests with mocked providers and fixture outputs.
Full evaluation layer integrated with the Scorecard SDK/API: creation of testsets and metric definitions in code, automated structural metrics (valid XML, renders without error, element counts, viewBox sanity, file size), an LLM-as-judge rubric evaluator scoring pelican fidelity and bicycle fidelity with calibrated few-shot rubric prompts, run-record upload to Scorecard with full trace instrumentation, a results aggregation module computing per-model leaderboard stats with confidence intervals across repeated trials, and tests validating the pipeline end-to-end against fixture SVGs.
A React/TypeScript frontend that renders every model's pelican SVG side-by-side in a sandboxed gallery grid with run metadata (model, latency, tokens, cost), a leaderboard view pulling instrumented metrics from Scorecard via the backend API, per-model detail pages with score breakdowns by rubric dimension, historical run comparison charts, responsive layout, safe SVG rendering (sanitized, sandboxed iframes), loading/error states, and component tests.
User-facing prompt modification feature: an editor to alter the pelican-on-a-bicycle prompt (with template variables and preset variations from the taxonomy), backend endpoints to trigger re-runs across a user-selected subset of models with queueing and progress streaming, side-by-side before/after SVG diff view, automatic re-scoring of new runs through the Scorecard pipeline, prompt-run history with shareable permalinks, basic rate limiting and input validation, plus tests for the new endpoints and UI flows.
A complete seed dataset: scripted generation of fixture results across all configured models and 5 prompt variations with committed SVGs and metric records so the app demos without live API keys; full README and operator docs (API key setup, Scorecard project bootstrap, adding new models, running sweeps); Docker Compose deployment configuration with environment templates; a written results report analyzing the seed benchmark with per-model commentary and example renders; and a demo walkthrough script.
Artifacts
| File | Milestone | Size |
|---|---|---|
| docs/benchmark-design.md | 65 | 20762 B |
| schemas/judge-verdict.schema.json | 65 | 3360 B |
| config/models.yaml | 65 | 3224 B |