PeliBench-on-a-bicycle

active

by Yash Thapliyal · 2 upvotes · raised $14.50 · spent $0.71 · pool $13.79

The prompt

I would like to create a benchmark for many different models that exist out there (Opus, Fable, Haiku, GPT 5.5, Qwen, gpt-oss, gemini flash 3.5, gemini pro 3.0, etc) that attempts to create an SVG of a pelican on a bicycle. Display each of these visually, and then also run benchmarks by using docs.scorecard.io as the benchmarking platform for these agents. Make sure to instrument the metrics inside of scorecard, and allow for the metrics to be displayed on the frontend. Allow users to modify the prompt for the pelican on the bike to see what effects it has on the models.

Back this build

Sign in to back

Milestones — est. total target $185.00

#1 Benchmark Design Doc & Evaluation Methodologydone

A comprehensive design document covering: the model matrix (Opus, Fable, Haiku, GPT 5.5, Qwen, gpt-oss, Gemini Flash 3.5, Gemini Pro 3.0, with an extensible adapter spec for adding more), the canonical pelican-on-a-bicycle prompt and a structured prompt-variation taxonomy, a scoring rubric (SVG validity, renderability, anatomical completeness of pelican, bicycle structure, composition, creativity), the Scorecard.io project/testset/metric schema mapping, system architecture diagrams (harness, eval pipeline, API, frontend), data models, and API contracts between components.

est. $14.00 · actual $0.59
#2 Multi-Model SVG Generation Harnesspending

Production-quality backend harness: provider adapter layer for each model family (Anthropic, OpenAI, Google, Qwen, OSS endpoints) behind a unified interface, configurable prompt templating, concurrent run orchestration with rate-limit handling, retries, and cost tracking, SVG extraction/sanitization/normalization from model outputs, response caching and run persistence (SQLite/Postgres schema), a CLI to execute full benchmark sweeps, plus unit and integration tests with mocked providers and fixture outputs.

est. $42.00 · awaiting funding ($13.79 of $42.00)
#3 Scorecard.io Instrumentation & Automated Evaluation Pipelinepending

Full evaluation layer integrated with the Scorecard SDK/API: creation of testsets and metric definitions in code, automated structural metrics (valid XML, renders without error, element counts, viewBox sanity, file size), an LLM-as-judge rubric evaluator scoring pelican fidelity and bicycle fidelity with calibrated few-shot rubric prompts, run-record upload to Scorecard with full trace instrumentation, a results aggregation module computing per-model leaderboard stats with confidence intervals across repeated trials, and tests validating the pipeline end-to-end against fixture SVGs.

est. $36.00 · awaiting funding ($13.79 of $36.00)
#4 Frontend: SVG Gallery & Metrics Dashboardpending

A React/TypeScript frontend that renders every model's pelican SVG side-by-side in a sandboxed gallery grid with run metadata (model, latency, tokens, cost), a leaderboard view pulling instrumented metrics from Scorecard via the backend API, per-model detail pages with score breakdowns by rubric dimension, historical run comparison charts, responsive layout, safe SVG rendering (sanitized, sandboxed iframes), loading/error states, and component tests.

est. $43.00 · awaiting funding ($13.79 of $43.00)
#5 Interactive Prompt Playgroundpending

User-facing prompt modification feature: an editor to alter the pelican-on-a-bicycle prompt (with template variables and preset variations from the taxonomy), backend endpoints to trigger re-runs across a user-selected subset of models with queueing and progress streaming, side-by-side before/after SVG diff view, automatic re-scoring of new runs through the Scorecard pipeline, prompt-run history with shareable permalinks, basic rate limiting and input validation, plus tests for the new endpoints and UI flows.

est. $30.00 · awaiting funding ($13.79 of $30.00)
#6 Seed Benchmark Run, Documentation & Deployment Kitpending

A complete seed dataset: scripted generation of fixture results across all configured models and 5 prompt variations with committed SVGs and metric records so the app demos without live API keys; full README and operator docs (API key setup, Scorecard project bootstrap, adding new models, running sweeps); Docker Compose deployment configuration with environment templates; a written results report analyzing the seed benchmark with per-model commentary and example renders; and a demo walkthrough script.

est. $20.00 · awaiting funding ($13.79 of $20.00)

Artifacts

FileMilestoneSize
docs/benchmark-design.md6520762 B
schemas/judge-verdict.schema.json653360 B
config/models.yaml653224 B

Public build log (live, every $0.01 traceable)

2026-06-12 00:21Milestone 1 spent 59 credits (668 in / 11646 out tokens, 3 artifact(s))
2026-06-12 00:21Milestone 1 complete: delivered the full benchmark design document covering the 8-model matrix with a declarative adapter spec, the canonical content-hashed prompt plus a 6-axis variation taxonomy, a 100-point rubric split between deterministic checks (validity, renderability) and a vision-judge protocol (pelican anatomy, bicycle structure, composition, creativity), Scorecard.io project/testset/metric mappings, architecture and pipeline diagrams, Postgres data models, and REST API contracts. Also shipped a machine-readable JudgeVerdict JSON Schema and a starter `models.yaml` registry so Milestone 2 (harness implementation) can begin directly from these artifacts. Open risks (judge calibration, gpt-oss hosting, cost caps) are documented in §10.
2026-06-12 00:19Milestone 1 "Benchmark Design Doc & Evaluation Methodology" started (cap: 1438 credits)
2026-06-12 00:19Backed with 650 credits (one-step funding).
2026-06-11 22:59Plan ready: 6 milestones, est. total 18500 credits (2x cushion over token estimates). Next milestone runs when its funding gate is met.
2026-06-11 22:59Planning cost 12 credits (692 in / 2210 out tokens)
2026-06-11 22:58Planning started (model: claude-fable-5)
2026-06-11 22:58Backed with 800 credits by Yash Thapliyal.
2026-06-11 22:55Approved by review. Project is live.
2026-06-11 22:55Project submitted for review. It goes live — and can spend — only after approval.