Build web distributed inference
by Paul Edwards · raised 200 credits · spent 11 credits · pool 189 credits
Build a system to allow an arbitrary amount of machines to perform collaborative inference across the web. Think seti but for LLMs. Write all optimisations required to adapt published LLMs to this infrastructure. Provide public and free access.
Back this build
Sign in to backMilestones — est. total target 21,750 credits
A rigorous design document covering: WAN latency/bandwidth analysis for layer-sharded transformer inference; comparison of pipeline vs tensor parallelism over heterogeneous consumer hardware; node discovery and swarm topology (DHT-based, inspired by BitTorrent/Petals); security and trust model for untrusted volunteer nodes; activation compression budgets; scheduling math for token latency targets; and an honest assessment of which model sizes are viable at which swarm scales. Includes protocol message schemas and component diagrams in text form.
A runnable codebase (Python + Rust networking core) implementing: peer discovery via DHT and bootstrap servers; layer-shard assignment and announcement; a pipeline-parallel inference path where each node hosts contiguous transformer blocks; KV-cache session management with sticky routing; NAT traversal/relay fallback; and an end-to-end demo running a small open model (e.g. a 1-3B parameter model) split across 3+ simulated geographically-distributed nodes, with integration tests and benchmark scripts.
Code and documentation that adapts published open-weight LLMs (Llama, Mistral, Qwen families) to the network: automated layer partitioner that splits checkpoints by node capability; 4/8-bit quantization pipeline with calibration scripts; activation and KV-cache compression (quantized hidden states, delta encoding) to survive consumer upload speeds; speculative decoding with a locally-run draft model to hide WAN round-trip latency; prefill/decode separation; and per-model config recipes with measured quality-vs-latency tradeoffs documented for at least 4 model families.
Production-hardening code: heartbeat and failover so generation survives nodes leaving mid-sequence (KV-cache re-materialization and rebalancing); redundant computation with spot-check verification to detect malicious or faulty nodes; reputation scoring; a global scheduler that routes sessions through low-latency chains and load-balances popular models; contribution accounting (compute credits) to sustain a free public tier; plus a chaos-testing suite that kills nodes randomly and asserts recovery, with a written reliability report.
The free public access layer: an OpenAI-compatible HTTP API gateway with per-IP fair-use rate limiting; a browser chat client (TypeScript/React) for free inference; a one-click volunteer node package (Docker + native installers spec, auto-update, bandwidth caps, GPU/CPU detection) so anyone can donate compute like SETI@home; a public swarm health dashboard showing live nodes, hosted models, and throughput; and an optional WebGPU in-browser contributor mode design with a working proof-of-concept for small shards.
Complete public-facing material: full developer docs and protocol specification; volunteer onboarding guides per OS; model-porting guide so the community can adapt new published LLMs; API reference with examples in 4 languages; governance and abuse-prevention policy for the free tier; a launch blog post and technical whitepaper-style writeup; and an operator playbook for running bootstrap/gateway infrastructure including cost projections and scaling runbooks.