# ADR-0010: Isolated Container-Based Code Judge Service - **Status:** Accepted - **Date:** 2024-06-03 - **Deciders:** FablePool core team - **Related:** ADR-0001 (backend), ADR-0009 (widget sandbox), ADR-0012 (Celery), `docs/architecture/01-system-overview.md` ## Context Code-challenge problems require executing untrusted learner code. The brief is explicit: *never run untrusted code directly on the main server*. The judge must be safe (kernel-level isolation), fair (deterministic limits), self-hostable (no mandatory cloud dependency), and simple enough for a small team to operate. Threats: container escape, resource exhaustion (fork bombs, memory bombs, disk fill), network abuse (spam, attacking internal services, exfiltrating secrets), timing attacks against shared infrastructure, and judge-output injection (malicious output that exploits the result parser or the UI). ## Decision Build a **separate judge service** ("fablepool-judge"), deployed on **dedicated hosts/VMs that run nothing else** and hold **no secrets** beyond a queue credential: 1. **Architecture:** the main app never talks to learner code. A code attempt is enqueued (Redis queue, ADR-0012); judge workers pull jobs, execute, and post results back via a signed callback to a single narrowly-scoped API endpoint. The judge host cannot reach Postgres, the app's internal network, or object storage write paths. 2. **Isolation: one container per execution**, hardened: - rootless runtime, `--network=none`, read-only rootfs with a small tmpfs workdir; - seccomp default profile + a stricter custom profile per language, `no-new-privileges`, all capabilities dropped; - cgroup limits: CPU time (wall + cpu), memory (default 256 MB), pids (default 64), tmpfs size (default 32 MB), output size cap (1 MB); - **gVisor (`runsc`) as the runtime where available**, falling back to hardened runc on hosts that can't run it — the deployment docs treat gVisor as strongly recommended, not optional, for public instances. 3. **Determinism & fairness:** language images are version-pinned by digest; problems declare language + limits in their answer spec (within platform caps); the judge reports limits used so attempts are reproducible. 4. **Grading model:** test cases live in the problem version (hidden cases are stored server-side, never sent to clients). The judge runs the harness *inside* the sandbox and emits a structured JSON verdict; the app **re-validates the verdict schema** and renders all program output as inert text (judge-output injection defense). 5. **Languages at MVP:** Python and JavaScript (Node), chosen for audience fit; the image/manifest design is language-pluggable. 6. **Implementation:** a small **FastAPI** service (per ADR-0001's carve- out) wrapping the container runtime, with Prometheus metrics (queue depth, exec latency, OOM/timeout rates) per ADR-0020. ## Alternatives Considered - **Third-party judge API (Judge0/Sphere Engine cloud).** Fast to ship but adds a paid external dependency every self-hoster must buy — contrary to the community-owned goal. Self-hosted Judge0 was considered; we prefer a thin in-house wrapper to control hardening defaults and the verdict schema, while keeping our design close enough to adopt Judge0 later if maintenance proves heavy. Rejected for MVP default. - **In-process restricted execution (Python `exec` + audit hooks, vm2…).** Language-level sandboxes have a long CVE graveyard (vm2 escapes, pysandbox's own author abandoning it). Categorically rejected. - **Firecracker microVMs.** Best-in-class isolation, but operational complexity (jailer, kernel images, device model) is heavy for a volunteer-run project; gVisor hits a strong middle ground. Deferred — the job protocol is runtime-agnostic, so a Firecracker executor can be swapped in later. - **WASM/WASI execution.** Excellent isolation and cheap startup, but language coverage and ecosystem fidelity (real Python with real libraries) are not where learners need them yet. Revisit; likely the long-term direction for simple exercises. - **Client-side execution (Pyodide/WASM in browser).** Zero server risk and great for instant feedback — but grading client-side is cheatable. We adopt it *additionally* later for "run" (not "submit") feedback; authoritative grading stays server-side. Out of MVP scope. ## Consequences - ✅ Untrusted code never touches app infrastructure; blast radius of a full sandbox escape is a secret-less, network-isolated judge host. - ✅ Self-hostable with plain Docker; no mandatory third-party service. - ⚠️ Dedicated judge hosts raise minimum deployment footprint; small instances may co-locate judge workers on the app host **only with gVisor enabled**, and the docs say so loudly. - ⚠️ Queue-based grading is asynchronous; the UX must handle pending verdicts (polling with backoff; target p95 < 5 s for MVP languages). - ⚠️ Language images are a supply-chain surface: pinned by digest, rebuilt on a schedule, scanned in CI (ADR-0020/0019).