# ADR-0010: Isolated Container-Based Code Judge Service

- **Status:** Accepted
- **Date:** 2024-06-03
- **Deciders:** FablePool core team
- **Related:** ADR-0001 (backend), ADR-0009 (widget sandbox), ADR-0012
  (Celery), `docs/architecture/01-system-overview.md`

## Context

Code-challenge problems require executing untrusted learner code. The brief
is explicit: *never run untrusted code directly on the main server*. The
judge must be safe (kernel-level isolation), fair (deterministic limits),
self-hostable (no mandatory cloud dependency), and simple enough for a
small team to operate.

Threats: container escape, resource exhaustion (fork bombs, memory bombs,
disk fill), network abuse (spam, attacking internal services, exfiltrating
secrets), timing attacks against shared infrastructure, and judge-output
injection (malicious output that exploits the result parser or the UI).

## Decision

Build a **separate judge service** ("fablepool-judge"), deployed on
**dedicated hosts/VMs that run nothing else** and hold **no secrets** beyond
a queue credential:

1. **Architecture:** the main app never talks to learner code. A code
   attempt is enqueued (Redis queue, ADR-0012); judge workers pull jobs,
   execute, and post results back via a signed callback to a single
   narrowly-scoped API endpoint. The judge host cannot reach Postgres,
   the app's internal network, or object storage write paths.
2. **Isolation: one container per execution**, hardened:
   - rootless runtime, `--network=none`, read-only rootfs with a small
     tmpfs workdir;
   - seccomp default profile + a stricter custom profile per language,
     `no-new-privileges`, all capabilities dropped;
   - cgroup limits: CPU time (wall + cpu), memory (default 256 MB), pids
     (default 64), tmpfs size (default 32 MB), output size cap (1 MB);
   - **gVisor (`runsc`) as the runtime where available**, falling back to
     hardened runc on hosts that can't run it — the deployment docs treat
     gVisor as strongly recommended, not optional, for public instances.
3. **Determinism & fairness:** language images are version-pinned by
   digest; problems declare language + limits in their answer spec
   (within platform caps); the judge reports limits used so attempts are
   reproducible.
4. **Grading model:** test cases live in the problem version (hidden cases
   are stored server-side, never sent to clients). The judge runs the
   harness *inside* the sandbox and emits a structured JSON verdict; the
   app **re-validates the verdict schema** and renders all program output
   as inert text (judge-output injection defense).
5. **Languages at MVP:** Python and JavaScript (Node), chosen for audience
   fit; the image/manifest design is language-pluggable.
6. **Implementation:** a small **FastAPI** service (per ADR-0001's carve-
   out) wrapping the container runtime, with Prometheus metrics
   (queue depth, exec latency, OOM/timeout rates) per ADR-0020.

## Alternatives Considered

- **Third-party judge API (Judge0/Sphere Engine cloud).** Fast to ship but
  adds a paid external dependency every self-hoster must buy — contrary to
  the community-owned goal. Self-hosted Judge0 was considered; we prefer a
  thin in-house wrapper to control hardening defaults and the verdict
  schema, while keeping our design close enough to adopt Judge0 later if
  maintenance proves heavy. Rejected for MVP default.
- **In-process restricted execution (Python `exec` + audit hooks, vm2…).**
  Language-level sandboxes have a long CVE graveyard (vm2 escapes,
  pysandbox's own author abandoning it). Categorically rejected.
- **Firecracker microVMs.** Best-in-class isolation, but operational
  complexity (jailer, kernel images, device model) is heavy for a
  volunteer-run project; gVisor hits a strong middle ground. Deferred —
  the job protocol is runtime-agnostic, so a Firecracker executor can be
  swapped in later.
- **WASM/WASI execution.** Excellent isolation and cheap startup, but
  language coverage and ecosystem fidelity (real Python with real
  libraries) are not where learners need them yet. Revisit; likely the
  long-term direction for simple exercises.
- **Client-side execution (Pyodide/WASM in browser).** Zero server risk and
  great for instant feedback — but grading client-side is cheatable. We
  adopt it *additionally* later for "run" (not "submit") feedback;
  authoritative grading stays server-side. Out of MVP scope.

## Consequences

- ✅ Untrusted code never touches app infrastructure; blast radius of a
  full sandbox escape is a secret-less, network-isolated judge host.
- ✅ Self-hostable with plain Docker; no mandatory third-party service.
- ⚠️ Dedicated judge hosts raise minimum deployment footprint; small
  instances may co-locate judge workers on the app host **only with
  gVisor enabled**, and the docs say so loudly.
- ⚠️ Queue-based grading is asynchronous; the UX must handle pending
  verdicts (polling with backoff; target p95 < 5 s for MVP languages).
- ⚠️ Language images are a supply-chain surface: pinned by digest, rebuilt
  on a schedule, scanned in CI (ADR-0020/0019).