# ADR 0019: Docker Compose for Development; Single-Host Docker for Production MVP - **Status:** Accepted - **Date:** 2025-01-15 - **Deciders:** Core architecture team - **Related:** ADR 0010 (sandbox isolation), ADR 0012 (Redis/Celery), ADR 0018 (MinIO), ADR 0020 (observability), docs/architecture/01 ## Context The system comprises: Next.js frontend, Django API, Celery workers (+ beat), PostgreSQL, Redis, Meilisearch, MinIO, the widget-sandbox static origin, and the code-execution sandbox service (ADR 0010). Two distinct audiences must be able to run it: 1. **Contributors** — must get a full working stack in minutes on Linux/macOS/Windows(WSL2), or the project loses contributors at the front door. 2. **Self-hosting operators** — typically one modest VPS; "community-owned platform" rings hollow if deployment requires a Kubernetes cluster. Plus our own flagship deployment, which starts small. Options: Docker Compose everywhere; Kubernetes (k3s) from day one; Nomad; bare-metal/systemd installs; managed-PaaS recipes. ## Decision 1. **Development: Docker Compose is the canonical environment.** One `docker compose up` brings up everything, with: - Bind-mounted source + hot reload for Django (`runserver`) and Next.js (`next dev`); dependency layers cached in images so rebuilds are rare. - A `make bootstrap` (or `just bootstrap`) that runs migrations, seeds demo content (sample problems/courses across types), creates a superuser, and builds search indexes — a contributor reaches a *populated, working* app, not a blank database. - Profiles: `core` (default), `observability` (Prometheus/Grafana/Jaeger, ADR 0020), `sandbox` (code-execution service — optional locally; code-challenge grading degrades gracefully to "sandbox unavailable" when absent). - Pinned image tags for all services (e.g. `postgres:16`, `redis:7`, `getmeili/meilisearch:v1.x`) so contributor environments are reproducible. - **Native-run escape hatch documented:** Compose-for-infra-only (Postgres/Redis/Meili/MinIO in containers, Django/Next on the host) for contributors who prefer native debugging. 2. **Production MVP: single-host Docker, Compose-managed, behind Caddy.** - The same images (built once in CI, ADR-referenced GitHub Actions, published to GHCR) run in a production Compose file: gunicorn+uvicorn workers for Django, `next start` (standalone output) for the frontend, dedicated Celery worker containers per queue class, and **Caddy** as the reverse proxy (automatic TLS via Let's Encrypt, HTTP/3, sane defaults — chosen over nginx specifically to remove the certificate-management failure mode for self-hosters). - **Same artifact, different configuration:** dev and prod run the same image with env-var configuration (12-factor); there is no "prod-only Dockerfile drift." - **The code-execution sandbox runs on a separate host (or is disabled)** in production — per ADR 0010, untrusted code never shares a kernel with the main stack. The widget-sandbox origin is just static files on a separate domain, served by the same Caddy. - Deployment is `git pull && docker compose pull && docker compose up -d` with a documented migration step; a small `deploy.sh` wraps ordering (migrate → restart API → restart workers) and health-check gating. Rolling restarts (start-new-before-stop-old) give near-zero-downtime for the stateless services; we accept brief worker-restart gaps. - **Backups are part of the deployment definition, not an afterthought:** a scheduled container runs `pg_dump` + Meilisearch-rebuildable note + MinIO bucket sync to the `ops` bucket/offsite target; restore procedure documented and exercised in CI quarterly (restore-test workflow). 3. **Kubernetes is explicitly deferred, with named triggers:** we adopt k8s (likely k3s first) when any of: (a) the flagship instance needs >1 app host for capacity, (b) zero-downtime deploys become contractual rather than nice-to-have, or (c) worker autoscaling materially saves money. Until then, the operational complexity (ingress, cert-manager, persistent volumes, upgrade churn) is pure cost. Nothing in the architecture resists the move: stateless 12-factor containers, externalized state, health endpoints, and config-by-env are exactly what k8s wants anyway. ## Alternatives Considered - **Kubernetes from day one:** would consume volunteer-ops attention the content platform needs more; raises self-hosting barrier dramatically; solves scaling problems we don't have. Deferred with explicit triggers, not rejected forever. - **Bare-metal/systemd install docs:** maximal transparency, but multiplies the support matrix (distro × Python × Node versions); containers collapse that matrix. We won't block community-maintained native packaging (e.g., a future Nix flake), but won't maintain it in-tree for MVP. - **PaaS recipes (Fly/Render/Railway) as primary:** convenient but couples the canonical deployment to commercial platforms — wrong default for a community-owned project. May be added as community-maintained alternatives. - **Nomad:** smaller operational footprint than k8s but a niche skill pool; fails the contributor-familiarity test of ADR 0012's reasoning. ## Consequences **Positive** - Minutes-to-running-stack for contributors; one-VPS self-hosting with automatic TLS; dev/prod parity via identical images; clean later path to k8s. - Backup/restore treated as core deliverable reduces the most common self-host disaster. **Negative / Accepted risks** - Single-host prod is a single point of failure; mitigated by backups + documented restore, and accepted as appropriate for MVP scale. The status page should be hosted off-box. - Compose's orchestration is primitive (no real rolling-deploy semantics for workers); accepted, scripted around. - Two Compose files (dev/prod) risk drift; mitigated by sharing a base file with overrides and CI-validating both configurations (`docker compose config`) on every PR. **Follow-ups** - Infra milestone: dev Compose with profiles, prod Compose + Caddy, `deploy.sh`, backup container, GHCR build pipeline, restore-test CI workflow.