# ADR 0019: Docker Compose for Development; Single-Host Docker for Production MVP

- **Status:** Accepted
- **Date:** 2025-01-15
- **Deciders:** Core architecture team
- **Related:** ADR 0010 (sandbox isolation), ADR 0012 (Redis/Celery), ADR 0018 (MinIO), ADR 0020 (observability), docs/architecture/01

## Context

The system comprises: Next.js frontend, Django API, Celery workers (+ beat), PostgreSQL, Redis, Meilisearch, MinIO, the widget-sandbox static origin, and the code-execution sandbox service (ADR 0010). Two distinct audiences must be able to run it:

1. **Contributors** — must get a full working stack in minutes on Linux/macOS/Windows(WSL2), or the project loses contributors at the front door.
2. **Self-hosting operators** — typically one modest VPS; "community-owned platform" rings hollow if deployment requires a Kubernetes cluster.

Plus our own flagship deployment, which starts small.

Options: Docker Compose everywhere; Kubernetes (k3s) from day one; Nomad; bare-metal/systemd installs; managed-PaaS recipes.

## Decision

1. **Development: Docker Compose is the canonical environment.** One `docker compose up` brings up everything, with:
   - Bind-mounted source + hot reload for Django (`runserver`) and Next.js (`next dev`); dependency layers cached in images so rebuilds are rare.
   - A `make bootstrap` (or `just bootstrap`) that runs migrations, seeds demo content (sample problems/courses across types), creates a superuser, and builds search indexes — a contributor reaches a *populated, working* app, not a blank database.
   - Profiles: `core` (default), `observability` (Prometheus/Grafana/Jaeger, ADR 0020), `sandbox` (code-execution service — optional locally; code-challenge grading degrades gracefully to "sandbox unavailable" when absent).
   - Pinned image tags for all services (e.g. `postgres:16`, `redis:7`, `getmeili/meilisearch:v1.x`) so contributor environments are reproducible.
   - **Native-run escape hatch documented:** Compose-for-infra-only (Postgres/Redis/Meili/MinIO in containers, Django/Next on the host) for contributors who prefer native debugging.
2. **Production MVP: single-host Docker, Compose-managed, behind Caddy.**
   - The same images (built once in CI, ADR-referenced GitHub Actions, published to GHCR) run in a production Compose file: gunicorn+uvicorn workers for Django, `next start` (standalone output) for the frontend, dedicated Celery worker containers per queue class, and **Caddy** as the reverse proxy (automatic TLS via Let's Encrypt, HTTP/3, sane defaults — chosen over nginx specifically to remove the certificate-management failure mode for self-hosters).
   - **Same artifact, different configuration:** dev and prod run the same image with env-var configuration (12-factor); there is no "prod-only Dockerfile drift."
   - **The code-execution sandbox runs on a separate host (or is disabled)** in production — per ADR 0010, untrusted code never shares a kernel with the main stack. The widget-sandbox origin is just static files on a separate domain, served by the same Caddy.
   - Deployment is `git pull && docker compose pull && docker compose up -d` with a documented migration step; a small `deploy.sh` wraps ordering (migrate → restart API → restart workers) and health-check gating. Rolling restarts (start-new-before-stop-old) give near-zero-downtime for the stateless services; we accept brief worker-restart gaps.
   - **Backups are part of the deployment definition, not an afterthought:** a scheduled container runs `pg_dump` + Meilisearch-rebuildable note + MinIO bucket sync to the `ops` bucket/offsite target; restore procedure documented and exercised in CI quarterly (restore-test workflow).
3. **Kubernetes is explicitly deferred, with named triggers:** we adopt k8s (likely k3s first) when any of: (a) the flagship instance needs >1 app host for capacity, (b) zero-downtime deploys become contractual rather than nice-to-have, or (c) worker autoscaling materially saves money. Until then, the operational complexity (ingress, cert-manager, persistent volumes, upgrade churn) is pure cost. Nothing in the architecture resists the move: stateless 12-factor containers, externalized state, health endpoints, and config-by-env are exactly what k8s wants anyway.

## Alternatives Considered

- **Kubernetes from day one:** would consume volunteer-ops attention the content platform needs more; raises self-hosting barrier dramatically; solves scaling problems we don't have. Deferred with explicit triggers, not rejected forever.
- **Bare-metal/systemd install docs:** maximal transparency, but multiplies the support matrix (distro × Python × Node versions); containers collapse that matrix. We won't block community-maintained native packaging (e.g., a future Nix flake), but won't maintain it in-tree for MVP.
- **PaaS recipes (Fly/Render/Railway) as primary:** convenient but couples the canonical deployment to commercial platforms — wrong default for a community-owned project. May be added as community-maintained alternatives.
- **Nomad:** smaller operational footprint than k8s but a niche skill pool; fails the contributor-familiarity test of ADR 0012's reasoning.

## Consequences

**Positive**
- Minutes-to-running-stack for contributors; one-VPS self-hosting with automatic TLS; dev/prod parity via identical images; clean later path to k8s.
- Backup/restore treated as core deliverable reduces the most common self-host disaster.

**Negative / Accepted risks**
- Single-host prod is a single point of failure; mitigated by backups + documented restore, and accepted as appropriate for MVP scale. The status page should be hosted off-box.
- Compose's orchestration is primitive (no real rolling-deploy semantics for workers); accepted, scripted around.
- Two Compose files (dev/prod) risk drift; mitigated by sharing a base file with overrides and CI-validating both configurations (`docker compose config`) on every PR.

**Follow-ups**
- Infra milestone: dev Compose with profiles, prod Compose + Caddy, `deploy.sh`, backup container, GHCR build pipeline, restore-test CI workflow.