# ADR 0012: Redis for Cache/Broker, Celery for Background Jobs - **Status:** Accepted - **Date:** 2025-01-15 - **Deciders:** Core architecture team - **Related:** ADR 0001 (Django/DRF), ADR 0011 (Meilisearch indexing), ADR 0010 (code execution sandbox), docs/architecture/01-system-overview.md ## Context Several platform responsibilities must not run inside the HTTP request/response cycle: - **Search indexing** on publish/unpublish/rollback events (ADR 0011). - **Code-challenge grading**: dispatching submissions to the sandbox service, polling/receiving results, recording attempts (ADR 0010). Grading can take seconds; requests must not block. - **Notifications**: fan-out of review decisions, discussion replies, mentions; email digests. - **Spaced-repetition scheduling**: nightly computation of due review queues per user. - **Reputation recalculation**, plagiarism-similarity scans on submission, media post-processing (image resizing for low-bandwidth variants per docs/architecture/09), and OER export bundling. - **Periodic jobs**: streak rollover at user-local midnight boundaries, search `quality_score` recompute, stale-draft reminders, audit-log archival (ADR 0013). We also need a **cache** for rendered MDX fragments, hot problem documents, session-adjacent data, and API rate-limit counters. Options evaluated: 1. **Redis + Celery** — the canonical Django pairing; mature, enormous documentation surface. 2. **Redis + Dramatiq** — leaner API, arguably cleaner defaults (acks-late by default), smaller community. 3. **PostgreSQL-backed queue** (e.g. `django-q2`, custom `SKIP LOCKED` tables) — one less service. 4. **RabbitMQ + Celery** — stronger broker semantics, heavier operations. ## Decision We adopt **Redis (7.x)** as both the cache backend and the message broker, and **Celery (5.x)** as the task framework, with **Celery Beat** (via `django-celery-beat`) for periodic scheduling. ### Topology and conventions 1. **Logical separation within one Redis instance** at MVP scale, using databases/key prefixes: - DB 0: Django cache (`django-redis` backend). - DB 1: Celery broker. - DB 2: Celery result backend (results kept only for tasks whose callers need them, e.g. grading status polling; most tasks are fire-and-forget with `ignore_result=True`). - Rate-limit counters live in the cache DB with `rl:` prefixes. Large deployments can split these into separate Redis instances purely via configuration. 2. **Queues by latency class**, so a slow grading backlog can never starve notifications: - `interactive` — grading dispatch, anything a user is actively waiting on. - `default` — indexing, notifications, reputation updates. - `batch` — nightly/periodic heavy jobs. Workers are deployed per queue; the dev Compose file runs one worker consuming all three. 3. **Task hygiene rules (enforced in code review):** - Tasks are **idempotent** — every task can be retried safely; tasks key their effects on stable IDs and use upserts or conditional writes. - Tasks receive **primitive IDs, never ORM objects**, and re-fetch state at execution time. - `acks_late=True` + `task_reject_on_worker_lost=True` for all tasks, with explicit `max_retries` and exponential backoff; tasks that exhaust retries land in a dead-letter handling path that writes a structured error record and emits a metric (ADR 0020). - Time limits on every task (`soft_time_limit` set; hard limit slightly above). 4. **Redis durability posture:** Redis is treated as **losable**. AOF (`appendonly yes`, `everysec`) is enabled to reduce loss on restart, but the system is designed so that a flushed Redis costs, at worst, re-enqueueable work: critical state transitions (publication, attempt recording, review decisions) are committed to PostgreSQL *before* the task enqueue, and reconciliation jobs (e.g. "index anything published but missing from search") run periodically to heal dropped tasks. 5. **Enqueue-after-commit:** all task enqueues from request handlers use `transaction.on_commit(...)` so tasks never observe uncommitted state. ## Alternatives Considered - **Dramatiq:** genuinely attractive (saner defaults, simpler internals). Rejected primarily for ecosystem reasons: `django-celery-beat` for DB-managed periodic schedules, broad contributor familiarity, Flower/inspection tooling, and richer canvas primitives (chains/groups) which the OER export pipeline will use. For an open-source project, optimizing for the *largest contributor familiarity pool* is itself an architectural value. - **PostgreSQL-backed queue:** tempting for "one less service," but Redis is needed for caching and rate limiting regardless, so the service-count argument evaporates; and a busy polling queue adds load to the database we most need to protect. - **RabbitMQ:** better broker guarantees (true ack semantics, no result/broker conflation) but a third stateful service with non-trivial operations (Erlang VM, mnesia, federation). Our idempotent-task discipline plus reconciliation jobs make Redis's weaker guarantees acceptable. ## Consequences **Positive** - One additional service (Redis) covers cache, broker, results, and rate limiting. - Celery's maturity de-risks the trickiest flows (grading orchestration with retries and timeouts). - Queue-per-latency-class prevents head-of-line blocking. **Negative / Accepted risks** - Redis-as-broker can drop tasks in crash scenarios; mitigated by write-then-enqueue ordering, idempotency, and reconciliation sweeps. We explicitly do **not** promise exactly-once execution anywhere; everything is at-least-once + idempotent. - Celery's configuration surface is large and has historical footguns; mitigated by a single, heavily commented `celery.py` with our hygiene defaults applied globally. - Beat is a single point of scheduling; we run exactly one beat process and document that constraint. **Follow-ups** - Implement the reconciliation tasks (`heal_search_index`, `heal_pending_gradings`) in the backend milestone. - Add Celery queue depth and task failure-rate metrics to the observability stack (ADR 0020).