# ADR 0012: Redis for Cache/Broker, Celery for Background Jobs

- **Status:** Accepted
- **Date:** 2025-01-15
- **Deciders:** Core architecture team
- **Related:** ADR 0001 (Django/DRF), ADR 0011 (Meilisearch indexing), ADR 0010 (code execution sandbox), docs/architecture/01-system-overview.md

## Context

Several platform responsibilities must not run inside the HTTP request/response cycle:

- **Search indexing** on publish/unpublish/rollback events (ADR 0011).
- **Code-challenge grading**: dispatching submissions to the sandbox service, polling/receiving results, recording attempts (ADR 0010). Grading can take seconds; requests must not block.
- **Notifications**: fan-out of review decisions, discussion replies, mentions; email digests.
- **Spaced-repetition scheduling**: nightly computation of due review queues per user.
- **Reputation recalculation**, plagiarism-similarity scans on submission, media post-processing (image resizing for low-bandwidth variants per docs/architecture/09), and OER export bundling.
- **Periodic jobs**: streak rollover at user-local midnight boundaries, search `quality_score` recompute, stale-draft reminders, audit-log archival (ADR 0013).

We also need a **cache** for rendered MDX fragments, hot problem documents, session-adjacent data, and API rate-limit counters.

Options evaluated:

1. **Redis + Celery** — the canonical Django pairing; mature, enormous documentation surface.
2. **Redis + Dramatiq** — leaner API, arguably cleaner defaults (acks-late by default), smaller community.
3. **PostgreSQL-backed queue** (e.g. `django-q2`, custom `SKIP LOCKED` tables) — one less service.
4. **RabbitMQ + Celery** — stronger broker semantics, heavier operations.

## Decision

We adopt **Redis (7.x)** as both the cache backend and the message broker, and **Celery (5.x)** as the task framework, with **Celery Beat** (via `django-celery-beat`) for periodic scheduling.

### Topology and conventions

1. **Logical separation within one Redis instance** at MVP scale, using databases/key prefixes:
   - DB 0: Django cache (`django-redis` backend).
   - DB 1: Celery broker.
   - DB 2: Celery result backend (results kept only for tasks whose callers need them, e.g. grading status polling; most tasks are fire-and-forget with `ignore_result=True`).
   - Rate-limit counters live in the cache DB with `rl:` prefixes.
   Large deployments can split these into separate Redis instances purely via configuration.
2. **Queues by latency class**, so a slow grading backlog can never starve notifications:
   - `interactive` — grading dispatch, anything a user is actively waiting on.
   - `default` — indexing, notifications, reputation updates.
   - `batch` — nightly/periodic heavy jobs.
   Workers are deployed per queue; the dev Compose file runs one worker consuming all three.
3. **Task hygiene rules (enforced in code review):**
   - Tasks are **idempotent** — every task can be retried safely; tasks key their effects on stable IDs and use upserts or conditional writes.
   - Tasks receive **primitive IDs, never ORM objects**, and re-fetch state at execution time.
   - `acks_late=True` + `task_reject_on_worker_lost=True` for all tasks, with explicit `max_retries` and exponential backoff; tasks that exhaust retries land in a dead-letter handling path that writes a structured error record and emits a metric (ADR 0020).
   - Time limits on every task (`soft_time_limit` set; hard limit slightly above).
4. **Redis durability posture:** Redis is treated as **losable**. AOF (`appendonly yes`, `everysec`) is enabled to reduce loss on restart, but the system is designed so that a flushed Redis costs, at worst, re-enqueueable work: critical state transitions (publication, attempt recording, review decisions) are committed to PostgreSQL *before* the task enqueue, and reconciliation jobs (e.g. "index anything published but missing from search") run periodically to heal dropped tasks.
5. **Enqueue-after-commit:** all task enqueues from request handlers use `transaction.on_commit(...)` so tasks never observe uncommitted state.

## Alternatives Considered

- **Dramatiq:** genuinely attractive (saner defaults, simpler internals). Rejected primarily for ecosystem reasons: `django-celery-beat` for DB-managed periodic schedules, broad contributor familiarity, Flower/inspection tooling, and richer canvas primitives (chains/groups) which the OER export pipeline will use. For an open-source project, optimizing for the *largest contributor familiarity pool* is itself an architectural value.
- **PostgreSQL-backed queue:** tempting for "one less service," but Redis is needed for caching and rate limiting regardless, so the service-count argument evaporates; and a busy polling queue adds load to the database we most need to protect.
- **RabbitMQ:** better broker guarantees (true ack semantics, no result/broker conflation) but a third stateful service with non-trivial operations (Erlang VM, mnesia, federation). Our idempotent-task discipline plus reconciliation jobs make Redis's weaker guarantees acceptable.

## Consequences

**Positive**
- One additional service (Redis) covers cache, broker, results, and rate limiting.
- Celery's maturity de-risks the trickiest flows (grading orchestration with retries and timeouts).
- Queue-per-latency-class prevents head-of-line blocking.

**Negative / Accepted risks**
- Redis-as-broker can drop tasks in crash scenarios; mitigated by write-then-enqueue ordering, idempotency, and reconciliation sweeps. We explicitly do **not** promise exactly-once execution anywhere; everything is at-least-once + idempotent.
- Celery's configuration surface is large and has historical footguns; mitigated by a single, heavily commented `celery.py` with our hygiene defaults applied globally.
- Beat is a single point of scheduling; we run exactly one beat process and document that constraint.

**Follow-ups**
- Implement the reconciliation tasks (`heal_search_index`, `heal_pending_gradings`) in the backend milestone.
- Add Celery queue depth and task failure-rate metrics to the observability stack (ADR 0020).