# ADR 0020: Observability — Structured Logs, Prometheus Metrics, OpenTelemetry Traces, Sentry Errors

- **Status:** Accepted
- **Date:** 2025-01-15
- **Deciders:** Core architecture team
- **Related:** ADR 0012 (Celery), ADR 0013 (audit log — *not* observability), ADR 0019 (deployment), docs/architecture/01

## Context

When the platform misbehaves — grading stuck, search stale, review notifications missing — volunteer operators must be able to answer "what's wrong?" quickly, on instances ranging from a hobbyist VPS to the flagship deployment. We need the three classic signals (logs, metrics, traces) plus error aggregation, under constraints:

- **Self-hosters must get value at near-zero cost and configuration.** An observability stack that requires running five extra services is one nobody will run; logs alone must be genuinely useful.
- **No mandatory SaaS.** Sentry is suggested in the brief; it must be optional and substitutable with its open-source/self-hosted form (or GlitchTip, which speaks the Sentry protocol).
- **Privacy:** this is a learning platform; telemetry must not leak learner answer content, and any error-reporting payloads must be scrubbed.
- Clear separation from the **audit log** (ADR 0013): audit is a governance record with integrity guarantees; observability is operational telemetry, sampled and disposable. Conflating them corrupts both.

## Decision

A **tiered** observability architecture — each tier optional, each adding value independently:

### Tier 0 (always on): structured logs + health endpoints

1. **JSON structured logging everywhere** (Django via `structlog`-configured stdlib logging; Next.js server via `pino`; Celery workers likewise), written to stdout per 12-factor, captured by Docker's journald/json-file driver. Every log line carries: timestamp, level, logger, `request_id`, `user_id` (opaque ID only — never email/handle), and `trace_id` when tracing is active. A `request_id` is generated at the proxy (Caddy) and propagated end-to-end, so a single grep reconstructs a request's story even with zero extra infrastructure.
2. **Scrubbing at the source:** logging helpers never accept attempt answers, free-text submissions, or auth material as loggable fields; a denylist processor redacts known-sensitive keys defensively.
3. **Health endpoints:** `/healthz` (process up), `/readyz` (DB + Redis + storage reachable, migrations current), and a worker heartbeat (Celery beat task touching a Redis key, exposed via `/readyz`'s detail payload). Caddy and `deploy.sh` (ADR 0019) gate on these.

### Tier 1 (recommended): Prometheus metrics + Grafana

4. **Metrics exposition:** `django-prometheus` for HTTP/DB/cache metrics; a small custom collector for **domain metrics**, which are the ones that actually matter operationally:
   - `celery_queue_depth{queue=}` and task failure/retry counters (ADR 0012's dead-letter path increments these),
   - `grading_dispatch_latency_seconds`, `sandbox_unavailable_total`,
   - `search_index_lag_seconds` (newest published-but-unindexed age) and `audit_chain_last_verified_age_seconds` (ADR 0013),
   - `review_queue_size`, `signup_total`, `attempt_total{problem_type=}`.
   - Exporters for Postgres/Redis included in the `observability` Compose profile.
5. **Shipped dashboards and alerts as code:** the repo carries Grafana dashboard JSON and a Prometheus alert rules file (API error rate, p95 latency, queue depth growth, index lag, low disk, audit-verify staleness, cert expiry). Self-hosters who enable the profile get curated dashboards, not a blank Grafana.

### Tier 2 (optional): tracing + error aggregation

6. **OpenTelemetry tracing**, instrumented via OTel Python SDK (Django, psycopg, redis, celery auto-instrumentation) and OTel JS for the Next.js server, exporting OTLP to a configurable endpoint — Jaeger all-in-one in the dev/observability profile; anything OTLP-compatible in prod. **Head sampling defaults to 5%** with always-sample for errored requests; tracing is off unless an endpoint is configured. Trace context propagates frontend-server → API → Celery (via task headers) so a "publish → index → searchable" flow is one trace.
7. **Error aggregation via the Sentry protocol:** `sentry-sdk` (Python) and `@sentry/nextjs`, pointed at self-hosted Sentry, hosted Sentry, or GlitchTip by DSN configuration; disabled when no DSN is set. `before_send` scrubbers strip request bodies, attempt payloads, and PII; `send_default_pii=False`. Release tagging from the CI-injected git SHA so errors map to deploys.

### Non-goals / boundaries

8. No product-analytics tooling in this ADR (that's a separate, consent-governed decision); no log-aggregation service (Loki etc.) at MVP — single-host journald + `docker logs` suffices and Loki is a documented add-on; the audit log is never used as a telemetry sink nor vice versa.

## Alternatives Considered

- **All-in OTel (logs+metrics+traces through one collector):** conceptually elegant; practically, Prometheus pull metrics and plain JSON stdout logs are simpler to run, better understood by the contributor pool, and Grafana consumes both natively. We adopt OTel where it's the clear winner (traces) without forcing it everywhere. Revisit when OTel logging/metrics maturity and operator familiarity improve.
- **ELK/OpenSearch logging stack:** heavy (see ADR 0011's JVM argument); journald + grep covers single-host reality; Loki noted as the lightweight upgrade path.
- **Sentry-only observability:** common shortcut; gives errors but no queue-depth/index-lag/system view, and couples health visibility to one (optional) tool. Rejected as sole answer.
- **StatsD/Telegraf pipelines:** push-model legacy; Prometheus pull + exporters is the contributor-familiar standard.

## Consequences

**Positive**
- A zero-config instance still produces correlatable, scrubbed, structured logs with request IDs — genuinely debuggable.
- Domain metrics (queue depth, index lag, audit-verify age) surface the platform's *real* failure modes, not just CPU graphs; dashboards/alerts ship as code.
- Every component is swappable (OTLP endpoint, Sentry-protocol DSN, Prometheus federation) — no SaaS lock-in anywhere.

**Negative / Accepted risks**
- Three optional tiers mean documentation must be very clear about what's on by default (Tier 0 only). Mitigated by a single `OBSERVABILITY.md` decision table.
- OTel SDK auto-instrumentation versions move quickly; we pin targeted versions in manifests (`opentelemetry-sdk ^1.25`, instrumentation packages matching) and accept periodic maintenance.
- Sampled traces will sometimes miss the interesting request; error-biased sampling mitigates the worst of it.

**Follow-ups**
- Infra milestone: `observability` Compose profile (Prometheus, Grafana + provisioned dashboards, Jaeger, exporters), alert rules file, `OBSERVABILITY.md`.
- Backend milestone: domain-metric collectors, health endpoints, structlog config, scrubbing processors.