# 01 — System Overview
## 1. Architectural style
Polymath is a **modular monolith API + separate frontend**, not microservices.
The rationale (see [ADR-0001](../adr/0001-modular-monolith.md)) is that an
open-source project run by volunteers must be cheap to operate, easy for a
single contributor to run locally, and simple to reason about. Only two
workloads are split out of the monolith, because they have hard isolation
requirements rather than scaling requirements:
1. The **widget sandbox origin** — a static host on a *different registrable
domain* serving untrusted widget bundles (see doc 07).
2. The **code runner** — an isolated service executing untrusted learner code
in locked-down containers (see doc 07, §6).
## 2. Component diagram
```mermaid
flowchart TB
subgraph Client["Browser / PWA"]
FE["Next.js App
(React, TypeScript, Tailwind, shadcn/ui)"]
WIFR["Sandboxed widget iframes
(cross-origin)"]
FE -- postMessage protocol --> WIFR
end
subgraph Edge["Edge / CDN"]
CDN["CDN: static assets,
widget bundles, media"]
end
subgraph Core["Core platform (Docker network)"]
API["Django 5.2 LTS + DRF
REST API, auth, RBAC"]
WORKER["Celery workers
(indexing, notifications,
exports, plagiarism scan)"]
BEAT["Celery beat
(streak rollover, SRS queue,
digest emails)"]
PG[("PostgreSQL 16
system of record")]
REDIS[("Redis 7
cache + broker + rate limits")]
MEILI[("Meilisearch
search index")]
S3[("MinIO / S3 / R2
media + widget bundles + exports")]
end
subgraph Isolated["Isolated services"]
SANDBOX["Widget sandbox host
widgets.polymath-sandbox.example
(static, separate origin)"]
RUNNER["Code runner service
(gVisor/nsjail containers,
no network, cgroup limits)"]
end
FE -- HTTPS JSON --> API
FE -- static --> CDN
WIFR -- bundle fetch --> SANDBOX
CDN --> S3
SANDBOX --> S3
API --> PG
API --> REDIS
API --> MEILI
API -- presigned URLs --> S3
API -- "internal HTTP + job queue" --> RUNNER
WORKER --> PG
WORKER --> REDIS
WORKER --> MEILI
WORKER --> S3
BEAT --> REDIS
OBS["Sentry · OpenTelemetry
Prometheus · Grafana"]
API -. traces/metrics .-> OBS
WORKER -. traces/metrics .-> OBS
FE -. errors .-> OBS
```
### Component responsibilities
| Component | Responsibilities | Explicit non-responsibilities |
|---|---|---|
| **Next.js frontend** | Rendering (SSR/ISR for public content, CSR for interactive solving), MDX rendering of *trusted* components, KaTeX math, client-side answer UX, offline PWA shell | Grading authority (server is authoritative), rendering untrusted widget code |
| **Django API** | Auth, RBAC, all business rules, content version graph, review workflow, grading of all answer types, audit log, notifications fan-out, public API | Long-running work (delegated to Celery), executing untrusted code |
| **Celery workers** | Search indexing, email/notification delivery, OER export/import, media processing, plagiarism similarity scan, reputation recomputation | Anything that must be synchronous/transactional with the request |
| **PostgreSQL** | System of record for *everything* including content documents (JSONB), audit log, search outbox | Full-text search ranking (Meilisearch), hot counters (Redis) |
| **Redis** | Cache, Celery broker, rate limiting, session blocklist, streak/SRS due-queues hot path | Durable data — Redis must always be rebuildable from Postgres |
| **Meilisearch** | Search & filtering over published content; faceting by topic/difficulty/tags/author/status (status facets only for staff indexes) | Source of truth — fully rebuildable from Postgres |
| **Widget sandbox host** | Serving immutable, content-addressed widget bundles from a separate registrable domain with a hardened CSP | Any access to user credentials, cookies, or the API |
| **Code runner** | Compiling/executing learner submissions for code challenges inside gVisor/nsjail with CPU/memory/time/pids limits and **no network** | Anything else; it cannot reach Postgres or Redis |
## 3. Key request flows
### 3.1 Learner solves a problem (numeric answer)
```mermaid
sequenceDiagram
autonumber
participant L as Learner (browser)
participant FE as Next.js
participant API as Django API
participant PG as PostgreSQL
participant R as Redis
L->>FE: open /problems/slug
FE->>API: GET /api/v1/problems/{slug} (published version)
API->>PG: fetch ProblemVersion (published), strip answers/solutions
Note over API: AnswerSpec correct values, hints,
and solutions are NEVER sent pre-attempt
API-->>FE: sanitized problem document
FE-->>L: render MDX + KaTeX
L->>FE: submit answer "3.14"
FE->>API: POST /api/v1/problems/{slug}/attempts {answer}
API->>R: rate-limit check (token bucket per user)
API->>PG: load full AnswerSpec, grade server-side
API->>PG: INSERT ProblemAttempt, UPDATE Progress, SRS schedule
API->>R: bump streak counter, invalidate progress cache
API-->>FE: {correct: true, explanation_unlocked: true}
FE->>API: GET /api/v1/problems/{slug}/solutions
API-->>FE: solutions (now permitted)
```
Grading is **always server-side**, for every answer type, including widget
tasks (the widget reports a structured answer payload; the server validates it
against the AnswerSpec). The client never receives the correct answer before a
correct attempt or an explicit "give up" action (which is recorded and counts
against streak/SRS scheduling).
### 3.2 Contribution & publication flow
```mermaid
sequenceDiagram
autonumber
participant C as Contributor
participant API as Django API
participant PG as PostgreSQL
participant W as Celery
participant M as Meilisearch
C->>API: POST /api/v1/problems (creates Problem + ProblemVersion v1, state=draft)
C->>API: PUT .../versions/head (autosaved drafts create new immutable versions)
C->>API: POST .../versions/{n}/submit
API->>PG: state draft→submitted, open Review, AuditLog entry
Note over API: Reviewer assignment job queued
participant Rev as Reviewer
Rev->>API: POST /api/v1/reviews/{id}/comments (line-anchored)
Rev->>API: POST /api/v1/reviews/{id}/decision {approve}
API->>PG: state in_review→accepted (2 approvals required)
Rev->>API: POST .../versions/{n}/publish (reviewer/moderator)
API->>PG: accepted→published, demote prior published version
API->>PG: write search outbox row (same transaction)
W->>PG: poll outbox
W->>M: upsert document in published index
```
### 3.3 Search outbox pattern
Search indexing uses a **transactional outbox**: the API writes an
`outbox_search` row in the same Postgres transaction as the content change; a
Celery worker drains the outbox and upserts into Meilisearch with retries.
This guarantees the index never permanently diverges from Postgres without
distributed transactions. A nightly full re-index job reconciles drift.
## 4. Frontend architecture
- **Rendering strategy:** Public, published content (problem statements,
course pages, topic pages) is served via **ISR** (incremental static
regeneration) keyed on the published version ID — a version is immutable, so
its rendered page is cacheable forever; publishing a new version changes the
key. Authenticated surfaces (dashboard, editor, review queue) are
client-rendered against the API.
- **MDX pipeline:** Authors write MDX; the server stores both the MDX source
and a compiled, sanitized AST (see doc 03 §7). The frontend renders the AST
with a fixed registry of **trusted components only** — there is no arbitrary
JSX evaluation in the main origin. Unknown components render as an inert
fallback block.
- **Math:** KaTeX, rendered server-side where possible (HTML + CSS output) so
low-end devices don't pay JS layout cost ([ADR-0008](../adr/0008-katex.md)).
- **Editor:** TipTap (ProseMirror) configured to round-trip the MDX subset in
doc 03; raw MDX source mode is always available ([ADR-0007](../adr/0007-tiptap-mdx.md)).
## 5. Background jobs inventory (MVP)
| Job | Trigger | Queue |
|---|---|---|
| `search.sync_outbox` | every 5s (beat) + on-demand | `indexing` |
| `search.full_reindex` | nightly | `indexing` |
| `notify.fanout` | on event (review decision, reply, mention, report resolution) | `notify` |
| `notify.email_digest` | daily per user-timezone bucket | `notify` |
| `srs.build_due_queue` | hourly | `default` |
| `streaks.rollover` | daily per timezone bucket | `default` |
| `reputation.recompute_user` | on reputation event (debounced) | `default` |
| `moderation.plagiarism_scan` | on submit-for-review | `moderation` |
| `oer.export_course` | on demand | `exports` |
| `oer.import_bundle` | on demand (staff) | `exports` |
| `media.process_upload` | on upload complete (strip EXIF, generate AVIF/WebP renditions, SVG sanitize) | `media` |
| `audit.verify_chain` | daily (hash-chain verification, see doc 08) | `default` |
## 6. Deployment topology
### Development (single command)
`docker compose up` brings up: `web` (Next.js dev), `api` (Django + autoreload),
`worker`, `beat`, `postgres`, `redis`, `meilisearch`, `minio`, `sandbox-host`
(static nginx on a second localhost port simulating the foreign origin), and
`runner` (code runner with the *unsafe-local* backend clearly labeled
dev-only). Seed data fixtures include sample problems of every answer type.
### Production (MVP): single VPS, Docker Compose
```mermaid
flowchart LR
U((Users)) --> CF["CDN / proxy
(TLS, caching, DDoS)"]
CF --> CADDY["Caddy reverse proxy"]
subgraph VPS["VPS (8 vCPU / 16 GB)"]
CADDY --> NEXT["next (2 replicas)"]
CADDY --> GUNICORN["api: gunicorn+uvicorn workers"]
GUNICORN --> PGV[(postgres + WAL-G to S3)]
GUNICORN --> RDS[(redis)]
GUNICORN --> MS[(meilisearch)]
WK["worker + beat"] --> PGV
end
CF --> SBX["sandbox origin
(separate domain, static)"]
GUNICORN -. mTLS .-> RUN["code runner
(separate small VPS)"]
```
The code runner runs on a **separate cheap VM** even in the MVP — it is the
one component where co-tenancy with the database is unacceptable. Kubernetes
is deliberately deferred ([ADR-0014](../adr/0014-deployment.md)); the compose
files are written so the same images move to k8s manifests later without code
changes.
## 7. Caching strategy
| Layer | What | Invalidation |
|---|---|---|
| CDN | published pages (ISR output), media renditions, widget bundles (immutable, content-addressed → `Cache-Control: immutable`) | version publish bumps cache key; bundles never invalidate |
| Redis | sanitized published documents, tag/topic trees, user permission snapshots (60s TTL), rate-limit buckets | explicit delete on publish/role-change + TTL backstop |
| Postgres | — | source of truth, no app-level caching of writes |
Rule: **every cache must be safely cold-startable.** Wiping Redis entirely may
slow the site but must never corrupt state (streaks/SRS due-queues are derived
projections of Postgres rows).
## 8. Observability
- **Sentry** on frontend, API, and workers (error tracking + release health).
- **OpenTelemetry** tracing in Django/Celery, exported OTLP; trace IDs returned
in `X-Request-Id` so users can attach them to bug reports.
- **Prometheus** metrics: request latency histograms, grading latency per
answer type, queue depths, outbox lag, runner sandbox kill counts, audit
chain verification status.
- **Privacy:** no third-party analytics; an optional self-hosted Plausible can
be enabled by instance operators. IPs are truncated in logs after 7 days.
## 9. Recommendations & spaced repetition (MVP scope)
Transparent, explainable heuristics only — no opaque models:
- **SRS:** FSRS-style scheduling simplified to SM-2 parameters for MVP
(ease factor per user×problem, interval doubling on success, reset on
failure), stored on `Progress` rows; an hourly job materializes per-user due
queues into Redis sorted sets.
- **Recommendations:** rank candidate published problems by
`(prerequisite-satisfaction × topic-affinity × difficulty-fit × freshness)`
where difficulty-fit targets ~70–85% predicted success based on the user's
rolling accuracy per topic. The formula is documented in the UI ("Why am I
seeing this?") — explainability is a product principle.
## 10. Failure modes & degradation
| Failure | Behavior |
|---|---|
| Meilisearch down | Search UI degrades to Postgres `ILIKE`+tag filter fallback (clearly slower); browsing by topic unaffected |
| Redis down | Caches bypass to Postgres; rate limiting falls back to conservative in-process limits; Celery paused (outbox accumulates safely) |
| Code runner down | Code challenges show "grading temporarily unavailable", submissions queue with idempotency keys, graded on recovery |
| Sandbox origin down | Widgets render a static fallback (the problem's `widgetFallback` MDX, required by schema — see doc 03 §5.9) |
| Postgres down | Hard outage; status page; PWA shell still serves cached published content read-only |