# 01 — System Overview ## 1. Architectural style Polymath is a **modular monolith API + separate frontend**, not microservices. The rationale (see [ADR-0001](../adr/0001-modular-monolith.md)) is that an open-source project run by volunteers must be cheap to operate, easy for a single contributor to run locally, and simple to reason about. Only two workloads are split out of the monolith, because they have hard isolation requirements rather than scaling requirements: 1. The **widget sandbox origin** — a static host on a *different registrable domain* serving untrusted widget bundles (see doc 07). 2. The **code runner** — an isolated service executing untrusted learner code in locked-down containers (see doc 07, §6). ## 2. Component diagram ```mermaid flowchart TB subgraph Client["Browser / PWA"] FE["Next.js App
(React, TypeScript, Tailwind, shadcn/ui)"] WIFR["Sandboxed widget iframes
(cross-origin)"] FE -- postMessage protocol --> WIFR end subgraph Edge["Edge / CDN"] CDN["CDN: static assets,
widget bundles, media"] end subgraph Core["Core platform (Docker network)"] API["Django 5.2 LTS + DRF
REST API, auth, RBAC"] WORKER["Celery workers
(indexing, notifications,
exports, plagiarism scan)"] BEAT["Celery beat
(streak rollover, SRS queue,
digest emails)"] PG[("PostgreSQL 16
system of record")] REDIS[("Redis 7
cache + broker + rate limits")] MEILI[("Meilisearch
search index")] S3[("MinIO / S3 / R2
media + widget bundles + exports")] end subgraph Isolated["Isolated services"] SANDBOX["Widget sandbox host
widgets.polymath-sandbox.example
(static, separate origin)"] RUNNER["Code runner service
(gVisor/nsjail containers,
no network, cgroup limits)"] end FE -- HTTPS JSON --> API FE -- static --> CDN WIFR -- bundle fetch --> SANDBOX CDN --> S3 SANDBOX --> S3 API --> PG API --> REDIS API --> MEILI API -- presigned URLs --> S3 API -- "internal HTTP + job queue" --> RUNNER WORKER --> PG WORKER --> REDIS WORKER --> MEILI WORKER --> S3 BEAT --> REDIS OBS["Sentry · OpenTelemetry
Prometheus · Grafana"] API -. traces/metrics .-> OBS WORKER -. traces/metrics .-> OBS FE -. errors .-> OBS ``` ### Component responsibilities | Component | Responsibilities | Explicit non-responsibilities | |---|---|---| | **Next.js frontend** | Rendering (SSR/ISR for public content, CSR for interactive solving), MDX rendering of *trusted* components, KaTeX math, client-side answer UX, offline PWA shell | Grading authority (server is authoritative), rendering untrusted widget code | | **Django API** | Auth, RBAC, all business rules, content version graph, review workflow, grading of all answer types, audit log, notifications fan-out, public API | Long-running work (delegated to Celery), executing untrusted code | | **Celery workers** | Search indexing, email/notification delivery, OER export/import, media processing, plagiarism similarity scan, reputation recomputation | Anything that must be synchronous/transactional with the request | | **PostgreSQL** | System of record for *everything* including content documents (JSONB), audit log, search outbox | Full-text search ranking (Meilisearch), hot counters (Redis) | | **Redis** | Cache, Celery broker, rate limiting, session blocklist, streak/SRS due-queues hot path | Durable data — Redis must always be rebuildable from Postgres | | **Meilisearch** | Search & filtering over published content; faceting by topic/difficulty/tags/author/status (status facets only for staff indexes) | Source of truth — fully rebuildable from Postgres | | **Widget sandbox host** | Serving immutable, content-addressed widget bundles from a separate registrable domain with a hardened CSP | Any access to user credentials, cookies, or the API | | **Code runner** | Compiling/executing learner submissions for code challenges inside gVisor/nsjail with CPU/memory/time/pids limits and **no network** | Anything else; it cannot reach Postgres or Redis | ## 3. Key request flows ### 3.1 Learner solves a problem (numeric answer) ```mermaid sequenceDiagram autonumber participant L as Learner (browser) participant FE as Next.js participant API as Django API participant PG as PostgreSQL participant R as Redis L->>FE: open /problems/slug FE->>API: GET /api/v1/problems/{slug} (published version) API->>PG: fetch ProblemVersion (published), strip answers/solutions Note over API: AnswerSpec correct values, hints,
and solutions are NEVER sent pre-attempt API-->>FE: sanitized problem document FE-->>L: render MDX + KaTeX L->>FE: submit answer "3.14" FE->>API: POST /api/v1/problems/{slug}/attempts {answer} API->>R: rate-limit check (token bucket per user) API->>PG: load full AnswerSpec, grade server-side API->>PG: INSERT ProblemAttempt, UPDATE Progress, SRS schedule API->>R: bump streak counter, invalidate progress cache API-->>FE: {correct: true, explanation_unlocked: true} FE->>API: GET /api/v1/problems/{slug}/solutions API-->>FE: solutions (now permitted) ``` Grading is **always server-side**, for every answer type, including widget tasks (the widget reports a structured answer payload; the server validates it against the AnswerSpec). The client never receives the correct answer before a correct attempt or an explicit "give up" action (which is recorded and counts against streak/SRS scheduling). ### 3.2 Contribution & publication flow ```mermaid sequenceDiagram autonumber participant C as Contributor participant API as Django API participant PG as PostgreSQL participant W as Celery participant M as Meilisearch C->>API: POST /api/v1/problems (creates Problem + ProblemVersion v1, state=draft) C->>API: PUT .../versions/head (autosaved drafts create new immutable versions) C->>API: POST .../versions/{n}/submit API->>PG: state draft→submitted, open Review, AuditLog entry Note over API: Reviewer assignment job queued participant Rev as Reviewer Rev->>API: POST /api/v1/reviews/{id}/comments (line-anchored) Rev->>API: POST /api/v1/reviews/{id}/decision {approve} API->>PG: state in_review→accepted (2 approvals required) Rev->>API: POST .../versions/{n}/publish (reviewer/moderator) API->>PG: accepted→published, demote prior published version API->>PG: write search outbox row (same transaction) W->>PG: poll outbox W->>M: upsert document in published index ``` ### 3.3 Search outbox pattern Search indexing uses a **transactional outbox**: the API writes an `outbox_search` row in the same Postgres transaction as the content change; a Celery worker drains the outbox and upserts into Meilisearch with retries. This guarantees the index never permanently diverges from Postgres without distributed transactions. A nightly full re-index job reconciles drift. ## 4. Frontend architecture - **Rendering strategy:** Public, published content (problem statements, course pages, topic pages) is served via **ISR** (incremental static regeneration) keyed on the published version ID — a version is immutable, so its rendered page is cacheable forever; publishing a new version changes the key. Authenticated surfaces (dashboard, editor, review queue) are client-rendered against the API. - **MDX pipeline:** Authors write MDX; the server stores both the MDX source and a compiled, sanitized AST (see doc 03 §7). The frontend renders the AST with a fixed registry of **trusted components only** — there is no arbitrary JSX evaluation in the main origin. Unknown components render as an inert fallback block. - **Math:** KaTeX, rendered server-side where possible (HTML + CSS output) so low-end devices don't pay JS layout cost ([ADR-0008](../adr/0008-katex.md)). - **Editor:** TipTap (ProseMirror) configured to round-trip the MDX subset in doc 03; raw MDX source mode is always available ([ADR-0007](../adr/0007-tiptap-mdx.md)). ## 5. Background jobs inventory (MVP) | Job | Trigger | Queue | |---|---|---| | `search.sync_outbox` | every 5s (beat) + on-demand | `indexing` | | `search.full_reindex` | nightly | `indexing` | | `notify.fanout` | on event (review decision, reply, mention, report resolution) | `notify` | | `notify.email_digest` | daily per user-timezone bucket | `notify` | | `srs.build_due_queue` | hourly | `default` | | `streaks.rollover` | daily per timezone bucket | `default` | | `reputation.recompute_user` | on reputation event (debounced) | `default` | | `moderation.plagiarism_scan` | on submit-for-review | `moderation` | | `oer.export_course` | on demand | `exports` | | `oer.import_bundle` | on demand (staff) | `exports` | | `media.process_upload` | on upload complete (strip EXIF, generate AVIF/WebP renditions, SVG sanitize) | `media` | | `audit.verify_chain` | daily (hash-chain verification, see doc 08) | `default` | ## 6. Deployment topology ### Development (single command) `docker compose up` brings up: `web` (Next.js dev), `api` (Django + autoreload), `worker`, `beat`, `postgres`, `redis`, `meilisearch`, `minio`, `sandbox-host` (static nginx on a second localhost port simulating the foreign origin), and `runner` (code runner with the *unsafe-local* backend clearly labeled dev-only). Seed data fixtures include sample problems of every answer type. ### Production (MVP): single VPS, Docker Compose ```mermaid flowchart LR U((Users)) --> CF["CDN / proxy
(TLS, caching, DDoS)"] CF --> CADDY["Caddy reverse proxy"] subgraph VPS["VPS (8 vCPU / 16 GB)"] CADDY --> NEXT["next (2 replicas)"] CADDY --> GUNICORN["api: gunicorn+uvicorn workers"] GUNICORN --> PGV[(postgres + WAL-G to S3)] GUNICORN --> RDS[(redis)] GUNICORN --> MS[(meilisearch)] WK["worker + beat"] --> PGV end CF --> SBX["sandbox origin
(separate domain, static)"] GUNICORN -. mTLS .-> RUN["code runner
(separate small VPS)"] ``` The code runner runs on a **separate cheap VM** even in the MVP — it is the one component where co-tenancy with the database is unacceptable. Kubernetes is deliberately deferred ([ADR-0014](../adr/0014-deployment.md)); the compose files are written so the same images move to k8s manifests later without code changes. ## 7. Caching strategy | Layer | What | Invalidation | |---|---|---| | CDN | published pages (ISR output), media renditions, widget bundles (immutable, content-addressed → `Cache-Control: immutable`) | version publish bumps cache key; bundles never invalidate | | Redis | sanitized published documents, tag/topic trees, user permission snapshots (60s TTL), rate-limit buckets | explicit delete on publish/role-change + TTL backstop | | Postgres | — | source of truth, no app-level caching of writes | Rule: **every cache must be safely cold-startable.** Wiping Redis entirely may slow the site but must never corrupt state (streaks/SRS due-queues are derived projections of Postgres rows). ## 8. Observability - **Sentry** on frontend, API, and workers (error tracking + release health). - **OpenTelemetry** tracing in Django/Celery, exported OTLP; trace IDs returned in `X-Request-Id` so users can attach them to bug reports. - **Prometheus** metrics: request latency histograms, grading latency per answer type, queue depths, outbox lag, runner sandbox kill counts, audit chain verification status. - **Privacy:** no third-party analytics; an optional self-hosted Plausible can be enabled by instance operators. IPs are truncated in logs after 7 days. ## 9. Recommendations & spaced repetition (MVP scope) Transparent, explainable heuristics only — no opaque models: - **SRS:** FSRS-style scheduling simplified to SM-2 parameters for MVP (ease factor per user×problem, interval doubling on success, reset on failure), stored on `Progress` rows; an hourly job materializes per-user due queues into Redis sorted sets. - **Recommendations:** rank candidate published problems by `(prerequisite-satisfaction × topic-affinity × difficulty-fit × freshness)` where difficulty-fit targets ~70–85% predicted success based on the user's rolling accuracy per topic. The formula is documented in the UI ("Why am I seeing this?") — explainability is a product principle. ## 10. Failure modes & degradation | Failure | Behavior | |---|---| | Meilisearch down | Search UI degrades to Postgres `ILIKE`+tag filter fallback (clearly slower); browsing by topic unaffected | | Redis down | Caches bypass to Postgres; rate limiting falls back to conservative in-process limits; Celery paused (outbox accumulates safely) | | Code runner down | Code challenges show "grading temporarily unavailable", submissions queue with idempotency keys, graded on recovery | | Sandbox origin down | Widgets render a static fallback (the problem's `widgetFallback` MDX, required by schema — see doc 03 §5.9) | | Postgres down | Hard outage; status page; PWA shell still serves cached published content read-only |