# ADR-0006: Structured JSON Documents with Embedded MDX, Not Raw HTML - **Status:** Accepted - **Date:** 2024-06-02 - **Deciders:** FablePool core team - **Related:** ADR-0007 (versioning), ADR-0009 (widget sandboxing), `docs/architecture/03-content-format.md`, `docs/schemas/*.json` ## Context Content storage format is the most consequential design decision in the system: it determines what can be validated, diffed, rendered, exported, graded, and sandboxed. Requirements from the project brief and from `03-content-format.md`: - structured, machine-validatable (answer specs must be gradeable); - diff-able for the review workflow and version history; - renderable to accessible HTML server-side, and to plain text/low-bandwidth variants; - safe — contributor content must never become an XSS or arbitrary-JSX execution vector; - portable for OER export/import. Candidate formats: raw HTML, pure Markdown files, pure MDX files, a fully custom block-JSON format (Notion-style), or a hybrid. ## Decision Adopt the **hybrid format specified in `03-content-format.md` and `docs/schemas/`**: each problem/course version is a **JSON document** whose structure (metadata, answer spec, hints, solutions, widget references, prerequisites) is schema-validated, and whose prose fields contain **restricted MDX** (CommonMark + GFM tables + math + a fixed component whitelist). Key rules: 1. **JSON owns structure; MDX owns prose.** Anything the machine must understand (answer type, tolerance, choice correctness, hint ordering, widget bindings) is a JSON field — never parsed out of prose. 2. **Restricted MDX, compiled server-side.** The MDX pipeline accepts only whitelisted components (``, `
`, ``, ``, ``, `` …). Unknown JSX, `import`, `export`, and expression evaluation are **rejected at validation time**, not silently stripped. Contributor MDX is therefore data, not code. 3. **No raw HTML.** The Markdown pipeline runs with raw HTML disabled; sanitization is defense-in-depth, not the primary safety mechanism. 4. **`schema_version` on every document**, with explicitly written migrations between schema versions (stored in the repo, run as Celery backfills). Old versions are migrated lazily on read and persisted on next write. 5. **Canonical serialization** (sorted keys, normalized whitespace in MDX) so version diffs in review are semantic, not cosmetic. ## Alternatives Considered - **Raw HTML (editor output).** Unvalidatable, ungradable, unsafe, un-diffable in any meaningful way. Explicitly prohibited by the brief. Rejected. - **Pure MDX files (content-as-repo).** Attractive (git-native), but answer specs in frontmatter are weakly validated, structured queries ("all numeric problems tagged calculus") require parsing every file, and letting MDX be *actual code* contradicts the security model. Rejected. - **Fully custom block JSON (no Markdown).** Maximum machine control, but hostile to contributors who write Markdown/LaTeX fluently, and it forces us to build an editor for every prose construct. Rejected. - **Per-field Markdown (no MDX components).** Simpler, but loses typed figures, widget slots, and callouts; we'd reinvent them as pseudo-syntax. Rejected. ## Consequences - ✅ Every version is validatable, diffable, exportable, and gradeable; reviews can show structured diffs ("tolerance changed 0.01 → 0.05"). - ✅ Server-side MDX compilation with a closed component set eliminates the contributor-JSX attack surface (widgets get their own sandbox, ADR-0009). - ⚠️ The editor must round-trip restricted MDX faithfully; TipTap ↔ MDX serialization is real engineering and owns its own test suite. - ⚠️ Schema migrations are a permanent maintenance duty; the `schema_version` discipline and migration test harness are non-optional. - ⚠️ Authors used to full MDX will hit the whitelist wall; the validator returns precise, friendly errors listing allowed components.