# ADR-0006: Structured JSON Documents with Embedded MDX, Not Raw HTML
- **Status:** Accepted
- **Date:** 2024-06-02
- **Deciders:** FablePool core team
- **Related:** ADR-0007 (versioning), ADR-0009 (widget sandboxing),
`docs/architecture/03-content-format.md`, `docs/schemas/*.json`
## Context
Content storage format is the most consequential design decision in the
system: it determines what can be validated, diffed, rendered, exported,
graded, and sandboxed. Requirements from the project brief and from
`03-content-format.md`:
- structured, machine-validatable (answer specs must be gradeable);
- diff-able for the review workflow and version history;
- renderable to accessible HTML server-side, and to plain text/low-bandwidth
variants;
- safe — contributor content must never become an XSS or arbitrary-JSX
execution vector;
- portable for OER export/import.
Candidate formats: raw HTML, pure Markdown files, pure MDX files, a fully
custom block-JSON format (Notion-style), or a hybrid.
## Decision
Adopt the **hybrid format specified in `03-content-format.md` and
`docs/schemas/`**: each problem/course version is a **JSON document** whose
structure (metadata, answer spec, hints, solutions, widget references,
prerequisites) is schema-validated, and whose prose fields contain
**restricted MDX** (CommonMark + GFM tables + math + a fixed component
whitelist).
Key rules:
1. **JSON owns structure; MDX owns prose.** Anything the machine must
understand (answer type, tolerance, choice correctness, hint ordering,
widget bindings) is a JSON field — never parsed out of prose.
2. **Restricted MDX, compiled server-side.** The MDX pipeline accepts only
whitelisted components (``, ``, ``,
``, ``, `` …). Unknown JSX, `import`,
`export`, and expression evaluation are **rejected at validation time**,
not silently stripped. Contributor MDX is therefore data, not code.
3. **No raw HTML.** The Markdown pipeline runs with raw HTML disabled;
sanitization is defense-in-depth, not the primary safety mechanism.
4. **`schema_version` on every document**, with explicitly written
migrations between schema versions (stored in the repo, run as Celery
backfills). Old versions are migrated lazily on read and persisted on
next write.
5. **Canonical serialization** (sorted keys, normalized whitespace in MDX)
so version diffs in review are semantic, not cosmetic.
## Alternatives Considered
- **Raw HTML (editor output).** Unvalidatable, ungradable, unsafe,
un-diffable in any meaningful way. Explicitly prohibited by the brief.
Rejected.
- **Pure MDX files (content-as-repo).** Attractive (git-native), but answer
specs in frontmatter are weakly validated, structured queries ("all
numeric problems tagged calculus") require parsing every file, and
letting MDX be *actual code* contradicts the security model. Rejected.
- **Fully custom block JSON (no Markdown).** Maximum machine control, but
hostile to contributors who write Markdown/LaTeX fluently, and it forces
us to build an editor for every prose construct. Rejected.
- **Per-field Markdown (no MDX components).** Simpler, but loses typed
figures, widget slots, and callouts; we'd reinvent them as pseudo-syntax.
Rejected.
## Consequences
- ✅ Every version is validatable, diffable, exportable, and gradeable;
reviews can show structured diffs ("tolerance changed 0.01 → 0.05").
- ✅ Server-side MDX compilation with a closed component set eliminates the
contributor-JSX attack surface (widgets get their own sandbox, ADR-0009).
- ⚠️ The editor must round-trip restricted MDX faithfully; TipTap ↔ MDX
serialization is real engineering and owns its own test suite.
- ⚠️ Schema migrations are a permanent maintenance duty; the
`schema_version` discipline and migration test harness are non-optional.
- ⚠️ Authors used to full MDX will hit the whitelist wall; the validator
returns precise, friendly errors listing allowed components.