# Authoring an Import Adapter Adapters are the only way raw source material enters a PMP node. This document specifies the adapter contract, the provenance obligations, idempotency rules, and how to test a new adapter. Read it before adding a source type; the contract is what keeps every byte in the log auditable. ## 1. What an adapter is — and is not An adapter: * **reads** one kind of source (a file format, a directory layout, an export dump), * **parses** it into discrete *evidence items* — small, self-contained facts about what the source literally contained, * **attaches provenance** describing exactly where each item came from, * and **yields** those items to the node, which signs and appends them as `evidence` operations. An adapter does **not**: * interpret, summarize, or derive ("Avery likes hiking" is a *claim*, produced by the derivation engine in a later milestone — never by an adapter), * write to the log directly (the node owns signing, sequencing, and dedup), * mutate or normalize away source content it cannot reproduce — when in doubt, preserve the source value verbatim alongside any normalized form (this is what the ICS adapter does with floating-time events). The mental model: **adapters are witnesses, not analysts.** They testify to what the source said, with receipts. ## 2. The contract Adapters live in `src/pmp/adapters/` and subclass the base class in `adapters/base.py`. The contract has three parts: ### 2.1 Identity Every adapter declares: * `name` — a stable, lowercase identifier (e.g. `"calendar.ics"`, `"notes.text"`, `"photos.exif_mock"`). This appears in every operation's provenance and must never be reused for a different format. * `version` — a semantic version string for the *adapter logic*. Bump it whenever parsing behavior changes in a way that could alter emitted evidence; provenance records it so old evidence can be re-interpreted correctly. * the **source types it accepts** (file vs. directory, expected extensions/shape), validated up front with a clear error rather than a half-import. ### 2.2 Parsing The core method takes a source path and yields evidence items. Each item carries: * **`kind`** — what sort of fact this is (e.g. `calendar.event`, `note.document`, `photo.metadata`). Kinds are namespaced by source domain and documented in the adapter's module docstring. * **`external_id`** — see §3; the dedup key. * **`body`** — a JSON-serializable dict of the parsed content. Keys must be stable across adapter versions where possible; values must round-trip through the canonical JSON serializer (`src/pmp/canonical.py`): no floats where precision matters (use strings/integers), timestamps as RFC 3339 strings, bytes as lowercase hex or base64 — match the conventions already used by the shipped adapters. * **`provenance`** — see §4. Yield items in a **deterministic order** for a given source (the shipped adapters sort by source-internal identifiers). Determinism makes imports reproducible and diffs meaningful. Parsing failures of the *whole source* should raise the adapter error type from `src/pmp/errors.py` with a message naming the path. Per-item irregularities (a malformed event among hundreds) should be handled the way the ICS adapter does: skip the item, count it, and surface the count in the import summary — never silently drop, never abort the whole import for one bad record. ### 2.3 Registration Add the adapter to the registry in `src/pmp/adapters/__init__.py`. That is all the CLI needs: `pmp import ` will find it, and `pmp import --help` will list it. ## 3. Idempotency: the `external_id` The node deduplicates on `(adapter name, external_id)`. The rule: > **Identical source content must produce the same `external_id`; changed > content must produce a different one.** In practice that means deriving the ID from a hash of the item's canonical source content — *not* from file paths, import times, or iteration order. The shipped adapters do this as follows: * **ICS**: hash over the event's UID plus its normalized property set, so an unchanged event re-imports as a duplicate, while an edited event (new DTSTART, say) becomes new evidence — both facts are true and both belong in an append-only log. * **Notes**: hash over the note's relative path and content bytes; touching the mtime alone changes nothing. * **Photos**: hash over each photo record's metadata fields. If your source has genuinely stable native IDs **and** versions (e.g. a notes app export with `id` + `revision`), prefer `"{id}@{revision}"` — it is more legible in the log. When in doubt, hash content. Never make `external_id` depend on anything outside the source bytes; otherwise re-importing a backup creates phantom duplicates and the audit trail rots. ## 4. Provenance obligations Every evidence item must carry a provenance block with at least: | Field | Meaning | |---|---| | `adapter` | the adapter `name` | | `adapter_version` | the adapter `version` at import time | | `source` | a human-meaningful locator: file path (relative where sensible), plus source-internal location (e.g. ICS UID, note filename + heading line, photo record index) | | `source_content_hash` | SHA-256 of the exact source bytes the item was parsed from (the file, or the item's slice where the format makes that well-defined) | | `imported_at` | RFC 3339 UTC timestamp of the import | The point of `source_content_hash`: an auditor holding the original file can prove the evidence in the log corresponds to those exact bytes, and a later re-import can prove the source changed. Hash the **raw bytes**, before any decoding or normalization your adapter performs. If your adapter normalizes (time zones, encodings, whitespace), record the verbatim source value in the body next to the normalized one whenever the normalization is lossy. The ICS adapter's handling of floating times is the reference example. ## 5. Worked skeleton A minimal new adapter, following the shipped ones (consult `adapters/photos.py` for the simplest complete real example): 1. Create `src/pmp/adapters/myformat.py`. Subclass the base adapter; set `name = "myformat.v1"` and `version = "0.1.0"`. 2. Validate the path shape up front; raise the adapter error from `pmp.errors` if it is not what you accept. 3. Read the raw bytes, compute the source hash once. 4. Parse; for each record build the body dict (canonical-JSON-safe values only), compute `external_id` from a SHA-256 over the record's canonical content, attach the provenance block, and yield. 5. Sort records before yielding so output order is deterministic. 6. Register it in `adapters/__init__.py`. 7. Write tests (§6) and a small sample dataset under `samples/`. Keep adapters dependency-free if at all possible. The three shipped adapters use only the standard library; if a format truly requires a third-party parser, it must be added to `pyproject.toml` with a flexible version constraint and justified in the PR — every dependency is attack surface for software that handles a person's life. ## 6. Testing a new adapter Follow the structure of `tests/test_adapter_photos.py` (smallest) and `tests/test_adapter_ics.py` (richest). A new adapter's tests must cover: 1. **Golden parse** — run against a checked-in sample under `samples/`, assert the exact set of kinds, external IDs, and key body fields. 2. **Determinism** — parse twice, assert identical item sequences. 3. **Idempotency at the node level** — import via the node twice into a temp node (use the fixtures in `tests/conftest.py`), assert the second import appends zero operations. 4. **Provenance completeness** — every emitted item has all §4 fields; `source_content_hash` matches an independently computed SHA-256 of the sample bytes. 5. **Malformed input** — a corrupt source raises the adapter error (whole-source failure) and a sample with one bad record imports the good records and reports the skip count (per-item failure). 6. **Canonical-JSON safety** — each body round-trips through `pmp.canonical` serialization unchanged. Sample data must be **synthetic**. Never commit real personal data, even your own — samples are forever. Extend the Avery Reyes fictional dataset (`samples/README.md` describes its conventions) so the demo stays coherent. ## 7. Review checklist Before opening a PR for a new adapter: - [ ] `name` is unique, namespaced, and documented in the module docstring - [ ] `version` set; bump policy understood - [ ] No interpretation/derivation in emitted bodies — witness, not analyst - [ ] `external_id` derived from source content only - [ ] Full provenance block on every item; raw-byte source hash - [ ] Deterministic yield order - [ ] Per-item failures skip-and-count; whole-source failures raise - [ ] Registered in `adapters/__init__.py`; visible in `pmp import --help` - [ ] Synthetic sample dataset committed; all six test categories present - [ ] No new dependencies, or new dependency declared in `pyproject.toml` and justified