# Authoring an Import Adapter

Adapters are the only way raw source material enters a PMP node. This document
specifies the adapter contract, the provenance obligations, idempotency rules,
and how to test a new adapter. Read it before adding a source type; the
contract is what keeps every byte in the log auditable.

## 1. What an adapter is — and is not

An adapter:

* **reads** one kind of source (a file format, a directory layout, an export
  dump),
* **parses** it into discrete *evidence items* — small, self-contained facts
  about what the source literally contained,
* **attaches provenance** describing exactly where each item came from,
* and **yields** those items to the node, which signs and appends them as
  `evidence` operations.

An adapter does **not**:

* interpret, summarize, or derive ("Avery likes hiking" is a *claim*, produced
  by the derivation engine in a later milestone — never by an adapter),
* write to the log directly (the node owns signing, sequencing, and dedup),
* mutate or normalize away source content it cannot reproduce — when in doubt,
  preserve the source value verbatim alongside any normalized form (this is
  what the ICS adapter does with floating-time events).

The mental model: **adapters are witnesses, not analysts.** They testify to
what the source said, with receipts.

## 2. The contract

Adapters live in `src/pmp/adapters/` and subclass the base class in
`adapters/base.py`. The contract has three parts:

### 2.1 Identity

Every adapter declares:

* `name` — a stable, lowercase identifier (e.g. `"calendar.ics"`,
  `"notes.text"`, `"photos.exif_mock"`). This appears in every operation's
  provenance and must never be reused for a different format.
* `version` — a semantic version string for the *adapter logic*. Bump it
  whenever parsing behavior changes in a way that could alter emitted evidence;
  provenance records it so old evidence can be re-interpreted correctly.
* the **source types it accepts** (file vs. directory, expected
  extensions/shape), validated up front with a clear error rather than a
  half-import.

### 2.2 Parsing

The core method takes a source path and yields evidence items. Each item
carries:

* **`kind`** — what sort of fact this is (e.g. `calendar.event`, `note.document`,
  `photo.metadata`). Kinds are namespaced by source domain and documented in
  the adapter's module docstring.
* **`external_id`** — see §3; the dedup key.
* **`body`** — a JSON-serializable dict of the parsed content. Keys must be
  stable across adapter versions where possible; values must round-trip through
  the canonical JSON serializer (`src/pmp/canonical.py`): no floats where
  precision matters (use strings/integers), timestamps as RFC 3339 strings,
  bytes as lowercase hex or base64 — match the conventions already used by the
  shipped adapters.
* **`provenance`** — see §4.

Yield items in a **deterministic order** for a given source (the shipped
adapters sort by source-internal identifiers). Determinism makes imports
reproducible and diffs meaningful.

Parsing failures of the *whole source* should raise the adapter error type from
`src/pmp/errors.py` with a message naming the path. Per-item irregularities
(a malformed event among hundreds) should be handled the way the ICS adapter
does: skip the item, count it, and surface the count in the import summary —
never silently drop, never abort the whole import for one bad record.

### 2.3 Registration

Add the adapter to the registry in `src/pmp/adapters/__init__.py`. That is all
the CLI needs: `pmp import <registry-key> <path>` will find it, and `pmp import
--help` will list it.

## 3. Idempotency: the `external_id`

The node deduplicates on `(adapter name, external_id)`. The rule:

> **Identical source content must produce the same `external_id`; changed
> content must produce a different one.**

In practice that means deriving the ID from a hash of the item's canonical
source content — *not* from file paths, import times, or iteration order. The
shipped adapters do this as follows:

* **ICS**: hash over the event's UID plus its normalized property set, so an
  unchanged event re-imports as a duplicate, while an edited event (new
  DTSTART, say) becomes new evidence — both facts are true and both belong in
  an append-only log.
* **Notes**: hash over the note's relative path and content bytes; touching the
  mtime alone changes nothing.
* **Photos**: hash over each photo record's metadata fields.

If your source has genuinely stable native IDs **and** versions (e.g. a notes
app export with `id` + `revision`), prefer `"{id}@{revision}"` — it is more
legible in the log. When in doubt, hash content.

Never make `external_id` depend on anything outside the source bytes; otherwise
re-importing a backup creates phantom duplicates and the audit trail rots.

## 4. Provenance obligations

Every evidence item must carry a provenance block with at least:

| Field | Meaning |
|---|---|
| `adapter` | the adapter `name` |
| `adapter_version` | the adapter `version` at import time |
| `source` | a human-meaningful locator: file path (relative where sensible), plus source-internal location (e.g. ICS UID, note filename + heading line, photo record index) |
| `source_content_hash` | SHA-256 of the exact source bytes the item was parsed from (the file, or the item's slice where the format makes that well-defined) |
| `imported_at` | RFC 3339 UTC timestamp of the import |

The point of `source_content_hash`: an auditor holding the original file can
prove the evidence in the log corresponds to those exact bytes, and a later
re-import can prove the source changed. Hash the **raw bytes**, before any
decoding or normalization your adapter performs.

If your adapter normalizes (time zones, encodings, whitespace), record the
verbatim source value in the body next to the normalized one whenever the
normalization is lossy. The ICS adapter's handling of floating times is the
reference example.

## 5. Worked skeleton

A minimal new adapter, following the shipped ones (consult
`adapters/photos.py` for the simplest complete real example):

1. Create `src/pmp/adapters/myformat.py`. Subclass the base adapter; set
   `name = "myformat.v1"` and `version = "0.1.0"`.
2. Validate the path shape up front; raise the adapter error from
   `pmp.errors` if it is not what you accept.
3. Read the raw bytes, compute the source hash once.
4. Parse; for each record build the body dict (canonical-JSON-safe values
   only), compute `external_id` from a SHA-256 over the record's canonical
   content, attach the provenance block, and yield.
5. Sort records before yielding so output order is deterministic.
6. Register it in `adapters/__init__.py`.
7. Write tests (§6) and a small sample dataset under `samples/`.

Keep adapters dependency-free if at all possible. The three shipped adapters
use only the standard library; if a format truly requires a third-party parser,
it must be added to `pyproject.toml` with a flexible version constraint and
justified in the PR — every dependency is attack surface for software that
handles a person's life.

## 6. Testing a new adapter

Follow the structure of `tests/test_adapter_photos.py` (smallest) and
`tests/test_adapter_ics.py` (richest). A new adapter's tests must cover:

1. **Golden parse** — run against a checked-in sample under `samples/`, assert
   the exact set of kinds, external IDs, and key body fields.
2. **Determinism** — parse twice, assert identical item sequences.
3. **Idempotency at the node level** — import via the node twice into a temp
   node (use the fixtures in `tests/conftest.py`), assert the second import
   appends zero operations.
4. **Provenance completeness** — every emitted item has all §4 fields;
   `source_content_hash` matches an independently computed SHA-256 of the
   sample bytes.
5. **Malformed input** — a corrupt source raises the adapter error (whole-source
   failure) and a sample with one bad record imports the good records and
   reports the skip count (per-item failure).
6. **Canonical-JSON safety** — each body round-trips through
   `pmp.canonical` serialization unchanged.

Sample data must be **synthetic**. Never commit real personal data, even your
own — samples are forever. Extend the Avery Reyes fictional dataset
(`samples/README.md` describes its conventions) so the demo stays coherent.

## 7. Review checklist

Before opening a PR for a new adapter:

- [ ] `name` is unique, namespaced, and documented in the module docstring
- [ ] `version` set; bump policy understood
- [ ] No interpretation/derivation in emitted bodies — witness, not analyst
- [ ] `external_id` derived from source content only
- [ ] Full provenance block on every item; raw-byte source hash
- [ ] Deterministic yield order
- [ ] Per-item failures skip-and-count; whole-source failures raise
- [ ] Registered in `adapters/__init__.py`; visible in `pmp import --help`
- [ ] Synthetic sample dataset committed; all six test categories present
- [ ] No new dependencies, or new dependency declared in `pyproject.toml` and justified