# Reproduction toolkit — FablePool Home Assistant ecosystem analysis

This package reproduces the raw data pulls behind the milestone-1 report.
The CSVs committed under `data/` at the repository root are the **frozen
snapshot** the report text cites; these tools let any maintainer regenerate
fresh equivalents (written to `data/generated/` so the snapshot is never
silently overwritten) and verify how far the ecosystem has drifted since.

## Install

```bash
cd tools
python -m venv .venv && source .venv/bin/activate
pip install -e .
```

Python ≥ 3.10. Dependencies (all declared in `pyproject.toml`):
`requests` (≥2.31,<3), `urllib3` (≥1.26,<3), `PyYAML` (≥6,<7).
No lockfile is committed — `pip` resolves these on first install; if you
want reproducible installs, run `pip freeze > requirements-lock.txt` after a
successful build.

Optional but recommended: export a GitHub token (classic or fine-grained,
no scopes needed for public data) — it triples the issue-triage speed and
gives the tree-listing calls a 5000 req/h budget:

```bash
export GITHUB_TOKEN=ghp_...
```

## Pipeline (run from the repository root)

```bash
# 1. Install base — one HTTP request, cached for a day. ~2 s.
ha-fetch-analytics --top 300 --out data/generated/analytics_integrations.csv

# 2. Quality scale for the top 150 (with per-rule detail). ~1 min first run.
ha-fetch-quality-scale \
    --domains-csv data/generated/analytics_integrations.csv \
    --limit 150 --rules \
    --out data/generated/quality_scale_raw.csv

#    Optional: full-ecosystem tier distribution (~2500 cached raw fetches,
#    10–15 min first run, instant afterwards):
ha-fetch-quality-scale --all \
    --out data/generated/quality_scale_all.csv \
    --distribution-out data/generated/quality_scale_distribution_all.csv

# 3. Open-issue triage for the top 150. ~6 min with a token, ~17 min without
#    (GitHub search API: 30 req/min authenticated, 10 anonymous; the tool
#    self-throttles and checkpoints every 25 domains).
ha-fetch-issues \
    --domains-csv data/generated/analytics_integrations.csv \
    --limit 150 --sample 20 \
    --out data/generated/issue_triage_raw.csv

# 4. Join everything into the report tables + Markdown. Offline, <1 s.
ha-build-report-tables --top 150 --outdir data/generated
```

Outputs of step 4: `top_integrations_generated.csv`,
`quality_scale_distribution_generated.csv`, `issue_triage_generated.csv`,
and `tables.md` (all three tables rendered as Markdown).

Each console script is also runnable as a module, e.g.
`python -m ha_analysis.fetch_analytics --help`.

## Caching, resumability, failure behaviour

* GET responses for parameter-free URLs (analytics payload, manifests,
  `quality_scale.yaml`, git-tree lookups) are cached under `tools/.cache`
  (TTL: 1 day for analytics, 7 days for repo content; tune with
  `--cache-ttl`). Delete the directory to force a cold pull.
* GitHub search responses are **not** cached (they carry query parameters
  and go stale fast); `ha-fetch-issues` checkpoints partial CSVs instead,
  and rows with a `fetch_error` value can be retried by re-running.
* All scripts honour `Retry-After` / `X-RateLimit-Reset` headers and abort
  with a clear message (rather than spinning) if a wait would exceed 15 min.

## Known schema fragility (verify on first run)

These are external surfaces with no stability contract; each script fails
loudly with a descriptive error rather than emitting a wrong table:

1. **`https://analytics.home-assistant.io/data.json`** — assumed to be a
   mapping of millisecond-timestamp keys → snapshot objects containing an
   `integrations: {domain: count}` mapping and a reporting-installs total
   (`reports_integrations`, falling back to `active_installations`). If the
   percentage column comes out blank, inspect the cached body and adjust the
   total-field lookup in `fetch_analytics.py`.
2. **`manifest.json` / `quality_scale.yaml`** in `home-assistant/core` —
   assumed fields: `quality_scale`, `integration_type`, `iot_class`,
   `codeowners`; rule entries as bare status strings or `{status, comment}`
   mappings. Tier vocabulary is whatever appears in manifests (bronze /
   silver / gold / platinum / legacy / internal), plus our synthetic
   `unscored` (field absent) and `missing` (no manifest at that path).
3. **GitHub issue labels** — the triage assumes core's `integration: <domain>`
   label convention. Spot-check one domain's count against the GitHub UI
   filter `repo:home-assistant/core is:issue is:open label:"integration: zha"`.

## Build-hygiene self-audit

* **Imports vs. manifest** — third-party imports across the package are
  exactly `requests`, `urllib3.util.retry.Retry`, and `yaml`; all three are
  declared in `pyproject.toml` with flexible (`>=,<` major-bound) ranges.
  Everything else is the standard library (`argparse`, `csv`, `json`,
  `hashlib`, `pathlib`, `collections`, `time`, `os`, `sys`).
* **Toolchain** — `requires-python = ">=3.10"`; no exact interpreter pin,
  no committed lockfile. Syntax used (`X | None` unions, `str.removeprefix`)
  needs 3.9/3.10+, consistent with the floor.
* **API surfaces** — only long-stable endpoints are used: plain GitHub REST
  v3 (`/branches`, `/git/commits`, `/git/trees`, `/search/issues` with the
  `2022-11-28` API version header), `raw.githubusercontent.com`, and the
  public analytics JSON. No SDKs, so no fast-moving client-library calls to
  drift. The unverifiable assumptions are the three schema items above.
* **Not verifiable here** — nothing in this environment could execute the
  scripts against the live endpoints; a maintainer's first run should
  confirm items 1–3 above (each fails loudly if violated).