# Reproduction toolkit — FablePool Home Assistant ecosystem analysis This package reproduces the raw data pulls behind the milestone-1 report. The CSVs committed under `data/` at the repository root are the **frozen snapshot** the report text cites; these tools let any maintainer regenerate fresh equivalents (written to `data/generated/` so the snapshot is never silently overwritten) and verify how far the ecosystem has drifted since. ## Install ```bash cd tools python -m venv .venv && source .venv/bin/activate pip install -e . ``` Python ≥ 3.10. Dependencies (all declared in `pyproject.toml`): `requests` (≥2.31,<3), `urllib3` (≥1.26,<3), `PyYAML` (≥6,<7). No lockfile is committed — `pip` resolves these on first install; if you want reproducible installs, run `pip freeze > requirements-lock.txt` after a successful build. Optional but recommended: export a GitHub token (classic or fine-grained, no scopes needed for public data) — it triples the issue-triage speed and gives the tree-listing calls a 5000 req/h budget: ```bash export GITHUB_TOKEN=ghp_... ``` ## Pipeline (run from the repository root) ```bash # 1. Install base — one HTTP request, cached for a day. ~2 s. ha-fetch-analytics --top 300 --out data/generated/analytics_integrations.csv # 2. Quality scale for the top 150 (with per-rule detail). ~1 min first run. ha-fetch-quality-scale \ --domains-csv data/generated/analytics_integrations.csv \ --limit 150 --rules \ --out data/generated/quality_scale_raw.csv # Optional: full-ecosystem tier distribution (~2500 cached raw fetches, # 10–15 min first run, instant afterwards): ha-fetch-quality-scale --all \ --out data/generated/quality_scale_all.csv \ --distribution-out data/generated/quality_scale_distribution_all.csv # 3. Open-issue triage for the top 150. ~6 min with a token, ~17 min without # (GitHub search API: 30 req/min authenticated, 10 anonymous; the tool # self-throttles and checkpoints every 25 domains). ha-fetch-issues \ --domains-csv data/generated/analytics_integrations.csv \ --limit 150 --sample 20 \ --out data/generated/issue_triage_raw.csv # 4. Join everything into the report tables + Markdown. Offline, <1 s. ha-build-report-tables --top 150 --outdir data/generated ``` Outputs of step 4: `top_integrations_generated.csv`, `quality_scale_distribution_generated.csv`, `issue_triage_generated.csv`, and `tables.md` (all three tables rendered as Markdown). Each console script is also runnable as a module, e.g. `python -m ha_analysis.fetch_analytics --help`. ## Caching, resumability, failure behaviour * GET responses for parameter-free URLs (analytics payload, manifests, `quality_scale.yaml`, git-tree lookups) are cached under `tools/.cache` (TTL: 1 day for analytics, 7 days for repo content; tune with `--cache-ttl`). Delete the directory to force a cold pull. * GitHub search responses are **not** cached (they carry query parameters and go stale fast); `ha-fetch-issues` checkpoints partial CSVs instead, and rows with a `fetch_error` value can be retried by re-running. * All scripts honour `Retry-After` / `X-RateLimit-Reset` headers and abort with a clear message (rather than spinning) if a wait would exceed 15 min. ## Known schema fragility (verify on first run) These are external surfaces with no stability contract; each script fails loudly with a descriptive error rather than emitting a wrong table: 1. **`https://analytics.home-assistant.io/data.json`** — assumed to be a mapping of millisecond-timestamp keys → snapshot objects containing an `integrations: {domain: count}` mapping and a reporting-installs total (`reports_integrations`, falling back to `active_installations`). If the percentage column comes out blank, inspect the cached body and adjust the total-field lookup in `fetch_analytics.py`. 2. **`manifest.json` / `quality_scale.yaml`** in `home-assistant/core` — assumed fields: `quality_scale`, `integration_type`, `iot_class`, `codeowners`; rule entries as bare status strings or `{status, comment}` mappings. Tier vocabulary is whatever appears in manifests (bronze / silver / gold / platinum / legacy / internal), plus our synthetic `unscored` (field absent) and `missing` (no manifest at that path). 3. **GitHub issue labels** — the triage assumes core's `integration: ` label convention. Spot-check one domain's count against the GitHub UI filter `repo:home-assistant/core is:issue is:open label:"integration: zha"`. ## Build-hygiene self-audit * **Imports vs. manifest** — third-party imports across the package are exactly `requests`, `urllib3.util.retry.Retry`, and `yaml`; all three are declared in `pyproject.toml` with flexible (`>=,<` major-bound) ranges. Everything else is the standard library (`argparse`, `csv`, `json`, `hashlib`, `pathlib`, `collections`, `time`, `os`, `sys`). * **Toolchain** — `requires-python = ">=3.10"`; no exact interpreter pin, no committed lockfile. Syntax used (`X | None` unions, `str.removeprefix`) needs 3.9/3.10+, consistent with the floor. * **API surfaces** — only long-stable endpoints are used: plain GitHub REST v3 (`/branches`, `/git/commits`, `/git/trees`, `/search/issues` with the `2022-11-28` API version header), `raw.githubusercontent.com`, and the public analytics JSON. No SDKs, so no fast-moving client-library calls to drift. The unverifiable assumptions are the three schema items above. * **Not verifiable here** — nothing in this environment could execute the scripts against the live endpoints; a maintainer's first run should confirm items 1–3 above (each fails loudly if violated).