# Demo 1 — Semantic Document Search & RAG

This demo ingests a small bundled corpus of ~40 short encyclopedic articles,
embeds them with a pluggable embedding provider, and serves:

1. **Semantic search** — vector similarity over article bodies, optionally
   filtered by category or year.
2. **A minimal RAG pipeline** — retrieve top passages for a question, assemble
   a grounded context, and either (a) produce an extractive answer locally with
   no LLM dependency, or (b) optionally call an LLM you configure.

Everything runs offline by default: the default embedding provider is a
deterministic local feature-hashing embedder with no model downloads and no
API keys.

## Run it

```bash
# from the repository root, with the stack running (docker compose up -d)
pip install -r demos/requirements.txt

# 1. Ingest the bundled dataset (creates namespace "demo-articles")
python demos/semantic-search/ingest.py

# 2. Search
python demos/semantic-search/search.py "why do coral reefs matter"
python demos/semantic-search/search.py "how does fermentation preserve food" --filter-category food
python demos/semantic-search/search.py --interactive

# 3. Ask a question through the RAG pipeline (extractive, no LLM needed)
python demos/semantic-search/rag.py "What did the printing press change about books?"

# Optional: use an LLM for answer synthesis (requires `pip install openai` + OPENAI_API_KEY)
python demos/semantic-search/rag.py "What did the printing press change?" --llm openai
```

## Better embeddings (optional)

```bash
pip install 'sentence-transformers>=2.6'
LAGOON_EMBEDDINGS=sentence-transformers python demos/semantic-search/ingest.py
LAGOON_EMBEDDINGS=sentence-transformers python demos/semantic-search/search.py "underwater ecosystems"
```

The namespace records which provider produced its vectors; `search.py` refuses
to query with a mismatched provider so you never compare vectors from
different embedding spaces.

## Dataset

`data/articles.jsonl` — 40 original short articles written for this demo
(science, history, technology, nature, food, space, art, music, sports,
geography), each with `id`, `title`, `body`, `category`, `year`.

## What to look at

- `ingest.py` — batched upserts, schema with full-text + filterable fields,
  waiting for background indexing to catch up.
- `search.py` — vector queries with metadata filters and attribute projection.
- `rag.py` — retrieval, context assembly with source attributions, extractive
  answering, and an optional pluggable LLM hook.