# Lagoon Roadmap & Non-Goals This document states honestly where Lagoon is, where it is going, and — just as importantly — what it deliberately is **not**. Roadmap items are intent, not commitments; ordering reflects current priorities and may change with community input. ## Status: v1 (current) Delivered and tested: - Object-storage-native storage engine: WAL, immutable segments, manifest CAS commits, tiered compaction, GC, crash recovery (filesystem, MinIO, S3). - Query engine: exact kNN + IVF-Flat ANN (cosine/dot/euclidean), BM25 full-text with field weighting, sparse vectors, exact pre-filtering with the full filter AST, RRF + weighted hybrid fusion, multi-query, query planner. - Stateless HTTP API server with API keys and reader/writer/admin roles; memory + disk LRU cache hierarchy with warming and pinning. - Copy-on-write namespace branching with reference-safe GC. - CLI, Python SDK, TypeScript SDK, OpenAPI spec, Docker Compose, Kubernetes manifests/Helm chart. - Demos (semantic/RAG, hybrid, code-search-with-branching), benchmark suite, and this documentation set. ## v1.x — Hardening (next) - **Cost-aware planner**: feed object-storage fetch counts into plan choice, not just selectivity estimates. - **Scalar quantization (int8) for IVF posting lists** — ~4× smaller vector segments, with measured recall impact published in the benchmark results. - **Smarter warming**: access-pattern-driven prefetch of postings blocks instead of whole-segment warming. - **Conditional-write coverage** for object stores lacking compare-and-swap (lease-based fallback is in; needs more soak testing). - **Backpressure & admission control** when indexing lag grows. - Expanded chaos/recovery test matrix (fault injection on every storage op). ## v2 — Scale-out - **Namespace sharding** for namespaces beyond single-node cache capacity (hash-partitioned segments, scatter-gather query execution). - **Distributed read-your-writes** improvements: manifest change notification (e.g., via object-store event streams) to shrink the staleness window below the poll interval. - **Multi-tenant scheduling**: per-key resource isolation beyond rate limits. - **Incremental reindexing** on schema changes (today: full rebuild). - **Snapshot/restore tooling** built on branching + export. ## v2+ — Exploration (no committed order) - Product quantization / OPQ for very large vector corpora. - Learned or adaptive `nprobe` selection targeting a recall SLO. - Streaming change feed per namespace (consume the WAL as CDC). - Multi-region replication (bucket-level replication + manifest fencing). - Pluggable rerankers (cross-encoder hook executed server-side). - GPU-accelerated scoring for batch/offline workloads. ## Non-Goals for v1 We say no to these on purpose. Some may graduate to the roadmap later; many never will. 1. **Not a general-purpose OLTP/OLAP database.** No joins, aggregations, transactions across namespaces, or SQL. Lagoon stores documents and ranks them. 2. **No memory-resident graph ANN (HNSW etc.).** v1 commits to IVF-style indexes because they page well from object storage and rebuild deterministically. If your workload demands graph-index latency at the cost of RAM-residency, other tools serve that better today. 3. **No built-in embedding generation.** Lagoon never calls OpenAI, Cohere, or any model. Demos show pluggable provider patterns; the database stays vendor-neutral. 4. **No strong cross-node consistency.** Multi-node reads are bounded-stale by design (manifests are polled). We document this instead of pretending otherwise; `min_manifest_generation` exists for callers who need read-your-writes across nodes. 5. **No per-document ACLs or row-level security.** Authorization is key + role + namespace scope. Model finer-grained access with namespaces (branching makes per-tenant/per-workspace namespaces cheap). 6. **No client-side encryption.** Encryption at rest is delegated to provider SSE (see the deployment guide); in-transit encryption to your TLS proxy. 7. **No exactly-once cross-system delivery guarantees.** Idempotency keys give safe retries; distributed transactions with your other systems are your orchestration layer's job. 8. **No performance-parity claims against commercial products.** Our benchmark suite measures *Lagoon* honestly on disclosed hardware with a reproducible methodology. We publish our numbers and our methodology; we do not publish comparisons we haven't rigorously run, and we make no parity claims about Turbopuffer or any other product. 9. **No serverless control plane, billing, or hosted offering** in this repository. Lagoon v1 is self-hosted software. 10. **No automatic schema inference migrations.** Vector dimensions and metrics are immutable per namespace; changing them means a new namespace (exports + branching make this cheap). ## How to influence this roadmap Open a GitHub discussion with your workload shape (corpus size, QPS, latency/recall targets, cold-vs-warm mix). Real workload evidence moves items up this list faster than anything else.