Build an open-source Turbopuffer-style object-storage-native search database
by Chris Stones · raised 13,337 credits · spent 8,405 credits · refunded 4,931 credits · pool 1 credits
Build a legally distinct, greenfield, open-source clone/alternative inspired by Turbopuffer: an object-storage-native vector + full-text search database for AI/RAG applications. Important legal and product constraint: Do not copy Turbopuffer’s source code, private implementation, brand, logo, website design, copy, names, trade dress, proprietary assets, or exact documentation. This must be a clean-room implementation with original branding, original UI, original docs, and its own API design where appropriate. Use Turbopuffer only as public product inspiration: object storage as the durable source of truth, stateless compute, SSD/memory caching, namespaces, vector search, full-text search, hybrid search, metadata filters, and namespace branching. Working name: Use a placeholder original name such as “OpenPuffer” or choose a better legally distinct project name during planning. Goal: Create a production-quality open-source search engine that developers can run locally or deploy to cloud infrastructure. It should support cheap cold storage on S3-compatible object storage, fast warm queries through local disk and memory caches, and a developer-friendly HTTP API plus Python and TypeScript clients. Core requirements: 1. Architecture - Durable state lives in object storage: S3-compatible storage, MinIO for local development, and local filesystem mode for tests. - Compute nodes should be as stateless as practical. - Use a per-namespace storage layout with manifests, write-ahead log files, index files, metadata files, and branch references. - Separate query serving from background indexing/compaction where possible. - Provide a clear architecture document explaining the storage format, query path, write path, consistency model, and tradeoffs. 2. API server - Build an HTTP JSON API with authentication via API keys. - Support organizations/projects/namespaces. - Namespaces are isolated document spaces. - Documents have IDs, optional dense vectors, optional sparse vectors, text fields, and arbitrary metadata attributes. - Include endpoints for: - create/update/delete namespace - namespace metadata - upsert documents - patch documents - delete by ID - delete by filter - query - export namespace - copy namespace - branch namespace - warm cache / pin namespace - health check and metrics 3. Write path - Implement durable write-ahead logging to object storage. - Support batched upserts. - Support row-oriented and column-oriented document ingestion if feasible. - Support idempotent writes where possible. - Support basic conditional writes. - Implement background indexing after writes. - Implement compaction so many small WAL/index files can be merged into efficient queryable segments. - Ensure successful writes are recoverable after process restart. 4. Query path - Support dense vector search: - exact kNN baseline - approximate nearest neighbor index suitable for object-storage-backed retrieval, preferably IVF/centroid-style rather than a memory-only graph index - cosine, dot product, and Euclidean distance - Support full-text search: - tokenization - inverted index - BM25 ranking - field weighting or boosting - Support hybrid search: - combine vector and BM25 results - support weighted score fusion and reciprocal rank fusion - allow multi-query requests - Support sparse vector search if feasible. - Support metadata filters: - Eq, NotEq, Gt, Gte, Lt, Lte, In, ContainsAny - And, Or, Not - string matching where practical - Support returning selected attributes only. - Support top_k / limit. - Include a query planner that chooses between exact search, ANN, full-text index, filter index, and hybrid execution paths. 5. Indexing - Build persistent index files stored in object storage. - Build local cacheable index segments. - Include filter/attribute indexes for fast filtered search. - Include an index metadata manifest. - Include rebuild and repair commands. - Include tests proving indexes survive restarts and can be reconstructed from object storage. 6. Cache hierarchy - Implement local disk cache for recently queried namespaces or segments. - Implement memory cache for hot metadata/index pieces. - Use LRU or configurable eviction. - Add a warm-cache endpoint that preloads a namespace or selected segments. - Add optional namespace pinning so selected namespaces stay warm. - Benchmark cold vs warm query behavior. 7. Namespace branching - Implement copy-on-write namespace branching. - Branching should create a new namespace quickly by referencing existing immutable files/segments. - Writes to the branch should not affect the source. - Writes to the source should not affect the branch. - Deleting a branch must not delete shared source data that is still referenced. - Include tests for branch isolation, multi-level branches, branch deletion, and querying branches. 8. Consistency and recovery - Document the consistency guarantees clearly. - At minimum, provide durable writes and read-your-writes behavior on a single-node deployment. - For distributed/multi-node mode, clearly document any eventual consistency tradeoffs. - Include crash-recovery tests: - restart after writes - restart during indexing - restart during compaction - corrupted/incomplete file handling where possible 9. Developer experience - Provide a CLI for: - starting local dev services - creating namespaces - loading JSON/JSONL/CSV data - running queries - exporting data - warming cache - benchmarking - Provide Python SDK. - Provide TypeScript/JavaScript SDK. - Provide Docker Compose with API server, worker/indexer, MinIO, and example app. - Provide optional Kubernetes manifests or Helm chart. - Provide OpenAPI spec. 10. Demo applications - Build at least three demos: - semantic document search/RAG demo - hybrid search demo over text + embeddings - code-search style demo using namespace branching to represent different branches or workspaces - Include sample datasets small enough to run locally. - Include scripts to generate embeddings using a pluggable embedding provider, but the database itself must not depend on any one embedding vendor. 11. Observability and operations - Add structured logs. - Add Prometheus metrics. - Add basic tracing hooks or OpenTelemetry instrumentation if feasible. - Include metrics for query latency, cold/warm cache hits, indexing lag, object storage reads/writes, WAL size, segment count, and compaction status. - Add rate limits and quotas as optional configuration. - Add audit logs for namespace and write operations. 12. Security - API key authentication. - Avoid logging secrets. - Basic role model if feasible: admin, writer, reader. - Optional encryption-at-rest documentation using object storage provider encryption. - Document threat model and deployment hardening guidance. 13. Testing and benchmarking - Unit tests for storage, API, filters, ranking, vector math, BM25, branching, caching, and recovery. - Integration tests using MinIO. - End-to-end tests through the Python and TypeScript clients. - Benchmark suite that measures: - ingest throughput - cold query latency - warm query latency - recall@k for ANN vs exact kNN - BM25 latency - hybrid query latency - cache hit rates - Include realistic but honest benchmark results. Do not claim parity with Turbopuffer unless demonstrated. 14. Documentation - README with quickstart. - Architecture guide. - API reference. - Python SDK docs. - TypeScript SDK docs. - Deployment guide. - Benchmark guide. - Contributor guide. - Roadmap. - Clear “non-goals for v1” section. Preferred implementation: - Use Rust for the core server and storage/query engine if practical. - Use Python and TypeScript for client SDKs. - Use Docker Compose for local development. - Use Apac
Back this build
Sign in to backMilestones — actual cost 9,553 credits
Clean-room design package and runnable skeleton. Deliverables: (1) full architecture document covering object-storage layout (per-namespace manifests, WAL files, immutable segments, branch references), write path, query path, consistency model (durable writes, read-your-writes single-node, documented eventual-consistency tradeoffs for multi-node), and explicit tradeoff analysis; (2) on-disk/object-storage format specification with versioning rules; (3) complete OpenAPI 3.1 spec for all endpoints (namespaces, upsert/patch/delete, delete-by-filter, query, export, copy, branch, warm/pin, health, metrics) with original API design; (4) Rust workspace scaffold (core engine crate, API server crate, CLI crate) compiling with storage abstraction trait implemented for local filesystem and S3-compatible backends (MinIO config included); (5) legally distinct naming/branding rationale and non-goals-for-v1 doc. Independently valuable as a reviewable design + buildable skeleton.
Working Rust storage engine over object storage. Deliverables: write-ahead logging to object storage with batched upserts, patch and delete-by-id, idempotent write keys, basic conditional writes; row- and column-oriented ingestion paths; background indexer that folds WAL entries into immutable segments; compaction that merges small WAL/segment files into efficient queryable segments with manifest atomically updated; crash-recovery logic (restart after writes, restart mid-indexing, restart mid-compaction, detection of incomplete/corrupt files); rebuild and repair commands. Includes a substantial unit + integration test suite against local filesystem and MinIO proving writes survive process restarts and indexes are reconstructable from object storage alone.
Complete query path in Rust. Deliverables: exact kNN baseline with cosine/dot/Euclidean; IVF/centroid-style ANN index designed for object-storage-backed retrieval (cluster centroids in manifest, posting lists as cacheable segments) with recall-vs-nprobe tuning; full-text engine with tokenizer, persistent inverted index, BM25 ranking, and field boosting; optional sparse-vector scoring path; metadata filter engine supporting Eq/NotEq/Gt/Gte/Lt/Lte/In/ContainsAny and And/Or/Not with attribute/filter indexes for fast filtered search; hybrid search with weighted score fusion and reciprocal rank fusion plus multi-query requests; selected-attribute projection and top_k; a query planner choosing between exact, ANN, full-text, filter-index, and hybrid execution. Includes unit tests for vector math, BM25 correctness, filter algebra, and recall@k validation tests of ANN against exact kNN.
Production HTTP layer and the differentiating features. Deliverables: full JSON API server implementing the OpenAPI spec with API-key auth, org/project/namespace hierarchy, and admin/writer/reader role model; export and copy-namespace endpoints; copy-on-write namespace branching via manifest references to shared immutable segments, with reference counting so branch deletion never removes still-referenced source data, plus tests for branch isolation in both directions, multi-level branches, and querying branches; two-tier cache hierarchy (local SSD segment cache + in-memory hot metadata/index cache) with configurable LRU eviction, warm-cache endpoint, and namespace pinning; structured logging, Prometheus metrics (query latency, cold/warm hit rates, indexing lag, object-storage I/O, WAL size, segment count, compaction status), OpenTelemetry tracing hooks, audit logs, optional rate limits/quotas; secret-safe logging and threat-model/hardening documentation. Includes integration tests against MinIO covering the full API surface.
Everything a developer needs to adopt the project. Deliverables: full-featured CLI (start local dev stack, create namespaces, load JSON/JSONL/CSV, run queries, export, warm cache, benchmark); idiomatic Python SDK with typed models, retries, batch helpers, and docs; TypeScript SDK with equivalent coverage and docs; Docker Compose stack (API server, indexer/worker, MinIO, example app); Kubernetes manifests plus a basic Helm chart; end-to-end test suites driving the server through both SDKs; CI workflow definitions for build, lint, and test.
Proof-of-value and complete docs. Deliverables: three runnable demos with small bundled datasets and pluggable embedding-provider scripts — (1) semantic document search/RAG demo, (2) hybrid text+embedding search demo, (3) code-search demo using namespace branching to model branches/workspaces; benchmark suite measuring ingest throughput, cold vs warm query latency, recall@k for ANN vs exact, BM25 latency, hybrid latency, and cache hit rates, with an honest results write-up methodology that makes no unverified parity claims; full documentation set: README quickstart, architecture guide, API reference, Python SDK docs, TypeScript SDK docs, deployment guide (including encryption-at-rest via provider SSE), benchmark guide, contributor guide, roadmap, and non-goals section.