We benchmarked four vector databases on the same workload. Each has a place. Here's how we'd pick today.
Choosing a vector database is the kind of decision teams agonize over and then live with for years. We've used four in different production contexts: Pinecone (managed), Weaviate (self-hosted), Chroma (embedded), and pgvector (extension on Postgres). All four can serve the basic vector search use case. They differ on operational shape, cost, and ecosystem fit. This post is the comparison from someone who's run each.
Same dataset (~280k documents, embedded with text-embedding-3-large at 3072 dimensions), same query workload (~5k queries/day), same evaluation set (200 hand-labeled queries with expected results).
Metrics:
| Database | Recall@10 | p50 | p95 | Cost/month | Ops time |
|---|---|---|---|---|---|
| Pinecone (s1.x1) | 96% | 35ms | 80ms | $720 | <1 hr |
| Weaviate (self-hosted) | 95% | 22ms | 60ms | $290 (compute) | ~6 hr |
| pgvector | 93% | 45ms | 110ms | $80 (existing PG) | ~2 hr |
| Chroma (embedded) | 94% | 18ms | 40ms | ~$0 (in-app) | ~1 hr |
Recall is within margin of error across all four — the choice isn't really about quality at this scale. It's about ops profile and cost.
What it gets right:
where clauses works without manual sharding.What it gets wrong (or where it costs):
When we'd pick Pinecone: a small team without ops capacity, rapid prototyping, when "just make it work" matters more than cost. We have it in production for one customer-facing app where reliability mattered more than the bill.
What it gets right:
What it gets wrong:
When we'd pick Weaviate: when hybrid search matters out of the box, when self-hosting is acceptable, when the volume is too high for Pinecone economics. Production for our internal knowledge-search where compliance preferred self-hosting.
What it gets right:
What it gets wrong:
When we'd pick pgvector: most cases, honestly. If you have Postgres, the operational simplicity wins. We use it for our largest deployment (the customer-support knowledge base).
A specific gotcha: HNSW index build time on 280k vectors took ~12 minutes. Not bad, but on a fresh deployment you wait. Plan accordingly.
What it gets right:
What it gets wrong:
When we'd pick Chroma: prototypes, internal tools, tasks where the vectors live with the code. We use it for some internal experimentation tools but nothing production-customer-facing.
Across our actual production deployments:
A simple decision tree:
People obsess over these; they barely matter at our scale:
What actually matters for the decision:
Don't pick the trendiest option. Pick the one that fits your stack. A team that already runs Postgres should default to pgvector. A team without Postgres might pick differently.
Benchmark with your actual data. Sample of 10k documents and 100 queries gives you a real comparison in an afternoon. Synthetic benchmarks are often misleading.
Watch the operational cost, not just the bill. $290/month for self-hosted Weaviate is cheaper than $720/month for Pinecone — until you spend 8 hours/month maintaining it. Then they're comparable in total cost.
Avoid premature scale anxiety. "But what about when we have 100M vectors?" — most teams never get there. Optimize for the next 6-12 months of scale, plan to revisit if you cross thresholds.
Have an export plan. Even managed services should expose their data in a portable format. We snapshot embeddings to S3 monthly so a migration is feasible if needed.
The vector DB choice is rarely the bottleneck on AI quality. Retrieval logic, chunking, re-ranking, and prompt design matter more. Pick the DB that fits operationally, then move on to the work that actually moves the metric.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.