We ran the same RAG workload across three vector stores for a quarter each. Here's what we learned about latency, cost, and operational overhead.
We migrated a production RAG workload (~280k documents, 1536-dim embeddings, ~50 queries/sec peak) across three vector stores over six months — Pinecone, pgvector, and Qdrant. Each ran for a full quarter under real traffic. Here are the numbers and the calls we'd make differently.
text-embedding-3-small)Pinecone is the path of least resistance. Set up an index in 10 minutes, ship code, watch it work.
We were already running Postgres. Adding pgvector seemed elegant — one less moving part.
CREATE EXTENSION vector;
CREATE TABLE doc_embeddings (
id BIGSERIAL PRIMARY KEY,
doc_id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
embedding vector(1536) NOT NULL,
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT now()
);
-- HNSW index — production-grade, builds in ~hours for 280k rows
CREATE INDEX ON doc_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
CREATE INDEX ON doc_embeddings (tenant_id);
SELECT doc_id, embedding <=> $1::vector AS distance
FROM doc_embeddings
WHERE tenant_id = $2
ORDER BY embedding <=> $1::vector
LIMIT 20;
ef_search needs to be balanced per query — too low loses recall, too high blows latency.-- We tuned this at the connection level
SET hnsw.ef_search = 80; -- balances recall and latency
We tested Qdrant self-hosted on EKS to see if a purpose-built engine paid off vs the convenience of pgvector.
| Metric | Pinecone | pgvector | Qdrant |
|---|---|---|---|
| p50 retrieval | 12ms | 28ms | 8ms |
| p95 retrieval | 28ms | 65ms | 18ms |
| p99 retrieval | 45ms | 110ms | 32ms |
| Recall @ 20 | 96% | 94% | 96% |
| Filtered query (3 fields) p95 | 140ms | 70ms | 22ms |
| Monthly cost | $280 | $45 | $120 |
| Backup story | Native | pg_dump | Snapshots |
| Setup time | 1 hour | 4 hours | 2 days |
| Operational incidents (90 days) | 0 | 0 | 2 |
pgvector for production. Reasons:
We'd reach for Qdrant if:
We'd stay on Pinecone if:
The right vector store depends on your numbers. Run all three under your real traffic for at least a sprint each before committing.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Every hook on this list caught a bug or a security issue in the last twelve months. The configs are short. The savings have been considerable.
Bills hit $3,400/mo for runner minutes. We moved to self-hosted on EKS spot. The savings were real; the surprises were too.
Explore more articles in this category
We've shipped all three patterns to production. They're not interchangeable. Here's the framework we now use to decide which approach fits a given task.
We invalidate ~6% of LLM outputs before they reach a downstream system. Here's how we structure prompts and validators to catch malformed responses early.
We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.
Evergreen posts worth revisiting.