We ran the same RAG workload across three vector stores for a quarter each. Here's what we learned about latency, cost, and operational overhead.

On this page

Vector Database Selection: Pinecone, pgvector, Qdrant After 6 Months in Production

We migrated a production RAG workload (~280k documents, 1536-dim embeddings, ~50 queries/sec peak) across three vector stores over six months — Pinecone, pgvector, and Qdrant. Each ran for a full quarter under real traffic. Here are the numbers and the calls we'd make differently.

The Workload #

Index size: 280k vectors, 1536-dim (text-embedding-3-small)
Query rate: 12 q/s avg, 50 q/s peak
Pattern: top-20 retrieval, then cross-encoder rerank to top-8
Quality bar: recall@8 ≥ 92% on internal eval set
Latency target: p95 retrieval < 80ms

Pinecone (Months 1–2): The Easy Default #

Pinecone is the path of least resistance. Set up an index in 10 minutes, ship code, watch it work.

What we liked #

Zero ops. Truly. We never touched a knob beyond pod count.
Sub-30ms p95 out of the box for our top-20 queries.
Auto-scaling handled the 50 q/s peaks without intervention.

What hurt #

Cost. The s1.x1 pod sufficient for our index ran ~$70/mo per pod, and we needed 2 for redundancy. Add metadata filtering and we ended up at ~$280/mo just for vectors.
Metadata filtering performance dropped sharply when we added filters on tenant + visibility + date_range simultaneously. p95 went from 28ms to 140ms with three filters.
Lock-in. The metadata schema and the way you structure namespaces is Pinecone-specific. Migrating out is a project.

pgvector (Months 3–4): The Pragmatic Choice #

We were already running Postgres. Adding pgvector seemed elegant — one less moving part.

Setup #

sql.sql

CREATE EXTENSION vector;

CREATE TABLE doc_embeddings (
  id BIGSERIAL PRIMARY KEY,
  doc_id TEXT NOT NULL,
  tenant_id TEXT NOT NULL,
  embedding vector(1536) NOT NULL,
  metadata JSONB,
  created_at TIMESTAMPTZ DEFAULT now()
);

-- HNSW index — production-grade, builds in ~hours for 280k rows
CREATE INDEX ON doc_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

CREATE INDEX ON doc_embeddings (tenant_id);

Query #

sql.sql

SELECT doc_id, embedding <=> $1::vector AS distance
FROM doc_embeddings
WHERE tenant_id = $2
ORDER BY embedding <=> $1::vector
LIMIT 20;

What we liked #

One database. Backups, monitoring, connection pooling — all reuse what we already had.
SQL filtering wins. Combining vector search with WHERE clauses on tenant + status + date_range was faster than Pinecone's metadata filters because Postgres planned the query well.
Cost dropped to ~$45/mo of incremental RDS capacity (we already had the instance).

What hurt #

Tuning HNSW is real work. ef_search needs to be balanced per query — too low loses recall, too high blows latency.
p95 was 65ms vs Pinecone's 28ms. Acceptable but noticeably slower.
Index build time on a fresh 280k-row table: 3.5 hours. We had to plan for this during cutovers.
No native sharding for vectors. If the index outgrows a single instance, you're rolling your own.

sql.sql

-- We tuned this at the connection level
SET hnsw.ef_search = 80;  -- balances recall and latency

Qdrant (Months 5–6): The Specialist #

We tested Qdrant self-hosted on EKS to see if a purpose-built engine paid off vs the convenience of pgvector.

What we liked #

Best raw latency. p95 was 18ms — half of pgvector. Filtered queries stayed fast.
Payload indexing on metadata worked exceptionally well. Filtering on three fields didn't degrade.
Snapshots and replicas are first-class. Backup story was easier than pgvector + custom dump scripts.

What hurt #

Operational overhead is real. We hit two issues in 8 weeks:
- A node restart caused index reload that took 6 minutes (memory-mapped files re-loading from disk). Service was effectively down during that window because we hadn't sized replicas correctly.
- A point-update API call format changed between minor versions; our client broke after a routine upgrade.
Cluster sizing was non-trivial. The official guide assumes you understand HNSW parameters; we got it wrong twice.
Cost was middle-of-the-road: ~$120/mo for a 2-node cluster with reasonable headroom.

The Numbers Side-by-Side #

Metric	Pinecone	pgvector	Qdrant
p50 retrieval	12ms	28ms	8ms
p95 retrieval	28ms	65ms	18ms
p99 retrieval	45ms	110ms	32ms
Recall @ 20	96%	94%	96%
Filtered query (3 fields) p95	140ms	70ms	22ms
Monthly cost	$280	$45	$120
Backup story	Native	pg_dump	Snapshots
Setup time	1 hour	4 hours	2 days
Operational incidents (90 days)	0	0	2

Where We Landed #

pgvector for production. Reasons:

Cost is 6× lower than Pinecone, 3× lower than Qdrant.
Operational simplicity. We have 0 oncall pages from vector storage in 4 months.
Filtering wins. Most of our queries filter by tenant + status; SQL planning beat both alternatives for that pattern.
Acceptable latency. 65ms p95 doesn't move our user-perceivable metrics. The end-to-end LLM call is dominated by inference, not retrieval.

We'd reach for Qdrant if:

We needed sub-20ms p95 (e.g., real-time recommendations)
Index size pushed past 10M vectors (where Postgres struggles)
We had dedicated ops capacity

We'd stay on Pinecone if:

We wanted zero infrastructure work
The team didn't have Postgres operational expertise

Best Practices Regardless of Choice #

Build an eval set first. 200+ labeled QA pairs. Without them you can't tell which store actually performs better.
Always rerank. A small cross-encoder lifts recall@8 by 8–15 percentage points and works the same regardless of vector store.
Cache aggressively. Semantic cache on the query side is free latency. We cache 22% of queries with no quality cost.
Monitor recall as a metric, not just latency. Faster but less accurate is a regression.
Plan for index rebuild. Embedding model changes. Schema changes. You'll rebuild eventually; budget for it.

What We'd Do Differently #

Run pgvector first. We chose Pinecone for speed-to-ship, then migrated. Starting on pgvector would have saved 6 weeks of integration work.
Test with real filters from day one. Latency on plain top-k is misleading. Production queries have WHERE clauses; benchmark those.
Pin Qdrant minor versions. Auto-upgrade in Helm caught us once.

The right vector store depends on your numbers. Run all three under your real traffic for at least a sprint each before committing.

Vector Database Selection: Pinecone, pgvector, Qdrant After 6 Months in Production

Vector Database Selection: Pinecone, pgvector, Qdrant After 6 Months in Production

The Workload #

Pinecone (Months 1–2): The Easy Default #

What we liked #

What hurt #

pgvector (Months 3–4): The Pragmatic Choice #

Setup #

Query #

What we liked #

What hurt #

Qdrant (Months 5–6): The Specialist #

What we liked #

What hurt #

The Numbers Side-by-Side #

Where We Landed #

Best Practices Regardless of Choice #

What We'd Do Differently #

Stay Updated

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

GitHub Actions Self-Hosted Runners: Why We Switched and What Broke

More from AI

MLOps — Model Registry vs MLflow Tracking, And When You Need Both

Agentic Ops — When (and When Not) to Use AI Agents for Incident Response

LLM Evals That Actually Predict Production Quality

MLOps — Model Registry vs MLflow Tracking, And When You Need Both

Agentic Ops — When (and When Not) to Use AI Agents for Incident Response

LLM Evals That Actually Predict Production Quality

Multi-Provider LLM Routing — Failover, Cost Routing, and Load Balancing

Hybrid Search — Combining BM25 and Embeddings for Better RAG

LLM Streaming UX — Backpressure, Cancellation, Partial Results

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

AWS Lambda and Serverless Best Practices for Production