We benchmarked six embedding models on the same retrieval task. The results that surprised us, and how we'd pick today.

On this page

Embedding Models: A Practical Comparison

Embeddings are the foundation of most retrieval-based AI features (RAG, semantic search, recommendations). The model choice matters more than people often say — we measured 7 percentage points of recall difference between commonly-recommended models on the same task. This post is the comparison from our actual benchmarking, with the criteria that ended up mattering.

What we benchmarked #

Six models, all trained for retrieval / semantic similarity:

text-embedding-3-small (OpenAI, 1536d)
text-embedding-3-large (OpenAI, 3072d)
text-embedding-ada-002 (OpenAI, legacy, 1536d)
voyage-2 (Voyage AI, 1024d)
bge-large-en-v1.5 (BAAI, 1024d, self-hostable)
bge-m3 (BAAI, multi-language + multi-functionality, self-hostable)

Same dataset (~280k documents from our customer support knowledge base), same eval set (200 hand-labeled queries with expected matching documents), same retrieval pipeline (cosine similarity, top-10 results).

Metrics: Recall@10, Recall@5, p50 query latency.

Results #

Model	Recall@10	Recall@5	Latency p50	Cost/1M tok	Notes
ada-002 (OpenAI legacy)	79%	68%	60ms	$0.10	Older, kept for comparison
3-small (OpenAI)	87%	78%	55ms	$0.02	Big jump from ada
3-large (OpenAI)	92%	85%	70ms	$0.13	Best of the OpenAI line
voyage-2 (Voyage AI)	91%	84%	120ms	$0.10	Comparable quality
bge-large (self-hosted)	89%	81%	18ms	~$0.04	Self-hosted infra cost
bge-m3 (self-hosted)	88%	80%	22ms	~$0.04	Multilingual support

A few observations:

ada-002 is no longer the right choice. It's been superseded; teams still using it should switch.
3-large is the highest quality of the lot for English-only documents, by a small margin.
bge-large self-hosted is competitive at ~30% the cost of 3-large and with much lower latency.
voyage-2 is high quality but slower (their hosted API has higher RTT than OpenAI's).
bge-m3 is the right answer if you have multilingual content. Its multilingual training shows in our (English-only) eval but matters when content is mixed.

What surprised us #

A few results we didn't expect:

The gap between 3-small and 3-large was significant. We'd assumed the difference was marginal. 7 percentage points of recall translates to noticeably better RAG answers for queries that are close to the boundary.

Self-hosted bge models are surprisingly fast. ~18ms for inference on a single GPU instance vs ~70ms for hosted-API calls (most of which is RTT). For a high-throughput pipeline, this is meaningful.

Voyage-2's quality is real, despite less marketing. Comparable to 3-large at lower cost per token. The RTT is higher (their service is in fewer regions); for batch use, this doesn't matter.

ada-002 is actually quite weak now. The improvements in 3-small/3-large/voyage-2 over ada-002 are large. Anyone still on ada-002 is leaving meaningful quality on the table.

How dimensionality plays in #

text-embedding-3-small is 1536d, 3-large is 3072d. Higher dimensionality often means better representation but more storage and slower search.

For 280k vectors:

Dimensions	Index size	Build time	Query latency
1024d (bge-large)	1.2 GB	8 min	12ms
1536d (3-small, ada-002)	1.7 GB	12 min	18ms
3072d (3-large)	3.4 GB	20 min	28ms

The cost of using 3-large's full dimensionality is real. OpenAI offers a dimensions parameter that lets you truncate to a smaller size:

python.python

client.embeddings.create(
    model="text-embedding-3-large",
    input="...",
    dimensions=1024,
)

Quality drops slightly (~1-2pp on our eval) but storage and search costs drop proportionally. For most uses, truncating 3-large to 1024d is the right balance.

Use-case fits #

Customer-support RAG over English docs: text-embedding-3-large or voyage-2. The recall difference matters for answer quality.

Internal semantic search over English: text-embedding-3-small or bge-large. Quality is good enough; cost is lower.

Multilingual RAG: bge-m3 is purpose-built for this; it outperforms English-only models on non-English content.

Embedding millions of documents: self-hosted bge-large. Cost-per-token of hosted APIs adds up at scale; self-hosted is cheaper if you have GPU infrastructure.

Real-time embedding (e.g., user-typed queries embedded as they type): self-hosted models for latency, since you want sub-30ms.

Recommendation engine over user behavior: typically a custom-trained model, not these off-the-shelf ones. The off-the-shelf retrieval models are tuned for text similarity, not user-item affinity.

What we ended up using #

Across our actual production deployments:

Customer-support RAG: text-embedding-3-large (full 3072d). Best quality wins for customer-facing.
Internal documentation search: text-embedding-3-small. Cheaper, quality is fine.
Multilingual support content: bge-m3 self-hosted. Required for non-English.
Real-time autocomplete-style search: self-hosted bge-large. Latency wins.

We standardized on these per-use-case rather than picking one model for everything.

Migrating between models #

A common gotcha: switching embedding models requires re-embedding all your content. The vectors from one model can't be compared with vectors from another.

For our 280k-document RAG, switching from ada-002 to 3-large:

Time to re-embed: ~6 hours (rate-limited by OpenAI's API)
Cost: ~$45 in API calls
Storage: doubled (need both old and new during transition)

Plan for this. If you're starting and might switch later, design the embedding pipeline so it's not painful to re-run.

Self-hosted: when it's worth it #

Self-hosting an embedding model adds operational overhead:

A GPU instance running 24/7 (~$0.30-0.80/hr depending on instance type)
Latency monitoring, scaling, etc.
Model file management

It pays off when:

Your token volume is high (millions per day) — API costs exceed self-hosted infrastructure costs.
Latency matters — self-hosted is much faster than hosted APIs.
You need multilingual support and the open-source models are better.
Compliance requires keeping data on-prem.

For most teams, hosted APIs are fine. For high-volume teams or those with specific requirements, self-hosting a model like bge-large is straightforward and pays back.

Common mistakes #

Not validating that the embedding model matches the document length. Some models have small context windows (e.g., 512 tokens). Documents exceeding that get truncated, hurting recall. Check your chunking strategy against your model's context window.

Pooling differently than the model intended. Some models expect mean-pooling; others use [CLS] token; others have a built-in pooling layer. Using the wrong method silently degrades quality.

Mixing embeddings from different models. Documents indexed with ada-002, queries embedded with 3-small. The vectors are in different spaces; results are nonsense.

Not re-evaluating when the model upgrades. Provider-side models (especially open-source ones) get updated. Your evaluation might no longer reflect current quality.

Using cosine similarity when the model recommends dot product. Some models are normalized at output, making cosine and dot product equivalent. Others aren't. Use what the model card recommends.

What I'd tell a team starting #

Default to text-embedding-3-small. Cheap, quality is good enough for most uses, well-supported.

Upgrade to 3-large if quality matters. The 7pp recall jump translates to noticeably better downstream answers.

Self-host bge-large if you have the volume to justify it. Comparable quality to 3-small, much cheaper at scale.

Use bge-m3 for multilingual content. Don't try to make English-only models do the multilingual job.

Build an eval set before committing. A simple set of "expected document for query" pairs lets you compare models on YOUR data, not someone else's.

Plan for re-embedding. Storage and time to re-embed is the migration cost between models. Design the pipeline so re-running is feasible.

The embedding model is a foundational choice. Get it close to right and the rest of your retrieval pipeline can compensate for the rest. Get it wrong and you spend the next year fighting the foundation. The benchmarking effort (a day, maybe two) pays off many times over the life of the project.

Embedding Models Comparison: Choosing the Right Model for Your Use Case

Embedding Models: A Practical Comparison

What we benchmarked #

Results #

What surprised us #

How dimensionality plays in #

Use-case fits #

What we ended up using #

Migrating between models #

Self-hosted: when it's worth it #

Common mistakes #

What I'd tell a team starting #

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

Real-World RAG Incidents: Lessons from a Production Rollout

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

Agent Memory: Short-Term, Long-Term, and When You Need Neither

Guardrails for Production LLMs: Input and Output Filtering That Holds

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas