We benchmarked six embedding models on the same retrieval task. The results that surprised us, and how we'd pick today.
Embeddings are the foundation of most retrieval-based AI features (RAG, semantic search, recommendations). The model choice matters more than people often say — we measured 7 percentage points of recall difference between commonly-recommended models on the same task. This post is the comparison from our actual benchmarking, with the criteria that ended up mattering.
Six models, all trained for retrieval / semantic similarity:
text-embedding-3-small (OpenAI, 1536d)text-embedding-3-large (OpenAI, 3072d)text-embedding-ada-002 (OpenAI, legacy, 1536d)voyage-2 (Voyage AI, 1024d)bge-large-en-v1.5 (BAAI, 1024d, self-hostable)bge-m3 (BAAI, multi-language + multi-functionality, self-hostable)Same dataset (~280k documents from our customer support knowledge base), same eval set (200 hand-labeled queries with expected matching documents), same retrieval pipeline (cosine similarity, top-10 results).
Metrics: Recall@10, Recall@5, p50 query latency.
| Model | Recall@10 | Recall@5 | Latency p50 | Cost/1M tok | Notes |
|---|---|---|---|---|---|
| ada-002 (OpenAI legacy) | 79% | 68% | 60ms | $0.10 | Older, kept for comparison |
| 3-small (OpenAI) | 87% | 78% | 55ms | $0.02 | Big jump from ada |
| 3-large (OpenAI) | 92% | 85% | 70ms | $0.13 | Best of the OpenAI line |
| voyage-2 (Voyage AI) | 91% | 84% | 120ms | $0.10 | Comparable quality |
| bge-large (self-hosted) | 89% | 81% | 18ms | ~$0.04 | Self-hosted infra cost |
| bge-m3 (self-hosted) | 88% | 80% | 22ms | ~$0.04 | Multilingual support |
A few observations:
ada-002 is no longer the right choice. It's been superseded; teams still using it should switch.3-large is the highest quality of the lot for English-only documents, by a small margin.bge-large self-hosted is competitive at ~30% the cost of 3-large and with much lower latency.voyage-2 is high quality but slower (their hosted API has higher RTT than OpenAI's).bge-m3 is the right answer if you have multilingual content. Its multilingual training shows in our (English-only) eval but matters when content is mixed.A few results we didn't expect:
The gap between 3-small and 3-large was significant. We'd assumed the difference was marginal. 7 percentage points of recall translates to noticeably better RAG answers for queries that are close to the boundary.
Self-hosted bge models are surprisingly fast. ~18ms for inference on a single GPU instance vs ~70ms for hosted-API calls (most of which is RTT). For a high-throughput pipeline, this is meaningful.
Voyage-2's quality is real, despite less marketing. Comparable to 3-large at lower cost per token. The RTT is higher (their service is in fewer regions); for batch use, this doesn't matter.
ada-002 is actually quite weak now. The improvements in 3-small/3-large/voyage-2 over ada-002 are large. Anyone still on ada-002 is leaving meaningful quality on the table.
text-embedding-3-small is 1536d, 3-large is 3072d. Higher dimensionality often means better representation but more storage and slower search.
For 280k vectors:
| Dimensions | Index size | Build time | Query latency |
|---|---|---|---|
| 1024d (bge-large) | 1.2 GB | 8 min | 12ms |
| 1536d (3-small, ada-002) | 1.7 GB | 12 min | 18ms |
| 3072d (3-large) | 3.4 GB | 20 min | 28ms |
The cost of using 3-large's full dimensionality is real. OpenAI offers a dimensions parameter that lets you truncate to a smaller size:
client.embeddings.create(
model="text-embedding-3-large",
input="...",
dimensions=1024,
)
Quality drops slightly (~1-2pp on our eval) but storage and search costs drop proportionally. For most uses, truncating 3-large to 1024d is the right balance.
Customer-support RAG over English docs: text-embedding-3-large or voyage-2. The recall difference matters for answer quality.
Internal semantic search over English: text-embedding-3-small or bge-large. Quality is good enough; cost is lower.
Multilingual RAG: bge-m3 is purpose-built for this; it outperforms English-only models on non-English content.
Embedding millions of documents: self-hosted bge-large. Cost-per-token of hosted APIs adds up at scale; self-hosted is cheaper if you have GPU infrastructure.
Real-time embedding (e.g., user-typed queries embedded as they type): self-hosted models for latency, since you want sub-30ms.
Recommendation engine over user behavior: typically a custom-trained model, not these off-the-shelf ones. The off-the-shelf retrieval models are tuned for text similarity, not user-item affinity.
Across our actual production deployments:
text-embedding-3-large (full 3072d). Best quality wins for customer-facing.text-embedding-3-small. Cheaper, quality is fine.bge-m3 self-hosted. Required for non-English.bge-large. Latency wins.We standardized on these per-use-case rather than picking one model for everything.
A common gotcha: switching embedding models requires re-embedding all your content. The vectors from one model can't be compared with vectors from another.
For our 280k-document RAG, switching from ada-002 to 3-large:
Plan for this. If you're starting and might switch later, design the embedding pipeline so it's not painful to re-run.
Self-hosting an embedding model adds operational overhead:
It pays off when:
For most teams, hosted APIs are fine. For high-volume teams or those with specific requirements, self-hosting a model like bge-large is straightforward and pays back.
Not validating that the embedding model matches the document length. Some models have small context windows (e.g., 512 tokens). Documents exceeding that get truncated, hurting recall. Check your chunking strategy against your model's context window.
Pooling differently than the model intended. Some models expect mean-pooling; others use [CLS] token; others have a built-in pooling layer. Using the wrong method silently degrades quality.
Mixing embeddings from different models. Documents indexed with ada-002, queries embedded with 3-small. The vectors are in different spaces; results are nonsense.
Not re-evaluating when the model upgrades. Provider-side models (especially open-source ones) get updated. Your evaluation might no longer reflect current quality.
Using cosine similarity when the model recommends dot product. Some models are normalized at output, making cosine and dot product equivalent. Others aren't. Use what the model card recommends.
Default to text-embedding-3-small. Cheap, quality is good enough for most uses, well-supported.
Upgrade to 3-large if quality matters. The 7pp recall jump translates to noticeably better downstream answers.
Self-host bge-large if you have the volume to justify it. Comparable quality to 3-small, much cheaper at scale.
Use bge-m3 for multilingual content. Don't try to make English-only models do the multilingual job.
Build an eval set before committing. A simple set of "expected document for query" pairs lets you compare models on YOUR data, not someone else's.
Plan for re-embedding. Storage and time to re-embed is the migration cost between models. Design the pipeline so re-running is feasible.
The embedding model is a foundational choice. Get it close to right and the rest of your retrieval pipeline can compensate for the rest. Get it wrong and you spend the next year fighting the foundation. The benchmarking effort (a day, maybe two) pays off many times over the life of the project.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.