Embeddings turn text into numbers a computer can compare. Here's the working mental model, a runnable Python example, and where embeddings fit in real apps.
By the end of this post you'll have a working Python script that turns sentences into vectors, compares them with cosine similarity, and returns the most relevant match for a query. You'll also have a clear mental model of why embeddings exist, what problems they solve, and where they fit in real applications.
No prior ML experience required. You'll need Python 3.9+ and ten minutes.
Computers compare numbers easily. Comparing meaning is harder. The string "how to fix a 502 error" and the string "my server returns bad gateway" are nearly identical in meaning but share almost no characters in common. Plain string matching can't tell.
Embeddings solve this. An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Two texts with similar meaning end up with vectors that are close together in space. Two texts with different meaning end up far apart.
The numbers themselves aren't human-readable. A typical embedding has 384, 768, or 1536 dimensions — way too many to visualize. But comparing two embeddings gives you a single number (the cosine similarity, between -1 and 1) that captures "how similar are these texts in meaning."
That single number is the magic. It powers semantic search, recommendation, RAG, classification, deduplication — anything that needs "find the closest match by meaning."
You don't make them yourself. You call a model. The model has been trained on huge amounts of text and has learned to assign similar vectors to texts with similar meaning.
Common ways to get embeddings today:
text-embedding-3-small, text-embedding-3-large) — fast, cheap, good quality, paid per tokensentence-transformers (e.g. bge-small-en-v1.5) — runs locally, free, smaller but solid for many tasksFor this tutorial we'll use sentence-transformers so there's nothing to sign up for and no API key to manage. The patterns translate directly to the hosted APIs — only the function call changes.
pip install sentence-transformers numpy
This pulls in sentence-transformers (the model loader and runner), plus numpy for the vector math. About 200MB total because of the underlying PyTorch + a small embedding model that gets cached the first time you run it.
You should see pip install successfully and exit. If you hit errors, check that you're on Python 3.9 or newer (python --version).
Save this as embeddings_demo.py:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
sentences = [
"How do I fix a 502 error on nginx?",
"My server keeps returning bad gateway responses.",
"What is the capital of France?",
"I love a good plate of pasta with tomato sauce.",
]
vectors = model.encode(sentences)
print(f"Got {len(vectors)} vectors")
print(f"Each vector has {len(vectors[0])} dimensions")
print(f"First vector starts with: {vectors[0][:5]}")
Run it:
python embeddings_demo.py
You should see output like:
Got 4 vectors
Each vector has 384 dimensions
First vector starts with: [-0.0123 0.0451 -0.0089 ...]
The first run downloads the model (~80MB) and caches it. Subsequent runs are instant.
Add this to the bottom of embeddings_demo.py:
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"\n[502 error] vs [bad gateway]: {cosine_similarity(vectors[0], vectors[1]):.3f}")
print(f"[502 error] vs [capital of France]: {cosine_similarity(vectors[0], vectors[2]):.3f}")
print(f"[502 error] vs [pasta]: {cosine_similarity(vectors[0], vectors[3]):.3f}")
Run it again. You should see something like:
[502 error] vs [bad gateway]: 0.731
[502 error] vs [capital of France]: 0.052
[502 error] vs [pasta]: 0.044
The two server-error sentences score much higher than either does against the unrelated sentences — even though they share no actual words. That's the embedding doing its job.
Now the practical bit. Given a small set of documents and a user query, return the closest match by meaning:
documents = [
"Set up SSH key-based authentication for secure server access",
"Install Docker and run your first container",
"Configure nginx as a reverse proxy in front of your app",
"Tune Postgres for high write throughput",
"Deploy a Lambda function with the AWS CLI",
]
doc_vectors = model.encode(documents)
def search(query: str, top_k: int = 2):
q_vec = model.encode(query)
scores = [cosine_similarity(q_vec, d) for d in doc_vectors]
ranked = sorted(zip(scores, documents), reverse=True)
return ranked[:top_k]
for hit in search("how do I run a website behind nginx?"):
print(f" {hit[0]:.3f} {hit[1]}")
You should see the nginx reverse-proxy doc come back first, followed by something else relevant. The query doesn't share words with the result, but the meaning matches.
That's the same algorithm at the core of semantic search engines, RAG retrievers, and recommendation systems. Larger systems use specialized vector databases (pgvector, Pinecone, Weaviate, Qdrant) to handle millions of vectors fast — but the math is identical.
Comparing embeddings from different models. A vector from OpenAI's text-embedding-3-small cannot be meaningfully compared to one from bge-large. They live in different spaces. Stick to one model per index.
Embedding huge chunks of text. Most models have an input limit (often 512 tokens, sometimes 8192). Anything longer gets truncated silently — you embed only the start, miss everything after. Split long documents into smaller chunks before embedding.
Forgetting to re-embed when you change models. Switching from bge-small to text-embedding-3-large means re-embedding every document. Old vectors become useless. Plan for this in your pipeline.
Using cosine similarity when the model is normalized for dot product. Most modern models output normalized vectors, where cosine similarity and dot product give the same answer. A few don't. Read the model card.
This was the foundations. From here, the natural next steps:
MiniLMEmbeddings aren't deep ML wizardry — they're a simple primitive (text → vector) that unlocks an enormous amount of useful behavior. Once the mental model clicks, you'll start seeing places to use them in everything you build.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Explore more articles in this category
A hands-on intro to prompt engineering. Learn the four levers (role, format, examples, constraints) and watch a vague prompt turn into a reliable one.
A working retrieval-augmented generation app you can run today. Markdown ingestion, embeddings, semantic search, and an LLM answer — start to finish in one afternoon.
We've shipped all three patterns to production. They're not interchangeable. Here's the framework we now use to decide which approach fits a given task.
Evergreen posts worth revisiting.