Embeddings turn text into numbers a computer can compare. Here's the working mental model, a runnable Python example, and where embeddings fit in real apps.

On this page

What Are Embeddings? A Beginner's Guide with Code

By the end of this post you'll have a working Python script that turns sentences into vectors, compares them with cosine similarity, and returns the most relevant match for a query. You'll also have a clear mental model of why embeddings exist, what problems they solve, and where they fit in real applications.

No prior ML experience required. You'll need Python 3.9+ and ten minutes.

The mental model #

Computers compare numbers easily. Comparing meaning is harder. The string "how to fix a 502 error" and the string "my server returns bad gateway" are nearly identical in meaning but share almost no characters in common. Plain string matching can't tell.

Embeddings solve this. An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Two texts with similar meaning end up with vectors that are close together in space. Two texts with different meaning end up far apart.

The numbers themselves aren't human-readable. A typical embedding has 384, 768, or 1536 dimensions — way too many to visualize. But comparing two embeddings gives you a single number (the cosine similarity, between -1 and 1) that captures "how similar are these texts in meaning."

That single number is the magic. It powers semantic search, recommendation, RAG, classification, deduplication — anything that needs "find the closest match by meaning."

How embeddings get made #

You don't make them yourself. You call a model. The model has been trained on huge amounts of text and has learned to assign similar vectors to texts with similar meaning.

Common ways to get embeddings today:

OpenAI API (text-embedding-3-small, text-embedding-3-large) — fast, cheap, good quality, paid per token
Anthropic / Voyage / Cohere APIs — similar shape
Open-source models via sentence-transformers (e.g. bge-small-en-v1.5) — runs locally, free, smaller but solid for many tasks

For this tutorial we'll use sentence-transformers so there's nothing to sign up for and no API key to manage. The patterns translate directly to the hosted APIs — only the function call changes.

Step 1: Install dependencies #

bash.bash

pip install sentence-transformers numpy

This pulls in sentence-transformers (the model loader and runner), plus numpy for the vector math. About 200MB total because of the underlying PyTorch + a small embedding model that gets cached the first time you run it.

You should see pip install successfully and exit. If you hit errors, check that you're on Python 3.9 or newer (python --version).

Step 2: Generate your first embeddings #

Save this as embeddings_demo.py:

python.python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

sentences = [
    "How do I fix a 502 error on nginx?",
    "My server keeps returning bad gateway responses.",
    "What is the capital of France?",
    "I love a good plate of pasta with tomato sauce.",
]

vectors = model.encode(sentences)

print(f"Got {len(vectors)} vectors")
print(f"Each vector has {len(vectors[0])} dimensions")
print(f"First vector starts with: {vectors[0][:5]}")

Run it:

bash.bash

python embeddings_demo.py

You should see output like:

code

Got 4 vectors
Each vector has 384 dimensions
First vector starts with: [-0.0123  0.0451 -0.0089 ...]

The first run downloads the model (~80MB) and caches it. Subsequent runs are instant.

Step 3: Compare two sentences #

Add this to the bottom of embeddings_demo.py:

python.python

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"\n[502 error] vs [bad gateway]:        {cosine_similarity(vectors[0], vectors[1]):.3f}")
print(f"[502 error] vs [capital of France]:  {cosine_similarity(vectors[0], vectors[2]):.3f}")
print(f"[502 error] vs [pasta]:              {cosine_similarity(vectors[0], vectors[3]):.3f}")

Run it again. You should see something like:

code

[502 error] vs [bad gateway]:        0.731
[502 error] vs [capital of France]:  0.052
[502 error] vs [pasta]:              0.044

The two server-error sentences score much higher than either does against the unrelated sentences — even though they share no actual words. That's the embedding doing its job.

Step 4: Build a tiny semantic search #

Now the practical bit. Given a small set of documents and a user query, return the closest match by meaning:

python.python

documents = [
    "Set up SSH key-based authentication for secure server access",
    "Install Docker and run your first container",
    "Configure nginx as a reverse proxy in front of your app",
    "Tune Postgres for high write throughput",
    "Deploy a Lambda function with the AWS CLI",
]

doc_vectors = model.encode(documents)

def search(query: str, top_k: int = 2):
    q_vec = model.encode(query)
    scores = [cosine_similarity(q_vec, d) for d in doc_vectors]
    ranked = sorted(zip(scores, documents), reverse=True)
    return ranked[:top_k]

for hit in search("how do I run a website behind nginx?"):
    print(f"  {hit[0]:.3f}  {hit[1]}")

You should see the nginx reverse-proxy doc come back first, followed by something else relevant. The query doesn't share words with the result, but the meaning matches.

That's the same algorithm at the core of semantic search engines, RAG retrievers, and recommendation systems. Larger systems use specialized vector databases (pgvector, Pinecone, Weaviate, Qdrant) to handle millions of vectors fast — but the math is identical.

Common mistakes #

Comparing embeddings from different models. A vector from OpenAI's text-embedding-3-small cannot be meaningfully compared to one from bge-large. They live in different spaces. Stick to one model per index.

Embedding huge chunks of text. Most models have an input limit (often 512 tokens, sometimes 8192). Anything longer gets truncated silently — you embed only the start, miss everything after. Split long documents into smaller chunks before embedding.

Forgetting to re-embed when you change models. Switching from bge-small to text-embedding-3-large means re-embedding every document. Old vectors become useless. Plan for this in your pipeline.

Using cosine similarity when the model is normalized for dot product. Most modern models output normalized vectors, where cosine similarity and dot product give the same answer. A few don't. Read the model card.

What to read next #

This was the foundations. From here, the natural next steps:

Vector databases for AI: Pinecone, Weaviate, Chroma, pgvector — where to put thousands of these vectors so search stays fast
Embedding models: a practical comparison — the tradeoffs between models when you outgrow MiniLM
Build your first RAG app in 100 lines of Python — embeddings as the retriever in a real LLM application

Embeddings aren't deep ML wizardry — they're a simple primitive (text → vector) that unlocks an enormous amount of useful behavior. Once the mental model clicks, you'll start seeing places to use them in everything you build.

What Are Embeddings? A Beginner's Guide with Code

What Are Embeddings? A Beginner's Guide with Code

The mental model #

How embeddings get made #

Step 1: Install dependencies #

Step 2: Generate your first embeddings #

Step 3: Compare two sentences #

Step 4: Build a tiny semantic search #

Common mistakes #

What to read next #

Stay Updated

Terraform Tutorial — Your First Infrastructure-as-Code Project

More from AI

Prompt Engineering Basics — From "Help Me" to Working Prompts

Build Your First RAG App in 100 Lines of Python

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

Prompt Engineering Basics — From "Help Me" to Working Prompts

Build Your First RAG App in 100 Lines of Python

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

LLM Output Validation: Schema-First Prompt Engineering Patterns

Terraform Tutorial — Your First Infrastructure-as-Code Project

SSH Tutorial — Keys, Config, and Working Remotely

About Admin

You might have missed

GitOps with Argo CD: Best Practices for 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

AWS Lambda and Serverless Best Practices for Production