A working retrieval-augmented generation app you can run today. Markdown ingestion, embeddings, semantic search, and an LLM answer — start to finish in one afternoon.
By the end of this post you'll have a Python script that takes a folder of markdown files, indexes them with embeddings, and answers questions about their contents using an LLM. It's the same architecture that powers customer support assistants, internal knowledge search, and documentation Q&A — just trimmed to the essentials.
You'll need Python 3.9+, an OpenAI API key, and a folder of markdown files to query. Total runtime: about 30 minutes including reading time.
LLMs hallucinate when asked about things outside their training data. RAG (retrieval-augmented generation) fixes that by retrieving relevant context first, then asking the LLM to answer using only that context — so the model can cite real information instead of guessing.
The architecture has three steps: ingest (load + chunk + embed your documents), retrieve (find relevant chunks for a query), generate (call the LLM with the chunks as context).
That's it. The complexity teams add later (re-rankers, hybrid search, query rewriting, citation post-processing) is all optimization on top of these three steps.
pip install openai chromadb
export OPENAI_API_KEY="sk-..."
chromadb is an embedded vector database — runs in-process, persists to a local directory, no separate server. Good for prototypes; for production you'd reach for pgvector or Pinecone.
openai covers both the embedding model and the LLM. You can swap providers later; the code shape stays the same.
Save this as rag.py:
import os
import glob
import openai
import chromadb
client = openai.OpenAI()
chroma = chromadb.PersistentClient(path="./rag_index")
collection = chroma.get_or_create_collection("docs")
def chunk(text: str, size: int = 800, overlap: int = 100) -> list[str]:
"""Split text into chunks with a small overlap so context isn't cut mid-thought."""
if len(text) <= size:
return [text]
chunks, start = [], 0
while start < len(text):
chunks.append(text[start : start + size])
start += size - overlap
return chunks
def embed(texts: list[str]) -> list[list[float]]:
resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
return [d.embedding for d in resp.data]
def ingest(folder: str):
files = glob.glob(os.path.join(folder, "**/*.md"), recursive=True)
print(f"Ingesting {len(files)} files...")
for path in files:
with open(path, encoding="utf-8") as f:
text = f.read()
chunks = chunk(text)
ids = [f"{path}::{i}" for i in range(len(chunks))]
metadatas = [{"source": path, "chunk": i} for i in range(len(chunks))]
embeddings = embed(chunks)
collection.upsert(
ids=ids, documents=chunks, embeddings=embeddings, metadatas=metadatas
)
print(f"Indexed {collection.count()} chunks total.")
Test it:
if __name__ == "__main__":
import sys
if sys.argv[1] == "ingest":
ingest(sys.argv[2])
python rag.py ingest ./my-docs
You should see something like Ingesting 12 files... followed by Indexed 87 chunks total.. The vectors live in ./rag_index/ on disk and persist between runs.
Add to rag.py:
def retrieve(query: str, k: int = 4) -> list[dict]:
[q_embed] = embed([query])
results = collection.query(query_embeddings=[q_embed], n_results=k)
return [
{"text": doc, "source": meta["source"]}
for doc, meta in zip(results["documents"][0], results["metadatas"][0])
]
Quick test from a Python REPL:
from rag import retrieve
for chunk in retrieve("how do I configure SSL?"):
print(f"[{chunk['source']}]")
print(chunk['text'][:200])
print("---")
You should see 4 chunks of your most-relevant content for that query. If they don't look relevant, your docs probably don't cover the topic, or the chunks are too small/large for the question shape — both fixable.
Add to rag.py:
SYSTEM_PROMPT = """You answer questions using ONLY the numbered context snippets below.
For every claim, append a citation like [1] referring to the snippet number.
If the answer is not in the context, reply exactly: "I don't have that information in the provided documents."
Do not paraphrase claims that lack a citation."""
def answer(query: str) -> str:
chunks = retrieve(query)
context = "\n\n".join(f"[{i+1}] {c['text']}" for i, c in enumerate(chunks))
sources = "\n".join(f"[{i+1}] {c['source']}" for i, c in enumerate(chunks))
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
],
temperature=0,
)
text = response.choices[0].message.content
return f"{text}\n\nSources:\n{sources}"
Wire it up to the CLI:
if __name__ == "__main__":
import sys
if sys.argv[1] == "ingest":
ingest(sys.argv[2])
elif sys.argv[1] == "ask":
print(answer(" ".join(sys.argv[2:])))
Now ask it something:
python rag.py ask "how do I configure SSL on my web server?"
You should see an answer that cites snippets [1], [2], etc., grounded in your actual documents, with the matching sources listed at the bottom. If the docs don't cover SSL, the model says it doesn't have the information instead of making something up — that's the whole point.
That's a working RAG app in well under 100 lines.
Embedding documents and queries with different models. They have to use the same embedding model. Mixing breaks retrieval silently.
Chunks too big or too small. Too big and irrelevant context drowns the answer. Too small and you cut sentences mid-thought. Start at 600–1000 characters with 100 char overlap; tune from there based on your content shape.
Skipping the "I don't know" instruction. Without it, the LLM will confidently answer questions even when retrieval returned irrelevant chunks. The exact-string fallback ("I don't have that information…") is what lets you detect "out of scope" programmatically.
Treating it as solved at this point. This is a working prototype, not a production system. For real users you'll want re-ranking, hybrid search (BM25 + vector), citation verification, eval sets, and confidence thresholding. Each is a small addition; none are necessary to validate the idea.
You now have the bones. The next levels:
The 100-line version is enough to know whether RAG fits your problem. If your eval queries return relevant chunks, you've validated the idea — everything past that is engineering, not research.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Build a real disk-cleanup script step by step. Learn variables, conditionals, loops, error handling, and the safety preamble that prevents foot-guns.
Walk through your first Dockerfile, container run, and image push in 30 minutes. No theory dumps — just the commands and what each one is doing.
Explore more articles in this category
Embeddings turn text into numbers a computer can compare. Here's the working mental model, a runnable Python example, and where embeddings fit in real apps.
A hands-on intro to prompt engineering. Learn the four levers (role, format, examples, constraints) and watch a vague prompt turn into a reliable one.
We've shipped all three patterns to production. They're not interchangeable. Here's the framework we now use to decide which approach fits a given task.
Evergreen posts worth revisiting.