A working retrieval-augmented generation app you can run today. Markdown ingestion, embeddings, semantic search, and an LLM answer — start to finish in one afternoon.

On this page

Build Your First RAG App in 100 Lines of Python

By the end of this post you'll have a Python script that takes a folder of markdown files, indexes them with embeddings, and answers questions about their contents using an LLM. It's the same architecture that powers customer support assistants, internal knowledge search, and documentation Q&A — just trimmed to the essentials.

You'll need Python 3.9+, an OpenAI API key, and a folder of markdown files to query. Total runtime: about 30 minutes including reading time.

What RAG is, in two sentences #

LLMs hallucinate when asked about things outside their training data. RAG (retrieval-augmented generation) fixes that by retrieving relevant context first, then asking the LLM to answer using only that context — so the model can cite real information instead of guessing.

The architecture has three steps: ingest (load + chunk + embed your documents), retrieve (find relevant chunks for a query), generate (call the LLM with the chunks as context).

That's it. The complexity teams add later (re-rankers, hybrid search, query rewriting, citation post-processing) is all optimization on top of these three steps.

Step 1: Setup #

bash.bash

pip install openai chromadb
export OPENAI_API_KEY="sk-..."

chromadb is an embedded vector database — runs in-process, persists to a local directory, no separate server. Good for prototypes; for production you'd reach for pgvector or Pinecone.

openai covers both the embedding model and the LLM. You can swap providers later; the code shape stays the same.

Step 2: Ingest a folder of markdown #

Save this as rag.py:

python.python

import os
import glob
import openai
import chromadb

client = openai.OpenAI()
chroma = chromadb.PersistentClient(path="./rag_index")
collection = chroma.get_or_create_collection("docs")


def chunk(text: str, size: int = 800, overlap: int = 100) -> list[str]:
    """Split text into chunks with a small overlap so context isn't cut mid-thought."""
    if len(text) <= size:
        return [text]
    chunks, start = [], 0
    while start < len(text):
        chunks.append(text[start : start + size])
        start += size - overlap
    return chunks


def embed(texts: list[str]) -> list[list[float]]:
    resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return [d.embedding for d in resp.data]


def ingest(folder: str):
    files = glob.glob(os.path.join(folder, "**/*.md"), recursive=True)
    print(f"Ingesting {len(files)} files...")
    for path in files:
        with open(path, encoding="utf-8") as f:
            text = f.read()
        chunks = chunk(text)
        ids = [f"{path}::{i}" for i in range(len(chunks))]
        metadatas = [{"source": path, "chunk": i} for i in range(len(chunks))]
        embeddings = embed(chunks)
        collection.upsert(
            ids=ids, documents=chunks, embeddings=embeddings, metadatas=metadatas
        )
    print(f"Indexed {collection.count()} chunks total.")

Test it:

python.python

if __name__ == "__main__":
    import sys
    if sys.argv[1] == "ingest":
        ingest(sys.argv[2])

bash.bash

python rag.py ingest ./my-docs

You should see something like Ingesting 12 files... followed by Indexed 87 chunks total.. The vectors live in ./rag_index/ on disk and persist between runs.

Step 3: Retrieve #

Add to rag.py:

python.python

def retrieve(query: str, k: int = 4) -> list[dict]:
    [q_embed] = embed([query])
    results = collection.query(query_embeddings=[q_embed], n_results=k)
    return [
        {"text": doc, "source": meta["source"]}
        for doc, meta in zip(results["documents"][0], results["metadatas"][0])
    ]

Quick test from a Python REPL:

python.python

from rag import retrieve
for chunk in retrieve("how do I configure SSL?"):
    print(f"[{chunk['source']}]")
    print(chunk['text'][:200])
    print("---")

You should see 4 chunks of your most-relevant content for that query. If they don't look relevant, your docs probably don't cover the topic, or the chunks are too small/large for the question shape — both fixable.

Step 4: Generate #

Add to rag.py:

python.python

SYSTEM_PROMPT = """You answer questions using ONLY the numbered context snippets below.
For every claim, append a citation like [1] referring to the snippet number.
If the answer is not in the context, reply exactly: "I don't have that information in the provided documents."
Do not paraphrase claims that lack a citation."""


def answer(query: str) -> str:
    chunks = retrieve(query)
    context = "\n\n".join(f"[{i+1}] {c['text']}" for i, c in enumerate(chunks))
    sources = "\n".join(f"[{i+1}] {c['source']}" for i, c in enumerate(chunks))

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
        ],
        temperature=0,
    )
    text = response.choices[0].message.content
    return f"{text}\n\nSources:\n{sources}"

Wire it up to the CLI:

python.python

if __name__ == "__main__":
    import sys
    if sys.argv[1] == "ingest":
        ingest(sys.argv[2])
    elif sys.argv[1] == "ask":
        print(answer(" ".join(sys.argv[2:])))

Now ask it something:

bash.bash

python rag.py ask "how do I configure SSL on my web server?"

You should see an answer that cites snippets [1], [2], etc., grounded in your actual documents, with the matching sources listed at the bottom. If the docs don't cover SSL, the model says it doesn't have the information instead of making something up — that's the whole point.

That's a working RAG app in well under 100 lines.

Common mistakes #

Embedding documents and queries with different models. They have to use the same embedding model. Mixing breaks retrieval silently.

Chunks too big or too small. Too big and irrelevant context drowns the answer. Too small and you cut sentences mid-thought. Start at 600–1000 characters with 100 char overlap; tune from there based on your content shape.

Skipping the "I don't know" instruction. Without it, the LLM will confidently answer questions even when retrieval returned irrelevant chunks. The exact-string fallback ("I don't have that information…") is what lets you detect "out of scope" programmatically.

Treating it as solved at this point. This is a working prototype, not a production system. For real users you'll want re-ranking, hybrid search (BM25 + vector), citation verification, eval sets, and confidence thresholding. Each is a small addition; none are necessary to validate the idea.

Build Your First RAG App in 100 Lines of Python

Build Your First RAG App in 100 Lines of Python

What RAG is, in two sentences #

Step 1: Setup #

Step 2: Ingest a folder of markdown #

Step 3: Retrieve #

Step 4: Generate #

Common mistakes #

What to read next #

Stay Updated

Bash Scripting Tutorial — Write Your First Useful Script

Docker for Beginners — Build, Run, and Ship Your First Container

More from AI

What Are Embeddings? A Beginner's Guide with Code

Prompt Engineering Basics — From "Help Me" to Working Prompts

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

What Are Embeddings? A Beginner's Guide with Code

Prompt Engineering Basics — From "Help Me" to Working Prompts

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

LLM Output Validation: Schema-First Prompt Engineering Patterns

Terraform Tutorial — Your First Infrastructure-as-Code Project

SSH Tutorial — Keys, Config, and Working Remotely

About Admin

You might have missed

GitOps with Argo CD: Best Practices for 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

AWS Lambda and Serverless Best Practices for Production

Build Your First RAG App in 100 Lines of Python

What RAG is, in two sentences#

Step 1: Setup#

Step 2: Ingest a folder of markdown#

Step 3: Retrieve#

Step 4: Generate#

Common mistakes#

What to read next#

Stay Updated

Bash Scripting Tutorial — Write Your First Useful Script

Docker for Beginners — Build, Run, and Ship Your First Container

More from AI

What Are Embeddings? A Beginner's Guide with Code

Prompt Engineering Basics — From "Help Me" to Working Prompts

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

About Admin

You might have missed

GitOps with Argo CD: Best Practices for 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

AWS Lambda and Serverless Best Practices for Production

What RAG is, in two sentences #

Step 1: Setup #

Step 2: Ingest a folder of markdown #

Step 3: Retrieve #

Step 4: Generate #

Common mistakes #

What to read next #