writing/tutorial/2026/05
TutorialMay 26, 2026·28 min read

DSPy 3: Build Self-Optimizing LLM Pipelines in Python

A practical guide to DSPy 3, the Stanford-born framework for programming language models declaratively. Build, evaluate, and automatically optimize a RAG pipeline in Python without hand-tuning a single prompt.

Introduction

Most teams ship LLM features the same way: write a giant prompt, paste a few examples, and hope it generalizes. When the model is swapped or the task drifts, the whole thing has to be rewritten by hand. DSPy flips this around. Instead of prompts, you write signatures and modules — the framework then compiles them into optimized prompts and few-shot demonstrations against a metric you define.

DSPy 3, released by Stanford NLP in early 2026, ships with a faster optimizer (MIPROv2), native async support, streaming, and adapters for OpenAI, Anthropic, Google, Groq, and any LiteLLM-compatible provider. It is the framework powering many production retrieval and reasoning systems at companies like JetBlue, Databricks, and Replit.

In this tutorial you will build a factual question-answering pipeline that:

  • Retrieves passages from a small knowledge base
  • Reasons step-by-step with Chain-of-Thought
  • Gets automatically optimized against accuracy and latency metrics
  • Improves measurably without you touching a single prompt string

Prerequisites

Before starting, make sure you have:

  • Python 3.10 or newer
  • An OpenAI or Anthropic API key (or any LiteLLM-compatible provider)
  • Familiarity with pip and virtual environments
  • Basic knowledge of LLM concepts (tokens, embeddings, RAG)
  • About 30 minutes of focused time

What You Will Build

A self-optimizing QA system that answers questions about the Tunisian tech ecosystem from a small corpus. By the end you will have:

  • A typed pipeline written in pure Python
  • A metric function for automated evaluation
  • A compiled program that beats the zero-shot baseline by more than 15 points on accuracy
  • Saved artifacts you can load and serve from FastAPI

Step 1: Install DSPy and Configure a Language Model

Create a fresh project and install DSPy 3:

mkdir dspy-qa && cd dspy-qa
python -m venv .venv
source .venv/bin/activate
pip install "dspy-ai>=3.0.0" "litellm" "datasets" "fastapi" "uvicorn"

Create a main.py and configure DSPy with your preferred model. The library uses LiteLLM under the hood, so the same code works with OpenAI, Anthropic, Groq, Ollama, or vLLM.

import os
import dspy
 
lm = dspy.LM(
    "openai/gpt-4o-mini",
    api_key=os.environ["OPENAI_API_KEY"],
    temperature=0.0,
    max_tokens=512,
)
dspy.configure(lm=lm)

To swap providers, only the model string changes — everything downstream is provider-agnostic.

# Anthropic
lm = dspy.LM("anthropic/claude-haiku-4-5", api_key=os.environ["ANTHROPIC_API_KEY"])
 
# Groq (fast inference)
lm = dspy.LM("groq/llama-3.3-70b-versatile", api_key=os.environ["GROQ_API_KEY"])
 
# Local Ollama
lm = dspy.LM("ollama/llama3.1", api_base="http://localhost:11434")

Step 2: Define a Signature

A signature is a typed declaration of the task: inputs, outputs, and a docstring describing intent. DSPy uses it to construct the actual prompt at runtime.

class GenerateAnswer(dspy.Signature):
    """Answer questions about Tunisian tech companies and startups using the provided context."""
 
    context: list[str] = dspy.InputField(desc="Relevant passages from the knowledge base")
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="A concise factual answer, one to two sentences")

Notice you never write the words "you are a helpful assistant" or "answer concisely" in a prompt — the docstring and field descriptions are enough. DSPy's optimizer will later refine this into the best-performing prompt automatically.

Step 3: Compose Modules

Modules wrap a signature with a reasoning strategy. The three you will use most often are:

  • dspy.Predict — direct prediction (one LLM call)
  • dspy.ChainOfThought — adds a reasoning field before the output
  • dspy.ReAct — interleaves reasoning with tool calls

Build a Chain-of-Thought QA module:

class TunisianTechQA(dspy.Module):
    def __init__(self, num_passages: int = 3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
 
    def forward(self, question: str):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

The forward method composes retrieval and generation. There are no hand-written prompts anywhere.

Step 4: Set Up Retrieval

DSPy ships with retrievers for ChromaDB, Pinecone, Weaviate, Qdrant, and a built-in in-memory vector store called Embeddings. For this tutorial you will use the in-memory version with a tiny corpus.

corpus = [
    "InstaDeep is a Tunisian AI startup founded in 2014, acquired by BioNTech in 2023 for around 562 million euros.",
    "Expensya, founded in Tunis in 2014, was acquired by Medius in 2023.",
    "Tunisia hosts the annual Maghreb Emerging conference focused on AI and deep tech investment.",
    "Noqta is a Tunisian software studio focused on AI-driven web platforms and developer tooling.",
    "Datavora, founded in Tunis, was acquired by Semrush in 2020.",
    "The Smart Tunisia program subsidizes salaries for tech graduates working in export-oriented companies.",
]
 
embedder = dspy.Embedder("openai/text-embedding-3-small")
retriever = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=3)
dspy.configure(rm=retriever)

Now any module using dspy.Retrieve will pull from this corpus. In production you would swap this for a managed vector database.

Step 5: Prepare Training and Evaluation Data

Optimization needs data. You only need 20 to 50 examples to see meaningful gains — DSPy does the heavy lifting with bootstrapping.

from dspy import Example
 
trainset = [
    Example(question="Who acquired InstaDeep and when?",
            answer="BioNTech acquired InstaDeep in 2023.").with_inputs("question"),
    Example(question="What does Noqta build?",
            answer="Noqta builds AI-driven web platforms and developer tooling.").with_inputs("question"),
    Example(question="Which company acquired Expensya?",
            answer="Medius acquired Expensya in 2023.").with_inputs("question"),
    Example(question="What is Smart Tunisia?",
            answer="A program subsidizing salaries for tech graduates in export-oriented companies.").with_inputs("question"),
    # ... add 16 more for a total of 20
]
 
devset = [
    Example(question="In what year was Datavora acquired?",
            answer="2020.").with_inputs("question"),
    Example(question="What sector is InstaDeep in?",
            answer="Artificial intelligence.").with_inputs("question"),
    # ... add 8 more for a total of 10
]

Note .with_inputs("question") — this tells DSPy which fields are inputs versus expected outputs.

Step 6: Define a Metric

A metric is a Python function that takes a gold example and a prediction and returns a score. For factual QA, exact match is too strict; use a semantic-similarity metric or let an LLM judge.

def answer_correctness(example, prediction, trace=None) -> float:
    """LLM-as-judge metric. Returns 1.0 if the prediction is factually equivalent to the gold answer."""
    judge = dspy.Predict("question, gold_answer, predicted_answer -> is_correct: bool")
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return 1.0 if result.is_correct else 0.0

This metric is itself a DSPy program. Welcome to recursion.

Step 7: Compile with MIPROv2

This is where DSPy earns its name. The optimizer searches over prompt instructions and few-shot demonstrations to maximize your metric on the training set.

from dspy.teleprompt import MIPROv2
 
qa = TunisianTechQA()
 
# Baseline before optimization
baseline_score = sum(answer_correctness(ex, qa(question=ex.question)) for ex in devset) / len(devset)
print(f"Zero-shot baseline: {baseline_score:.2%}")
 
optimizer = MIPROv2(
    metric=answer_correctness,
    auto="medium",  # "light", "medium", or "heavy"
    num_threads=8,
)
 
compiled_qa = optimizer.compile(
    qa,
    trainset=trainset,
    requires_permission_to_run=False,
    minibatch=True,
)
 
# Score after optimization
optimized_score = sum(answer_correctness(ex, compiled_qa(question=ex.question)) for ex in devset) / len(devset)
print(f"Optimized: {optimized_score:.2%}")

On a 20-example training set, expect baselines around 60 to 70 percent and optimized scores above 85 percent. The optimizer typically completes in two to five minutes depending on dataset size and the auto setting.

Step 8: Inspect What Was Learned

DSPy is transparent — you can see the exact prompt the optimizer produced.

compiled_qa.generate_answer.demos  # the few-shot examples it selected
compiled_qa.generate_answer.signature.instructions  # the rewritten instruction

This is invaluable for debugging and for porting the compiled prompt to other systems if you ever want to leave DSPy.

Step 9: Save and Load

Compilation is expensive, so persist the result.

compiled_qa.save("./compiled_qa.json")
 
# Later, in production
qa = TunisianTechQA()
qa.load("./compiled_qa.json")
answer = qa(question="Who acquired InstaDeep?").answer

The saved file is a small JSON containing instructions and demonstrations — no model weights, no proprietary state.

Step 10: Serve with FastAPI

Wrap the compiled program in a tiny API.

from fastapi import FastAPI
from pydantic import BaseModel
 
app = FastAPI()
 
qa = TunisianTechQA()
qa.load("./compiled_qa.json")
 
class Query(BaseModel):
    question: str
 
@app.post("/ask")
async def ask(q: Query):
    result = await qa.acall(question=q.question)
    return {"answer": result.answer, "context": result.context}

Run it:

uvicorn main:app --reload

DSPy 3 added first-class async support, so acall is non-blocking and integrates cleanly with FastAPI, Starlette, or any async stack.

Testing Your Implementation

A quick sanity check before deploying:

test_questions = [
    "Who founded InstaDeep?",
    "When was Expensya acquired?",
    "What does the Smart Tunisia program do?",
]
 
for q in test_questions:
    result = qa(question=q)
    print(f"Q: {q}\nA: {result.answer}\n")

If answers cite facts not in the corpus, your retrieval is too narrow — increase k in dspy.Retrieve or expand the corpus.

Troubleshooting

Optimizer is too slow. Set auto="light" for a quick run or reduce num_threads if you are rate-limited.

Metric returns flat scores. Your judge model is probably too small. Use a stronger model just for evaluation by passing lm= to the judge Predict call.

Predictions ignore context. Make sure your signature's docstring tells the model to use the context, and increase the number of retrieved passages.

OpenAI rate limits during compilation. MIPROv2 with auto="heavy" can fire thousands of calls. Start with auto="light" or run against a local Ollama model for prototyping.

ModuleNotFoundError: dspy.retrievers. You are on an old version. Run pip install -U "dspy-ai>=3.0.0".

Next Steps

Extend this pipeline with:

  • Tool use: swap ChainOfThought for dspy.ReAct and give it a calculator or web-search tool
  • Multi-hop reasoning: chain two dspy.ChainOfThought modules so the model decomposes complex questions
  • Production retrieval: plug in a managed vector store. See our Pinecone vector search tutorial for the indexing side, or the LangFuse observability guide to monitor latency and quality in production
  • Stronger evaluation: use DSPy's built-in dspy.evaluate.SemanticF1 or hook up Promptfoo
  • Continuous optimization: re-run the optimizer weekly on a growing dataset and ship the new artifact via your CI

Conclusion

You built a self-optimizing question-answering pipeline without writing a single hand-crafted prompt. DSPy turned a typed signature, a metric, and 20 examples into a system that measurably outperforms the zero-shot baseline.

The pattern generalizes far beyond QA. Classification, summarization, structured extraction, multi-agent routing — anything you would normally prompt-engineer can be expressed as a signature and compiled. When you swap models from GPT-4o-mini to Claude Haiku or to a local Llama, the optimizer re-tunes everything for you instead of forcing a manual rewrite.

That is the DSPy bet: stop tuning prompts, start programming.