Introduction

Most agent frameworks ship with the same recipe: the model emits a JSON object describing which tool to call, the runtime parses it, executes the tool, feeds the result back, and loops. It works, but every step pays a tax — JSON is verbose, tool composition is awkward, and chaining three operations requires three round trips.

smolagents, released by Hugging Face in 2025 and stabilized through 2026, takes a different bet: instead of emitting JSON, the agent writes Python code and executes it. A single generation can call three tools, loop over a list, do arithmetic, and store intermediate variables — all in one shot. Research from Hugging Face and DeepMind has shown code agents reach the same task accuracy as JSON agents in 30 percent fewer steps.

The library itself is famously small — under 1000 lines of core code — and supports any model that speaks the OpenAI, Anthropic, Hugging Face Inference, or LiteLLM protocol. In this tutorial you will build a production-grade research agent that pulls data from the web, summarizes papers, and coordinates with a sub-agent — all in pure Python.

Prerequisites

Before starting, make sure you have:

Python 3.10 or newer
A Hugging Face account and access token (free tier works)
An OpenAI or Anthropic API key (optional, for comparison)
Familiarity with pip and virtual environments
Basic understanding of LLM concepts (tokens, system prompts, tool calls)
About 30 minutes of focused time

What You Will Build

A multi-step research agent that:

Accepts a natural-language research question
Searches the web using DuckDuckGo
Fetches and parses web pages
Summarizes findings with Chain-of-Thought reasoning
Delegates math-heavy subtasks to a specialized sub-agent
Runs untrusted code inside an E2B sandbox for safety

By the end you will have a reusable agent skeleton you can ship behind a FastAPI endpoint.

Step 1: Install smolagents

Create a fresh project and install the library along with the integrations we will need:

mkdir smol-research && cd smol-research
python -m venv .venv
source .venv/bin/activate
pip install "smolagents[toolkit,litellm,e2b]>=1.10" \
  "duckduckgo-search" "markdownify" "requests"

The bracketed extras pull in:

toolkit — the official toolset (web search, page fetcher, calculator)
litellm — adapter for OpenAI, Anthropic, Groq, Ollama, and any LiteLLM-compatible provider
e2b — secure remote code execution sandboxes

Export your tokens:

export HF_TOKEN="hf_xxx"
export OPENAI_API_KEY="sk-..."
export E2B_API_KEY="e2b_..."   # optional, only for the sandbox step

Step 2: Build Your First CodeAgent

Create agent.py. The smallest useful agent is three lines plus a model:

from smolagents import CodeAgent, InferenceClientModel, DuckDuckGoSearchTool
 
model = InferenceClientModel(model_id="meta-llama/Llama-3.3-70B-Instruct")
 
agent = CodeAgent(
    tools=[DuckDuckGoSearchTool()],
    model=model,
    max_steps=6,
)
 
result = agent.run("What was the most cited AI paper of 2025?")
print(result)

Run it:

python agent.py

You will see the agent print its reasoning and the Python code it generated at each step. A typical first step looks like this:

# Step 1 — agent-generated code
results = duckduckgo_search(query="most cited AI paper 2025")
print(results[:3])

Notice that the agent did not emit JSON — it wrote real Python that calls duckduckgo_search as a function. The runtime executes that Python in a restricted interpreter, captures print output, and feeds it back as the observation.

Switching Models

InferenceClientModel uses the Hugging Face Inference API. To switch to OpenAI or Anthropic, swap the model class:

from smolagents import LiteLLMModel
 
# OpenAI
model = LiteLLMModel(model_id="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
 
# Anthropic
model = LiteLLMModel(
    model_id="anthropic/claude-haiku-4-5",
    api_key=os.environ["ANTHROPIC_API_KEY"],
)
 
# Ollama running locally
model = LiteLLMModel(model_id="ollama/llama3.1:8b", api_base="http://localhost:11434")

Everything else stays the same. This is the main reason smolagents is gaining adoption — provider-agnostic by design, no special tool-call schema for each API.

Step 3: Add a Custom Tool

The official toolkit covers the basics, but real agents need domain tools. A tool in smolagents is a Python function with a docstring and type hints — the framework generates the schema automatically.

Add a tool that fetches and converts a web page to clean Markdown:

from smolagents import tool
import requests
from markdownify import markdownify
import re
 
@tool
def fetch_page(url: str) -> str:
    """Fetch a web page and return its content as clean Markdown.
 
    Args:
        url: The full URL of the page to fetch.
    """
    response = requests.get(url, timeout=10, headers={"User-Agent": "Mozilla/5.0"})
    response.raise_for_status()
    markdown = markdownify(response.text, heading_style="ATX")
    markdown = re.sub(r"\n{3,}", "\n\n", markdown)
    return markdown[:8000]

The @tool decorator does three things:

Parses the function signature into a JSON Schema-equivalent description
Extracts the docstring as the tool description
Registers safe imports so the agent can call it

agent = CodeAgent(
    tools=[DuckDuckGoSearchTool(), fetch_page],
    model=model,
    max_steps=8,
    additional_authorized_imports=["re", "json"],
)

The additional_authorized_imports parameter is critical. By default the interpreter blocks all imports except a small allowlist (math, statistics, datetime). Anything your tools rely on must be declared explicitly — this is what keeps generated code safe.

Step 4: Reasoning, Planning, and Memory

For non-trivial questions a single round of search is not enough. smolagents has a built-in planning step you can enable:

agent = CodeAgent(
    tools=[DuckDuckGoSearchTool(), fetch_page],
    model=model,
    max_steps=10,
    planning_interval=3,
    additional_authorized_imports=["re", "json"],
)

With planning_interval=3, every three steps the agent stops and writes a plain-language plan summarizing what it has learned and what it still needs. In our tests this reduces wandering on multi-hop questions by around 40 percent.

You can also inspect or persist the agent's memory:

result = agent.run("Compare the carbon footprint of GPT-4o and Llama 3.1 inference.")
 
for step in agent.memory.steps:
    print(step.step_number, step.tool_calls or step.code_action)

agent.memory is a structured list of ActionStep, PlanningStep, and FinalAnswerStep objects — perfect for logging into Langfuse, OpenTelemetry, or your own observability stack.

Step 5: Multi-Agent Orchestration

Real workloads split cleanly across specialists. smolagents supports nesting agents as tools through the ManagedAgent wrapper.

Create a math-focused sub-agent:

from smolagents import ManagedAgent, ToolCallingAgent
 
math_agent = ToolCallingAgent(
    tools=[],
    model=model,
    name="math_solver",
    description=(
        "Performs precise numerical reasoning. Send it a self-contained math "
        "question and it returns the answer."
    ),
    max_steps=4,
)
 
managed_math = ManagedAgent(
    agent=math_agent,
    name="math_solver",
    description="Use this for any arithmetic or numerical comparison heavier than addition.",
)

ToolCallingAgent uses classic JSON tool calls — useful when the model is small and code generation is unreliable. The outer CodeAgent will still emit Python, but it can now call the inner agent as if it were a function:

research_agent = CodeAgent(
    tools=[DuckDuckGoSearchTool(), fetch_page],
    managed_agents=[managed_math],
    model=model,
    max_steps=12,
    planning_interval=4,
)
 
answer = research_agent.run(
    "Find the carbon footprint per 1000 tokens of GPT-4o and Llama 3.1 70B, "
    "then compute the ratio."
)

In a typical run the outer agent searches for the two figures, then writes Python like:

ratio = math_solver(task=f"Divide {gpt_co2} by {llama_co2} and round to 2 decimals")
final_answer(f"GPT-4o emits {ratio}x more CO2 per 1000 tokens.")

The benefit is twofold: the math agent is cheaper because it uses a smaller model, and the orchestrator stays focused on retrieval.

Step 6: Sandbox Execution with E2B

Letting an LLM run Python locally is fine for trusted prototypes, but production systems need isolation. smolagents integrates with E2B for remote sandboxes — each agent run gets a disposable cloud container.

from smolagents import CodeAgent, E2BExecutor
 
research_agent = CodeAgent(
    tools=[DuckDuckGoSearchTool(), fetch_page],
    managed_agents=[managed_math],
    model=model,
    max_steps=12,
    executor_type="e2b",
    executor_kwargs={"api_key": os.environ["E2B_API_KEY"]},
)

That single parameter change moves every line of generated code off your machine and into an ephemeral container. Network access, filesystem, and CPU limits are configurable through executor_kwargs. The agent code stays identical — only the executor changes.

For local-only sandboxing without E2B, smolagents also supports a Docker-based executor:

research_agent = CodeAgent(
    tools=[fetch_page],
    model=model,
    executor_type="docker",
    executor_kwargs={"image": "python:3.12-slim"},
)

Step 7: Streaming and Observability

Long-running agent runs need feedback. Pass stream_outputs=True to surface tokens as they arrive:

for chunk in research_agent.run(
    "Summarize the top 5 AI papers of 2025",
    stream_outputs=True,
):
    print(chunk, end="", flush=True)

For structured tracing, smolagents emits OpenTelemetry spans out of the box. Configure an OTLP exporter pointing at Langfuse, Honeycomb, or any compatible backend:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
 
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="https://cloud.langfuse.com/api/public/otel/v1/traces"))
)
trace.set_tracer_provider(provider)

Every tool call, planning step, and model invocation now becomes a span you can drill into.

Step 8: Serve the Agent Behind FastAPI

Wrap the agent in a thin HTTP layer so it can be embedded in any product:

from fastapi import FastAPI
from pydantic import BaseModel
 
app = FastAPI()
 
class Question(BaseModel):
    question: str
 
@app.post("/research")
def research(payload: Question):
    answer = research_agent.run(payload.question)
    return {"answer": answer, "steps": len(research_agent.memory.steps)}

Run with uvicorn main:app --reload --port 8000 and the agent is reachable on /research. Pair it with a queue like Inngest or Trigger.dev if your tasks routinely exceed 30 seconds.

Testing Your Implementation

Smoke test the full pipeline with a question that requires both retrieval and math:

curl -X POST http://localhost:8000/research \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the throughput difference between vLLM and TGI on Llama 3 70B?"}'

You should get a response in 15 to 40 seconds depending on the model. The JSON payload includes the final answer and the number of reasoning steps consumed.

For deterministic regression tests, mock the model:

from smolagents import FakeModel
 
fake = FakeModel(responses=[
    "Thought: I should search.\nCode:\n```py\nresults = duckduckgo_search(query='vLLM vs TGI')\nprint(results)\n```",
    "Thought: I have enough.\nCode:\n```py\nfinal_answer('vLLM is roughly 2x faster.')\n```",
])
agent = CodeAgent(tools=[DuckDuckGoSearchTool()], model=fake)
assert "2x" in agent.run("vLLM vs TGI throughput?")

FakeModel plays back canned responses, letting you assert on tool sequences without spending tokens.

Troubleshooting

The agent keeps importing forbidden modules. Add the module to additional_authorized_imports. If you genuinely need the full standard library, set additional_authorized_imports=["*"] — but only inside a sandbox.

Tool descriptions look wrong in traces. Make sure every parameter has a type hint and the docstring follows the Google style with an Args: block. The parser is strict.

Sub-agents never get called. The orchestrator decides when to delegate based on the description field of the ManagedAgent. Write descriptions in imperative voice and mention concrete use cases: "Use this for currency conversions or unit math."

E2B sandbox times out. Default container lifetime is 5 minutes. For longer runs, pass executor_kwargs={"timeout": 1800} (in seconds).

Tokens spike on rerun. smolagents memory grows linearly with steps. For interactive chat-style use, call agent.memory.reset() between turns.

Next Steps

Explore the smolagents Hub tools — pre-built tools you can pull with one line
Combine smolagents with our LangGraph tutorial for a comparison of code-first vs graph-first agents
Add evaluation with our Promptfoo guide so changes do not silently regress quality
Deploy with Modal Labs serverless GPUs for autoscaling inference

Conclusion

smolagents bets that the right output for an LLM agent is code, not JSON — and the bet has been paying off in benchmarks and production deployments alike. Across this tutorial you stood up a CodeAgent, added a custom tool, layered a sub-agent for specialized work, and moved execution into a sandboxed container, all in a few hundred lines of Python.

The library is intentionally small. Once you understand the four primitives — model, tool, agent, executor — you can read the entire source in an afternoon and customize anything. That is rare in the agent space, and it is exactly what makes smolagents a good default when you want fewer abstractions and more control.