Six months after your AI feature shipped, the same questions keep coming up in stand-up. Why did p95 latency on the assistant double last week. Why did the OpenAI bill jump 38 percent month over month with no traffic change. Why is one customer reporting that the agent keeps suggesting the same broken tool call. Nobody on the team can answer with confidence, because the only observability is application logs and a Datadog dashboard that knows nothing about prompts, tokens, or traces.
This is the gap that LLM observability platforms exist to fill. In 2026 three projects dominate the conversation: Langfuse, Helicone, and LangSmith. They overlap in the marketing copy and differ sharply in the day-to-day. This piece walks through what each one is actually good at, where the seams show, and how to pick one without locking yourself in.
What LLM Observability Actually Means
Application observability — logs, metrics, traces — assumes deterministic code. The same input produces the same output, errors are typed, and a stack trace points at a line number. None of that holds for an LLM call. The same prompt, the same model, the same temperature can produce different outputs across runs. Errors are usually semantic, not exceptional. A "successful" call can return hallucinated JSON that crashes the next step three seconds later.
So LLM observability has to capture a different shape of data. The minimum useful set:
- Traces — the full tree of LLM calls, tool calls, retrievals, and intermediate decisions for a single user request. Without this, debugging multi-step agents is guesswork.
- Inputs and outputs in full — not just metadata. You need the actual prompt that went in and the actual completion that came back, including system prompts, tool definitions, and structured-output schemas.
- Token counts and cost per request — broken down by model, by user, by feature, by tenant. The OpenAI dashboard cannot answer "which feature is driving the bill" because it has no application context.
- Latency at every node — model latency, network latency, queue time, tool execution time. Aggregating end-to-end latency hides the actual bottleneck.
- User feedback signals — thumbs up, thumbs down, regenerate, copied output. The closed loop between user reaction and trace is where evals start.
- Evaluation results — programmatic or LLM-as-judge scores on production traces, run continuously, not just in CI.
- Prompt versions — what prompt was deployed when, who changed it, and which traces ran against which version.
A platform that gives you fewer than five of those is not observability, it is logging with a nicer UI. With that bar set, here is how the three big players line up.
Langfuse: The Open-Source Default
Langfuse is an open-source LLM engineering platform with a hosted cloud tier and a fully self-hostable Docker deployment. It is the project that most teams converge on when they want serious tracing without committing to a closed vendor.
The data model is the strongest in the category. A trace is a tree of nested observations, where each observation is a generation, a span, an event, or a score. You can attach arbitrary metadata, user IDs, session IDs, tags, and structured input/output payloads at any level. This sounds boring until you try to filter "all traces for tenant 42 in the last 24 hours where the final tool call was payment_charge and the user rating was negative" — Langfuse answers that query in two clicks.
Instrumentation is provider-agnostic. The Python and TypeScript SDKs wrap any LLM client; there are first-class decorators for OpenAI, Anthropic, LangChain, LlamaIndex, Vercel AI SDK, and a generic OpenTelemetry-compatible mode that takes traces from anything that emits OTel. A typical integration:
from langfuse.decorators import observe
from langfuse.openai import openai
@observe()
def answer_question(user_id: str, question: str):
return openai.chat.completions.create(
model="gpt-5-turbo",
messages=[{"role": "user", "content": question}],
)The @observe decorator creates a trace and attaches the OpenAI generation as a nested observation automatically. Add langfuse_context.update_current_trace(user_id=user_id, tags=["beta"]) to attach context, and the trace shows up in the UI with full input/output, token counts, and latency broken down per step.
The evals story is where Langfuse pulled ahead in the last twelve months. You can define evaluators in the UI (LLM-as-judge, custom Python, regex) and run them automatically on a sample of production traces, on demand against historical traces, or in CI against a fixed dataset. The same evaluator runs in all three contexts, so the score you tune on a CI dataset is the score you watch on production traffic. Few platforms get this loop right.
Prompt management is built in. Prompts live as versioned objects with optional A/B labels (production, staging, experimental), and the SDK fetches them by label at runtime. Combined with traces, you can answer "did the click-through rate drop after we shipped prompt v17" without leaving the platform.
Self-hosting is the differentiator that wins enterprise deals in regulated industries. The full platform runs on Postgres plus ClickHouse for analytics, deploys with Docker Compose or Helm, and never sends a byte to Langfuse Cloud. For MENA businesses dealing with data residency requirements — Saudi PDPL, UAE PDPL, Tunisian Loi 63 — this is often non-negotiable.
The trade-off: Langfuse is a platform, not a product. Onboarding is heavier than the SDKs suggest. The dashboard is dense, the eval pipeline assumes you understand what an LLM-as-judge actually measures, and the self-hosted stack needs someone who can run ClickHouse without flinching. For teams shipping AI features in a hurry, this is friction.
Helicone: The Proxy-First Cost Tracker
Helicone takes a different architectural bet. Instead of asking you to instrument your code, it sits as a proxy in front of the model provider. You change one URL — api.openai.com becomes oai.helicone.ai — add an auth header, and every request flows through Helicone's edge with full logging, caching, rate limiting, and cost tracking.
from openai import OpenAI
client = OpenAI(
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": f"Bearer {HELICONE_API_KEY}",
"Helicone-User-Id": user_id,
},
)Three lines, zero refactor. For teams with messy codebases where instrumentation would mean touching forty files, this is genuinely magic. For teams running serverless functions or edge workers where SDK weight matters, the proxy approach also avoids the cold-start tax of a heavyweight tracing SDK.
The cost-control surface is the strongest reason to pick Helicone. Built-in features that take days to wire up elsewhere:
- Per-user and per-organization cost dashboards out of the box, indexed by the headers you pass
- Prompt caching with configurable TTLs at the proxy layer, transparent to the application
- Rate limiting per user, per API key, per model, with custom rules ("free tier users get 50 requests per day to gpt-5")
- Retries and fallbacks to a different model when the primary returns an error, configured in the dashboard
- Cost alerts that page you when a single user crosses a threshold (the classic prompt-injection-driven bill explosion)
The tracing model is flatter than Langfuse. Helicone groups requests into sessions by a header you pass, but does not natively express the parent-child relationships that multi-step agents need. There is a custom-properties model that gets you most of the way, but if you are building deeply nested agentic flows, you will feel the constraint.
Evals exist but are less mature. Prompt management is improving but still secondary to the cost-tracking story.
The honest positioning: Helicone is the right choice when your top operational pain is cost visibility and rate limiting, when your stack is mostly direct OpenAI/Anthropic API calls, and when you do not want to touch application code. It is less the right choice when you are building agent frameworks where the trace tree is the primary debugging artifact.
Self-hosting is supported via Docker, but the proxy architecture means you also have to run a high-availability ingress in front of every model call, which raises the operational bar.
LangSmith: The LangChain-Native Power Tool
LangSmith is built and operated by the LangChain team, and that origin shows in every product decision. If your stack is LangChain or LangGraph, LangSmith is the path of least resistance. Set two environment variables and every chain, agent, and graph execution shows up as a trace, with the right nesting and the right metadata, no decorators required:
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls__...
export LANGCHAIN_PROJECT=productionFor LangGraph users in particular, LangSmith renders the graph topology, the state transitions between nodes, and the inputs/outputs of each node with a fidelity no competitor matches. Debugging a stuck agent loop in LangGraph without LangSmith is painful; with it, you can replay any trace from any node forward, swap a prompt, and re-run.
The evaluation framework is the most opinionated in the market. LangSmith ships pairwise evaluators, reference-based evaluators, custom evaluators, and dataset-driven regression suites with first-class support for human review queues. The "Annotation Queue" feature — where you push production traces into a labeling UI for human reviewers — is the cleanest implementation of the "humans in the eval loop" pattern that you will find anywhere.
Prompt management is mature, with a public prompt hub (forkable templates), version history, and a playground that lets you swap models and re-run any prompt against any dataset.
Three things to know before you commit.
First, LangSmith is opinionated toward the LangChain ecosystem. The generic SDK works with any LLM call, but the magic — the auto-tracing, the graph rendering, the deep introspection of agent state — needs LangChain or LangGraph. If your stack is the Vercel AI SDK, Pydantic AI, Mastra, or hand-rolled OpenAI calls, you get a thinner experience.
Second, LangSmith is a hosted product. A self-hosted enterprise tier exists but pricing is gated by sales, and the deployment story is heavier than Langfuse. For teams that need on-premise from day one and have no AWS marketplace appetite, this is a friction.
Third, the pricing has shifted upward as the platform matured. Per-trace billing past the free tier is reasonable at small scale but compounds quickly when you cross into millions of traces per month, which is where production AI products live. Budget for it before you commit.
Picking One: A Decision Framework
The honest take after deploying all three in different projects:
-
Pick Langfuse if you want the most balanced platform, value open source and self-hosting, and have at least one engineer who can own the observability stack. It is the safest long-term bet for a serious AI product team that wants to avoid vendor lock-in.
-
Pick Helicone if cost visibility and rate limiting are your top pains, your application is mostly direct API calls without deep agent nesting, and you want to ship the integration in an afternoon without touching application code. It is the fastest path to "we now know what the bill is going to."
-
Pick LangSmith if you are already on LangChain or LangGraph and the agent topology is the primary thing you need to debug. The graph-aware tracing is genuinely a tier above what the other two do for that specific workload.
These are not mutually exclusive. The pragmatic pattern many teams converge on: Helicone as the proxy for cost control and rate limiting (because the integration is free and the cost dashboard pays for itself in a week), plus Langfuse or LangSmith for traces, evals, and prompt management. The two layers do not fight; they capture different signals.
The Cost Question
A note on what these platforms actually cost in production. For an app doing one million LLM calls per month with average trace depth of three observations per call (so roughly three million observations), expect to spend somewhere in the range of 200 to 500 USD per month on the hosted tiers. Self-hosted Langfuse on a single Hetzner box handles that volume for well under 50 USD per month in infrastructure, at the cost of one engineer-day per month of maintenance.
Compare that to the model bill — the same one million calls on gpt-5 with average 2K context costs north of 8,000 USD per month — and observability is a rounding error. Teams that skip it because of cost are usually optimizing the wrong variable. The right framing: spending 3 percent of the model bill to know exactly where the other 97 percent is going.
What To Build First
For a team standing up LLM observability from zero, the highest-leverage order:
- Get traces flowing. Pick one of the three, instrument the highest-traffic path, accept that the first week of traces will look ugly. Visibility beats elegance.
- Attach user IDs. Without user IDs on traces, you cannot answer the customer-support ticket "why did the bot do this to my account on Tuesday."
- Tag traces by feature. Without feature tags, you cannot answer "which feature is driving the bill" — and that question will come within the first month.
- Add user feedback capture. Thumbs up and thumbs down on every assistant message, tied to the trace ID. This is the dataset that all future evals depend on.
- Wire up one eval. Start with a single LLM-as-judge on a single trace type — "did the agent answer the user's actual question, yes or no." Run it on every production trace. Watch the score over time.
- Then talk about prompt management. Until you have traces, user IDs, feature tags, feedback, and at least one eval, prompt versioning is premature.
Most teams try to start with prompt management and an eval framework, then bolt on tracing later when they hit a customer-facing incident. The order should be reversed. Tracing is the foundation; everything else is a layer on top.
Where This Goes Next
The frontier in 2026 is moving toward agent observability — tools that understand the difference between an LLM call, a tool call, a planning step, and a memory write, and that can replay agent trajectories with state mutations intact. Langfuse and LangSmith are both shipping in this direction; expect significant feature movement over the next two quarters.
For deeper context on how to structure agent flows that are actually observable, our AI agent evaluation guide covers the metric framework, and the AI agent memory piece covers the state-mutation surface that observability platforms have to model.
The team that wins the next twelve months of AI engineering will not be the team with the best prompts. It will be the team that knows, at any moment, exactly which prompts ran, what they cost, what they returned, and what the user did next. That is what these three platforms exist to make possible.