writing/tutorial/2026/05
TutorialMay 20, 2026·28 min read

LLM Observability with Langfuse: Trace, Debug, and Improve AI Apps in Next.js (2026)

Learn how to add production-grade observability to your Next.js AI application using Langfuse. This tutorial covers LLM tracing, user session tracking, feedback collection, prompt versioning, and automated evaluation — all in TypeScript.

Every team building AI-powered applications eventually hits the same wall: the app works in development, but in production things go wrong in ways you cannot explain. A user reports a bad answer. Costs spike overnight. Latency jumps on specific input types. You have no idea which prompt version is responsible or which users are affected.

Traditional APM tools measure infrastructure — CPU, memory, request duration. They tell you a request took 900ms, but nothing about why the model gave a poor response, which prompt template is underperforming, or how token costs are trending by user segment. You are flying blind.

Langfuse is the leading open-source LLM engineering platform built specifically for this gap. It gives your AI app the same level of observability that Datadog gives your infrastructure: full trace trees, prompt versioning with A/B testing, user feedback loops, automated evaluation pipelines, and a cost analytics dashboard — all in a self-hostable package.

In this tutorial you will build a Next.js AI chatbot and instrument it end-to-end with Langfuse, covering everything from basic tracing to production evaluation.

Prerequisites

Before starting, make sure you have:

  • Node.js 20+ installed (node -v to confirm)
  • A Next.js 15 project (or create one with npx create-next-app@latest)
  • An OpenAI API key (or any other LLM provider)
  • A free Langfuse account at cloud.langfuse.com — or Docker for self-hosting
  • Solid TypeScript knowledge and familiarity with the Next.js App Router

What You Will Build

By the end of this tutorial your chatbot will have:

  • Full LLM tracing — every AI request captured in Langfuse with inputs, outputs, latency, and token counts
  • Nested spans — sub-operations within a request (retrieval, reranking, generation) tracked separately
  • User and session context — traces grouped by user ID and conversation session
  • User feedback — thumbs up/down UI wired directly to Langfuse scores
  • Prompt management — prompts fetched from the Langfuse dashboard instead of hardcoded strings
  • Automated evaluation — LLM-as-judge scoring running asynchronously after each response

This is the observability stack used by production AI teams at scale.

Why LLM Observability Is Non-Negotiable in 2026

The shift from prototype to production AI exposes a class of bugs that standard monitoring misses entirely:

  • Prompt regressions — a wording change that seemed harmless degrades answer quality for a specific query pattern
  • Context window misuse — messages getting silently truncated once conversation history grows past a threshold
  • Hallucination clusters — a subset of inputs that reliably triggers confident wrong answers
  • Token cost anomalies — one edge-case query spending 50x more tokens than expected
  • Latency outliers — p99 latency 10x the median because of a specific model or prompt branch

Without traces you can only react after users complain. With Langfuse you get complete visibility to catch these issues before they affect retention.

Step 1: Install Dependencies

Start in your Next.js project root:

npm install langfuse openai

You will also need environment variables. Create or update your .env.local:

# OpenAI
OPENAI_API_KEY=sk-...
 
# Langfuse — from your project settings at cloud.langfuse.com
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_BASE_URL=https://cloud.langfuse.com

For a self-hosted instance, replace LANGFUSE_BASE_URL with your own domain (e.g. https://langfuse.yourcompany.com).

Step 2: Initialize the Langfuse Client

Create a singleton client to avoid opening multiple connections on every request. In serverless environments each Lambda or Vercel function invocation is isolated, so a module-level singleton works correctly — it is initialized once per cold start, not once globally.

// lib/langfuse.ts
import { Langfuse } from "langfuse";
 
export const langfuse = new Langfuse({
  secretKey: process.env.LANGFUSE_SECRET_KEY!,
  publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
  baseUrl: process.env.LANGFUSE_BASE_URL ?? "https://cloud.langfuse.com",
  flushAt: 1,      // send immediately in serverless (no long-running process)
  flushInterval: 0, // disable timer-based flushing
});

Serverless flushing is critical. In a traditional server the Langfuse SDK batches and flushes events in the background. In serverless (Vercel, AWS Lambda, Cloudflare Workers) the process is killed after the response is sent — before the batch timer fires. Setting flushAt: 1 and always calling await langfuse.flushAsync() at the end of each handler ensures no events are lost.

Step 3: Trace Your First LLM Call

Create a simple API route that wraps an OpenAI call inside a Langfuse trace. A trace represents one logical operation from the user's perspective (e.g. one chat turn). Inside the trace you create a generation to record the specific LLM call with its model, inputs, outputs, and token usage.

// app/api/chat/route.ts
import { NextRequest, NextResponse } from "next/server";
import OpenAI from "openai";
import { langfuse } from "@/lib/langfuse";
 
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
 
export async function POST(req: NextRequest) {
  const { messages, userId, sessionId } = await req.json();
 
  // 1. Start a trace for this user turn
  const trace = langfuse.trace({
    name: "chat-turn",
    userId: userId ?? "anonymous",
    sessionId: sessionId ?? "default",
    input: { messages },
    tags: ["chat", process.env.NODE_ENV ?? "development"],
  });
 
  // 2. Record the LLM call as a generation inside the trace
  const generation = trace.generation({
    name: "openai-gpt4o",
    model: "gpt-4o",
    modelParameters: {
      temperature: 0.7,
      maxTokens: 1024,
    },
    input: messages,
  });
 
  try {
    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages,
      max_tokens: 1024,
      temperature: 0.7,
    });
 
    const content = response.choices[0].message.content ?? "";
 
    // 3. Close the generation with output and token usage
    generation.end({
      output: content,
      usage: {
        promptTokens: response.usage?.prompt_tokens,
        completionTokens: response.usage?.completion_tokens,
        totalTokens: response.usage?.total_tokens,
      },
    });
 
    // 4. Close the trace with the final output
    trace.update({ output: { content } });
 
    // 5. Flush before the serverless function exits
    await langfuse.flushAsync();
 
    return NextResponse.json({
      content,
      traceId: trace.id, // return to frontend for feedback linking
    });
  } catch (error) {
    generation.end({
      level: "ERROR",
      statusMessage: String(error),
    });
    await langfuse.flushAsync();
    throw error;
  }
}

After a request, open your Langfuse dashboard and navigate to Traces. You will see the trace with full input/output, model name, token counts, latency, and estimated cost — all without any manual configuration.

Step 4: Add Nested Spans for Complex Pipelines

Real AI apps do more than one LLM call per request. A RAG pipeline, for example, retrieves documents, reranks them, constructs a context window, and then generates an answer. Langfuse lets you nest spans inside a trace to make this visible.

// app/api/rag-chat/route.ts
import { NextRequest, NextResponse } from "next/server";
import OpenAI from "openai";
import { langfuse } from "@/lib/langfuse";
 
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
 
// Placeholder — replace with your actual vector search client
async function retrieveDocuments(query: string) {
  return [
    { id: "doc-1", content: "Relevant passage about the topic..." },
    { id: "doc-2", content: "Another relevant passage..." },
  ];
}
 
export async function POST(req: NextRequest) {
  const { query, userId, sessionId } = await req.json();
 
  const trace = langfuse.trace({
    name: "rag-chat",
    userId,
    sessionId,
    input: { query },
  });
 
  // Span 1: document retrieval
  const retrievalSpan = trace.span({
    name: "vector-retrieval",
    input: { query },
  });
  const documents = await retrieveDocuments(query);
  retrievalSpan.end({
    output: { documentCount: documents.length, documentIds: documents.map((d) => d.id) },
  });
 
  // Span 2: context construction
  const contextSpan = trace.span({ name: "context-construction" });
  const context = documents.map((d) => d.content).join("\n\n");
  const messages: OpenAI.ChatCompletionMessageParam[] = [
    { role: "system", content: "Answer using only the provided context." },
    { role: "user", content: `Context:\n${context}\n\nQuestion: ${query}` },
  ];
  contextSpan.end({ output: { tokenEstimate: context.length / 4 } });
 
  // Span 3: LLM generation
  const generation = trace.generation({
    name: "answer-generation",
    model: "gpt-4o",
    input: messages,
  });
 
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
  });
 
  const answer = response.choices[0].message.content ?? "";
 
  generation.end({
    output: answer,
    usage: {
      promptTokens: response.usage?.prompt_tokens,
      completionTokens: response.usage?.completion_tokens,
      totalTokens: response.usage?.total_tokens,
    },
  });
 
  trace.update({ output: { answer } });
  await langfuse.flushAsync();
 
  return NextResponse.json({ answer, traceId: trace.id });
}

In the Langfuse dashboard the trace now shows as a tree: the root trace with three children (retrieval span, context span, generation). You can see exactly where time is spent and how each step contributes to the final answer.

Step 5: Collect User Feedback

Connecting user thumbs-up/thumbs-down to your traces closes the loop between user experience and model performance. Langfuse calls these scores — they can come from users, automated evals, or human reviewers.

First, expose a feedback endpoint:

// app/api/feedback/route.ts
import { NextRequest, NextResponse } from "next/server";
import { langfuse } from "@/lib/langfuse";
 
export async function POST(req: NextRequest) {
  const { traceId, value, comment } = await req.json();
 
  // value: 1 for thumbs up, 0 for thumbs down
  langfuse.score({
    traceId,
    name: "user-feedback",
    value,
    comment: comment ?? undefined,
    dataType: "BOOLEAN",
  });
 
  await langfuse.flushAsync();
  return NextResponse.json({ ok: true });
}

Then wire up the UI. Here is a minimal React component:

// components/FeedbackButtons.tsx
"use client";
 
import { useState } from "react";
 
interface FeedbackButtonsProps {
  traceId: string;
}
 
export function FeedbackButtons({ traceId }: FeedbackButtonsProps) {
  const [submitted, setSubmitted] = useState<boolean | null>(null);
 
  const sendFeedback = async (value: 0 | 1) => {
    setSubmitted(value === 1);
    await fetch("/api/feedback", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ traceId, value }),
    });
  };
 
  if (submitted !== null) {
    return <p className="text-sm text-gray-500">Thanks for your feedback!</p>;
  }
 
  return (
    <div className="flex gap-2 mt-2">
      <button
        onClick={() => sendFeedback(1)}
        className="px-3 py-1 text-sm border rounded hover:bg-green-50"
      >
        👍 Helpful
      </button>
      <button
        onClick={() => sendFeedback(0)}
        className="px-3 py-1 text-sm border rounded hover:bg-red-50"
      >
        👎 Not helpful
      </button>
    </div>
  );
}

Pass the traceId returned by your API route to this component after each AI response. Scores immediately appear in Langfuse, where you can filter traces by score, see which sessions have the most negative feedback, and correlate quality dips with specific prompt versions or model changes.

Step 6: Manage Prompts from the Dashboard

Hardcoding prompts in source code means every iteration requires a code deployment. Langfuse's Prompt Management lets you version, label, and A/B test prompts from the dashboard — and fetch the active version at runtime without a deploy.

Create a Prompt in Langfuse

  1. In the Langfuse dashboard go to PromptsNew Prompt
  2. Name it chat-system-prompt
  3. Paste your system prompt, using double-curly-brace syntax for variables: You are a helpful assistant for {{company_name}}.
  4. Publish it with label production

Fetch and Use the Prompt

// lib/prompts.ts
import { langfuse } from "@/lib/langfuse";
 
export async function getSystemPrompt(companyName: string): Promise<string> {
  const prompt = await langfuse.getPrompt("chat-system-prompt", undefined, {
    label: "production",
    cacheTtlSeconds: 60, // cache for 1 minute to reduce API calls
  });
 
  // Compile the prompt, substituting template variables
  return prompt.compile({ company_name: companyName });
}

Now link the prompt to your trace generation so Langfuse knows which prompt version produced each response:

// Inside your chat API route
const promptObj = await langfuse.getPrompt("chat-system-prompt", undefined, {
  label: "production",
  cacheTtlSeconds: 60,
});
 
const systemPrompt = promptObj.compile({ company_name: "Acme Corp" });
 
const generation = trace.generation({
  name: "openai-gpt4o",
  model: "gpt-4o",
  input: [
    { role: "system", content: systemPrompt },
    ...messages,
  ],
  prompt: promptObj, // links this generation to the prompt version
});

Now the Langfuse dashboard can show you exactly which prompt version was active for every trace, and you can compare quality metrics across versions directly.

Prompt version pinning. By default getPrompt fetches the production label. If you want a specific version (e.g. for staging), pass the version number as the second argument: langfuse.getPrompt("chat-system-prompt", 3). Use labels (production, staging, canary) to manage rollout without touching code.

Step 7: Set Up Automated Evaluation

Human feedback is invaluable but sparse. Automated evaluation lets you score 100% of your traces using rule-based checks or an LLM-as-judge.

Rule-Based Scoring

Add a simple post-processing step after each generation:

// lib/evaluators.ts
import { langfuse } from "@/lib/langfuse";
 
export async function evaluateResponse(
  traceId: string,
  generationId: string,
  output: string,
  expectedKeywords: string[]
): Promise<void> {
  // Score 1: response length check (penalize very short answers)
  const lengthScore = output.split(" ").length > 20 ? 1 : 0;
 
  langfuse.score({
    traceId,
    observationId: generationId,
    name: "response-length-ok",
    value: lengthScore,
    dataType: "BOOLEAN",
  });
 
  // Score 2: keyword coverage
  const covered = expectedKeywords.filter((kw) =>
    output.toLowerCase().includes(kw.toLowerCase())
  ).length;
  const coverageScore = expectedKeywords.length > 0
    ? covered / expectedKeywords.length
    : 1;
 
  langfuse.score({
    traceId,
    observationId: generationId,
    name: "keyword-coverage",
    value: coverageScore,
    dataType: "NUMERIC",
    comment: `${covered}/${expectedKeywords.length} keywords found`,
  });
 
  await langfuse.flushAsync();
}

LLM-as-Judge

For more nuanced quality signals, use a second (cheaper) LLM call to evaluate the response:

// lib/llm-judge.ts
import OpenAI from "openai";
import { langfuse } from "@/lib/langfuse";
 
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
 
export async function llmJudge(
  traceId: string,
  question: string,
  answer: string
): Promise<void> {
  const judgeTrace = langfuse.trace({ name: "llm-judge", tags: ["eval"] });
 
  const judgeGeneration = judgeTrace.generation({
    name: "judge-call",
    model: "gpt-4o-mini",
    input: [
      {
        role: "user",
        content: `Rate the following answer on a scale from 0 to 1 for accuracy and helpfulness. Return only a JSON object: {"score": 0.0, "reason": "..."}.
 
Question: ${question}
Answer: ${answer}`,
      },
    ],
  });
 
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "user",
        content: `Rate the following answer on a scale from 0 to 1 for accuracy and helpfulness. Return only a JSON object with keys "score" and "reason".
 
Question: ${question}
Answer: ${answer}`,
      },
    ],
    response_format: { type: "json_object" },
  });
 
  const raw = response.choices[0].message.content ?? '{"score":0,"reason":"parse error"}';
  const parsed = JSON.parse(raw) as { score: number; reason: string };
 
  judgeGeneration.end({ output: parsed });
 
  langfuse.score({
    traceId,
    name: "llm-judge-quality",
    value: parsed.score,
    dataType: "NUMERIC",
    comment: parsed.reason,
  });
 
  await langfuse.flushAsync();
}

Call llmJudge asynchronously (via a background job like Hatchet or Trigger.dev) after each user turn so it does not add latency to the main response path.

Step 8: Production Checklist

Before shipping to production, review these operational considerations:

Sampling High-Volume Traces

If you are handling thousands of requests per minute, tracing 100% may be expensive. Add a simple sampling gate:

// lib/langfuse.ts — extend your existing file
export function shouldTrace(sampleRate = 0.1): boolean {
  return Math.random() < sampleRate;
}

Use it in your route handler:

if (!shouldTrace(0.2)) {
  // skip tracing for this request
  const response = await openai.chat.completions.create({ model: "gpt-4o", messages });
  return NextResponse.json({ content: response.choices[0].message.content });
}
// ... full tracing path

Self-Hosting Langfuse

Langfuse is MIT-licensed and ships a Docker Compose stack for self-hosting. If you have data residency requirements (common in EU and MENA markets) or want to avoid cloud costs at scale:

git clone https://github.com/langfuse/langfuse.git
cd langfuse
cp .env.example .env
# Edit .env: set NEXTAUTH_SECRET, DATABASE_URL, SALT, etc.
docker compose up -d

Point your SDK to the self-hosted instance by setting LANGFUSE_BASE_URL to your server's URL.

Environment Isolation

Use Langfuse Projects to separate development, staging, and production data. Each project has its own key pair — set LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY per environment.

Never log sensitive user data. Langfuse traces are stored and potentially visible to your team. Before passing messages to trace.input() or generation.input(), strip or redact PII (names, emails, ID numbers) that should not appear in your observability tooling.

Troubleshooting

Traces not appearing in the dashboard

  • Confirm LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY are correct (they start with sk-lf- and pk-lf-)
  • Ensure await langfuse.flushAsync() is called before the function returns
  • Check that LANGFUSE_BASE_URL matches the cloud or self-hosted URL exactly

Token counts showing as zero

  • Langfuse reads token counts from the usage object you pass to generation.end(). Make sure you forward response.usage.prompt_tokens, response.usage.completion_tokens, and response.usage.total_tokens from the OpenAI response.

Prompt getPrompt returns 404

  • The prompt must exist in Langfuse and have at least one version with the requested label (production by default). Check the Prompts page in your dashboard.

High latency adding tracing overhead

  • Langfuse SDK calls are non-blocking — they queue events and flush asynchronously. The only synchronous work is getPrompt (cached after the first call). If you see latency increases, enable the cacheTtlSeconds option on getPrompt.

Next Steps

With your observability pipeline in place, here are natural extensions:

  • Agentic RAG with Next.js — combine Langfuse traces with a multi-step retrieval agent
  • Claude Agent SDK for TypeScript — add Langfuse traces to Claude-powered agentic workflows
  • n8n AI Multi-Agent Workflows — orchestrate multiple AI agents and trace each one
  • Langfuse Datasets — curate traced examples into evaluation datasets for regression testing
  • Online evaluation — configure Langfuse to run your LLM judge automatically on every new trace via webhooks

Conclusion

You now have a production-grade observability layer on your Next.js AI application. Every LLM call is traced with full context, users can signal quality through feedback buttons, prompts are versioned and manageable without deployments, and automated evaluators continuously score response quality.

The ability to see what your AI is doing — not just that it succeeded or failed, but how well and why — is what separates prototypes from products. Langfuse gives you that visibility with open-source flexibility, whether you run it on their cloud or self-host in your own infrastructure.

Ship observable AI, not black boxes.