Once your AI app leaves the prototype stage, you quickly discover that console.log is not enough. You need to answer questions like: which user prompt triggered the failing response, how much did that conversation cost, why did latency spike at 3 AM, and which prompt version produced the best eval scores?

Langfuse is the open-source observability platform built specifically for LLM apps. It handles tracing, cost and token tracking, prompt versioning, evaluations, and user feedback. In this tutorial, you will wire Langfuse into a Next.js 15 app that uses the Vercel AI SDK, capture every LLM call with rich metadata, manage prompts as first-class artifacts, and build dashboards that actually help you debug production.

Prerequisites

Before starting, make sure you have:

Node.js 20 or newer installed
A Next.js 15 project (App Router) — or you can scaffold one during Step 1
A paid or free OpenAI / Anthropic / Google API key (any LLM provider works)
Basic familiarity with TypeScript and React Server Components
Docker Desktop if you plan to self-host Langfuse (optional — Langfuse Cloud works too)

What You Will Build

By the end of this tutorial, you will have:

A Next.js AI chat route instrumented with Langfuse traces
Automatic token, cost, and latency capture for every LLM call
A prompt registry where non-developers can edit prompts without a deploy
User feedback scoring (thumbs up / down) attached to traces
An evaluation pipeline that grades outputs with an LLM-as-judge
Production-ready dashboards and alerts

Let us begin.

Step 1: Choose Cloud or Self-Hosted

Langfuse offers two deployment paths. Pick one:

Option A — Langfuse Cloud (fastest): Sign up at cloud.langfuse.com, create a project, and grab the public and secret API keys. Free tier covers 50k observations per month, which is plenty for development.

Option B — Self-hosted (more control): Run the official Docker Compose stack locally or on your own VPS. Clone the repo and boot it up:

git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d

Langfuse will be available at http://localhost:3000. Create an account, create a project, and copy the keys from the project settings page.

For this tutorial, we will assume Cloud, but every code sample works identically for self-hosted — only the LANGFUSE_BASEURL changes.

Step 2: Scaffold the Next.js App

Skip this step if you already have a Next.js 15 project. Otherwise:

npx create-next-app@latest langfuse-demo --typescript --app --tailwind
cd langfuse-demo
npm install ai @ai-sdk/openai zod

Add your secrets to .env.local:

OPENAI_API_KEY=sk-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_BASEURL=https://cloud.langfuse.com

Never commit .env.local. Add it to .gitignore if it is not already there.

Step 3: Install the Langfuse SDK

Langfuse ships several SDKs. For Next.js with the Vercel AI SDK, install the core JS SDK plus the Vercel AI SDK integration:

npm install langfuse langfuse-vercel

The langfuse-vercel package provides a telemetry exporter that hooks into the AI SDK OpenTelemetry spans automatically. No manual span wrapping required.

Step 4: Initialize the Langfuse Client

Create a shared client at lib/langfuse.ts:

import { Langfuse } from "langfuse";
 
export const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
  secretKey: process.env.LANGFUSE_SECRET_KEY!,
  baseUrl: process.env.LANGFUSE_BASEURL,
  flushAt: 1, // send every event immediately during development
});
 
// Ensure events flush before Next.js terminates the serverless function
export async function flushLangfuse() {
  await langfuse.flushAsync();
}

The flushAt: 1 setting is great for development but should be raised to 20 or 50 in production to batch network calls. Always call flushLangfuse() at the end of serverless handlers — if you skip it, events can be lost when the Lambda freezes.

Step 5: Register the Telemetry Exporter

In Next.js 15, OpenTelemetry instrumentation lives in instrumentation.ts at the project root. Create it:

import { registerOTel } from "@vercel/otel";
import { LangfuseExporter } from "langfuse-vercel";
 
export function register() {
  registerOTel({
    serviceName: "langfuse-demo",
    traceExporter: new LangfuseExporter({
      publicKey: process.env.LANGFUSE_PUBLIC_KEY,
      secretKey: process.env.LANGFUSE_SECRET_KEY,
      baseUrl: process.env.LANGFUSE_BASEURL,
    }),
  });
}

Also install the Vercel OpenTelemetry helper:

npm install @vercel/otel @opentelemetry/api

Enable telemetry in next.config.js:

/** @type {import('next').NextConfig} */
const nextConfig = {
  experimental: {
    instrumentationHook: true,
  },
};
 
module.exports = nextConfig;

On Next.js 15.1 and newer, instrumentationHook is enabled by default and this flag is unnecessary — check your Next version before adding it.

Step 6: Build a Traced Chat Route

Create app/api/chat/route.ts:

import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
import { after } from "next/server";
import { flushLangfuse } from "@/lib/langfuse";
 
export async function POST(req: Request) {
  const { messages, userId, sessionId } = await req.json();
 
  const result = streamText({
    model: openai("gpt-4o-mini"),
    messages,
    experimental_telemetry: {
      isEnabled: true,
      functionId: "chat-completion",
      metadata: {
        langfuseUserId: userId ?? "anonymous",
        langfuseSessionId: sessionId,
        tags: ["production", "chat"],
      },
    },
  });
 
  // Flush telemetry after the response is streamed
  after(async () => {
    await flushLangfuse();
  });
 
  return result.toDataStreamResponse();
}

The experimental_telemetry block is the key. Langfuse reads langfuseUserId, langfuseSessionId, and tags from metadata and turns them into first-class fields you can filter on. The after() helper from next/server runs flushing after the response is sent, so you do not delay the user.

Step 7: Test Your First Trace

Start the dev server and send a request:

npm run dev

Then from another terminal:

curl -X POST http://localhost:3000/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{ "role": "user", "content": "Explain retrieval augmented generation in one paragraph." }],
    "userId": "user-42",
    "sessionId": "demo-session"
  }'

Open the Langfuse dashboard. Within a few seconds you should see a new trace with the full prompt, the streamed response, the model name, token counts, and the total cost computed from the live pricing table. Click the trace to expand it and you will see every span, including the OpenAI API call nested inside.

Step 8: Manage Prompts as Versioned Artifacts

Hardcoding prompts in your source makes them impossible to iterate on without a deploy. Langfuse solves this with a prompt registry.

Go to Prompts in the dashboard and create a new prompt named chat-system:

You are a concise technical assistant for developers.
Always answer in markdown. When you show code, use fenced blocks with a language tag.
Current date: {{today}}

Langfuse stores this as version 1. Fetch it from your route:

import { langfuse } from "@/lib/langfuse";
 
async function buildSystemMessage() {
  const prompt = await langfuse.getPrompt("chat-system", undefined, {
    cacheTtlSeconds: 60,
  });
 
  const compiled = prompt.compile({
    today: new Date().toISOString().split("T")[0],
  });
 
  return { role: "system" as const, content: compiled, promptRef: prompt };
}

Then pass the prompt reference to experimental_telemetry so Langfuse links the trace to the exact prompt version used:

const system = await buildSystemMessage();
 
const result = streamText({
  model: openai("gpt-4o-mini"),
  messages: [{ role: "system", content: system.content }, ...messages],
  experimental_telemetry: {
    isEnabled: true,
    functionId: "chat-completion",
    metadata: {
      langfusePrompt: system.promptRef.toJSON(),
    },
  },
});

Now when a teammate edits the prompt in the Langfuse UI, every new request uses the updated version within 60 seconds — no deploy needed. And the Prompts page shows you which versions are currently in use and how each version performs on latency and quality.

Step 9: Capture User Feedback

A trace is only half the story. You need to know whether users liked the answer. Add a feedback endpoint:

// app/api/feedback/route.ts
import { langfuse } from "@/lib/langfuse";
import { after } from "next/server";
 
export async function POST(req: Request) {
  const { traceId, value, comment } = await req.json();
 
  langfuse.score({
    traceId,
    name: "user-feedback",
    value: value === "up" ? 1 : 0,
    dataType: "NUMERIC",
    comment,
  });
 
  after(async () => {
    await langfuse.flushAsync();
  });
 
  return Response.json({ ok: true });
}

On the frontend, render thumbs up / down buttons after each assistant message and POST to /api/feedback with the traceId that Langfuse returned in the response headers. Now every trace has a quality signal you can filter and group by.

Step 10: Add LLM-as-Judge Evaluations

Manual feedback is valuable but sparse. To score every conversation automatically, configure an LLM-as-judge evaluator.

In the Langfuse dashboard, go to Evaluations, create a new evaluator, and choose LLM-as-judge. Use a template like:

Rate the assistant response on helpfulness from 0 to 1.
Consider: Does it answer the user question? Is it accurate? Is it concise?

User question: {{input}}
Assistant answer: {{output}}

Return JSON: { "score": number, "reasoning": string }

Select which traces to score — for example, all traces tagged production in the last 24 hours. Langfuse runs the evaluator in the background and attaches the score to each trace. You can then chart helpfulness over time, broken down by prompt version, to validate that your prompt edits actually improve quality.

Step 11: Build a Production Dashboard

The Dashboards tab lets you compose charts without writing SQL. Useful panels to pin:

Cost per day — spot runaway spend early
P95 latency by model — catch slow models before users complain
Token usage by userId — identify heavy users who might need rate limits
Error rate by prompt version — detect bad prompts before they hurt everyone
Average helpfulness score by day — track quality drift

Add a threshold alert on cost and error rate. Langfuse can push alerts to Slack or Discord webhooks when a threshold is crossed.

Step 12: Use Sessions to Group Conversations

Passing the same langfuseSessionId across multiple requests groups them into a single conversation view. This is invaluable when debugging multi-turn flows. In your frontend, generate a session ID once per conversation (store it in component state or a cookie) and include it in every /api/chat request.

Langfuse then shows you the full thread timeline, the cumulative cost of the whole conversation, and where latency compounded across turns.

Step 13: Mask PII Before Sending

If your app handles personal data, you must redact sensitive content before it leaves your server. Langfuse supports a masking function:

import { Langfuse } from "langfuse";
 
export const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
  secretKey: process.env.LANGFUSE_SECRET_KEY!,
  mask: (data) => {
    if (typeof data !== "string") return data;
    return data
      .replace(/\b[\w.-]+@[\w.-]+\.\w+\b/g, "[EMAIL]")
      .replace(/\b(\+?\d{1,3}[ -]?)?\(?\d{3}\)?[ -]?\d{3}[ -]?\d{4}\b/g, "[PHONE]");
  },
});

The mask runs on both inputs and outputs before the event is shipped. For stricter compliance, run a more robust PII classifier such as Microsoft Presidio inside the mask function.

Step 14: Deploy to Production

A few production hardening tips:

Batch events: raise flushAt to 20 and flushInterval to 5000 ms. This reduces network overhead dramatically.
Handle shutdown: on Vercel, the after() helper covers flushing. On long-running Node servers, register process.on("SIGTERM", flushLangfuse).
Environment separation: use a different Langfuse project per environment (dev, staging, prod). This keeps dashboards clean.
Secret rotation: Langfuse keys can be rotated in the project settings without downtime. Rotate quarterly.
Sampling: if you have very high volume, use sampleRate: 0.1 on the client to capture only 10 percent of traces. You can always sample more on specific user cohorts.

Testing Your Implementation

Before shipping:

Send a chat request. Confirm a trace appears with the correct prompt version, user id, and session id.
Submit thumbs up feedback. Confirm a score attaches to the trace.
Simulate an error (unset the OpenAI key temporarily). Confirm Langfuse captures the error with a full stack trace.
Edit the prompt in the Langfuse UI. Confirm the next request uses the new version within 60 seconds.
Check cost numbers match your OpenAI billing dashboard within a few percent.

Troubleshooting

No traces appear in the dashboard. Check that instrumentation.ts is at the project root and that experimental_telemetry.isEnabled is true. Run with NODE_OPTIONS=--trace-warnings to catch exporter errors.

Events appear only after cold starts. You forgot to call flushLangfuse() inside after(). On Vercel, serverless functions freeze immediately after the response and buffered events never ship.

Costs show as zero. The model you are using is not in the Langfuse pricing catalog yet. Add custom pricing in Settings, Models or open a PR to the Langfuse repo.

Prompt version never updates. Check that cacheTtlSeconds is not set too high in getPrompt(). The default fallback behavior will serve the last known version forever if the API is unreachable — double check network connectivity in your deployment region.

Next Steps

Combine Langfuse with Vercel AI SDK agent loops to trace tool calls and handoffs
Add Langfuse to your Claude Agent SDK project for unified observability
Set up a Langfuse + Next.js observability stack alongside OpenTelemetry for application traces
Explore Langfuse Datasets to build a regression test suite that runs on every prompt change
Integrate with Inngest durable functions so background workflows also appear in your trace view

Conclusion

LLM observability is not optional once real users hit your AI app. Langfuse gives you traces, costs, prompt versioning, user feedback, and automated evaluation in a single open-source platform that plugs cleanly into Next.js and the Vercel AI SDK. The setup takes under an hour, and the payoff is the ability to answer every question a PM, engineer, or finance team member can throw at your AI system — with real data instead of guesses.

Wire it in once, and your future self will thank you the first time production acts up.