Voice is the interface that finally makes AI agents feel alive. Typed chatbots have a ceiling — they make users wait, read, and re-read. A voice agent that responds in around 300 milliseconds, lets you interrupt mid-sentence, and calls real tools on your backend feels like a different category of product. The OpenAI Realtime API made that experience accessible without you having to stitch together a speech-to-text pipeline, an LLM, and a text-to-speech engine. It is one model, one connection, full duplex.

In this tutorial you will build a voice agent that runs in the browser, talks to the OpenAI Realtime API over WebRTC, calls server-side tools through function calling, and is safe to deploy to production. We will use Next.js 15 (App Router), TypeScript, and a thin WebRTC client. By the end you will have a working agent that can take a customer call, look up an order in your database, and dictate the answer back in natural speech.

What you will build

A voice-powered support agent for an imaginary e-commerce store. Users press a button, speak a question, and the agent answers out loud. Behind the scenes it can:

Stream audio from the microphone to OpenAI in real time
Stream audio back through the speakers with sub-second latency
Detect when the user is speaking and interrupt itself if interrupted
Call a server-side lookup_order tool that hits a database
Persist the conversation transcript so you can review or fine-tune later

Prerequisites

Before starting, make sure you have:

Node.js 20 or newer installed
An OpenAI API key with access to the Realtime API
Basic familiarity with React, Next.js App Router, and TypeScript
A modern browser (Chrome, Edge, or Safari) with microphone permission
A code editor — VS Code is recommended

You should also be comfortable reading WebRTC concepts at a high level. We will not implement signaling from scratch — OpenAI exposes a single HTTP endpoint that handles SDP exchange for you.

Why the Realtime API and not the old pipeline

A traditional voice stack looks like microphone, then Whisper, then GPT, then a TTS service like ElevenLabs. Every hop adds latency and makes interruption hard. The Realtime API replaces all of that with a single multimodal model that consumes audio frames and emits audio frames. It also handles voice activity detection server-side, which means barge-in, turn-taking, and silence detection just work.

Two other reasons matter for production:

One bill, one rate limit. No coordinating quotas across three vendors.
Function calling is native. The same model that hears the user can call your tools without a second round trip.

Step 1: Project setup

Create a fresh Next.js app with TypeScript and Tailwind:

npx create-next-app@latest voice-agent --typescript --app --tailwind --eslint
cd voice-agent
npm install openai zod

Add your API key to a local environment file:

# .env.local
OPENAI_API_KEY=sk-proj-...

The OPENAI_API_KEY value must never reach the browser. We will use it only on the server, to mint short-lived ephemeral tokens that the browser can use to open a WebRTC connection.

Step 2: Mint ephemeral tokens on the server

The single most common mistake people make with Realtime is shipping their main API key to the client. Do not do that. Instead, expose a Next.js Route Handler that returns a 60-second ephemeral token tied to a specific session.

Create app/api/realtime/session/route.ts:

import { NextResponse } from "next/server";
 
export async function POST() {
  const response = await fetch(
    "https://api.openai.com/v1/realtime/sessions",
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        model: "gpt-realtime",
        voice: "marin",
        modalities: ["audio", "text"],
        instructions:
          "You are a helpful and concise voice agent for an online store. Always answer in the language the user speaks. Confirm order numbers before calling lookup_order.",
      }),
    },
  );
 
  if (!response.ok) {
    return NextResponse.json(
      { error: "Failed to create session" },
      { status: 500 },
    );
  }
 
  const data = await response.json();
  return NextResponse.json(data);
}

The response contains client_secret.value, which is the ephemeral token. It is valid for around one minute and is scoped to a single session. Even if a user steals it from the network tab, they can only continue the call they were already on.

Step 3: Connect the browser with WebRTC

WebRTC is the right transport here because it ships audio with built-in jitter buffering, packet loss recovery, and Opus encoding. It also gives the browser low-level control over capture and playback.

Create lib/realtime-client.ts:

export type RealtimeEvent =
  | { type: "session.created"; session: unknown }
  | { type: "input_audio_buffer.speech_started" }
  | { type: "input_audio_buffer.speech_stopped" }
  | { type: "response.audio_transcript.delta"; delta: string }
  | { type: "response.function_call_arguments.done"; name: string; call_id: string; arguments: string }
  | { type: "error"; error: { message: string } };
 
export interface RealtimeClient {
  pc: RTCPeerConnection;
  dc: RTCDataChannel;
  audio: HTMLAudioElement;
  stop: () => void;
}
 
export async function startRealtime(
  onEvent: (event: RealtimeEvent) => void,
): Promise<RealtimeClient> {
  const tokenRes = await fetch("/api/realtime/session", { method: "POST" });
  const { client_secret } = await tokenRes.json();
 
  const pc = new RTCPeerConnection();
 
  const audio = new Audio();
  audio.autoplay = true;
  pc.ontrack = (event) => {
    audio.srcObject = event.streams[0];
  };
 
  const mic = await navigator.mediaDevices.getUserMedia({ audio: true });
  pc.addTrack(mic.getAudioTracks()[0]);
 
  const dc = pc.createDataChannel("oai-events");
  dc.onmessage = (event) => {
    const parsed = JSON.parse(event.data) as RealtimeEvent;
    onEvent(parsed);
  };
 
  const offer = await pc.createOffer();
  await pc.setLocalDescription(offer);
 
  const sdpRes = await fetch(
    "https://api.openai.com/v1/realtime?model=gpt-realtime",
    {
      method: "POST",
      body: offer.sdp,
      headers: {
        Authorization: `Bearer ${client_secret.value}`,
        "Content-Type": "application/sdp",
      },
    },
  );
 
  const answer = { type: "answer" as const, sdp: await sdpRes.text() };
  await pc.setRemoteDescription(answer);
 
  return {
    pc,
    dc,
    audio,
    stop: () => {
      mic.getTracks().forEach((t) => t.stop());
      dc.close();
      pc.close();
    },
  };
}

A few details that are easy to miss:

The getUserMedia call must happen inside a user gesture handler, otherwise the browser will reject it.
The data channel name is oai-events. That is hard-coded by the API.
We never set pc.onicecandidate. The Realtime endpoint accepts a single SDP offer and returns the answer in one shot, so trickle ICE is not needed.

Step 4: Build the UI

Create a simple client component that starts and stops the agent. Save it as app/page.tsx:

"use client";
 
import { useRef, useState } from "react";
import { startRealtime, type RealtimeClient } from "@/lib/realtime-client";
 
export default function Home() {
  const [status, setStatus] = useState<"idle" | "connecting" | "live">("idle");
  const [transcript, setTranscript] = useState("");
  const clientRef = useRef<RealtimeClient | null>(null);
 
  async function start() {
    setStatus("connecting");
    const client = await startRealtime((event) => {
      if (event.type === "response.audio_transcript.delta") {
        setTranscript((prev) => prev + event.delta);
      }
      if (event.type === "session.created") {
        setStatus("live");
      }
    });
    clientRef.current = client;
  }
 
  function stop() {
    clientRef.current?.stop();
    clientRef.current = null;
    setStatus("idle");
  }
 
  return (
    <main className="mx-auto max-w-2xl p-8 space-y-6">
      <h1 className="text-3xl font-bold">Voice Agent</h1>
      <button
        onClick={status === "idle" ? start : stop}
        className="rounded-full bg-black text-white px-6 py-3"
      >
        {status === "idle" ? "Start call" : "End call"}
      </button>
      <p className="text-sm text-zinc-500">Status: {status}</p>
      <pre className="whitespace-pre-wrap rounded-lg bg-zinc-100 p-4 text-sm">
        {transcript}
      </pre>
    </main>
  );
}

Run npm run dev, click Start call, allow microphone access, and say hello. The agent should reply within around 500 ms.

Step 5: Add tools with function calling

A voice agent that only chats is a toy. The interesting part is when it can do something. The Realtime API uses the same JSON-schema-based function calling that the Chat Completions API uses. Define tools when you create the session.

Update the body of the /api/realtime/session route handler to include a tools array:

body: JSON.stringify({
  model: "gpt-realtime",
  voice: "marin",
  modalities: ["audio", "text"],
  instructions:
    "You are a helpful and concise voice agent for an online store. Confirm order numbers before calling lookup_order.",
  tools: [
    {
      type: "function",
      name: "lookup_order",
      description: "Look up the status of a customer order by order number.",
      parameters: {
        type: "object",
        properties: {
          order_number: {
            type: "string",
            description: "The order number, for example ORD-12345.",
          },
        },
        required: ["order_number"],
      },
    },
  ],
  tool_choice: "auto",
}),

When the model decides to call the tool, the data channel emits a response.function_call_arguments.done event. The browser is the wrong place to run business logic, so forward the call to your backend, run it, and stream the result back through the same data channel.

In app/page.tsx, extend the event handler:

const client = await startRealtime(async (event) => {
  if (event.type === "response.audio_transcript.delta") {
    setTranscript((prev) => prev + event.delta);
  }
  if (event.type === "response.function_call_arguments.done") {
    const args = JSON.parse(event.arguments);
    const res = await fetch("/api/orders/lookup", {
      method: "POST",
      body: JSON.stringify(args),
    });
    const result = await res.json();
 
    clientRef.current?.dc.send(
      JSON.stringify({
        type: "conversation.item.create",
        item: {
          type: "function_call_output",
          call_id: event.call_id,
          output: JSON.stringify(result),
        },
      }),
    );
    clientRef.current?.dc.send(JSON.stringify({ type: "response.create" }));
  }
});

Then add the backend route at app/api/orders/lookup/route.ts:

import { NextResponse } from "next/server";
import { z } from "zod";
 
const Schema = z.object({ order_number: z.string() });
 
export async function POST(request: Request) {
  const body = Schema.parse(await request.json());
 
  const order = await db.orders.findUnique({
    where: { id: body.order_number },
  });
 
  if (!order) {
    return NextResponse.json({ found: false });
  }
 
  return NextResponse.json({
    found: true,
    status: order.status,
    expected_delivery: order.expectedDelivery,
  });
}

The agent will now hear the order number, repeat it for confirmation, call your backend, and read the result back as natural speech. All in one conversational turn.

Step 6: Voice activity detection and interruption

By default, the Realtime API uses server VAD to detect turn boundaries. That means the model knows when the user has stopped speaking and starts replying immediately. It also knows when the user starts speaking again, which is how interruption works — the client receives a response.cancelled event and stops playback.

You can tune VAD by sending a session.update event after the session is created:

clientRef.current?.dc.send(
  JSON.stringify({
    type: "session.update",
    session: {
      turn_detection: {
        type: "server_vad",
        threshold: 0.5,
        prefix_padding_ms: 300,
        silence_duration_ms: 600,
      },
    },
  }),
);

A silence_duration_ms of 600 is a good default. Lower values make the agent jump in too eagerly, higher values make it feel sluggish.

Step 7: Persist transcripts

For analytics, audit logs, or fine-tuning, you will want to save the full transcript on the server. The Realtime API emits conversation.item.created events for both sides of the conversation. Pipe those to a tiny logging endpoint:

if (event.type === "conversation.item.created") {
  fetch("/api/logs/turn", {
    method: "POST",
    body: JSON.stringify({
      sessionId: client.sessionId,
      role: event.item.role,
      content: event.item.content,
      ts: Date.now(),
    }),
  });
}

Store these in Postgres or any append-only log. Do not store raw audio unless you have a clear reason and user consent — that is a separate compliance conversation.

Step 8: Production considerations

A demo that works on your laptop is not a production agent. Before you ship:

Rate limit the session endpoint. Each call to /api/realtime/session creates a paid OpenAI session. Wrap it in Upstash or Arcjet so a single user cannot loop on Start call and burn your budget.
Set a hard maximum duration. Realtime sessions can run for up to 30 minutes. Add a client-side timer that calls stop() after a sensible cap, for example five minutes for a support call.
Mask PII in transcripts. Run the saved text through a redaction step before it lands in your warehouse.
Localize the voice. Pass the user's language as part of the system prompt so the model picks the right accent and idioms. The MENA region in particular benefits from a native Arabic system prompt rather than English instructions translated on the fly.
Monitor with Langfuse or OpenTelemetry. Track latency, tool-call success rate, and average session cost per user.

Troubleshooting

The microphone permission prompt never appears. You called getUserMedia outside a user gesture. Wrap the call in a click handler on the Start call button.

Audio comes through choppy. Check that autoplay is set on the <audio> element. Some browsers block autoplay until the user interacts with the page once. The first click on Start call counts.

The agent never calls the tool. Confirm tool_choice: "auto" is set, and that the system prompt does not forbid tool use. Also check the response.function_call_arguments.done event in your debugger — sometimes the model invents an order number before asking for one, which means your prompt needs to be tightened.

WebRTC connection fails behind a corporate proxy. Add a TURN server. For most public deployments the default STUN config from the API is enough.

Next steps

You now have the spine of a real voice product. From here you can layer on:

Multi-agent handoff — route a call from a triage agent to a billing specialist using a transfer_to_agent tool. See our LangGraph stateful agents tutorial for the orchestration pattern.
Retrieval-augmented answers — give the agent a search_docs tool that hits your vector store. Our Agentic RAG with Next.js guide walks through the indexing side.
Phone integration — bridge the agent to inbound and outbound phone calls using a Twilio Voice or LiveKit SIP trunk.
Observability — wire every tool call and turn into Langfuse so you can measure quality across releases.

Conclusion

The Realtime API collapses the voice-AI stack into something a small team can actually maintain. Three things mattered the most in this build: ephemeral tokens to keep your main key server-side, WebRTC to keep latency human, and function calling to make the agent useful instead of merely talkative. With those primitives in place, the rest is product work — defining the right tools, writing a tight system prompt, and instrumenting the conversations that ship.

Voice is not a gimmick anymore. The teams that learn to ship it now will define the next generation of customer-facing AI.

OpenAI Realtime API Tutorial 2026: Build a Voice AI Agent with Next.js and WebRTC