OpenAI gpt-realtime: Build Voice Agents That Answer Calls

Voice AI has crossed a threshold. With the general availability of OpenAI's gpt-realtime model and a wave of production-grade API updates, developers can now build voice agents that answer real phone calls, connect to external tools through MCP servers, and interpret images — all within a single, real-time speech-to-speech session.

This guide covers what's new, why it matters, and how to wire your first voice agent to an actual phone number.

What Is gpt-realtime?

gpt-realtime is OpenAI's speech-to-speech model built for low-latency, bidirectional voice interactions. Unlike text-based pipelines where you transcribe audio, send it to an LLM, then synthesize speech from the response, gpt-realtime handles the entire chain natively — audio in, audio out — with dramatically reduced latency.

The model graduated from preview to General Availability in April 2026, bringing three major upgrades:

SIP phone calling — connect AI agents directly to the public telephone network
Remote MCP server support — extend agents with external tools without manual wiring
Image input — ground conversations in visual context

Compared to its preview predecessor, gpt-realtime shows a 48% improvement in instruction following and a 34% improvement in tool calling accuracy. Two new voices — Cedar and Marin — deliver more natural, expressive speech output.

Key New Features

SIP Phone Integration

Session Initiation Protocol (SIP) is the standard that powers enterprise telephony — PBX systems, call centers, desk phones, and carriers like Twilio and Telnyx. The Realtime API now supports SIP natively, meaning your AI agent can make and receive real phone calls on an actual phone number.

Setup in four steps:

Point your SIP trunk to: sip:YOUR_PROJECT_ID@sip.api.openai.com;transport=tls
Configure a webhook in the OpenAI platform under Project → Webhooks
When a call arrives, OpenAI fires a realtime.call.incoming event to your webhook
Accept the call and connect via WebSocket: wss://api.openai.com/v1/realtime?call_id=CALL_ID

Here is a minimal webhook handler in Python using FastAPI:

from fastapi import FastAPI, Request
import httpx
 
app = FastAPI()
OPENAI_API_KEY = "sk-..."
 
@app.post("/webhook/calls")
async def handle_incoming_call(request: Request):
    event = await request.json()
 
    if event["type"] == "realtime.call.incoming":
        call_id = event["call_id"]
 
        async with httpx.AsyncClient() as client:
            await client.post(
                f"https://api.openai.com/v1/realtime/calls/{call_id}/accept",
                headers={"Authorization": f"Bearer {OPENAI_API_KEY}"},
                json={
                    "type": "realtime",
                    "model": "gpt-realtime-2",
                    "instructions": "You are a helpful customer support agent. Be concise and friendly."
                }
            )
 
    return {"status": "ok"}

The webhook event includes a call_id, SIP headers (From, To, Call-ID), and a timestamp for verification. For production workloads requiring call recording, routing logic, or DID pools, pairing gpt-realtime with Twilio or Telnyx gives you full carrier-grade infrastructure alongside OpenAI's intelligence layer.

Remote MCP Server Support

Model Context Protocol (MCP) is the emerging standard for connecting AI models to external tools — databases, CRMs, internal APIs, and more. The Realtime API now accepts MCP server URLs directly in the session configuration:

{
  "type": "realtime",
  "model": "gpt-realtime-2",
  "instructions": "You are a booking agent for a hotel chain.",
  "tools": [
    {
      "type": "mcp",
      "server_url": "https://your-mcp-server.example.com/sse"
    }
  ]
}

Once connected, the API handles tool calls automatically — no manual dispatch loop required. The agent can check availability, create reservations, look up records, and confirm transactions all within a live voice call. This eliminates what used to be hundreds of lines of integration boilerplate.

Image Input in Real-Time Sessions

gpt-realtime now accepts image frames alongside audio. This unlocks scenarios that were previously impossible for voice agents:

A caller sends a photo of a broken component — the agent diagnoses it verbally
A customer shares a screenshot of an error — the agent walks them through the fix step by step
A field technician describes what they see — the agent confirms via the live visual feed

Images are passed as base64 data or URLs within the session event stream, following the same pattern as vision support in the Chat Completions API.

Use Cases

The combination of SIP + MCP + multimodal input makes gpt-realtime practical across a wide range of industries. For MENA and North Africa enterprises, three verticals stand out:

Industry	Use Case	Features Used
Contact Centers	Inbound Arabic-language support, appointment scheduling	SIP + MCP (CRM)
Healthcare	Patient intake, real-time clinical documentation	SIP + MCP (EHR)
Financial Services	Account inquiries, fraud alerts, loan status	SIP + MCP (banking API)
Field Service	Remote diagnostics with visual assistance	SIP + Image Input
Hospitality	Reservation management, multilingual concierge bots	SIP + MCP (booking system)

Arabic voice agents are a particularly compelling opportunity: gpt-realtime supports multilingual speech input and output, which means businesses serving Arabic-speaking customers in Tunisia, Saudi Arabia, and the broader MENA region can deploy a single model across their entire contact center stack.

Pricing

Pricing as of May 2026:

Token Type	Cost per Million Tokens
Audio Input	$32
Cached Audio Input	$0.40
Audio Output	$64

A typical 1-minute voice exchange costs roughly $0.30, making it competitive with dedicated voice AI platforms and significantly cheaper than staffing human agents at scale.

Getting Started: WebSocket-Only Agent

For web-based voice interactions without phone calling, connect directly via WebSocket:

const WebSocket = require("ws");
 
const ws = new WebSocket(
  "wss://api.openai.com/v1/realtime?model=gpt-realtime-2",
  {
    headers: {
      "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
      "OpenAI-Beta": "realtime=v1"
    }
  }
);
 
ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      instructions: "You are a friendly assistant.",
      voice: "cedar",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16"
    }
  }));
});

Connecting a Phone Number via Twilio

Create a Twilio Elastic SIP Trunk pointing to sip:YOUR_PROJECT_ID@sip.api.openai.com;transport=tls
Assign a Twilio DID (phone number) to the trunk
Set your webhook URL in the OpenAI platform settings
Deploy your webhook handler and test with an inbound call

OpenAI provides regional IP ranges for SIP allowlisting across North Europe, South Central US, East US 2, and West US — helpful for firewall configuration at enterprise telephony providers.

Production Considerations

Latency: gpt-realtime targets sub-600ms round-trip times. Network distance between your SIP provider and OpenAI's regional endpoints matters — choose the region closest to your users.

Fallback Handling: Implement logic to gracefully handle dropped or rejected calls. The /realtime/calls/{call_id}/reject endpoint accepts standard SIP status codes, so you can return a busy signal or transfer to a human queue when the AI cannot handle a call.

Compliance: For healthcare (HIPAA) and financial services (PCI-DSS) deployments, confirm that your SIP provider and session data handling meet the relevant regulatory requirements before going live.

Conclusion

gpt-realtime closes the gap between AI assistant and production telephony system. By combining low-latency speech-to-speech intelligence with real phone network access via SIP, external tool connectivity via MCP, and visual grounding via image input, OpenAI has assembled a complete stack for the next generation of voice agents.

The most powerful pattern for 2026: gpt-realtime + a carrier-grade SIP provider + your existing MCP servers. That trio can replace significant portions of legacy IVR infrastructure while delivering a far better caller experience.

Start with the webhook handler, connect a test phone number, and you can have an intelligent agent answering calls in under an hour.