Voice AI has crossed a threshold. With the general availability of OpenAI's gpt-realtime model and a wave of production-grade API updates, developers can now build voice agents that answer real phone calls, connect to external tools through MCP servers, and interpret images — all within a single, real-time speech-to-speech session.
This guide covers what's new, why it matters, and how to wire your first voice agent to an actual phone number.
What Is gpt-realtime?
gpt-realtime is OpenAI's speech-to-speech model built for low-latency, bidirectional voice interactions. Unlike text-based pipelines where you transcribe audio, send it to an LLM, then synthesize speech from the response, gpt-realtime handles the entire chain natively — audio in, audio out — with dramatically reduced latency.
The model graduated from preview to General Availability in April 2026, bringing three major upgrades:
- SIP phone calling — connect AI agents directly to the public telephone network
- Remote MCP server support — extend agents with external tools without manual wiring
- Image input — ground conversations in visual context
Compared to its preview predecessor, gpt-realtime shows a 48% improvement in instruction following and a 34% improvement in tool calling accuracy. Two new voices — Cedar and Marin — deliver more natural, expressive speech output.
Key New Features
SIP Phone Integration
Session Initiation Protocol (SIP) is the standard that powers enterprise telephony — PBX systems, call centers, desk phones, and carriers like Twilio and Telnyx. The Realtime API now supports SIP natively, meaning your AI agent can make and receive real phone calls on an actual phone number.
Setup in four steps:
- Point your SIP trunk to:
sip:YOUR_PROJECT_ID@sip.api.openai.com;transport=tls - Configure a webhook in the OpenAI platform under Project → Webhooks
- When a call arrives, OpenAI fires a
realtime.call.incomingevent to your webhook - Accept the call and connect via WebSocket:
wss://api.openai.com/v1/realtime?call_id=CALL_ID
Here is a minimal webhook handler in Python using FastAPI:
from fastapi import FastAPI, Request
import httpx
app = FastAPI()
OPENAI_API_KEY = "sk-..."
@app.post("/webhook/calls")
async def handle_incoming_call(request: Request):
event = await request.json()
if event["type"] == "realtime.call.incoming":
call_id = event["call_id"]
async with httpx.AsyncClient() as client:
await client.post(
f"https://api.openai.com/v1/realtime/calls/{call_id}/accept",
headers={"Authorization": f"Bearer {OPENAI_API_KEY}"},
json={
"type": "realtime",
"model": "gpt-realtime-2",
"instructions": "You are a helpful customer support agent. Be concise and friendly."
}
)
return {"status": "ok"}The webhook event includes a call_id, SIP headers (From, To, Call-ID), and a timestamp for verification. For production workloads requiring call recording, routing logic, or DID pools, pairing gpt-realtime with Twilio or Telnyx gives you full carrier-grade infrastructure alongside OpenAI's intelligence layer.
Remote MCP Server Support
Model Context Protocol (MCP) is the emerging standard for connecting AI models to external tools — databases, CRMs, internal APIs, and more. The Realtime API now accepts MCP server URLs directly in the session configuration:
{
"type": "realtime",
"model": "gpt-realtime-2",
"instructions": "You are a booking agent for a hotel chain.",
"tools": [
{
"type": "mcp",
"server_url": "https://your-mcp-server.example.com/sse"
}
]
}Once connected, the API handles tool calls automatically — no manual dispatch loop required. The agent can check availability, create reservations, look up records, and confirm transactions all within a live voice call. This eliminates what used to be hundreds of lines of integration boilerplate.
Image Input in Real-Time Sessions
gpt-realtime now accepts image frames alongside audio. This unlocks scenarios that were previously impossible for voice agents:
- A caller sends a photo of a broken component — the agent diagnoses it verbally
- A customer shares a screenshot of an error — the agent walks them through the fix step by step
- A field technician describes what they see — the agent confirms via the live visual feed
Images are passed as base64 data or URLs within the session event stream, following the same pattern as vision support in the Chat Completions API.
Use Cases
The combination of SIP + MCP + multimodal input makes gpt-realtime practical across a wide range of industries. For MENA and North Africa enterprises, three verticals stand out:
| Industry | Use Case | Features Used |
|---|---|---|
| Contact Centers | Inbound Arabic-language support, appointment scheduling | SIP + MCP (CRM) |
| Healthcare | Patient intake, real-time clinical documentation | SIP + MCP (EHR) |
| Financial Services | Account inquiries, fraud alerts, loan status | SIP + MCP (banking API) |
| Field Service | Remote diagnostics with visual assistance | SIP + Image Input |
| Hospitality | Reservation management, multilingual concierge bots | SIP + MCP (booking system) |
Arabic voice agents are a particularly compelling opportunity: gpt-realtime supports multilingual speech input and output, which means businesses serving Arabic-speaking customers in Tunisia, Saudi Arabia, and the broader MENA region can deploy a single model across their entire contact center stack.
Pricing
Pricing as of May 2026:
| Token Type | Cost per Million Tokens |
|---|---|
| Audio Input | $32 |
| Cached Audio Input | $0.40 |
| Audio Output | $64 |
A typical 1-minute voice exchange costs roughly $0.30, making it competitive with dedicated voice AI platforms and significantly cheaper than staffing human agents at scale.
Getting Started: WebSocket-Only Agent
For web-based voice interactions without phone calling, connect directly via WebSocket:
const WebSocket = require("ws");
const ws = new WebSocket(
"wss://api.openai.com/v1/realtime?model=gpt-realtime-2",
{
headers: {
"Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
"OpenAI-Beta": "realtime=v1"
}
}
);
ws.on("open", () => {
ws.send(JSON.stringify({
type: "session.update",
session: {
instructions: "You are a friendly assistant.",
voice: "cedar",
input_audio_format: "pcm16",
output_audio_format: "pcm16"
}
}));
});Connecting a Phone Number via Twilio
- Create a Twilio Elastic SIP Trunk pointing to
sip:YOUR_PROJECT_ID@sip.api.openai.com;transport=tls - Assign a Twilio DID (phone number) to the trunk
- Set your webhook URL in the OpenAI platform settings
- Deploy your webhook handler and test with an inbound call
OpenAI provides regional IP ranges for SIP allowlisting across North Europe, South Central US, East US 2, and West US — helpful for firewall configuration at enterprise telephony providers.
Production Considerations
Latency: gpt-realtime targets sub-600ms round-trip times. Network distance between your SIP provider and OpenAI's regional endpoints matters — choose the region closest to your users.
Fallback Handling: Implement logic to gracefully handle dropped or rejected calls. The /realtime/calls/{call_id}/reject endpoint accepts standard SIP status codes, so you can return a busy signal or transfer to a human queue when the AI cannot handle a call.
Compliance: For healthcare (HIPAA) and financial services (PCI-DSS) deployments, confirm that your SIP provider and session data handling meet the relevant regulatory requirements before going live.
Conclusion
gpt-realtime closes the gap between AI assistant and production telephony system. By combining low-latency speech-to-speech intelligence with real phone network access via SIP, external tool connectivity via MCP, and visual grounding via image input, OpenAI has assembled a complete stack for the next generation of voice agents.
The most powerful pattern for 2026: gpt-realtime + a carrier-grade SIP provider + your existing MCP servers. That trio can replace significant portions of legacy IVR infrastructure while delivering a far better caller experience.
Start with the webhook handler, connect a test phone number, and you can have an intelligent agent answering calls in under an hour.