OpenAI Adds WebSocket Support to Responses API, Cutting Latency by 40% for AI Agents

OpenAI has launched WebSocket support for its Responses API, a significant infrastructure upgrade designed to slash latency for long-running AI agents that rely heavily on tool calls. The new mode enables persistent, bidirectional connections that eliminate the overhead of repeated HTTP requests, delivering up to 40% faster end-to-end execution for complex workflows.
Key Highlights
- Up to 40% latency reduction for workflows involving 20+ tool calls
- Persistent connections via
wss://api.openai.com/v1/responses— no more resending full conversation history each turn - Incremental input pattern — only new data (tool outputs, user messages) is sent per turn
- Warmup optimization — pre-load tools and instructions before the first generation turn
- Compatible with Zero Data Retention (ZDR) and
store=falsefor privacy-sensitive deployments
How It Works
Instead of the traditional HTTP request-response cycle, WebSocket mode maintains an open connection between the client and OpenAI's servers. After the initial response.create event, subsequent turns chain via previous_response_id and only send incremental inputs — the new tool results or user messages.
The server maintains the previous response state in a connection-local in-memory cache, meaning the full context doesn't need to be retransmitted each time. This architecture is particularly beneficial for agentic workflows where the AI repeatedly calls external tools.
A warmup feature allows developers to send generate: false to pre-stage tools and instructions, so the first actual generation turn starts faster.
Why It Matters
As AI agents become more sophisticated, they increasingly rely on chains of tool calls — searching databases, calling APIs, running code, and more. Under the standard HTTP model, each turn required resending the entire conversation history, creating a growing latency bottleneck.
Coding assistants like Cursor have already reported a 30% speed boost using the new WebSocket mode. For developers building background AI workers or multi-step agent pipelines, this is a meaningful infrastructure improvement.
Limitations
The WebSocket mode has a 60-minute connection limit, after which clients must reconnect. Only one response can be in-flight per connection (no multiplexing), and failed turns evict their cached state to prevent stale data reuse.
What's Next
The WebSocket mode signals OpenAI's broader push toward supporting always-on, persistent AI agents. As the industry moves from single-prompt interactions to long-running autonomous workflows, low-latency infrastructure like this becomes essential.
Developers can start using WebSocket mode today by connecting to wss://api.openai.com/v1/responses with Bearer token authentication.
Discuss Your Project with Us
We're here to help with your web development needs. Schedule a call to discuss your project and how we can assist you.
Let's find the best solutions for your needs.