OpenAI Adds WebSocket Support to Responses API, Cutting Latency by 40% for AI Agents

OpenAI has launched WebSocket support for its Responses API, a significant infrastructure upgrade designed to slash latency for long-running AI agents that rely heavily on tool calls. The new mode enables persistent, bidirectional connections that eliminate the overhead of repeated HTTP requests, delivering up to 40% faster end-to-end execution for complex workflows.

Key Highlights

Up to 40% latency reduction for workflows involving 20+ tool calls
Persistent connections via wss://api.openai.com/v1/responses — no more resending full conversation history each turn
Incremental input pattern — only new data (tool outputs, user messages) is sent per turn
Warmup optimization — pre-load tools and instructions before the first generation turn
Compatible with Zero Data Retention (ZDR) and store=false for privacy-sensitive deployments

How It Works

Instead of the traditional HTTP request-response cycle, WebSocket mode maintains an open connection between the client and OpenAI's servers. After the initial response.create event, subsequent turns chain via previous_response_id and only send incremental inputs — the new tool results or user messages.

The server maintains the previous response state in a connection-local in-memory cache, meaning the full context doesn't need to be retransmitted each time. This architecture is particularly beneficial for agentic workflows where the AI repeatedly calls external tools.

A warmup feature allows developers to send generate: false to pre-stage tools and instructions, so the first actual generation turn starts faster.

Why It Matters

As AI agents become more sophisticated, they increasingly rely on chains of tool calls — searching databases, calling APIs, running code, and more. Under the standard HTTP model, each turn required resending the entire conversation history, creating a growing latency bottleneck.

Coding assistants like Cursor have already reported a 30% speed boost using the new WebSocket mode. For developers building background AI workers or multi-step agent pipelines, this is a meaningful infrastructure improvement.

Limitations

The WebSocket mode has a 60-minute connection limit, after which clients must reconnect. Only one response can be in-flight per connection (no multiplexing), and failed turns evict their cached state to prevent stale data reuse.

What's Next

The WebSocket mode signals OpenAI's broader push toward supporting always-on, persistent AI agents. As the industry moves from single-prompt interactions to long-running autonomous workflows, low-latency infrastructure like this becomes essential.

Developers can start using WebSocket mode today by connecting to wss://api.openai.com/v1/responses with Bearer token authentication.

Source: OpenAI — WebSocket Mode Documentation