{"id":63,"date":"2026-04-26T12:51:37","date_gmt":"2026-04-26T12:51:37","guid":{"rendered":"https:\/\/blog.quansys.ai\/?p=63"},"modified":"2026-04-26T12:55:52","modified_gmt":"2026-04-26T12:55:52","slug":"speeding-up-agentic-workflows-with-websockets-in-the-responses-api-2","status":"publish","type":"post","link":"https:\/\/blog.quansys.ai\/?p=63","title":{"rendered":"Speeding up agentic workflows with WebSockets in the Responses API"},"content":{"rendered":"\n<p><\/p>\n\n\n\n<p>When you ask Codex to fix a bug, it scans through your codebase for relevant files, reads them to build context, makes edits, and runs tests to verify the fix worked. Under the hood, that means dozens of back-and-forth Responses API requests: determine the model\u2019s next action, run a tool on your computer, send the tool output back to the API, and repeat.<\/p>\n\n\n\n<p>All of these requests can add up to minutes that users spend waiting for Codex to complete complex tasks. From a latency perspective, the Codex agent loop spends most of its time in three main stages:&nbsp;working in the API services (to validate and process requests), model inference, and client-side time (running tools and building model context). Inference is the stage where the model runs on GPUs to generate new tokens. In the past, running LLM inference on GPUs was the slowest part of the agentic loop, so API service overhead was easy to hide. As inference gets faster, the cumulative API overhead from an agentic rollout is much more notable.<\/p>\n\n\n\n<p>In this post, we&#8217;ll explain how we made agent loops using the API 40% faster end-to-end, letting users experience the jump in inference speed from 65 to nearly 1,000 tokens per second. We approached this through caching, eliminating unnecessary network hops, improving our safety stack to quickly flag issues, and\u2014most importantly\u2014building a way to create a persistent connection to the Responses API, instead of having to make a series of synchronous API calls.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1192\" height=\"1118\" src=\"https:\/\/blog.quansys.ai\/wp-content\/uploads\/2026\/04\/A_Codex_Agent_Loop_In_Practice.webp\" alt=\"Diagram titled \u201cA Codex agent loop in practice\u201d showing an iterative flow between Codex and the Responses API, with tool calls (rg, sed, apply_patch, pytest) and results exchanged until the final message: \u201cThe bug has been fixed.\u201d\" class=\"wp-image-26\" srcset=\"https:\/\/blog.quansys.ai\/wp-content\/uploads\/2026\/04\/A_Codex_Agent_Loop_In_Practice.webp 1192w, https:\/\/blog.quansys.ai\/wp-content\/uploads\/2026\/04\/A_Codex_Agent_Loop_In_Practice-300x281.webp 300w, https:\/\/blog.quansys.ai\/wp-content\/uploads\/2026\/04\/A_Codex_Agent_Loop_In_Practice-1024x960.webp 1024w, https:\/\/blog.quansys.ai\/wp-content\/uploads\/2026\/04\/A_Codex_Agent_Loop_In_Practice-768x720.webp 768w\" sizes=\"auto, (max-width: 1192px) 100vw, 1192px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">When the API became the bottleneck<\/h2>\n\n\n\n<p>In the Responses API, previous flagship models like GPT\u20115 and GPT\u20115.2 ran at roughly 65 tokens per second (TPS). For the launch of GPT\u20115.3\u2011Codex\u2011Spark, a fast coding model, our goal was an order of magnitude faster: over 1,000 TPS, enabled by specialized Cerebras hardware optimized for LLM inference. To make sure users could experience the true speed of this new model, we had to reduce API overhead.&nbsp;<\/p>\n\n\n\n<p>Around November of 2025, we launched a performance sprint on the Responses API, landing many optimizations to the critical-path latency for a single request:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Caching rendered tokens and model configuration in memory to skip expensive tokenization and network calls for multi-turn responses<\/li>\n\n\n\n<li>Reducing network hop latency by eliminating calls to intermediate services (for example, image processing resolution) and directly calling the inference service itself<\/li>\n\n\n\n<li>Improving our safety stack so we could run certain classifiers to flag conversations faster<\/li>\n<\/ul>\n\n\n\n<p>With these improvements, we saw close to a 45% improvement in time to first token (TTFT)\u2014which reflects how responsive the API feels\u2014but these improvements were still not fast enough for GPT\u20115.3\u2011Codex\u2011Spark. Even with these improvements, Responses API overhead was too large relative to the speed of the model\u2014that is, users had to wait for the CPUs running our API before they could use the GPUs serving the model.<\/p>\n\n\n\n<p>The deeper issue was structural: we treated each Codex request as independent, processing conversation state and other reusable context in every follow-up request. Even when most of the conversation hadn&#8217;t changed, we still paid for work tied to the full history. As conversations got longer, that repeated processing became more expensive.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Building a persistent connection<\/h2>\n\n\n\n<p>To tighten up the design, we rethought the transport protocol: could we keep a persistent connection and cache state, rather than establishing a new connection over HTTP and sending the full conversation history for each follow-up request? The idea was to only send any new information requiring validation and processing and cache reusable state in memory for the lifetime of the connection. This would reduce overhead from redundant work.<\/p>\n\n\n\n<p>We considered a few different approaches, including WebSockets and gRPC bidirectional streaming. We landed on WebSockets because as a simple message transport protocol, users wouldn&#8217;t have to change their Responses API input and output shapes. It was developer-friendly and fit our existing architecture with little disruption.<\/p>\n\n\n\n<p>The first WebSocket prototype changed what we thought was possible for Responses API latency. An engineer on the Codex team with deep expertise across the API stack pulled together a prototype by running a Codex agent overnight.<\/p>\n\n\n\n<p>In that prototype, agentic rollouts were modeled as a single long-running Response. Using&nbsp;<code>asyncio<\/code>&nbsp;features, the Responses API would asynchronously block in the sampling loop after a tool call was sampled, and the Responses API would send a&nbsp;<code>response.done<\/code>&nbsp;event back to the client. After executing the tool call, clients would send back a&nbsp;<code>response.append<\/code>&nbsp;event with the tool result, which unblocked the sampling loop and let the model continue.<\/p>\n\n\n\n<p>An analogy here is treating the local tool call as a hosted tool call. When the model calls web search, the inference loop blocks, calls a web search service, and puts the service response in the model context. In our design, we did the same thing; but instead of calling a remote service, we sent the model&#8217;s tool call to the client back over the WebSocket. When the client responded, we put the client&#8217;s tool call response into the context and continued to sample.<\/p>\n\n\n\n<p>This design was extremely effective because it eliminated repeated API work across an agent rollout. We could do preinference work once, pause for tool execution, and do postinference work once at the end.<\/p>\n\n\n\n<p>Unfortunately, this came at the cost of a less familiar and more complicated API shape. We wanted developers to be able to drop in WebSocket support without having to rewrite their API integration around a new interaction mode.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Keeping the API familiar while making the stack incremental<\/h2>\n\n\n\n<p>For the version we launched, we switched back to a familiar shape: keep using&nbsp;<code>response.create<\/code>&nbsp;with the same body, and use&nbsp;<code>previous_response_id<\/code>&nbsp;to continue the conversation context from the previous response\u2019s state.<\/p>\n\n\n\n<p>On a WebSocket connection, the server keeps a connection-scoped, in-memory cache of previous response state.<\/p>\n\n\n\n<p>That cached state includes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The previous&nbsp;<code>response<\/code>&nbsp;object<\/li>\n\n\n\n<li>Prior input and output items<\/li>\n\n\n\n<li>Tool definitions and namespaces<\/li>\n\n\n\n<li>Reusable sampling artifacts, like previously rendered tokens<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1192\" height=\"1892\" src=\"https:\/\/blog.quansys.ai\/wp-content\/uploads\/2026\/04\/From_sequential_requests_to_overlapped_execution__2_.webp\" alt=\"Diagram titled \u201cFrom sequential requests to overlapped execution\u201d comparing a sequential request pipeline with a WebSocket-based approach where multiple requests overlap across validation, preinference, sampling, and postinference stages.\" class=\"wp-image-27\" srcset=\"https:\/\/blog.quansys.ai\/wp-content\/uploads\/2026\/04\/From_sequential_requests_to_overlapped_execution__2_.webp 1192w, https:\/\/blog.quansys.ai\/wp-content\/uploads\/2026\/04\/From_sequential_requests_to_overlapped_execution__2_-189x300.webp 189w, https:\/\/blog.quansys.ai\/wp-content\/uploads\/2026\/04\/From_sequential_requests_to_overlapped_execution__2_-645x1024.webp 645w, https:\/\/blog.quansys.ai\/wp-content\/uploads\/2026\/04\/From_sequential_requests_to_overlapped_execution__2_-768x1219.webp 768w, https:\/\/blog.quansys.ai\/wp-content\/uploads\/2026\/04\/From_sequential_requests_to_overlapped_execution__2_-968x1536.webp 968w\" sizes=\"auto, (max-width: 1192px) 100vw, 1192px\" \/><\/figure>\n\n\n\n<p>By reusing the in-memory previous response state, we were able to land several major optimizations:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Making some of our safety classifiers and request validators process only new input, not the full history every time<\/li>\n\n\n\n<li>Keeping an in-memory cache of rendered tokens that we append to so we can skip unnecessary tokenization<\/li>\n\n\n\n<li>Reusing our successful model resolution\/routing logic across requests&nbsp;<\/li>\n\n\n\n<li>Overlapping non-blocking postinference work like billing with subsequent requests<\/li>\n<\/ul>\n\n\n\n<p>The goal was to get as close as possible to the minimal-overhead prototype but with an API shape developers already understood and built around.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Setting a new bar for speed<\/h2>\n\n\n\n<p>After a two-month sprint building WebSocket mode, we launched an alpha with key coding agent startups so they could integrate it into their infrastructure and safely ramp up traffic. Alpha users loved it, reporting&nbsp;<a href=\"https:\/\/x.com\/aisdk\/status\/2026031263925039591\" target=\"_blank\" rel=\"noreferrer noopener\">up to 40% improvements\u2060(opens in a new window)<\/a>&nbsp;in their agentic workflows. Given the positive alpha feedback, we were ready to launch.<\/p>\n\n\n\n<p>The launch results were immediate. Codex quickly ramped up the majority of their Responses API traffic onto WebSocket mode, seeing significant latency improvements. For GPT\u20115.3\u2011Codex\u2011Spark, we hit our 1,000 TPS target and saw bursts up to 4,000 TPS, showing that the Responses API could keep up with much faster inference in real production traffic. The impact showed up quickly in the developer community too:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Codex quickly ramped the majority of their traffic onto WebSockets.&nbsp;Codex users running the latest models such as&nbsp;<a href=\"https:\/\/developers.openai.com\/api\/docs\/models\/gpt-5.3-codex\" target=\"_blank\" rel=\"noreferrer noopener\">GPT\u20115.3\u2011Codex\u2060(opens in a new window)<\/a>,&nbsp;<a href=\"https:\/\/developers.openai.com\/api\/docs\/models\/gpt-5.4\" target=\"_blank\" rel=\"noreferrer noopener\">GPT\u20115.4\u2060(opens in a new window)<\/a>, and beyond all benefit from WebSocket mode\u2019s speed up.<\/li>\n\n\n\n<li>Vercel integrated WebSocket mode into the AI SDK and saw latency decrease by&nbsp;<a href=\"https:\/\/x.com\/aisdk\/status\/2026031263925039591\" target=\"_blank\" rel=\"noreferrer noopener\">up to 40%\u2060(opens in a new window)<\/a>.<\/li>\n\n\n\n<li>Cline\u2019s multi-file workflows are&nbsp;<a href=\"https:\/\/x.com\/cline\/status\/2026031848791630033\" target=\"_blank\" rel=\"noreferrer noopener\">39% faster\u2060(opens in a new window)<\/a>.<\/li>\n\n\n\n<li>OpenAI models in Cursor became up to&nbsp;<a href=\"https:\/\/x.com\/leerob\/status\/2026030244407468259\" target=\"_blank\" rel=\"noreferrer noopener\">30% faster\u2060(opens in a new window)<\/a>.<\/li>\n<\/ul>\n\n\n\n<p>WebSocket mode is the one of the most significant new capabilities in the Responses API since its launch in March 2025. We went from idea to running in production in just a few weeks through close collaboration between OpenAI&#8217;s API and Codex teams. It not only dramatically improves agent rollout latency but also supports a growing need for builders: as model inference gets faster, the services and systems that surround inference also need to speed up to transfer these gains to users.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When you ask Codex to fix a bug, it scans through your codebase for relevant files, reads them to build context, makes edits, and runs tests to verify the fix worked. Under the hood, that means dozens of back-and-forth Responses API requests: determine the model\u2019s next action, run a tool on your computer, send the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":69,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-63","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-agents"],"_links":{"self":[{"href":"https:\/\/blog.quansys.ai\/index.php?rest_route=\/wp\/v2\/posts\/63","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.quansys.ai\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.quansys.ai\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.quansys.ai\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.quansys.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=63"}],"version-history":[{"count":2,"href":"https:\/\/blog.quansys.ai\/index.php?rest_route=\/wp\/v2\/posts\/63\/revisions"}],"predecessor-version":[{"id":74,"href":"https:\/\/blog.quansys.ai\/index.php?rest_route=\/wp\/v2\/posts\/63\/revisions\/74"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.quansys.ai\/index.php?rest_route=\/wp\/v2\/media\/69"}],"wp:attachment":[{"href":"https:\/\/blog.quansys.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=63"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.quansys.ai\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=63"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.quansys.ai\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=63"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}