GPT-5.5 'Spud' Is OpenAI's Strongest Coding Model Yet — With One Important Asterisk

Table of Contents

OpenAI shipped GPT-5.5 on April 23, internally codenamed “Spud” — and for once, the leak-to-launch gap actually built up justified expectations. This is not an incremental patch release. GPT-5.5 is the first fully retrained base model since GPT-4.5, and on several agentic benchmarks it is now the best available model. That matters. So does knowing exactly which benchmarks it wins and which ones it doesn’t.

What Changed in the Architecture
#

GPT-5.5 is notable not just for the numbers but for what OpenAI did to get them. The 5.1 through 5.4 releases were refinements on a shared base — RLHF tuning, instruction-following tweaks, reasoning mode improvements. GPT-5.5 is a ground-up retraining. OpenAI says the new base model was trained with a stronger emphasis on long-horizon task coherence: the model learns to maintain state across multi-step tool use, not just produce high-quality individual responses.

The practical claim is “faster, sharper thinker for fewer tokens.” In agentic loops — where each model call compounds in cost — this matters. If GPT-5.5 routes a 10-step debugging task in 7 calls where Opus 4.7 takes 11, the $30/million output token price (vs Opus 4.7’s $25) starts to look comparable in production.

The Benchmark Split
#

Here is the actual scorecard, as of launch:

Benchmark	GPT-5.5	Claude Opus 4.7	Winner
Terminal-Bench 2.0	82.7%	69.4%	GPT-5.5
SWE-bench Pro	58.6%	64.3%	Opus 4.7
Expert-SWE	73.1%	—	GPT-5.5
OSWorld-Verified	78.7%	78.0%	GPT-5.5 (marginal)
MCP-Atlas	75.3%	79.1%	Opus 4.7
GDPval	84.9%	—	GPT-5.5

The story the headline writers landed on — “GPT-5.5 masters agentic coding” — is technically accurate for Terminal-Bench 2.0 and Expert-SWE. But SWE-bench Pro is the benchmark that has consistently proven hardest to game because it tests actual GitHub issue resolution on held-out repositories. On that metric, Claude Opus 4.7 leads by 5.7 percentage points — not a rounding error.

The MCP-Atlas split is also worth noting. MCP-Atlas benchmarks model performance on multi-tool coordination across the Model Context Protocol, which is the real-world substrate for production agentic systems. Opus 4.7’s 79.1% vs GPT-5.5’s 75.3% suggests Anthropic’s tighter integration with its own tooling ecosystem still confers an advantage in the workflows that matter most.

What Terminal-Bench 2.0 Actually Tests
#

Terminal-Bench 2.0 is worth unpacking because GPT-5.5’s 82.7% lead over Opus 4.7 is significant — 13.3 points. The benchmark tests complex command-line workflows: tasks that require planning, iteration, and tool coordination in a terminal environment. Think “set up a CI pipeline with environment-specific config, debug the failing test, and commit a working fix.”

This is, notably, Claude Code’s home turf. Which makes the Terminal-Bench 2.0 number interesting in two directions: it shows GPT-5.5 is genuinely capable in terminal-native workflows, and it raises the question of what happens when GPT-5.5 gets a proper agentic harness — not just ChatGPT and Codex, but a terminal-native deployment model analogous to Claude Code.

That harness does not exist yet.

Availability and the Deployment Gap
#

GPT-5.5 is currently live in ChatGPT and Codex for paid subscribers (Plus, Pro, Business, Enterprise). The API is still in controlled rollout — OpenAI cited additional safety and security work for serving partners at scale.

Claude Opus 4.7, by contrast, is GA on the Claude API, AWS Bedrock, Google Vertex AI, and Microsoft Azure AI Foundry. If you are building production agentic systems today and need the API to be reliably available, that multi-cloud GA status is not a minor footnote.

OpenAI’s API rollout will presumably complete within days to weeks. But “the model exists in ChatGPT” and “you can build autonomous agents against it at production scale” are not the same thing.

The Autonomy Architecture Question
#

Here is the tension that the benchmark comparison does not capture: GPT-5.5 is an excellent model being delivered through a product architecture that was not designed for terminal-native agentic work. ChatGPT is a conversational interface. Codex is a coding assistant. Both are built around a human in the loop.

Claude Code is built around the terminal as the operating environment. That means persistent bash sessions, multi-agent orchestration, worktree isolation, /ultrareview, /ultraplan, Routines with scheduled and GitHub-event triggers, and a CLAUDE.md-driven project-context model. The model is only one part of the stack. The scaffolding matters enormously.

GPT-5.5 is the best model OpenAI has shipped for agentic tasks. The gap between its benchmark performance and its deployment architecture — still IDE-centric, still conversational-first — is where Claude Code users should be watching, not the SWE-bench Pro delta.

Pricing in Context
#

Opus 4.7: $5 input / $25 output per million tokens.
GPT-5.5: $5 input / $30 output per million tokens.

GPT-5.5 Pro (extended reasoning mode): $30 input / $180 output — significant jump for workloads requiring deep multi-step planning.

OpenAI’s “fewer tokens per task” claim is unverified by independent benchmarks at launch. If it holds in practice — say, 20-30% fewer output tokens per completed agentic task — the effective cost per task could be roughly comparable to Opus 4.7. Developer benchmarking over the next few weeks will settle this question.

The Competitive Picture in April 2026
#

The frontier is genuinely competitive in a way it was not 12 months ago. SWE-bench Verified is now effectively at human baseline across multiple models. SWE-bench Pro is the new discriminating benchmark, and Opus 4.7 leads there. Terminal-Bench 2.0 now has GPT-5.5 in front.

Neither model has runaway dominance. The choice increasingly turns on tooling ecosystem, deployment reliability, and whether you are building around a conversational interface or a terminal-native agent loop.

GPT-5.5 is a real step up. The asterisk is that benchmark leadership is not the same as agentic infrastructure leadership. OpenAI is narrowing the model gap; the architecture gap is a different question entirely.

Sources: Introducing GPT-5.5 | OpenAI, VentureBeat: GPT-5.5 narrows beats Claude Mythos Preview on Terminal-Bench 2.0, MarkTechPost: GPT-5.5 scores 82.7% on Terminal-Bench 2.0, TechCrunch: OpenAI releases GPT-5.5, CNBC: OpenAI announces GPT-5.5, The Next Web: GPT-5.5 launch

What Changed in the Architecture#

The Benchmark Split#

What Terminal-Bench 2.0 Actually Tests#

Availability and the Deployment Gap#

The Autonomy Architecture Question#

Pricing in Context#

The Competitive Picture in April 2026#

Related