Claude Sonnet 5: Anthropic's 'Most Agentic Sonnet Yet' Closes the Gap With Opus

Table of Contents

Anthropic didn’t wait for a flagship event to ship its next model. On June 30, Claude Sonnet 5 quietly became the default model in Claude Code and across every Claude plan — no keynote, no Code with Claude conference, just a changelog entry and a model card. That’s a very different launch posture than the Fable 5 rollout three weeks earlier, and it tells you something about where Anthropic thinks the real leverage is: not the frontier ceiling, but the model most developers actually run all day.

What Shipped
#

Claude Sonnet 5 (claude-sonnet-5 on the API) is now the default for Free and Pro plans, available to Max, Team, and Enterprise, and the default model in Claude Code as of CLI 2.1.197. It carries a native 1M-token context window — not a beta flag, not an opt-in header, just the default and only context size. Sessions auto-compact at roughly 967K tokens, and max output is 128K tokens with adaptive thinking carried over from Sonnet 4.6.

Pricing is the headline for anyone running high-volume agentic workflows: $2 per million input tokens and $10 per million output tokens through August 31, stepping up to standard pricing of $3/$15 after that. Anthropic also shipped a new tokenizer with this release (the same move it made for Opus 4.7), which changes token counts by roughly 1.0–1.35x depending on content — worth checking if you have cost dashboards hardcoded to the old token-per-dollar math.

The Benchmarks
#

Anthropic’s own comparison table, corroborated by TechCrunch’s coverage, puts Sonnet 5’s agentic coding score at 63.2%, up from Sonnet 4.6’s 58.1% — and closing in on Opus 4.8’s 69.2%. That’s the number that matters most for this blog’s readers: a mid-tier model gaining five points on the composite agentic coding eval while costing a third of Opus’s rate.

The rest of the scorecard tells the same story:

Evaluation	Sonnet 4.6	Sonnet 5
Agentic coding	58.1%	63.2%
OSWorld-Verified	78.5%	81.2%
Humanity’s Last Exam (no tools)	34.6%	43.2%
Humanity’s Last Exam (with tools)	46.8%	57.4%

The Humanity’s Last Exam jump — roughly 9-11 points in both configurations — is the more interesting number for anyone tracking reasoning depth rather than pure coding throughput. It suggests the gains aren’t narrowly tuned to SWE-bench-style tasks; they show up on a benchmark specifically designed to resist that kind of overfitting.

“It Checks Its Own Work Without Being Asked”
#

The qualitative feedback from Anthropic’s early-access partners is arguably more telling than the benchmark table. Testers reported that Sonnet 5 finishes multi-part tasks end-to-end where previous Sonnet models would stop short of the finish line, and that it verifies its own output unprompted — running the test suite, checking the diff, catching the edge case — instead of handing back a plausible-looking answer and waiting to be told it’s wrong.

Daniel Shepard at Zapier put it plainly: “That used to stall halfway. For day-to-day automation, it’s a no-brainer.” That’s the sentence that should worry Cursor and GitHub Copilot more than any benchmark row. The pitch for IDE-centric, human-in-the-loop AI tools has always rested partly on the assumption that mid-tier models need a human checking their work at every step. A Sonnet-class model that self-verifies without being asked erodes exactly that assumption — and it erodes it at the price point most teams actually deploy at scale, not the Opus-tier price point reserved for the hardest 10% of tasks.

Where This Leaves Opus 4.8
#

Anthropic is explicit that this isn’t a replacement play: “Between Sonnet 5 and Opus 4.8, users can adjust the effort level to find the right balance of cost and performance.” That’s the correct way to read the two-model lineup right now. Opus 4.8 still leads on the hardest agentic coding tasks (69.2% vs. 63.2%), and for genuinely difficult multi-day refactors or novel architecture decisions, that six-point gap is real money in outcomes, not just benchmark trivia.

But the compounding effect of a cheaper, more capable default model is what actually changes daily workflows. Most Claude Code sessions aren’t six-day refactors — they’re the hundreds of smaller loops (fix this test, extend this endpoint, write this migration) that make up the bulk of agentic coding volume. Moving the default model up five points on agentic coding while cutting the effective cost is a bigger aggregate productivity change than a frontier model gaining the same five points at the top of the price ladder.

Safety Posture, Briefly
#

Anthropic reports that Sonnet 5 shows lower rates of undesirable and sycophantic behavior than Sonnet 4.6, improved resistance to prompt injection, and better refusal rates on malicious requests — while carrying substantially lower cybersecurity capability than the Opus and Fable model lines, with cyber safeguards enabled by default but less restrictive than the ones that got Fable 5 pulled from export markets in June. That distinction matters this week specifically: Fable 5 is being restored globally starting today, and having Sonnet 5’s safety profile clearly differentiated from Fable’s gives Anthropic a cleaner story with regulators about why one model needed export controls and the other doesn’t.

The Practical Takeaway
#

If you’re running Claude Code day to day, you don’t need to do anything — Sonnet 5 is already your default, and the 1M context window means you can stop manually managing context truncation for most sessions. If you have cost-tracking infrastructure keyed to token counts, check it against the new tokenizer’s 1.0–1.35x multiplier before your July bill surprises you. And if you’ve been paying Opus rates out of habit for tasks that don’t actually need Opus-level reasoning, this is the moment to re-benchmark: a model that closes five-sixths of the gap to the frontier at a third of the price is the kind of release that should change your default routing logic, not just your changelog-reading habits.

Sources:

Claude Sonnet 5: Anthropic's 'Most Agentic Sonnet Yet' Closes the Gap With Opus — For Half the Price

What Shipped
#

The Benchmarks
#

“It Checks Its Own Work Without Being Asked”
#

Where This Leaves Opus 4.8
#

Safety Posture, Briefly
#

The Practical Takeaway
#

Related

What Shipped#

The Benchmarks#

“It Checks Its Own Work Without Being Asked”#

Where This Leaves Opus 4.8#

Safety Posture, Briefly#

The Practical Takeaway#

Related

What Shipped
#

The Benchmarks
#

“It Checks Its Own Work Without Being Asked”
#

Where This Leaves Opus 4.8
#

Safety Posture, Briefly
#

The Practical Takeaway
#