Skip to main content
  1. Articles/

Three Bugs, Six Weeks, One Lesson: Anthropic's Claude Code Postmortem

·1338 words·7 mins·

On April 23, Anthropic published an engineering postmortem that the company probably did not want to write. It traced weeks of widely reported quality decline in Claude Code to three separate engineering changes that overlapped in time, affected different parts of the system, and compounded each other in ways that made the root cause unusually difficult to pin down. The post acknowledged the errors clearly, confirmed that all three had been resolved as of April 20 (v2.1.116), and announced that Anthropic was resetting usage limits for all subscribers.

It was, by software industry standards, a reasonably honest postmortem. It was also the third trust-related incident the company had to address in roughly six weeks — following the silent reasoning effort downgrade in March and the Pro plan removal test in April. At some point, the pattern matters as much as the individual incidents.

What Actually Happened
#

The postmortem identified three distinct changes, each introduced for legitimate reasons, each creating problems that weren’t immediately visible.

Bug 1: Reasoning effort downgrade (March 4)

On March 4, Anthropic changed Claude Code’s default reasoning effort from high to medium. The stated goal was latency reduction — models spend less time generating internal reasoning steps before producing output. The change was not announced in the changelog. Users had no way to know it had happened.

Independent analysis by AMD researchers eventually quantified the impact across 6,852 sessions: a 73% reduction in average thinking depth, with no user-visible signal that anything had changed. Anthropic reversed the change on April 7. The post acknowledged it directly: “This was the wrong tradeoff.” As of the postmortem, all Opus 4.7 users default to xhigh effort; all other models default to high.

Bug 2: Caching regression (March 26)

On March 26, Anthropic shipped a change intended to reduce latency by clearing cached thinking state from sessions that had been idle for more than an hour. The intent was reasonable — stale context from hours ago is usually not useful, and keeping it cached costs resources.

A bug in the implementation caused the clearing to happen not once after idle, but on every subsequent turn for the rest of the session. The practical effect was that Claude appeared forgetful and repetitive within a single working session — re-asking for context it had already been given, losing track of decisions made earlier in the conversation, seemingly unable to maintain a coherent thread over a long task. This was fixed on April 10.

This is the least-covered of the three issues, which is somewhat surprising given that it directly affected the continuous multi-step workflows that Claude Code is specifically designed for. A caching bug that made the model appear to lose memory mid-session is a significant regression for agentic use cases.

Bug 3: Verbosity reduction (April 16)

On April 16, Anthropic added a system prompt instruction to reduce response verbosity. The motivation, again, was legitimate: Claude Code’s responses had grown long and users were reporting that the model was padding output with unnecessary explanation. The instruction was meant to tighten that up.

In combination with other prompt changes that were already in place, the verbosity reduction degraded coding quality. The Register summarized the community’s reaction with characteristic economy: “Anthropic admits it dumbed down Claude with ‘upgrades’.” The instruction was reverted on April 20, the same day as the fix.

The Part That Isn’t in the Postmortem
#

Anthropic’s post is clear about what happened. What’s worth examining is how it happened — specifically, what the postmortem reveals about the limits of internal quality assurance for a hosted AI product.

All three issues were caught by user complaints, not internal evaluations. The VentureBeat coverage framed this as “mystery solved,” but the mystery is only half the story. The more important half is that Anthropic’s own evals missed all three regressions before they shipped to production. A postmortem writeup on Machine Learning at Scale put it plainly: “Anthropic shipped three regressions in a month and their evals didn’t catch one of them.”

The engineering post doesn’t address this directly. It doesn’t describe what the evals looked like, why they failed to detect the regressions, or what changes to the evaluation process are being made. That’s a legitimate gap in an otherwise honest postmortem.

The underlying problem is structural. When you depend on a hosted AI product — where the harness, system prompt, caching behavior, and reasoning configuration are all managed by the vendor — you’re accepting that engineering decisions made without your knowledge will affect your workflows. The three bugs in this postmortem weren’t model failures; they were harness failures. The model didn’t change. What changed was the infrastructure around it, invisible to the user, with no mechanism for users to detect or opt out of the changes.

What Power Users Can Do
#

The postmortem contains practical information that’s worth extracting.

Reasoning effort is now configurable. As of v2.1.116, the defaults are xhigh for Opus 4.7 and high for other models. You can set this explicitly via --config reasoning_effort:xhigh in your CLI invocation, or in ~/.claude/settings.json. The April 2026 power-user features update documented the xhigh effort level, which was added precisely because users needed explicit control over this parameter.

CLAUDE.md invariants provide a quality floor. You can’t pin to a specific Claude Code version for hosted behavior, but you can use CLAUDE.md to enforce quality expectations at the project level: reasoning depth requirements, output format constraints, verification steps before marking tasks complete. These don’t prevent harness regressions, but they give the model explicit quality instructions that survive system prompt changes.

Watch the changelog. None of the three changes in this postmortem appeared in the changelog when they shipped. The effort downgrade on March 4 wasn’t disclosed until the AMD study surfaced it. The caching bug shipped and was fixed with minimal public documentation. This is a gap Anthropic should address — material changes to reasoning configuration, caching behavior, or system prompt instructions should appear in release notes.

A Pattern Worth Watching
#

Anthropic’s public record across the first four months of 2026 contains several incidents worth noting in sequence:

  • March 4: Silent effort downgrade shipped without changelog disclosure
  • March 26: Caching regression shipped, creates forgetfulness in active sessions
  • April 7: Effort default reverted after AMD study goes public
  • April 10: Caching bug fixed
  • April 16: Verbosity reduction prompt ships
  • April 20: Verbosity instruction reverted
  • April 22: Pro plan A/B test surfaces (Claude Code access removed for ~2% of new users)
  • April 23: Postmortem published, usage limits reset

Each incident has an explanation. The effort downgrade was a latency tradeoff. The caching change was a resource optimization. The verbosity instruction addressed genuine user feedback. The Pro plan test was a pricing A/B experiment.

The pattern that emerges is of a company making product decisions without adequate disclosure, catching problems only after significant user backlash, and then addressing them reactively. That’s a trust problem that individual postmortems don’t fully resolve.

The usage limit reset was a meaningful gesture. Publishing the postmortem was the right call. What would actually change the pattern is a disclosure policy: material changes to reasoning configuration, caching behavior, or system prompt instructions should ship with changelog entries that users can read before the change takes effect.

Claude Code is the best agentic coding tool available. The SWE-bench Pro numbers, the Agent Teams architecture, the terminal-native model — none of that is undermined by this postmortem. What the postmortem does surface is a gap between the product’s technical capabilities and the operational trust that enterprise teams need before treating it as critical infrastructure.

That gap is closeable. The first step is a changelog policy that matches the product’s ambitions.


Sources:

Related