Skip to main content
  1. Articles/

EvoClaw: AI Agents Hit 80% on Isolated Tasks and 38% on Real Codebases — The 54-Point Performance Cliff

·1256 words·6 mins·
Author
Florent Clairambault
CTO & Software engineer

Every major AI coding benchmark measures the same thing: can an agent fix this isolated bug? Can it implement this standalone feature? Can it complete this self-contained task?

Software development is not a collection of isolated tasks. It’s continuous evolution — each change depends on prior changes, each new feature must preserve existing behavior, and errors compound across commits. The standard benchmarks can’t measure that. EvoClaw, published at ICML 2026, does.

The findings are sobering: agents that score above 80% on isolated task benchmarks max out at 38% on continuous development. The best performer — Claude Opus 4.6 in the Claude Code framework — achieved 36.29%. GPT-5.3-Codex reached 28.88%. Gemini 3.1 Pro hit 23.32%. The gap between leaderboard performance and sustained development performance is over 54 percentage points.

What EvoClaw Measures
#

The benchmark, developed by researchers at USC, Princeton, Yale, and Stanford and accepted at ICML 2026, constructs 98 verified development milestones across seven diverse open-source repositories spanning five programming languages. Each milestone represents a coherent development objective extracted from real commit history. Milestones are interdependent — later milestones build on earlier ones, and failing to complete milestone 3 correctly will corrupt the environment for milestone 4.

The dataset characteristics tell you why this is hard:

  • Average of 27.4 files modified per milestone
  • Average of 17.1 failing-to-passing tests per milestone (the changes an agent needs to implement)
  • Average of 6,218 passing tests that must stay passing (the regressions an agent must not introduce)
  • Gold-patch size ranges from under 100 to over 1,500 lines of code

The scoring reflects this dual constraint. Recall measures whether the agent implemented the required functional changes. Precision measures whether the agent introduced regressions. The final Score is the harmonic mean — you can’t game it by fixing tasks while breaking existing behavior.

The Results
#

The headline comparison is stark:

SettingBest result
Isolated tasks (SWE-bench style)>80%
Continuous development (EvoClaw)38% maximum

Full model results in the continuous setting:

Model + FrameworkEvoClaw Score
Claude Opus 4.6 (Claude Code)36.29%
Claude Sonnet 4.6 (Claude Code)29.58%
Claude Opus 4.5 (Claude Code)25.85%
GPT-5.3-Codex (Codex CLI)28.88%
GPT-5.2 (Codex CLI)23.30%
Gemini 3 Flash (Gemini CLI)24.22%
Gemini 3.1 Pro (Gemini CLI)23.32%
Gemini 3 Pro (Gemini CLI)24.25%
Claude Sonnet 4.5 (Claude Code)15.16%
GPT-5.2-Codex (Codex CLI)13.46%

The best absolute milestone resolve rate — completing a full milestone end-to-end — was only 13.37% across all agents. Most of the time, agents achieve partial progress: some of the required tests pass, some regressions appear, and the next milestone inherits a broken environment.

One notable finding on cost: Gemini 3 Flash achieved 24.22% at approximately 1/9th the cost of Gemini 3 Pro’s 24.25%. For the continuous development use case, the cheapest model in the Google family matched the most expensive one. This is consistent with a pattern seen elsewhere: frontier closed models don’t have a reliable quality advantage for sustained multi-step work when compared to faster, cheaper models in the same family.

Why Performance Collapses at Scale
#

EvoClaw identifies three root causes for failure:

Error propagation. Logic errors account for approximately 57% of failure root causes. Unlike syntax errors (which fail loudly and immediately), logic errors — producing technically valid code that implements the wrong behavior — propagate silently. By the time the agent reaches milestone 5, the corrupted state from milestone 2’s logic error has contaminated two intermediate milestones. The paper measures “inherited propagation” (errors that spread from one milestone to another) and finds logic error chains have the highest propagation rate at 12% and the highest proportion of “missing test execution” at 17% — the agent doesn’t even realize it’s broken.

Recall grows, precision saturates. This is the key dynamic. As agents work through milestones, their recall rate (implementing new required changes) grows approximately linearly. Their precision rate (not breaking existing tests) saturates rapidly and then declines. Agents get better at implementing new things while simultaneously getting worse at not breaking old things. In a codebase that evolves across dozens of milestones, this is a compounding disaster.

Insufficient codebase exploration. The paper finds a strong positive correlation between codebase exploration effort and EvoClaw performance. Higher-scoring agents spend more tokens reading existing code before making changes. Lower-scoring agents dive into implementation immediately, change files based on the milestone description, and miss the implicit constraints embedded in surrounding code. The benchmark effectively measures whether agents have built a working mental model of the evolving system — not just whether they can implement a spec in isolation.

What This Means for Agentic Workflows
#

The EvoClaw results make explicit what practitioners have observed anecdotally: the number on your AI vendor’s benchmark page does not predict how useful that agent is across a real development sprint.

The 54-point performance cliff has a structural explanation. Isolated benchmarks like SWE-bench Verified test agents on tasks where the evaluation environment is pristine at the start of each run. The agent has no history to contend with — no prior agent decisions, no accumulated technical debt, no evolving test suite constraints. The gap between that pristine baseline and a real codebase that has been under development for months is what EvoClaw measures.

Several specific implications follow:

Model selection changes. Claude Opus 4.6 leading by 7+ percentage points over GPT-5.3-Codex (36.29% vs 28.88%) in continuous settings is a larger advantage than the same models show on isolated benchmarks. If you’re choosing a model for a multi-week autonomous development task, leaderboard rankings underweight this gap. (The paper predates Opus 4.7 and 4.8; those models would be expected to score higher, but the relative ordering is likely to hold.)

Precision is the binding constraint. The recall/precision divergence means that the critical workflow design question isn’t “can the agent implement this feature?” It’s “can the agent implement this feature without breaking the 6,000 things that currently work?” This is where pre-commit hooks, CI enforcement, and test-first specifications matter most — not as bureaucratic overhead but as the precision floor that prevents error propagation.

Exploration is infrastructure. Agents that explore before acting outperform agents that act immediately. This is an argument for CLAUDE.md files that describe codebase architecture, for spec documents that embed the invariants agents need to respect, and for workflows that include an explicit “read and understand” phase before the “implement” phase. The EvoClaw finding that exploration correlates with success isn’t just about the current task; it’s about whether the agent has built the context needed to avoid breaking the 23 things it isn’t currently working on.

The Spec-Driven Development thesis. The precision saturation problem is fundamentally a specification problem. Agents fail not because they can’t code but because they don’t know what they’re not supposed to break. Explicit specifications — both for what to build and for what invariants must be preserved — directly address the information deficit that causes precision to collapse. A CLAUDE.md that lists architectural invariants, a test-first spec that encodes existing behavior as constraints, and a review-in-loop workflow that catches regressions before they propagate: these are the practices that turn a 38%-continuous-development agent into something that can run a real sprint autonomously.

The paper closes with a direct conclusion: progress on isolated benchmarks is not enough. Bridging the continuous development gap requires agents capable of long-horizon planning, accumulated context management, and proactive error prevention. That’s a workflow and tooling problem as much as a model capability problem.


Sources

Related