Skip to main content
  1. Articles/

Cursor Composer 2.5 Matches Claude Opus 4.7 on Benchmarks. Here's Why the Fight Isn't Over.

·1019 words·5 mins·
Author
Florent Clairambault
CTO & Software engineer

Cursor dropped two things this week that deserve serious attention: Composer 2.5, their new foundation model, and a batch of workflow features that make the IDE meaningfully more autonomous. For anyone betting that Claude Code’s lead was safe because Anthropic had the best model, this week is a reality check.

Let me be precise about what changed — and what didn’t.

What Composer 2.5 Actually Achieves
#

Composer 2.5 was shipped on May 18, built on a checkpoint of Moonshot AI’s Kimi K2.5 with significant additional training. Cursor reports spending 85% of the compute budget on RL fine-tuning and synthetic task generation — 25x more synthetic tasks than Composer 2.

The benchmark results are the headline:

BenchmarkComposer 2.5Composer 2Claude Opus 4.7GPT-5.5
SWE-bench Multilingual79.8%73.7%79.8%~77%
CursorBench v3.163.2%61.3%
Terminal-Bench 2.069.3%61.7%69.4%82.7%

Composer 2.5 is essentially tied with Opus 4.7 on the two benchmarks where direct comparison is possible. That’s not a rounding error — that’s parity at a fraction of the price.

On pricing: Composer 2.5 standard tier runs at $0.50/M input and $2.50/M output. Claude Opus 4.7 is $5/$25 per million tokens. The fast tier for Composer 2.5 is $3/$15 — still cheaper than Opus 4.7 standard. For teams running thousands of agentic tasks per month, this is not a minor footnote.

The Self-Summarization Technique Compounds
#

Cursor didn’t just fine-tune a checkpoint. Composer 2.5 inherits the compaction-in-the-loop RL approach introduced in Composer 2: the model was trained with context compression as part of the RL reward signal, so it learns to identify what matters in a long context and compress it without losing task-relevant state.

The practical effect is roughly 5x token efficiency in long-running agentic sessions. Combined with 10x lower per-token pricing, a Cursor session that would cost $50 in Opus 4.7 API calls can approach a few dollars on Composer 2.5. That’s the math that makes enterprise finance teams pay attention.

Parallel Agents and the PR Review Tab
#

Alongside Composer 2.5, Cursor shipped two workflow features on May 20 that matter independently:

Parallel agents: Cursor can now split a plan into independent sub-tasks and execute them simultaneously as async subagents in isolated worktrees. This directly mirrors what Claude Code’s Agent Teams architecture has offered since February. Each parallel agent runs concurrently; results are merged and reviewed. Cursor frames this as “split → run → vote” — pick the best result from multiple simultaneous attempts.

Full PR review experience: The PR review tab now has inline comment threads, a commits tab, a file-tree navigator, and support for resolving individual review threads without leaving the IDE. This is a direct response to Claude Code Review GA (launched May 6 at Code with Claude SF), which charges $15-25 per PR for multi-agent analysis.

Where the Autonomy Gap Persists
#

This is the part that gets glossed over in coverage that focuses on benchmark charts.

Composer 2.5 runs inside Cursor. Everything it does passes through the running Cursor application. That architectural choice has consequences that no amount of fine-tuning can fix:

The IDE must be running. Cursor’s parallel agents don’t execute on Anthropic’s infrastructure or in a cloud sandbox that persists when your laptop closes. They run in processes managed by the Cursor application. Long multi-hour autonomous tasks are bounded by your session. Claude Code Routines and Managed Agents run on Anthropic’s Cloud Container Runtime and continue whether or not your machine is on.

Permission is always HITL. Cursor’s “autonomy” is UI-mediated: every significant file write or command gets routed back through the interface for approval or at minimum observation. Claude Code’s Auto Mode and --dangerously-skip-permissions with an allowlist let agents drive without checkpoints. This is a design philosophy difference, not a technical limitation — but it means the definition of “autonomous” differs materially between the two tools.

Benchmark-first training vs. agentic deployment. Composer 2.5 was optimized against CursorBench v3.1 and SWE-bench Multilingual — standard software task benchmarks. Claude Opus 4.7’s training data includes multi-agent orchestration, Routines, tool-error handling, and long-horizon task management. SWE-bench measures whether the model can solve a discrete coding problem; it doesn’t measure whether the model can coordinate a 10-agent team across a 3-hour migration.

The Right Way to Read Benchmark Parity
#

Benchmark parity does not mean workflow parity. This is a point worth dwelling on.

When Composer 2 launched in March with 73.7% SWE-bench Multilingual (vs. Opus 4.7’s 79.8% at the time), the gap was visible enough that teams could rationalize staying on Claude Code for the model quality. Composer 2.5 closes that gap. The remaining differentiation now lives entirely in:

  1. Workflow architecture — where agents run, for how long, and with what level of human involvement
  2. Ecosystem depth — MCP server compatibility, Routines, Managed Agents, Analytics API, skills libraries
  3. Enterprise features — Mantle ZOA backend on Bedrock, RBAC, OpenTelemetry, audit trails

That’s actually a defensible position for Anthropic. But it requires being honest that “best model for coding tasks” is no longer a clean Claude Code differentiator. The competition is fighting harder than it was three months ago.

What This Means for Teams Choosing Now
#

If your team is doing mostly contained, IDE-based coding tasks — PR reviews, feature implementations, bug fixes — and you want to minimize token costs, Cursor Composer 2.5 is a serious option. The benchmark results are real. The price advantage is real.

If your team is running autonomous workflows — nightly agents that ship PRs, multi-agent systems coordinating across codebases, long-horizon tasks that run while the team sleeps — Claude Code’s architecture remains qualitatively different. Cheap tokens don’t help if the agent can’t run while your machine is off.

The AI coding market in May 2026 looks like this: one tool has the best autonomous infrastructure; another has now matched its model quality at a fraction of the cost. The right answer depends on what “autonomous” means to your team — and whether you’re willing to be honest about the difference.


Sources: Cursor Composer 2.5 launch — TechTimes · The New Stack analysis · Cursor changelog · Claude Opus 4.7 benchmarks — Anthropic

Related