Cursor Composer 2: The Model That Learns to Forget — and Sparked a Controversy

Table of Contents

Cursor shipped Composer 2 on March 17, 2026, and buried the lead in two ways. First, they glossed over the genuinely interesting technical contribution — a new training technique that teaches a model to compress its own memory mid-task. Second, they forgot to mention the model is built on Kimi K2.5, a Chinese open-weight model from Moonshot AI. The internet noticed. Elon Musk weighed in. A co-founder apologized.

Let’s start with the part that actually matters.

The Problem With Long-Horizon Tasks
#

If you’ve used an AI agent to do anything non-trivial — refactor a large codebase, debug a multi-file regression, run a migration — you’ve hit the context wall. Agent trajectories get long. Hundreds of turns, tool call results, file contents, intermediate reasoning: it all stacks up, and eventually you blow past whatever the model’s context window can hold.

The standard workarounds are bad. Prompted summarization tells the model to compress its context, but it’s a bolt-on step with no connection to the task reward. The model has no incentive to preserve what actually matters. Sliding windows just drop old tokens, which is worse. Both approaches cause information loss that compounds over a long trajectory. One forgotten function signature or dropped constraint and the entire task can silently go sideways.

This is the unsolved problem Cursor’s research team went after.

Compaction-in-the-Loop RL
#

Cursor’s solution is conceptually elegant: make the model’s summarization behavior part of the reinforcement learning signal itself.

Here is how it works. During RL training, the model runs on a task. When it hits a fixed token-length trigger — 40k or 80k tokens — it pauses, generates a compressed summary of its own context (targeting around 1,000 tokens), and then continues from that summary. This cycle repeats as many times as necessary. At the end, the RL reward covers the complete chain: task completion and every compaction step along the way. If the model summarizes sloppily and loses something critical, it fails the task. Negative reward. The model learns what to keep.

The results Cursor published are worth taking seriously. Compared to prompted summarization: 50% fewer compaction errors, 5x more token-efficient (1k tokens versus 5k), and in a live demo, the model solved a “make-doom-for-mips” Terminal-Bench challenge in 170 turns while compressing over 100k tokens of context without losing the thread.

The deeper implication is the one that should interest you if you care about agent capability ceilings. By training with compaction in the loop, you can train on trajectories that are substantially longer than your maximum context window. The model can learn tasks that require hundreds of sequential actions — the kind of tasks that real software projects actually involve. That’s a qualitatively different capability class than what you get from context-window-limited training runs.

This is, as far as I can tell, the first time a commercial coding tool has embedded long-horizon task compression directly into the RL loop rather than treating it as a post-processing afterthought. That matters.

The Benchmarks
#

Composer 2 is not a marginal improvement on its predecessor.

On CursorBench, Composer 1.5 scored 44.2. Composer 2 scores 61.3. Claude Opus 4.6 scores 58.2. On Terminal-Bench 2.0, the numbers are 47.9 to 61.7, with Claude Opus 4.6 at 58.0. On SWE-bench Multilingual, it goes from 65.9 to 73.7.

Beating Claude Opus 4.6 on Cursor’s own benchmarks while also being 86% cheaper than Composer 1.5 is a real result. Yes, CursorBench is Cursor’s benchmark, so apply appropriate skepticism about benchmark overfitting. But Terminal-Bench and SWE-bench Multilingual are external, and the numbers hold there too.

Pricing: Composer 2 Standard is $0.50/M input and $2.50/M output. Composer 2 Fast is $1.50/M input and $7.50/M output. Both come with a 200k context window. Composer 1.5 was considerably more expensive at scale. This is not a “better and costs more” story — the price drop is substantial.

The Part They Forgot to Mention
#

Now for the controversy.

Composer 2 is built on Kimi K2.5, a 1-trillion-parameter mixture-of-experts model with 32 billion active parameters, released open-weight by Moonshot AI. Cursor’s March 17 research blog described the training methodology in detail. It did not mention Kimi K2.5.

A developer spotted the model identifier kimi-k2p5-rl-0317-s515-fast leaking through in API responses. The thread went to 2.6 million views. Elon Musk showed up and wrote, in characteristically helpful fashion: “Yeah, it’s Kimi 2.5.” Cursor co-founder Aman Sanger acknowledged the oversight: “It was a miss to not mention the Kimi base in our blog from the start.” Cursor’s VP of Developer Education Lee Robinson subsequently argued that roughly 75% of Composer 2’s performance characteristics come from Cursor’s additional training rather than the base model.

That framing may be accurate. Compaction-in-the-loop RL is a serious contribution, not a thin wrapper. But the sequence of events — omit the base model, get caught, apologize — is exactly the pattern that erodes trust in a space already prone to benchmark theater and capability overclaiming.

There is also a licensing wrinkle worth noting. Kimi K2.5 uses a modified MIT license that requires companies generating over $20 million per month in revenue to display “Kimi K2.5” in their product UI. Cursor reportedly runs at approximately $160 million per month in revenue and had not done this at the time of the controversy. Whether this gets resolved quietly or becomes a bigger issue remains to be seen.

What to Make of All This
#

The technical contribution is real. Compaction-in-the-loop RL is a sensible and apparently effective answer to a genuine agent capability problem, and the benchmark results suggest it works in practice, not just in theory. If you are running long autonomous agent tasks — the kind where context overflow is a constant source of frustration — Composer 2 is worth testing seriously.

The transparency failure is also real. In an era where every AI product is one API probe away from revealing what it actually runs on, the base model is not a minor detail. Developers deploying AI tools in production have a reasonable interest in knowing what they’re actually running — for supply chain reasons, for compliance, for geopolitical risk assessment if that’s part of your threat model. “We trained on top of it” is not a sufficient answer to “what model does this run on?”

The question this incident sharpens is one the industry hasn’t cleanly answered: when a company fine-tunes an open-weight model extensively, what provenance disclosure do users have a right to expect? Cursor’s position — that 75% comes from their training — implies they believe substantial fine-tuning changes the disclosure calculus. That’s a debatable position, and the debate is now happening in public whether Cursor wanted it to or not.

Composer 2 is good. The compaction technique is worth watching. And the controversy is a useful reminder that “built on top of” and “built from scratch” mean very different things, regardless of how much training you stack on top.

Sources
#

Cursor research blog, March 17, 2026: cursor.com/blog/self-summarization
Moonshot AI, Kimi K2.5 model release and license
CursorBench, Terminal-Bench 2.0, SWE-bench Multilingual benchmark results as reported by Cursor
Aman Sanger (Cursor co-founder) public statement on base model disclosure
Lee Robinson (Cursor VP of Developer Education) public statement on training contribution breakdown

The Problem With Long-Horizon Tasks#

Compaction-in-the-Loop RL#

The Benchmarks#

The Part They Forgot to Mention#

What to Make of All This#

Sources#

Related