Skip to main content
  1. Articles/

MiniMax M2.7: The Open-Source Agent That Rewrote Its Own Training Loop

·1178 words·6 mins·

On April 12, 2026, MiniMax quietly open-sourced M2.7 on Hugging Face. The model had been announced internally on March 18. There was no splashy demo, no product keynote, no benchmark war on X. Just weights, a technical report, and some numbers that are genuinely hard to dismiss.

M2.7 scores 56.22% on SWE-bench Pro and 57.0% on Terminal Bench 2 — matching GPT-5.3-Codex on SWE-Pro and landing with an Elo of 1,495 on the coding arena leaderboard. For an open-source model, that’s extraordinary. For a model that helped build itself, it’s something else entirely.

What “Self-Evolving” Actually Means
#

MiniMax’s marketing copy says M2.7 is the first model to “actively participate in its own development cycle.” That phrase is doing a lot of work, so it’s worth being precise about what actually happened.

During the M2.7 training runs, MiniMax gave the model access to its own reinforcement learning harness. The model could update its internal memory, propose new skills for the harness, and adjust its own scaffold — the scaffolding that governs how the model structures its reasoning and tool use during RL experiments. Then it ran experiments. Observed results. Updated the scaffold. Ran more experiments.

This cycle repeated for over 100 autonomous rounds.

The outcome: a 30% performance improvement relative to the M2.5 baseline on the tasks the model was optimizing for. That is not a marketing number — it reflects a measurable delta in SWE-Pro performance between the model trained with human-designed scaffolding and the model that iterated on its own.

What MiniMax is describing is a form of closed-loop self-improvement: the model contributes to the decisions that shape its own training process. It is not science fiction. It is not general self-improvement across arbitrary tasks. It is a tightly controlled RL experiment where the model has limited but real write access to its own training scaffolding.

It is also the most concrete published implementation of this technique at this scale. And the results are measurable.

The Benchmark Story
#

Let’s put M2.7’s numbers in context alongside the current frontier:

ModelSWE-bench ProTerminal Bench 2.0Open?
Claude Opus 4.764.3%No
GPT-5.558.6%82.7%No
GPT-5.3-Codex~56%~75%No
MiniMax M2.756.22%57.0%Yes
MiniMax M2.5Yes (80.2% SWE-Verified)

M2.7 does not top any single benchmark. Claude Opus 4.7 leads SWE-Pro by an eight-point margin; GPT-5.5 leads Terminal-Bench 2.0 by a significant gap. But M2.7 is competing in that tier — and it does so with publicly available weights.

That framing matters because it determines how you use the model. You cannot self-host Claude Opus 4.7. You cannot audit its behavior under adversarial inputs or run it in an air-gapped environment. You cannot modify its scaffolding, distill it, or fine-tune it on proprietary workflows. M2.7 is all of those things: fully open under a modified MIT license that requires commercial deployers to display the model name in their product UI.

The Skill Adherence Number Nobody Is Talking About
#

The benchmark scores get the attention. The number worth sitting with is 97%.

M2.7 maintains 97% skill adherence across 40 complex skills, each exceeding 2,000 tokens. Skill adherence measures whether the model follows a defined skill — a structured procedure involving tool use, reasoning steps, and state management — without deviating from the expected execution path. At 2,000+ tokens per skill, this is not a simple instruction-following task. These are multi-step, multi-tool procedures.

97% adherence across 40 such skills is the kind of reliability number that enterprise deployments require. It means you can define a complex agent workflow, trust that M2.7 will execute it consistently, and build production systems on top of it without constant human oversight.

MiniMax backs this up with a production claim: M2.7 handles 30–50% of MiniMax’s internal reinforcement learning team workflows autonomously. That is an extremely specific number for a company to publish. It either reflects real deployment data or is going to be very embarrassing in six months.

Native Agent Teams
#

M2.7 ships with native Agent Teams support — a multi-agent architecture with stable role boundaries baked into the model’s training, not bolted on as a prompt engineering trick.

In practice, this means you can assign roles (architect, coder, reviewer, test engineer) to separate M2.7 instances and they will maintain their role identity and authority boundaries across a multi-agent session without drifting into each other’s domains or recursively second-guessing each other’s decisions.

Claude Code has a similar architecture in its multi-agent orchestration model. The notable difference is that M2.7 was explicitly trained for this pattern, while Claude Code’s agent teams emerge from Claude’s general instruction-following capacity combined with the Claude Code shell’s task routing logic.

Whether training for agent teams versus prompting for them produces meaningfully different results in production is an open question. M2.7’s 30-50% autonomous workflow handling claim suggests MiniMax believes the training approach matters.

The Open-Source Implications
#

M2.7 represents the most capable openly available model for software engineering tasks as of April 2026. The gap to Claude Opus 4.7 on SWE-Pro (56.22% vs. 64.3%) is real but not infinite. The gap on Terminal Bench 2.0 is wider — 57.0% vs. GPT-5.5’s 82.7% — but Terminal Bench heavily rewards the kind of scaffolding optimizations that closed commercial systems have applied for months. M2.7 closes that gap through its self-evolving scaffold, and it will iterate.

For teams that need:

  • Air-gapped deployment: M2.7 is available on Ollama and HuggingFace with full weights
  • Regulatory compliance: Open weights mean auditable behavior
  • Cost control: $0.30 per million input tokens for M2.5-Lightning (M2.7 pricing is similar); a fraction of Opus 4.7’s $5 input cost
  • Fine-tuning flexibility: Modify M2.7 on domain-specific codebases in ways you cannot do with closed models

…M2.7 is the most credible option in the market right now. That was not true six months ago, when open-source models were firmly one tier below the frontier.

Where M2.7 Fits Against Claude Code
#

Claude Code runs on Claude Opus 4.7, which leads M2.7 on SWE-Pro by eight points. That gap translates to real differences in complex, ambiguous engineering tasks — the kind where judgment calls matter and the model has to reason about architecture, not just write the patch.

What Claude Code offers that no open-source deployment can currently match: the full Claude Code tooling ecosystem (MCP-native architecture, 6,000+ MCP servers, Routines, Ultraplan, computer use), the safety and predictability guarantees that come from Anthropic’s constitutional AI training, and the enterprise features (RBAC, OpenTelemetry, Analytics API, Bedrock GA) that large organizations require.

For individual developers or small teams who don’t need enterprise features, want full control of their stack, and are cost-sensitive, M2.7 running locally through Ollama or deployed on owned infrastructure is now a serious alternative to subscribing to a frontier commercial model.

The open-source frontier closed faster than most people expected. M2.7 is evidence of that closing.


Sources

Related