# Swe-Bench

- [Mistral Medium 3.5 Just Entered the Agentic Coding Race — Here's Where It Stands](https://sdd.sh/2026/05/mistral-medium-3.5-just-entered-the-agentic-coding-race-heres-where-it-stands.md): Mistral's 128B Medium 3.5 model and its Vibe remote agent platform went live this week. 77.6% SWE-bench Verified, async cloud execution, and a direct shot at the agentic coding market. The benchmarks are strong. The architecture tells a more complicated story.
- [DeepSeek V4 Ships: Frontier-Class Coding at 1/6th the Cost](https://sdd.sh/2026/04/deepseek-v4-ships-frontier-class-coding-at-1/6th-the-cost.md): DeepSeek V4-Pro hits 80.6% on SWE-bench Verified and 93.5% on LiveCodeBench — matching or exceeding most closed models — while costing 1/6th of Claude Opus 4.7 and releasing under the MIT license. Here's what actually matters, and what the benchmarks don't tell you.
- [MiniMax M2.7: The Open-Source Agent That Rewrote Its Own Training Loop](https://sdd.sh/2026/04/minimax-m2.7-the-open-source-agent-that-rewrote-its-own-training-loop.md): MiniMax M2.7 is the first open-source model to participate in its own development cycle — 100 autonomous rounds of scaffold optimization, 30% performance gain, 56.22% on SWE-Pro. It's not just a strong model. It's a glimpse of what model self-improvement looks like in practice.
- [GPT-5.5 'Spud' Is OpenAI's Strongest Coding Model Yet — With One Important Asterisk](https://sdd.sh/2026/04/gpt-5.5-spud-is-openais-strongest-coding-model-yet-with-one-important-asterisk.md): OpenAI's first fully retrained base model since GPT-4.5 delivers 82.7% on Terminal-Bench 2.0 and leads on most agentic evals. But on SWE-bench Pro — the benchmark that tests real-world GitHub issue resolution — Claude Opus 4.7 still leads by 5.7 points. Here's what that split actually means.
- [The Stanford AI Index 2026 Is Out. The Skeptics Are Out of Arguments.](https://sdd.sh/2026/04/the-stanford-ai-index-2026-is-out.-the-skeptics-are-out-of-arguments..md): Stanford HAI's 423-page 2026 AI Index dropped April 13. The numbers on agentic coding are not subtle: SWE-bench Verified jumped from 60% to near 100% of human baseline in a single year. Here's what the data actually means for working engineers.
- [Claude Opus 4.7: 87.6% SWE-bench, Implicit-Need Tests, Same Price](https://sdd.sh/2026/04/claude-opus-4.7-87.6-swe-bench-implicit-need-tests-same-price.md): Anthropic shipped Claude Opus 4.7 on April 16, 2026. SWE-bench Verified jumps nearly 7 points to 87.6%, SWE-bench Pro leaps from 53.4% to 64.3%, and the model is the first Claude to pass implicit-need tests. Pricing stays flat at $5/$25 per million tokens.
- [81% vs. 46%: The AI Coding Benchmark That's Been Lying to You](https://sdd.sh/2026/04/81-vs.-46-the-ai-coding-benchmark-thats-been-lying-to-you.md): SWE-bench Verified — the benchmark that put every frontier model above 80% — is contaminated. OpenAI stopped reporting it in February. Here's what actually happened, what SWE-bench Pro replaces it with, and why 46% is a more honest number than 81%.
- [GLM-5.1: The Open-Source Model That Just Beat Everyone on SWE-bench Pro](https://sdd.sh/2026/04/glm-5.1-the-open-source-model-that-just-beat-everyone-on-swe-bench-pro.md): Z.AI released GLM-5.1 today — a 754B open-weight model under MIT license that scored 58.4% on SWE-bench Pro, beating GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro. Its headline demo: an 8-hour autonomous session that built a complete Linux desktop environment across 655 iterations. The closed-model monopoly on frontier coding capability just got its first serious challenge.
- [The SWE-bench Plateau: Three Frontier Models Walk In, All Score 80% — Now What?](https://sdd.sh/2026/04/the-swe-bench-plateau-three-frontier-models-walk-in-all-score-80-now-what.md): Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.3-Codex are all within 0.8% of each other on SWE-bench Verified. When every frontier model aces the exam, the exam stops being useful. Here's what actually differentiates them.
