# Benchmarks

- [DeepSeek V4: Near-Frontier Performance, Open Weights, and the First Major Model Built for Huawei Chips](https://sdd.sh/2026/04/deepseek-v4-near-frontier-performance-open-weights-and-the-first-major-model-built-for-huawei-chips.md): DeepSeek V4 arrived April 24 with two variants: a 1.6T-parameter Pro and a 284B-parameter Flash, both MIT-licensed and priced far below Western closed models. The bigger story is what it runs on: Huawei Ascend chips, not Nvidia.
- [DeepSeek V4 Ships: Frontier-Class Coding at 1/6th the Cost](https://sdd.sh/2026/04/deepseek-v4-ships-frontier-class-coding-at-1/6th-the-cost.md): DeepSeek V4-Pro hits 80.6% on SWE-bench Verified and 93.5% on LiveCodeBench — matching or exceeding most closed models — while costing 1/6th of Claude Opus 4.7 and releasing under the MIT license. Here's what actually matters, and what the benchmarks don't tell you.
- [MiniMax M2.7: The Open-Source Agent That Rewrote Its Own Training Loop](https://sdd.sh/2026/04/minimax-m2.7-the-open-source-agent-that-rewrote-its-own-training-loop.md): MiniMax M2.7 is the first open-source model to participate in its own development cycle — 100 autonomous rounds of scaffold optimization, 30% performance gain, 56.22% on SWE-Pro. It's not just a strong model. It's a glimpse of what model self-improvement looks like in practice.
- [GPT-5.5 'Spud' Is OpenAI's Strongest Coding Model Yet — With One Important Asterisk](https://sdd.sh/2026/04/gpt-5.5-spud-is-openais-strongest-coding-model-yet-with-one-important-asterisk.md): OpenAI's first fully retrained base model since GPT-4.5 delivers 82.7% on Terminal-Bench 2.0 and leads on most agentic evals. But on SWE-bench Pro — the benchmark that tests real-world GitHub issue resolution — Claude Opus 4.7 still leads by 5.7 points. Here's what that split actually means.
- [Claude Opus 4.7: 87.6% SWE-bench, Implicit-Need Tests, Same Price](https://sdd.sh/2026/04/claude-opus-4.7-87.6-swe-bench-implicit-need-tests-same-price.md): Anthropic shipped Claude Opus 4.7 on April 16, 2026. SWE-bench Verified jumps nearly 7 points to 87.6%, SWE-bench Pro leaps from 53.4% to 64.3%, and the model is the first Claude to pass implicit-need tests. Pricing stays flat at $5/$25 per million tokens.
- [81% vs. 46%: The AI Coding Benchmark That's Been Lying to You](https://sdd.sh/2026/04/81-vs.-46-the-ai-coding-benchmark-thats-been-lying-to-you.md): SWE-bench Verified — the benchmark that put every frontier model above 80% — is contaminated. OpenAI stopped reporting it in February. Here's what actually happened, what SWE-bench Pro replaces it with, and why 46% is a more honest number than 81%.
- [GLM-5.1: The Open-Source Model That Just Beat Everyone on SWE-bench Pro](https://sdd.sh/2026/04/glm-5.1-the-open-source-model-that-just-beat-everyone-on-swe-bench-pro.md): Z.AI released GLM-5.1 today — a 754B open-weight model under MIT license that scored 58.4% on SWE-bench Pro, beating GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro. Its headline demo: an 8-hour autonomous session that built a complete Linux desktop environment across 655 iterations. The closed-model monopoly on frontier coding capability just got its first serious challenge.
- [Gemma 4: Google Just Made the Case for Running Your Coding Agent Locally](https://sdd.sh/2026/04/gemma-4-google-just-made-the-case-for-running-your-coding-agent-locally.md): Google's Gemma 4 dropped on April 2 with Apache 2.0 licensing, 80% on LiveCodeBench v6, a Codeforces ELO of 2,150, and agentic tool-use scores that make the previous generation look like a prototype. The 26B MoE model runs on a single consumer GPU with 256K context. Here's what it actually means.
- [The SWE-bench Plateau: Three Frontier Models Walk In, All Score 80% — Now What?](https://sdd.sh/2026/04/the-swe-bench-plateau-three-frontier-models-walk-in-all-score-80-now-what.md): Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.3-Codex are all within 0.8% of each other on SWE-bench Verified. When every frontier model aces the exam, the exam stops being useful. Here's what actually differentiates them.
- [Cursor Composer 2: The Model That Learns to Forget — and Sparked a Controversy](https://sdd.sh/2026/03/cursor-composer-2-the-model-that-learns-to-forget-and-sparked-a-controversy.md): Cursor's new coding model beats Claude Opus 4.6 on key benchmarks — but the real story is a training breakthrough called compaction-in-the-loop RL, and a transparency controversy that revealed Cursor quietly built it on a Chinese open-source model.
- [GPT-5.3-Codex: The First AI Model That Helped Build Itself — and Got a Scary Security Rating](https://sdd.sh/2026/03/gpt-5.3-codex-the-first-ai-model-that-helped-build-itself-and-got-a-scary-security-rating.md): OpenAI's GPT-5.3-Codex was instrumental in creating itself, introduced mid-turn steering for agentic workflows, and became the first OpenAI model rated 'High capability' for cybersecurity — which means it can reliably exploit real vulnerabilities.
