↓ Skip to main content

Benchmarks

DeepSeek V4: Near-Frontier Performance, Open Weights, and the First Major Model Built for Huawei Chips

28 April 2026·1136 words·6 mins

DeepSeek V4 Ships: Frontier-Class Coding at 1/6th the Cost

26 April 2026·1274 words·6 mins

MiniMax M2.7: The Open-Source Agent That Rewrote Its Own Training Loop

25 April 2026·1178 words·6 mins

GPT-5.5 'Spud' Is OpenAI's Strongest Coding Model Yet — With One Important Asterisk

24 April 2026·979 words·5 mins

Claude Opus 4.7: 87.6% SWE-bench, Implicit-Need Tests, Same Price

17 April 2026·1218 words·6 mins

81% vs. 46%: The AI Coding Benchmark That's Been Lying to You

11 April 2026·1432 words·7 mins

GLM-5.1: The Open-Source Model That Just Beat Everyone on SWE-bench Pro

8 April 2026·1238 words·6 mins

Gemma 4: Google Just Made the Case for Running Your Coding Agent Locally

5 April 2026·1431 words·7 mins

The SWE-bench Plateau: Three Frontier Models Walk In, All Score 80% — Now What?

1 April 2026·1660 words·8 mins

GPT-5.3-Codex: The First AI Model That Helped Build Itself — and Got a Scary Security Rating

27 March 2026·1150 words·6 mins