A curated reference for engineers who need to track the AI model landscape without wading through hype. Focused on models relevant to coding, agentic workflows, and software development. Updated every Monday.
Benchmarks used here:
- SWE-bench Verified — resolving real GitHub issues from popular repos
- SWE-bench Pro — harder, multi-language variant designed to be contamination-resistant
- LiveCodeBench — live competitive programming problems, updated continuously
- HumanEval — function synthesis from docstrings (older benchmark, now mostly saturated)
Anthropic — Claude#
The primary recommendation for serious agentic coding. Claude Code is built on this model family.
| Model | Released | Context | SWE-bench Verified | SWE-bench Pro | Key addition |
|---|---|---|---|---|---|
| Claude 1 | Mar 2023 | 9K | — | — | First release, Constitutional AI |
| Claude 2 | Jul 2023 | 100K | — | — | 2× longer context, improved reasoning |
| Claude 2.1 | Nov 2023 | 200K | — | — | Reduced hallucinations, 200K context |
| Claude 3 Haiku | Mar 2024 | 200K | — | — | Fast, lightweight, low cost |
| Claude 3 Sonnet | Mar 2024 | 200K | — | — | Balanced speed/capability |
| Claude 3 Opus | Mar 2024 | 200K | ~38% | — | Most capable at launch, topped early benchmarks |
| Claude 3.5 Sonnet (v1) | Jun 2024 | 200K | ~49% | — | Surpassed Opus on coding at lower cost |
| Claude 3.5 Sonnet (v2) | Oct 2024 | 200K | ~57% | — | Computer use (beta), improved agentic behavior |
| Claude 3.5 Haiku | Nov 2024 | 200K | ~41% | — | Fast + capable small model |
| Claude 3.7 Sonnet | Feb 2025 | 200K | ~70% | — | Extended thinking, hybrid reasoning mode |
| Claude Haiku 4.5 | Late 2025 | 200K | — | — | 4th-gen architecture, speed-optimized |
| Claude Sonnet 4.5 | Late 2025 | 200K | — | — | Balanced 4th-gen model |
| Claude Sonnet 4.6 | Early 2026 | 1M¹ | ~75% | — | 1M token context GA (Mar 13, 2026) |
| Claude Opus 4.6 | Feb 5, 2026 | 1M¹ | 80.8% | 53.4% | Flagship at launch, 1M context GA |
| Claude Opus 4.7 | Apr 16, 2026 | 1M | 87.6% | 64.3% | Implicit-need tests, 3× vision resolution, multi-agent coordination |
¹ 1M token context became generally available on Sonnet 4.6 and Opus 4.6 on March 13, 2026, with standard pricing throughout.
On Claude Opus 4.7 — the current performance leader. Key improvements over 4.6: one-third the tool errors in agentic loops, 14% improvement on complex multi-step workflows using fewer tokens, and native multi-agent coordination for parallel workstreams. The first Claude to pass implicit-need tests — meaning it can infer which tools to reach for without being explicitly told. Became the default opus API alias on April 23, 2026.
OpenAI — GPT & o-series#
| Model | Released | Context | SWE-bench Verified | SWE-bench Pro | Key addition |
|---|---|---|---|---|---|
| GPT-4 | Mar 2023 | 8K/32K | — | — | First multimodal GPT, reasoning jump |
| GPT-4 Turbo | Nov 2023 | 128K | — | — | 128K context, lower cost, JSON mode |
| GPT-4o | May 2024 | 128K | ~33% | — | Omni model, faster, native multimodal |
| GPT-4o mini | Jul 2024 | 128K | — | — | Small, cheap, high throughput |
| o1 | Sep 2024 | 128K | ~49% | — | Chain-of-thought reasoning, “thinking tokens” |
| o1 mini | Sep 2024 | 128K | — | — | Reasoning at lower cost |
| o3 | Jan 2025 | 200K | ~72% | — | Strong reasoning, ARC-AGI breakthrough |
| o4 mini | Apr 2025 | 200K | ~68% | — | Efficient reasoning model |
| GPT-5 | Mid-2025 | 256K | — | — | Multimodal flagship |
| GPT-5.3-Codex | Feb 5, 2026 | 256K | ~78% | — | First to participate in its own training pipeline; mid-turn steering |
| GPT-5.4 | Mar 5, 2026 | 256K | 80.6% | 57.7% | Superseded 5.3-Codex; integrated Codex plugin for Claude Code |
On GPT-5.3-Codex — notable for being “instrumental in creating itself”: the team used early versions to debug training runs and manage deployment during its own production pipeline. Also introduced mid-turn steering (redirect the model mid-task without context loss) and became the first OpenAI model rated “High capability” for cybersecurity (77.6% CTF benchmark). Released the same day as Claude Opus 4.6 — the timing was not accidental.
Google — Gemini & Gemma#
| Model | Released | Context | SWE-bench Verified | Key addition |
|---|---|---|---|---|
| Gemini 1.0 (Ultra/Pro/Nano) | Dec 2023 | 32K | — | First Gemini family, multimodal |
| Gemini 1.5 Pro | Feb 2024 | 1M | — | 1M token context, long-doc reasoning |
| Gemini 1.5 Flash | May 2024 | 1M | — | Fast and efficient with long context |
| Gemini 2.0 Flash | Dec 2024 | 1M | — | Agentic capabilities, tool use, real-time |
| Gemini 2.5 Pro | Mar 2025 | 1M | ~63% | Thinking mode, strong coding benchmarks |
| Gemini 3.1 Pro | Early 2026 | 1M | — | SWE-bench Pro: 54.2% |
| Gemma 4 | Apr 2, 2026 | 256K | — | Open-weight (Apache 2.0), 80% LiveCodeBench v6, 2,150 Codeforces ELO, runs on single consumer GPU |
On Gemma 4 — 26B MoE architecture that runs on a single consumer GPU with 256K context. First open-weight model to make a serious case for local coding agents: 80% LiveCodeBench v6, Codeforces ELO of 2,150, and agentic tool-use scores that outclass the previous generation. Compatible with any OpenAI-compatible server — works directly with aider, continue.dev, and similar tools.
Meta — Llama#
| Model | Released | Params | Context | Key addition |
|---|---|---|---|---|
| Llama 2 | Jul 2023 | 7B–70B | 4K | First major open-source release for production use |
| Llama 3 | Apr 2024 | 8B–70B | 8K | Strong coding, instruction following |
| Llama 3.1 | Jul 2024 | 8B–405B | 128K | 405B matches frontier, 128K context |
| Llama 3.2 | Sep 2024 | 1B–90B | 128K | Multimodal, small on-device models |
| Llama 4 | Apr 2025 | MoE | 1M | Mixture-of-Experts, near-frontier performance |
Open-Source & Independent Labs#
| Model | Lab | Released | License | Key achievement |
|---|---|---|---|---|
| Mistral Large | Mistral | Feb 2024 | Commercial | Competitive with GPT-4 on reasoning |
| DeepSeek-Coder V2 | DeepSeek | May 2024 | MIT | Strongest open-source coding model at launch |
| DeepSeek V3 | DeepSeek | Dec 2024 | MIT | Near-frontier, fraction of training cost |
| DeepSeek R1 | DeepSeek | Jan 2025 | MIT | Open-source reasoning model, matched o1 |
| Kimi K2.5 | Moonshot | Early 2026 | Proprietary | Compaction-in-the-loop RL; powers Cursor Composer 2 |
| GLM-5.1 | Z.AI | Apr 8, 2026 | MIT | 754B open-weight, 58.4% SWE-bench Pro — beat GPT-5.4 and Opus 4.6 at time of release |
On GLM-5.1 — 754B open-weight model under MIT license. Scored 58.4% on SWE-bench Pro at release, beating GPT-5.4 (57.7%), Claude Opus 4.6 (53.4%), and Gemini 3.1 Pro (54.2%). The headline demo: an 8-hour autonomous session that built a complete Linux desktop environment across 655 iterations. The closed-model monopoly on frontier coding capability just got its first serious challenger.
How to read the benchmark numbers#
SWE-bench Verified tests whether a model can resolve real GitHub issues. A score of 80% means the model correctly resolves 4 in 5 tasks. Progress on this benchmark directly translates to production value in agentic coding workflows.
SWE-bench Pro is harder and designed to resist data contamination — tasks are drawn from less-popular repos and non-Python languages. It’s a better signal for where models actually stand when they can’t pattern-match training data.
LiveCodeBench uses live competitive programming problems (updated continuously, so training data can’t help), making it a clean signal for reasoning quality rather than memorization.
Treat all numbers as approximate signals, not precise rankings. Model capability is context-dependent. A model that tops SWE-bench might still be wrong for your codebase if your stack is niche, your tasks require very long context, or you need local deployment.