---
title: "AI Models Reference"
date: 2026-04-23
summary: "A curated timeline of AI model releases relevant to coding and software development. Benchmarks, context windows, and key capabilities — updated weekly."
---


A curated reference for engineers who need to track the AI model landscape without wading through hype. Focused on models relevant to coding, agentic workflows, and software development. Updated every Monday.

**Benchmarks used here:**
- **SWE-bench Verified** — resolving real GitHub issues from popular repos
- **SWE-bench Pro** — harder, multi-language variant designed to be contamination-resistant
- **LiveCodeBench** — live competitive programming problems, updated continuously
- **HumanEval** — function synthesis from docstrings (older benchmark, now mostly saturated)

---

## Anthropic — Claude

The primary recommendation for serious agentic coding. Claude Code is built on this model family.

| Model | Released | Context | SWE-bench Verified | SWE-bench Pro | Key addition |
|-------|----------|---------|-------------------|---------------|-------------|
| Claude 1 | Mar 2023 | 9K | — | — | First release, Constitutional AI |
| Claude 2 | Jul 2023 | 100K | — | — | 2× longer context, improved reasoning |
| Claude 2.1 | Nov 2023 | 200K | — | — | Reduced hallucinations, 200K context |
| Claude 3 Haiku | Mar 2024 | 200K | — | — | Fast, lightweight, low cost |
| Claude 3 Sonnet | Mar 2024 | 200K | — | — | Balanced speed/capability |
| Claude 3 Opus | Mar 2024 | 200K | ~38% | — | Most capable at launch, topped early benchmarks |
| Claude 3.5 Sonnet (v1) | Jun 2024 | 200K | ~49% | — | Surpassed Opus on coding at lower cost |
| Claude 3.5 Sonnet (v2) | Oct 2024 | 200K | ~57% | — | Computer use (beta), improved agentic behavior |
| Claude 3.5 Haiku | Nov 2024 | 200K | ~41% | — | Fast + capable small model |
| Claude 3.7 Sonnet | Feb 2025 | 200K | ~70% | — | Extended thinking, hybrid reasoning mode |
| Claude Haiku 4.5 | Late 2025 | 200K | — | — | 4th-gen architecture, speed-optimized |
| Claude Sonnet 4.5 | Late 2025 | 200K | — | — | Balanced 4th-gen model |
| Claude Sonnet 4.6 | Early 2026 | 1M¹ | ~75% | — | 1M token context GA (Mar 13, 2026) |
| Claude Opus 4.6 | Feb 5, 2026 | 1M¹ | 80.8% | 53.4% | Flagship at launch, 1M context GA |
| **Claude Opus 4.7** | **Apr 16, 2026** | **1M** | **87.6%** | **64.3%** | **Implicit-need tests, 3× vision resolution, multi-agent coordination** |

¹ 1M token context became generally available on Sonnet 4.6 and Opus 4.6 on March 13, 2026, with standard pricing throughout.

**On Claude Opus 4.7** — the current performance leader. Key improvements over 4.6: one-third the tool errors in agentic loops, 14% improvement on complex multi-step workflows using *fewer* tokens, and native multi-agent coordination for parallel workstreams. The first Claude to pass implicit-need tests — meaning it can infer which tools to reach for without being explicitly told. Became the default `opus` API alias on April 23, 2026.

---

## OpenAI — GPT & o-series

| Model | Released | Context | SWE-bench Verified | SWE-bench Pro | Key addition |
|-------|----------|---------|-------------------|---------------|-------------|
| GPT-4 | Mar 2023 | 8K/32K | — | — | First multimodal GPT, reasoning jump |
| GPT-4 Turbo | Nov 2023 | 128K | — | — | 128K context, lower cost, JSON mode |
| GPT-4o | May 2024 | 128K | ~33% | — | Omni model, faster, native multimodal |
| GPT-4o mini | Jul 2024 | 128K | — | — | Small, cheap, high throughput |
| o1 | Sep 2024 | 128K | ~49% | — | Chain-of-thought reasoning, "thinking tokens" |
| o1 mini | Sep 2024 | 128K | — | — | Reasoning at lower cost |
| o3 | Jan 2025 | 200K | ~72% | — | Strong reasoning, ARC-AGI breakthrough |
| o4 mini | Apr 2025 | 200K | ~68% | — | Efficient reasoning model |
| GPT-5 | Mid-2025 | 256K | — | — | Multimodal flagship |
| GPT-5.3-Codex | Feb 5, 2026 | 256K | ~78% | — | First to participate in its own training pipeline; mid-turn steering |
| GPT-5.4 | Mar 5, 2026 | 256K | 80.6% | 57.7% | Superseded 5.3-Codex; integrated Codex plugin for Claude Code |

**On GPT-5.3-Codex** — notable for being "instrumental in creating itself": the team used early versions to debug training runs and manage deployment during its own production pipeline. Also introduced mid-turn steering (redirect the model mid-task without context loss) and became the first OpenAI model rated "High capability" for cybersecurity (77.6% CTF benchmark). Released the same day as Claude Opus 4.6 — the timing was not accidental.

---

## Google — Gemini & Gemma

| Model | Released | Context | SWE-bench Verified | Key addition |
|-------|----------|---------|-------------------|-------------|
| Gemini 1.0 (Ultra/Pro/Nano) | Dec 2023 | 32K | — | First Gemini family, multimodal |
| Gemini 1.5 Pro | Feb 2024 | 1M | — | 1M token context, long-doc reasoning |
| Gemini 1.5 Flash | May 2024 | 1M | — | Fast and efficient with long context |
| Gemini 2.0 Flash | Dec 2024 | 1M | — | Agentic capabilities, tool use, real-time |
| Gemini 2.5 Pro | Mar 2025 | 1M | ~63% | Thinking mode, strong coding benchmarks |
| Gemini 3.1 Pro | Early 2026 | 1M | — | SWE-bench Pro: 54.2% |
| **Gemma 4** | **Apr 2, 2026** | **256K** | — | **Open-weight (Apache 2.0), 80% LiveCodeBench v6, 2,150 Codeforces ELO, runs on single consumer GPU** |

**On Gemma 4** — 26B MoE architecture that runs on a single consumer GPU with 256K context. First open-weight model to make a serious case for local coding agents: 80% LiveCodeBench v6, Codeforces ELO of 2,150, and agentic tool-use scores that outclass the previous generation. Compatible with any OpenAI-compatible server — works directly with `aider`, `continue.dev`, and similar tools.

---

## Meta — Llama

| Model | Released | Params | Context | Key addition |
|-------|----------|--------|---------|-------------|
| Llama 2 | Jul 2023 | 7B–70B | 4K | First major open-source release for production use |
| Llama 3 | Apr 2024 | 8B–70B | 8K | Strong coding, instruction following |
| Llama 3.1 | Jul 2024 | 8B–405B | 128K | 405B matches frontier, 128K context |
| Llama 3.2 | Sep 2024 | 1B–90B | 128K | Multimodal, small on-device models |
| Llama 4 | Apr 2025 | MoE | 1M | Mixture-of-Experts, near-frontier performance |

---

## Open-Source & Independent Labs

| Model | Lab | Released | License | Key achievement |
|-------|-----|----------|---------|----------------|
| Mistral Large | Mistral | Feb 2024 | Commercial | Competitive with GPT-4 on reasoning |
| DeepSeek-Coder V2 | DeepSeek | May 2024 | MIT | Strongest open-source coding model at launch |
| DeepSeek V3 | DeepSeek | Dec 2024 | MIT | Near-frontier, fraction of training cost |
| DeepSeek R1 | DeepSeek | Jan 2025 | MIT | Open-source reasoning model, matched o1 |
| Kimi K2.5 | Moonshot | Early 2026 | Proprietary | Compaction-in-the-loop RL; powers Cursor Composer 2 |
| **GLM-5.1** | **Z.AI** | **Apr 8, 2026** | **MIT** | **754B open-weight, 58.4% SWE-bench Pro — beat GPT-5.4 and Opus 4.6 at time of release** |

**On GLM-5.1** — 754B open-weight model under MIT license. Scored 58.4% on SWE-bench Pro at release, beating GPT-5.4 (57.7%), Claude Opus 4.6 (53.4%), and Gemini 3.1 Pro (54.2%). The headline demo: an 8-hour autonomous session that built a complete Linux desktop environment across 655 iterations. The closed-model monopoly on frontier coding capability just got its first serious challenger.

---

## How to read the benchmark numbers

**SWE-bench Verified** tests whether a model can resolve real GitHub issues. A score of 80% means the model correctly resolves 4 in 5 tasks. Progress on this benchmark directly translates to production value in agentic coding workflows.

**SWE-bench Pro** is harder and designed to resist data contamination — tasks are drawn from less-popular repos and non-Python languages. It's a better signal for where models actually stand when they can't pattern-match training data.

**LiveCodeBench** uses live competitive programming problems (updated continuously, so training data can't help), making it a clean signal for reasoning quality rather than memorization.

Treat all numbers as approximate signals, not precise rankings. Model capability is context-dependent. A model that tops SWE-bench might still be wrong for your codebase if your stack is niche, your tasks require very long context, or you need local deployment.

