AI Models Reference

Table of Contents

A curated reference for engineers who need to track the AI model landscape without wading through hype. Focused on models relevant to coding, agentic workflows, and software development. Updated every Monday.

Benchmarks used here:

SWE-bench Verified — resolving real GitHub issues from popular repos
SWE-bench Pro — harder, multi-language variant designed to be contamination-resistant
LiveCodeBench — live competitive programming problems, updated continuously
HumanEval — function synthesis from docstrings (older benchmark, now mostly saturated)

Anthropic — Claude
#

The primary recommendation for serious agentic coding. Claude Code is built on this model family.

Model	Released	Context	SWE-bench Verified	SWE-bench Pro	Key addition
Claude 1	Mar 2023	9K	—	—	First release, Constitutional AI
Claude 2	Jul 2023	100K	—	—	2× longer context, improved reasoning
Claude 2.1	Nov 2023	200K	—	—	Reduced hallucinations, 200K context
Claude 3 Haiku	Mar 2024	200K	—	—	Fast, lightweight, low cost
Claude 3 Sonnet	Mar 2024	200K	—	—	Balanced speed/capability
Claude 3 Opus	Mar 2024	200K	~38%	—	Most capable at launch, topped early benchmarks
Claude 3.5 Sonnet (v1)	Jun 2024	200K	~49%	—	Surpassed Opus on coding at lower cost
Claude 3.5 Sonnet (v2)	Oct 2024	200K	~57%	—	Computer use (beta), improved agentic behavior
Claude 3.5 Haiku	Nov 2024	200K	~41%	—	Fast + capable small model
Claude 3.7 Sonnet	Feb 2025	200K	~70%	—	Extended thinking, hybrid reasoning mode
Claude Haiku 4.5	Late 2025	200K	—	—	4th-gen architecture, speed-optimized
Claude Sonnet 4.5	Late 2025	200K	—	—	Balanced 4th-gen model
Claude Sonnet 4.6	Early 2026	1M¹	~75%	—	1M token context GA (Mar 13, 2026)
Claude Opus 4.6	Feb 5, 2026	1M¹	80.8%	53.4%	Flagship at launch, 1M context GA
Claude Mythos Preview	Apr 7, 2026	1M	—	—	Autonomous zero-day discovery across all major OS and browsers; restricted to Project Glasswing defense partners; not commercially available
Claude Opus 4.7	Apr 16, 2026	1M	87.6%	64.3%	Implicit-need tests, 3× vision resolution, multi-agent coordination
Claude Opus 4.8	May 28, 2026	1M	—	69.2%	Dynamic Workflows (hundreds of parallel subagents for codebase-scale migrations), 4× less likely to leave code flaws unreported, Fast mode

¹ 1M token context became generally available on Sonnet 4.6 and Opus 4.6 on March 13, 2026, with standard pricing throughout.

Current API aliases: claude-opus-4-8 (Opus 4.8), claude-sonnet-4-6 (Sonnet 4.6), claude-haiku-4-5-20251001 (Haiku 4.5).

On Claude Mythos Preview — the first model Anthropic has publicly declined to release on capability grounds. Announced April 7, 2026 alongside Project Glasswing, a restricted-access program giving Mythos to a select group of infrastructure defenders (AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, Microsoft, NVIDIA, Palo Alto Networks, and ~40 additional organizations). Mythos autonomously discovered thousands of high-severity zero-day vulnerabilities across every major operating system and web browser — including a 27-year-old bug in OpenBSD and a 16-year-old flaw in FFmpeg. The UK AI Security Institute gave it expert-level hacking tasks; it succeeded 73% of the time. Anthropic’s own characterization: “unprecedented offensive cybersecurity capability.” SWE-bench and Terminal-Bench scores have not been published; the announcement focused entirely on security capability. Not available via the standard API, Claude.ai, or Claude Code subscriptions. The geopolitical implications are significant: every Project Glasswing launch partner is US-headquartered or deeply embedded in US infrastructure.

On Claude Opus 4.8 — the current performance leader and the fastest flagship iteration Anthropic has shipped: 41 days after Opus 4.7, it lifts SWE-bench Pro from 64.3% to 69.2%. The headline reliability stat is that it is 4× less likely to let code flaws go unreported — a direct attack on the “almost right” silent-bug problem behind the developer-trust gap. The marquee feature is Dynamic Workflows (research preview): an orchestrating agent that fans work across hundreds of parallel subagents for codebase-scale migrations “from kickoff to merge.” Pricing is unchanged from 4.7 ($5/$25 per million input/output tokens); a new Fast mode runs at 2× the rate for 2.5× the speed. In Claude Code it defaults to high-effort mode with a leaner system prompt. Released May 28, 2026 alongside Anthropic’s $65B Series H at a $965B valuation, and a teased “Mythos-class” GA “in the coming weeks.” Covered in depth at Claude Opus 4.8 and the $965B Question.

On Claude Opus 4.7 — the prior performance leader. Key improvements over 4.6: one-third the tool errors in agentic loops, 14% improvement on complex multi-step workflows using fewer tokens, and native multi-agent coordination for parallel workstreams. The first Claude to pass implicit-need tests — meaning it can infer which tools to reach for without being explicitly told. Became the default opus API alias on April 23, 2026.

OpenAI — GPT & o-series
#

Model	Released	Context	SWE-bench Verified	SWE-bench Pro	Key addition
GPT-4	Mar 2023	8K/32K	—	—	First multimodal GPT, reasoning jump
GPT-4 Turbo	Nov 2023	128K	—	—	128K context, lower cost, JSON mode
GPT-4o	May 2024	128K	~33%	—	Omni model, faster, native multimodal
GPT-4o mini	Jul 2024	128K	—	—	Small, cheap, high throughput
o1	Sep 2024	128K	~49%	—	Chain-of-thought reasoning, “thinking tokens”
o1 mini	Sep 2024	128K	—	—	Reasoning at lower cost
o3	Jan 2025	200K	~72%	—	Strong reasoning, ARC-AGI breakthrough
o4 mini	Apr 2025	200K	~68%	—	Efficient reasoning model
GPT-5	Mid-2025	256K	—	—	Multimodal flagship
GPT-5.3-Codex	Feb 5, 2026	256K	~78%	—	First to participate in its own training pipeline; mid-turn steering
GPT-5.4	Mar 5, 2026	256K	80.6%	57.7%	Superseded 5.3-Codex; integrated Codex plugin for Claude Code
GPT-5.5 “Spud”	Apr 23, 2026	256K	—	58.6%	First fully retrained base since GPT-4.5; 82.7% Terminal-Bench 2.0, leads Expert-SWE (73.1%) and GDPval (84.9%); 159 Epoch Capabilities Index (Apr 28)
GPT-5.5-Cyber	May 7, 2026	256K	—	—	Cybersecurity-specialized variant of GPT-5.5; more permissive for authorized red teaming, pen testing, and controlled validation; restricted to vetted Trusted Access for Cyber partners

On GPT-5.5-Cyber — a fine-tuned variant of GPT-5.5 with relaxed guardrails for authorized defensive security workflows: writing proofs of concept for discovered vulnerabilities, running authorized red team simulations, and reverse-engineering malware for threat analysis. Access is gated behind OpenAI’s Trusted Access for Cyber program, with account-level controls and verified institutional affiliation required. Advanced Account Security becomes mandatory for all Cyber program members from June 1, 2026. This is not a new base model — it is GPT-5.5 trained to be more permissive on security-related tasks for a small population of vetted defenders. Anthropic made a similar move earlier with Claude Security (Opus 4.7-powered), though Claude Security’s architecture focuses on reasoning-based scanning rather than permissive red team assistance.

On GPT-5.5 “Spud” — the first ground-up retraining since GPT-4.5. OpenAI trained for long-horizon task coherence: the model maintains state across multi-step tool use rather than producing high-quality individual responses. Leads on Terminal-Bench 2.0 (82.7% vs Opus 4.7’s 69.4%), Expert-SWE (73.1%), and GDPval (84.9%). Claude Opus 4.7 still leads on SWE-bench Pro (64.3% vs 58.6%) and MCP-Atlas (79.1% vs 75.3%). Currently live in ChatGPT and Codex for paid tiers; API in controlled rollout at launch. Priced at $5/$30 per million input/output tokens.

On GPT-5.3-Codex — notable for being “instrumental in creating itself”: the team used early versions to debug training runs and manage deployment during its own production pipeline. Also introduced mid-turn steering (redirect the model mid-task without context loss) and became the first OpenAI model rated “High capability” for cybersecurity (77.6% CTF benchmark). Released the same day as Claude Opus 4.6 — the timing was not accidental.

Google — Gemini & Gemma
#

Model	Released	Context	SWE-bench Verified	Key addition
Gemini 1.0 (Ultra/Pro/Nano)	Dec 2023	32K	—	First Gemini family, multimodal
Gemini 1.5 Pro	Feb 2024	1M	—	1M token context, long-doc reasoning
Gemini 1.5 Flash	May 2024	1M	—	Fast and efficient with long context
Gemini 2.0 Flash	Dec 2024	1M	—	Agentic capabilities, tool use, real-time
Gemini 2.5 Pro	Mar 2025	1M	~63%	Thinking mode, strong coding benchmarks
Gemini 3.1 Pro	Early 2026	1M	—	SWE-bench Pro: 54.2%
Gemma 4	Apr 2, 2026	256K	—	Open-weight (Apache 2.0), 80% LiveCodeBench v6, 2,150 Codeforces ELO, runs on single consumer GPU
Gemini 3.1 Ultra	May 2026	2M	—	2M token context; native multimodal reasoning (text, image, audio, video); Google’s most capable model to date
Gemini 3.5 Flash	May 19, 2026	1M	78%	4× faster output than other frontier models; 76.2% Terminal-Bench 2.1, 83.6% MCP Atlas; “Flash” now matches last year’s Pro on coding; $1.50/$9.00/M tokens
Gemini 3.5 Pro	Announced May 19, 2026; GA June 2026	2M	—	2M token context; Deep Think reasoning mode; completes the 3.5 family above Flash; $15/$60 per M tokens (estimated); first GA model with 2M context

On Gemini 3.5 Pro — the capstone of Google’s 3.5 family, announced at Google I/O on May 19, 2026, with general availability confirmed for June 2026. Extends context to 2M tokens (double Flash’s ceiling, matching the 3.1 Ultra research preview). Adds a Deep Think reasoning mode — Google’s equivalent of extended thinking — for multi-step problems where Flash’s 4× speed trades against reasoning depth. Flash deliberately improved on 3.1 Pro’s coding benchmarks but regressed on hardest reasoning; 3.5 Pro is designed to close that gap. Expected pricing: ~$15/$60 per million input/output tokens (roughly 10× Flash). Available first to Google AI Pro and Ultra subscribers, then API rollout. SWE-bench Verified and Pro scores not yet published at announcement.

On Gemini 3.5 Flash — the most significant Flash release to date, and the first time a Flash-tier model has meaningfully outperformed the previous generation’s Pro on agentic coding benchmarks. Launched at Google I/O on May 19, 2026, it leads on MCP Atlas (83.6%) and GDPval-AA, runs at roughly 4× the output speed of comparable frontier models, and integrates directly with Google’s Antigravity agent harness (now available via the Gemini Managed Agents API). Trails Claude Opus 4.7 (87.6% SWE-bench Verified, 64.3% SWE-bench Pro) and GPT-5.5 (82.7% Terminal-Bench 2.0) on the harder benchmarks, but at $1.50/$9.00 per million tokens it offers the best cost-to-performance ratio of any frontier-class Google model. Covered in depth at Gemini 3.5 Flash: Google’s Fast Tier Just Became a Frontier Tier.

On Gemma 4 — 26B MoE architecture that runs on a single consumer GPU with 256K context. First open-weight model to make a serious case for local coding agents: 80% LiveCodeBench v6, Codeforces ELO of 2,150, and agentic tool-use scores that outclass the previous generation. Compatible with any OpenAI-compatible server — works directly with aider, continue.dev, and similar tools.

On Gemini 3.1 Ultra — Google’s most capable model release of the year. Extends context to 2M tokens (double the previous Gemini ceiling) with native multimodal reasoning across text, image, audio, and video in a unified architecture. Agentic coding benchmarks not yet independently verified at publication; positioned as Google’s answer to Claude Opus 4.7 and GPT-5.5 for long-horizon multi-step tasks. SWE-bench scores pending.

Meta — Llama
#

Model	Released	Params	Context	Key addition
Llama 2	Jul 2023	7B–70B	4K	First major open-source release for production use
Llama 3	Apr 2024	8B–70B	8K	Strong coding, instruction following
Llama 3.1	Jul 2024	8B–405B	128K	405B matches frontier, 128K context
Llama 3.2	Sep 2024	1B–90B	128K	Multimodal, small on-device models
Llama 4	Apr 2025	MoE	1M	Mixture-of-Experts, near-frontier performance

xAI — Grok
#

Elon Musk’s AI lab. Grok is xAI’s primary model family — positioned as a frontier model for chat and agentic workflows, with a growing focus on enterprise productivity use cases (legal, finance) and cost-competitive API pricing. Grok Build is xAI’s terminal-native coding agent, comparable in concept to Claude Code but trailing on coding benchmarks.

Model	Released	Context	SWE-bench Verified	Key addition
Grok 1	Mar 2024	8K	—	First public Grok; open-sourced under Apache 2.0 (314B MoE)
Grok 2	Aug 2024	128K	—	Significant reasoning improvement; multimodal input
Grok 3	Feb 2025	128K	—	Major capability jump; first competitive frontier Grok
Grok 4	Late 2025	256K	—	Extended context; improved instruction following
Grok 4.3	May 4, 2026	1M	~51%¹	40% input price cut ($1.25/M), native video input, 16-Agent Heavy orchestration; #1 ArtificialAnalysis agentic tool-calling leaderboard
Grok Build	May 14, 2026	1M	70.8%	Terminal-native coding agent (grok-code-fast-1); CLI-native; $300/month

¹ Grok 4.3 trails Claude Opus 4.7 by approximately 14pp on SWE-bench Pro per ArtificialAnalysis comparative data. Exact SWE-bench Verified score not independently published by xAI.

On Grok 4.3 — xAI’s clearest pivot from benchmark competition to practical cost and productivity. Released May 4, 2026 with a 40% input price reduction ($1.25/M tokens), native video input, 1M token context, and 16-Agent Heavy (an orchestrator that coordinates up to 16 parallel worker agents). Leads on niche enterprise benchmarks: #1 on ArtificialAnalysis’s agentic tool-calling leaderboard, #1 on ValsAI CaseLaw v2 and CorpFin. Trails Claude Opus 4.7 and GPT-5.5 on general coding measures. Followed by Grok Skills (May 18, 2026): persistent cross-session expertise that replaces repetitive system-prompt preambles — document generation, deck creation, spreadsheet editing, and custom workflow automation. Covered in depth at Grok 4.3 and Grok Skills: xAI’s Pivot From Benchmark Hype to Business Reality.

On Grok Build — xAI’s terminal-native coding agent, launched May 14, 2026. Built on grok-code-fast-1, a speed-optimized variant of Grok 4.3. Scores 70.8% on SWE-bench Verified — meaningfully behind Claude Opus 4.7 (87.6%) and GPT-5.5 (from Terminal-Bench 2.0 data) but ahead of earlier-generation IDE-embedded agents. CLI-native and local-first by design. Priced at $300/month (introductory $99). Arena Mode (head-to-head agent comparison in parallel worktrees) was announced but not yet live at launch.

Alibaba — Qwen
#

The most prolific open-weight model lineage outside of Meta’s Llama. Alibaba’s Qwen family covers general-purpose LLMs, coding specialists, and reasoning models — with a release cadence that accelerated from roughly annual in 2023 to near-monthly by 2026. The sub-35B models ship under Apache 2.0; frontier flagships are API-only proprietary.

Model	Released	Context	SWE-bench Verified	SWE-bench Pro	Key addition
Qwen 2.5-Coder (7B–72B)	Nov 2024	128K	~70%	—	First dedicated coding model; 5.5T code training tokens; Apache 2.0
QwQ-32B	Mar 2025	128K	—	—	Reasoning-focused; RL chain-of-thought; AIME24: 79.5%; competes with o1-mini; Apache 2.0
Qwen 3 (235B-A22B)	Apr 2025	128K	—	—	Hybrid thinking/non-thinking mode toggle; 2,056 Codeforces ELO; MCP support
Qwen 3-Coder (480B-A35B)	Jul 2025	256K	~70%	—	Agentic coding specialist; Qwen Code CLI companion; Apache 2.0
Qwen 3-Coder-Next (80B-A3B)	Feb 2026	256K	~71%	44.3%	Local-friendly: 3B active params; runs on consumer GPU; $0.11/$0.80 per M tokens
Qwen 3.5-27B	Feb 2026	1M	72.4%	50.9%	1M context; Apache 2.0
Qwen 3.6-27B	Apr 2026	128K	77.2%	53.5%	Dense 27B outperforms 397B MoE on SWE-bench Pro; Apache 2.0
Qwen 3.6-Plus	Apr 2026	1M	78.8%	—	Proprietary; MCPMark leader at launch; $0.50/$3.00 per M tokens
Qwen 3.6-Max-Preview	Apr 2026	1M	—	58.4%	Closed frontier; claims #1 across 6 agent/coding benchmarks as of Apr 2026

On Qwen 3 and the MoE efficiency story — the Qwen team’s key differentiator is sparse MoE architecture with very few active parameters: 3B–35B active out of 35B–480B total. Qwen 3-Coder-Next (80B-A3B) delivers 71% SWE-bench Verified locally at $0.11/M input. Qwen 3.6-35B-A3B (released Apr 16, 2026) reaches 73.4% on a consumer GPU under Apache 2.0. For teams that need local, private AI coding with auditable weights, this is the only serious option at this capability level — Claude and GPT-4o have no equivalent.

On Qwen 3.6-Max-Preview — Alibaba’s current closed frontier model, API-only with undisclosed parameter count. Claims the #1 rank across 6 coding and agentic benchmarks as of April 2026, including SWE-bench Pro (58.4%), Terminal-Bench 2.0 (65.4%), and several agentic tool-use suites. SWE-bench Verified scores not published — the emphasis on agentic benchmarks suggests the model is tuned for tool-use pipelines over single-shot completions. Available via Alibaba Cloud’s DashScope API; pricing not publicly listed.

On Qwen Code — Alibaba’s answer to Claude Code. Launched in July 2025 as a terminal-native CLI agent forked from Gemini CLI, supporting multiple API backends (Alibaba Cloud, OpenRouter, Fireworks, local Ollama). Pairs with Qwen 3-Coder and Qwen 3.6-Plus as default backends. Open-source under Apache 2.0. The existence of Qwen Code illustrates the market dynamic clearly: Anthropic defined the terminal-native agentic coding category with Claude Code; within a year, every major lab shipped a clone. None match Claude Code’s depth of integration — hooks, MCP ecosystem maturity, operator SDK — but Qwen Code’s open-source nature and local-model support give it a distinct value proposition for privacy-sensitive or air-gapped teams.

Open-Source & Independent Labs
#

Model	Lab	Released	License	Key achievement
Mistral Large	Mistral	Feb 2024	Commercial	Competitive with GPT-4 on reasoning
DeepSeek-Coder V2	DeepSeek	May 2024	MIT	Strongest open-source coding model at launch
DeepSeek V3	DeepSeek	Dec 2024	MIT	Near-frontier, fraction of training cost
DeepSeek R1	DeepSeek	Jan 2025	MIT	Open-source reasoning model, matched o1
DeepSeek V4-Flash	DeepSeek	Apr 24, 2026	MIT	284B MoE, 1M context, $0.14/$0.28 per M tokens — best price-performance at this tier
DeepSeek V4-Pro	DeepSeek	Apr 24, 2026	MIT	1.6T param MoE (49B active), 80.6% SWE-bench Verified, 93.5% LiveCodeBench, 3206 Codeforces — 1/6th cost of Opus 4.7
Kimi K2.5	Moonshot	Early 2026	Proprietary	Compaction-in-the-loop RL; powers Cursor Composer 2
Kimi K2.6	Moonshot	May 12, 2026	Modified MIT	1T MoE (32B active), 58.6% SWE-bench Pro, 66.7% Terminal-Bench 2.0, Agent Swarm 300 sub-agents — most capable open-weight coding model at release
MiniMax M2.5	MiniMax	Early 2026	MIT-style	80.2% SWE-bench Verified; $0.30/1M input tokens — strongest open-source price-performance at launch
GLM-5.1	Z.AI	Apr 8, 2026	MIT	754B open-weight, 58.4% SWE-bench Pro — beat GPT-5.4 and Opus 4.6 at time of release
MiniMax M2.7	MiniMax	Apr 12, 2026	MIT-style	56.22% SWE-bench Pro, 57.0% Terminal Bench 2 — first model to participate in its own training cycle via 100 autonomous RL rounds

On GLM-5.1 — 754B open-weight model under MIT license. Scored 58.4% on SWE-bench Pro at release, beating GPT-5.4 (57.7%), Claude Opus 4.6 (53.4%), and Gemini 3.1 Pro (54.2%). The headline demo: an 8-hour autonomous session that built a complete Linux desktop environment across 655 iterations. The closed-model monopoly on frontier coding capability just got its first serious challenger.

On MiniMax M2.7 — the first open-source model to participate in its own development cycle: given write access to its RL scaffolding, it autonomously ran 100 rounds of self-optimization, producing a 30% performance gain over M2.5. Scores 56.22% on SWE-bench Pro and 57.0% on Terminal Bench 2 — landing within 8 points of Claude Opus 4.7 on the harder benchmark with publicly available weights. Ships with native Agent Teams support and 97% skill adherence across 40 complex multi-tool workflows. Available on Hugging Face and Ollama under a modified MIT license.

On Kimi K2.6 — the most capable open-weight coding model released to date. 1T-parameter MoE architecture with only 32B parameters active per inference pass, making it cost-effective to run. Scores 58.6% on SWE-bench Pro (within 6 points of Claude Opus 4.7’s 64.3%) and 66.7% on Terminal-Bench 2.0 — matching GPT-5.5 on the harder agentic terminal benchmark. Agent Swarm mode coordinates up to 300 sub-agents across 4,000 steps for complex multi-component tasks. Priced at $0.60/$2.50 per million input/output tokens — approximately 1/8th the cost of Opus 4.7. Released under a Modified MIT license; weights available on Hugging Face. Covered in depth at Kimi K2.6: Most Capable Open-Weight Coding Model.

On DeepSeek V4-Pro — 1.6T parameter MoE (49B active per pass) with a hybrid CSA/HCA attention mechanism that cuts inference FLOPs by 73% and KV cache by 90% at 1M tokens compared to V3.2. Scores 80.6% on SWE-bench Verified (statistically tied with Claude Opus 4.7) and leads LiveCodeBench at 93.5%. Priced at $0.145/$3.48 per million input/output tokens — approximately 1/6th of Opus 4.7 — and released under MIT with self-hosting permitted. SWE-bench Pro scores not yet published at launch; agentic harness evaluation pending. V4-Flash offers the same 1M context at $0.14/$0.28 per million tokens for cost-sensitive workloads.

How to read the benchmark numbers
#

SWE-bench Verified tests whether a model can resolve real GitHub issues. A score of 80% means the model correctly resolves 4 in 5 tasks. Progress on this benchmark directly translates to production value in agentic coding workflows.

SWE-bench Pro is harder and designed to resist data contamination — tasks are drawn from less-popular repos and non-Python languages. It’s a better signal for where models actually stand when they can’t pattern-match training data.

LiveCodeBench uses live competitive programming problems (updated continuously, so training data can’t help), making it a clean signal for reasoning quality rather than memorization.

Treat all numbers as approximate signals, not precise rankings. Model capability is context-dependent. A model that tops SWE-bench might still be wrong for your codebase if your stack is niche, your tasks require very long context, or you need local deployment.

Anthropic — Claude#

OpenAI — GPT & o-series#

Google — Gemini & Gemma#

Meta — Llama#

xAI — Grok#

Alibaba — Qwen#

Open-Source & Independent Labs#

How to read the benchmark numbers#