Skip to main content

AI Models Reference

·3393 words·16 mins·
Author
Florent Clairambault
CTO & Software engineer

A curated reference for engineers who need to track the AI model landscape without wading through hype. Focused on models relevant to coding, agentic workflows, and software development. Updated every Monday.

Benchmarks used here:

  • SWE-bench Verified — resolving real GitHub issues from popular repos
  • SWE-bench Pro — harder, multi-language variant designed to be contamination-resistant
  • LiveCodeBench — live competitive programming problems, updated continuously
  • HumanEval — function synthesis from docstrings (older benchmark, now mostly saturated)

Anthropic — Claude
#

The primary recommendation for serious agentic coding. Claude Code is built on this model family.

ModelReleasedContextSWE-bench VerifiedSWE-bench ProKey addition
Claude 1Mar 20239KFirst release, Constitutional AI
Claude 2Jul 2023100K2× longer context, improved reasoning
Claude 2.1Nov 2023200KReduced hallucinations, 200K context
Claude 3 HaikuMar 2024200KFast, lightweight, low cost
Claude 3 SonnetMar 2024200KBalanced speed/capability
Claude 3 OpusMar 2024200K~38%Most capable at launch, topped early benchmarks
Claude 3.5 Sonnet (v1)Jun 2024200K~49%Surpassed Opus on coding at lower cost
Claude 3.5 Sonnet (v2)Oct 2024200K~57%Computer use (beta), improved agentic behavior
Claude 3.5 HaikuNov 2024200K~41%Fast + capable small model
Claude 3.7 SonnetFeb 2025200K~70%Extended thinking, hybrid reasoning mode
Claude Haiku 4.5Late 2025200K4th-gen architecture, speed-optimized
Claude Sonnet 4.5Late 2025200KBalanced 4th-gen model
Claude Sonnet 4.6Early 20261M¹~75%1M token context GA (Mar 13, 2026)
Claude Opus 4.6Feb 5, 20261M¹80.8%53.4%Flagship at launch, 1M context GA
Claude Mythos PreviewApr 7, 20261MAutonomous zero-day discovery across all major OS and browsers; restricted to Project Glasswing defense partners; not commercially available
Claude Opus 4.7Apr 16, 20261M87.6%64.3%Implicit-need tests, 3× vision resolution, multi-agent coordination
Claude Opus 4.8May 28, 20261M69.2%Dynamic Workflows (hundreds of parallel subagents for codebase-scale migrations), 4× less likely to leave code flaws unreported, Fast mode

¹ 1M token context became generally available on Sonnet 4.6 and Opus 4.6 on March 13, 2026, with standard pricing throughout.

Current API aliases: claude-opus-4-8 (Opus 4.8), claude-sonnet-4-6 (Sonnet 4.6), claude-haiku-4-5-20251001 (Haiku 4.5).

On Claude Mythos Preview — the first model Anthropic has publicly declined to release on capability grounds. Announced April 7, 2026 alongside Project Glasswing, a restricted-access program giving Mythos to a select group of infrastructure defenders (AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, Microsoft, NVIDIA, Palo Alto Networks, and ~40 additional organizations). Mythos autonomously discovered thousands of high-severity zero-day vulnerabilities across every major operating system and web browser — including a 27-year-old bug in OpenBSD and a 16-year-old flaw in FFmpeg. The UK AI Security Institute gave it expert-level hacking tasks; it succeeded 73% of the time. Anthropic’s own characterization: “unprecedented offensive cybersecurity capability.” SWE-bench and Terminal-Bench scores have not been published; the announcement focused entirely on security capability. Not available via the standard API, Claude.ai, or Claude Code subscriptions. The geopolitical implications are significant: every Project Glasswing launch partner is US-headquartered or deeply embedded in US infrastructure.

On Claude Opus 4.8 — the current performance leader and the fastest flagship iteration Anthropic has shipped: 41 days after Opus 4.7, it lifts SWE-bench Pro from 64.3% to 69.2%. The headline reliability stat is that it is 4× less likely to let code flaws go unreported — a direct attack on the “almost right” silent-bug problem behind the developer-trust gap. The marquee feature is Dynamic Workflows (research preview): an orchestrating agent that fans work across hundreds of parallel subagents for codebase-scale migrations “from kickoff to merge.” Pricing is unchanged from 4.7 ($5/$25 per million input/output tokens); a new Fast mode runs at 2× the rate for 2.5× the speed. In Claude Code it defaults to high-effort mode with a leaner system prompt. Released May 28, 2026 alongside Anthropic’s $65B Series H at a $965B valuation, and a teased “Mythos-class” GA “in the coming weeks.” Covered in depth at Claude Opus 4.8 and the $965B Question.

On Claude Opus 4.7 — the prior performance leader. Key improvements over 4.6: one-third the tool errors in agentic loops, 14% improvement on complex multi-step workflows using fewer tokens, and native multi-agent coordination for parallel workstreams. The first Claude to pass implicit-need tests — meaning it can infer which tools to reach for without being explicitly told. Became the default opus API alias on April 23, 2026.


OpenAI — GPT & o-series
#

ModelReleasedContextSWE-bench VerifiedSWE-bench ProKey addition
GPT-4Mar 20238K/32KFirst multimodal GPT, reasoning jump
GPT-4 TurboNov 2023128K128K context, lower cost, JSON mode
GPT-4oMay 2024128K~33%Omni model, faster, native multimodal
GPT-4o miniJul 2024128KSmall, cheap, high throughput
o1Sep 2024128K~49%Chain-of-thought reasoning, “thinking tokens”
o1 miniSep 2024128KReasoning at lower cost
o3Jan 2025200K~72%Strong reasoning, ARC-AGI breakthrough
o4 miniApr 2025200K~68%Efficient reasoning model
GPT-5Mid-2025256KMultimodal flagship
GPT-5.3-CodexFeb 5, 2026256K~78%First to participate in its own training pipeline; mid-turn steering
GPT-5.4Mar 5, 2026256K80.6%57.7%Superseded 5.3-Codex; integrated Codex plugin for Claude Code
GPT-5.5 “Spud”Apr 23, 2026256K58.6%First fully retrained base since GPT-4.5; 82.7% Terminal-Bench 2.0, leads Expert-SWE (73.1%) and GDPval (84.9%); 159 Epoch Capabilities Index (Apr 28)
GPT-5.5-CyberMay 7, 2026256KCybersecurity-specialized variant of GPT-5.5; more permissive for authorized red teaming, pen testing, and controlled validation; restricted to vetted Trusted Access for Cyber partners

On GPT-5.5-Cyber — a fine-tuned variant of GPT-5.5 with relaxed guardrails for authorized defensive security workflows: writing proofs of concept for discovered vulnerabilities, running authorized red team simulations, and reverse-engineering malware for threat analysis. Access is gated behind OpenAI’s Trusted Access for Cyber program, with account-level controls and verified institutional affiliation required. Advanced Account Security becomes mandatory for all Cyber program members from June 1, 2026. This is not a new base model — it is GPT-5.5 trained to be more permissive on security-related tasks for a small population of vetted defenders. Anthropic made a similar move earlier with Claude Security (Opus 4.7-powered), though Claude Security’s architecture focuses on reasoning-based scanning rather than permissive red team assistance.

On GPT-5.5 “Spud” — the first ground-up retraining since GPT-4.5. OpenAI trained for long-horizon task coherence: the model maintains state across multi-step tool use rather than producing high-quality individual responses. Leads on Terminal-Bench 2.0 (82.7% vs Opus 4.7’s 69.4%), Expert-SWE (73.1%), and GDPval (84.9%). Claude Opus 4.7 still leads on SWE-bench Pro (64.3% vs 58.6%) and MCP-Atlas (79.1% vs 75.3%). Currently live in ChatGPT and Codex for paid tiers; API in controlled rollout at launch. Priced at $5/$30 per million input/output tokens.

On GPT-5.3-Codex — notable for being “instrumental in creating itself”: the team used early versions to debug training runs and manage deployment during its own production pipeline. Also introduced mid-turn steering (redirect the model mid-task without context loss) and became the first OpenAI model rated “High capability” for cybersecurity (77.6% CTF benchmark). Released the same day as Claude Opus 4.6 — the timing was not accidental.


Google — Gemini & Gemma
#

ModelReleasedContextSWE-bench VerifiedKey addition
Gemini 1.0 (Ultra/Pro/Nano)Dec 202332KFirst Gemini family, multimodal
Gemini 1.5 ProFeb 20241M1M token context, long-doc reasoning
Gemini 1.5 FlashMay 20241MFast and efficient with long context
Gemini 2.0 FlashDec 20241MAgentic capabilities, tool use, real-time
Gemini 2.5 ProMar 20251M~63%Thinking mode, strong coding benchmarks
Gemini 3.1 ProEarly 20261MSWE-bench Pro: 54.2%
Gemma 4Apr 2, 2026256KOpen-weight (Apache 2.0), 80% LiveCodeBench v6, 2,150 Codeforces ELO, runs on single consumer GPU
Gemini 3.1 UltraMay 20262M2M token context; native multimodal reasoning (text, image, audio, video); Google’s most capable model to date
Gemini 3.5 FlashMay 19, 20261M78%4× faster output than other frontier models; 76.2% Terminal-Bench 2.1, 83.6% MCP Atlas; “Flash” now matches last year’s Pro on coding; $1.50/$9.00/M tokens
Gemini 3.5 ProAnnounced May 19, 2026; GA June 20262M2M token context; Deep Think reasoning mode; completes the 3.5 family above Flash; $15/$60 per M tokens (estimated); first GA model with 2M context

On Gemini 3.5 Pro — the capstone of Google’s 3.5 family, announced at Google I/O on May 19, 2026, with general availability confirmed for June 2026. Extends context to 2M tokens (double Flash’s ceiling, matching the 3.1 Ultra research preview). Adds a Deep Think reasoning mode — Google’s equivalent of extended thinking — for multi-step problems where Flash’s 4× speed trades against reasoning depth. Flash deliberately improved on 3.1 Pro’s coding benchmarks but regressed on hardest reasoning; 3.5 Pro is designed to close that gap. Expected pricing: ~$15/$60 per million input/output tokens (roughly 10× Flash). Available first to Google AI Pro and Ultra subscribers, then API rollout. SWE-bench Verified and Pro scores not yet published at announcement.

On Gemini 3.5 Flash — the most significant Flash release to date, and the first time a Flash-tier model has meaningfully outperformed the previous generation’s Pro on agentic coding benchmarks. Launched at Google I/O on May 19, 2026, it leads on MCP Atlas (83.6%) and GDPval-AA, runs at roughly 4× the output speed of comparable frontier models, and integrates directly with Google’s Antigravity agent harness (now available via the Gemini Managed Agents API). Trails Claude Opus 4.7 (87.6% SWE-bench Verified, 64.3% SWE-bench Pro) and GPT-5.5 (82.7% Terminal-Bench 2.0) on the harder benchmarks, but at $1.50/$9.00 per million tokens it offers the best cost-to-performance ratio of any frontier-class Google model. Covered in depth at Gemini 3.5 Flash: Google’s Fast Tier Just Became a Frontier Tier.

On Gemma 4 — 26B MoE architecture that runs on a single consumer GPU with 256K context. First open-weight model to make a serious case for local coding agents: 80% LiveCodeBench v6, Codeforces ELO of 2,150, and agentic tool-use scores that outclass the previous generation. Compatible with any OpenAI-compatible server — works directly with aider, continue.dev, and similar tools.

On Gemini 3.1 Ultra — Google’s most capable model release of the year. Extends context to 2M tokens (double the previous Gemini ceiling) with native multimodal reasoning across text, image, audio, and video in a unified architecture. Agentic coding benchmarks not yet independently verified at publication; positioned as Google’s answer to Claude Opus 4.7 and GPT-5.5 for long-horizon multi-step tasks. SWE-bench scores pending.


Meta — Llama
#

ModelReleasedParamsContextKey addition
Llama 2Jul 20237B–70B4KFirst major open-source release for production use
Llama 3Apr 20248B–70B8KStrong coding, instruction following
Llama 3.1Jul 20248B–405B128K405B matches frontier, 128K context
Llama 3.2Sep 20241B–90B128KMultimodal, small on-device models
Llama 4Apr 2025MoE1MMixture-of-Experts, near-frontier performance

xAI — Grok
#

Elon Musk’s AI lab. Grok is xAI’s primary model family — positioned as a frontier model for chat and agentic workflows, with a growing focus on enterprise productivity use cases (legal, finance) and cost-competitive API pricing. Grok Build is xAI’s terminal-native coding agent, comparable in concept to Claude Code but trailing on coding benchmarks.

ModelReleasedContextSWE-bench VerifiedKey addition
Grok 1Mar 20248KFirst public Grok; open-sourced under Apache 2.0 (314B MoE)
Grok 2Aug 2024128KSignificant reasoning improvement; multimodal input
Grok 3Feb 2025128KMajor capability jump; first competitive frontier Grok
Grok 4Late 2025256KExtended context; improved instruction following
Grok 4.3May 4, 20261M~51%¹40% input price cut ($1.25/M), native video input, 16-Agent Heavy orchestration; #1 ArtificialAnalysis agentic tool-calling leaderboard
Grok BuildMay 14, 20261M70.8%Terminal-native coding agent (grok-code-fast-1); CLI-native; $300/month

¹ Grok 4.3 trails Claude Opus 4.7 by approximately 14pp on SWE-bench Pro per ArtificialAnalysis comparative data. Exact SWE-bench Verified score not independently published by xAI.

On Grok 4.3 — xAI’s clearest pivot from benchmark competition to practical cost and productivity. Released May 4, 2026 with a 40% input price reduction ($1.25/M tokens), native video input, 1M token context, and 16-Agent Heavy (an orchestrator that coordinates up to 16 parallel worker agents). Leads on niche enterprise benchmarks: #1 on ArtificialAnalysis’s agentic tool-calling leaderboard, #1 on ValsAI CaseLaw v2 and CorpFin. Trails Claude Opus 4.7 and GPT-5.5 on general coding measures. Followed by Grok Skills (May 18, 2026): persistent cross-session expertise that replaces repetitive system-prompt preambles — document generation, deck creation, spreadsheet editing, and custom workflow automation. Covered in depth at Grok 4.3 and Grok Skills: xAI’s Pivot From Benchmark Hype to Business Reality.

On Grok Build — xAI’s terminal-native coding agent, launched May 14, 2026. Built on grok-code-fast-1, a speed-optimized variant of Grok 4.3. Scores 70.8% on SWE-bench Verified — meaningfully behind Claude Opus 4.7 (87.6%) and GPT-5.5 (from Terminal-Bench 2.0 data) but ahead of earlier-generation IDE-embedded agents. CLI-native and local-first by design. Priced at $300/month (introductory $99). Arena Mode (head-to-head agent comparison in parallel worktrees) was announced but not yet live at launch.


Alibaba — Qwen
#

The most prolific open-weight model lineage outside of Meta’s Llama. Alibaba’s Qwen family covers general-purpose LLMs, coding specialists, and reasoning models — with a release cadence that accelerated from roughly annual in 2023 to near-monthly by 2026. The sub-35B models ship under Apache 2.0; frontier flagships are API-only proprietary.

ModelReleasedContextSWE-bench VerifiedSWE-bench ProKey addition
Qwen 2.5-Coder (7B–72B)Nov 2024128K~70%First dedicated coding model; 5.5T code training tokens; Apache 2.0
QwQ-32BMar 2025128KReasoning-focused; RL chain-of-thought; AIME24: 79.5%; competes with o1-mini; Apache 2.0
Qwen 3 (235B-A22B)Apr 2025128KHybrid thinking/non-thinking mode toggle; 2,056 Codeforces ELO; MCP support
Qwen 3-Coder (480B-A35B)Jul 2025256K~70%Agentic coding specialist; Qwen Code CLI companion; Apache 2.0
Qwen 3-Coder-Next (80B-A3B)Feb 2026256K~71%44.3%Local-friendly: 3B active params; runs on consumer GPU; $0.11/$0.80 per M tokens
Qwen 3.5-27BFeb 20261M72.4%50.9%1M context; Apache 2.0
Qwen 3.6-27BApr 2026128K77.2%53.5%Dense 27B outperforms 397B MoE on SWE-bench Pro; Apache 2.0
Qwen 3.6-PlusApr 20261M78.8%Proprietary; MCPMark leader at launch; $0.50/$3.00 per M tokens
Qwen 3.6-Max-PreviewApr 20261M58.4%Closed frontier; claims #1 across 6 agent/coding benchmarks as of Apr 2026

On Qwen 3 and the MoE efficiency story — the Qwen team’s key differentiator is sparse MoE architecture with very few active parameters: 3B–35B active out of 35B–480B total. Qwen 3-Coder-Next (80B-A3B) delivers 71% SWE-bench Verified locally at $0.11/M input. Qwen 3.6-35B-A3B (released Apr 16, 2026) reaches 73.4% on a consumer GPU under Apache 2.0. For teams that need local, private AI coding with auditable weights, this is the only serious option at this capability level — Claude and GPT-4o have no equivalent.

On Qwen 3.6-Max-Preview — Alibaba’s current closed frontier model, API-only with undisclosed parameter count. Claims the #1 rank across 6 coding and agentic benchmarks as of April 2026, including SWE-bench Pro (58.4%), Terminal-Bench 2.0 (65.4%), and several agentic tool-use suites. SWE-bench Verified scores not published — the emphasis on agentic benchmarks suggests the model is tuned for tool-use pipelines over single-shot completions. Available via Alibaba Cloud’s DashScope API; pricing not publicly listed.

On Qwen Code — Alibaba’s answer to Claude Code. Launched in July 2025 as a terminal-native CLI agent forked from Gemini CLI, supporting multiple API backends (Alibaba Cloud, OpenRouter, Fireworks, local Ollama). Pairs with Qwen 3-Coder and Qwen 3.6-Plus as default backends. Open-source under Apache 2.0. The existence of Qwen Code illustrates the market dynamic clearly: Anthropic defined the terminal-native agentic coding category with Claude Code; within a year, every major lab shipped a clone. None match Claude Code’s depth of integration — hooks, MCP ecosystem maturity, operator SDK — but Qwen Code’s open-source nature and local-model support give it a distinct value proposition for privacy-sensitive or air-gapped teams.


Open-Source & Independent Labs
#

ModelLabReleasedLicenseKey achievement
Mistral LargeMistralFeb 2024CommercialCompetitive with GPT-4 on reasoning
DeepSeek-Coder V2DeepSeekMay 2024MITStrongest open-source coding model at launch
DeepSeek V3DeepSeekDec 2024MITNear-frontier, fraction of training cost
DeepSeek R1DeepSeekJan 2025MITOpen-source reasoning model, matched o1
DeepSeek V4-FlashDeepSeekApr 24, 2026MIT284B MoE, 1M context, $0.14/$0.28 per M tokens — best price-performance at this tier
DeepSeek V4-ProDeepSeekApr 24, 2026MIT1.6T param MoE (49B active), 80.6% SWE-bench Verified, 93.5% LiveCodeBench, 3206 Codeforces — 1/6th cost of Opus 4.7
Kimi K2.5MoonshotEarly 2026ProprietaryCompaction-in-the-loop RL; powers Cursor Composer 2
Kimi K2.6MoonshotMay 12, 2026Modified MIT1T MoE (32B active), 58.6% SWE-bench Pro, 66.7% Terminal-Bench 2.0, Agent Swarm 300 sub-agents — most capable open-weight coding model at release
MiniMax M2.5MiniMaxEarly 2026MIT-style80.2% SWE-bench Verified; $0.30/1M input tokens — strongest open-source price-performance at launch
GLM-5.1Z.AIApr 8, 2026MIT754B open-weight, 58.4% SWE-bench Pro — beat GPT-5.4 and Opus 4.6 at time of release
MiniMax M2.7MiniMaxApr 12, 2026MIT-style56.22% SWE-bench Pro, 57.0% Terminal Bench 2 — first model to participate in its own training cycle via 100 autonomous RL rounds

On GLM-5.1 — 754B open-weight model under MIT license. Scored 58.4% on SWE-bench Pro at release, beating GPT-5.4 (57.7%), Claude Opus 4.6 (53.4%), and Gemini 3.1 Pro (54.2%). The headline demo: an 8-hour autonomous session that built a complete Linux desktop environment across 655 iterations. The closed-model monopoly on frontier coding capability just got its first serious challenger.

On MiniMax M2.7 — the first open-source model to participate in its own development cycle: given write access to its RL scaffolding, it autonomously ran 100 rounds of self-optimization, producing a 30% performance gain over M2.5. Scores 56.22% on SWE-bench Pro and 57.0% on Terminal Bench 2 — landing within 8 points of Claude Opus 4.7 on the harder benchmark with publicly available weights. Ships with native Agent Teams support and 97% skill adherence across 40 complex multi-tool workflows. Available on Hugging Face and Ollama under a modified MIT license.

On Kimi K2.6 — the most capable open-weight coding model released to date. 1T-parameter MoE architecture with only 32B parameters active per inference pass, making it cost-effective to run. Scores 58.6% on SWE-bench Pro (within 6 points of Claude Opus 4.7’s 64.3%) and 66.7% on Terminal-Bench 2.0 — matching GPT-5.5 on the harder agentic terminal benchmark. Agent Swarm mode coordinates up to 300 sub-agents across 4,000 steps for complex multi-component tasks. Priced at $0.60/$2.50 per million input/output tokens — approximately 1/8th the cost of Opus 4.7. Released under a Modified MIT license; weights available on Hugging Face. Covered in depth at Kimi K2.6: Most Capable Open-Weight Coding Model.

On DeepSeek V4-Pro — 1.6T parameter MoE (49B active per pass) with a hybrid CSA/HCA attention mechanism that cuts inference FLOPs by 73% and KV cache by 90% at 1M tokens compared to V3.2. Scores 80.6% on SWE-bench Verified (statistically tied with Claude Opus 4.7) and leads LiveCodeBench at 93.5%. Priced at $0.145/$3.48 per million input/output tokens — approximately 1/6th of Opus 4.7 — and released under MIT with self-hosting permitted. SWE-bench Pro scores not yet published at launch; agentic harness evaluation pending. V4-Flash offers the same 1M context at $0.14/$0.28 per million tokens for cost-sensitive workloads.


How to read the benchmark numbers
#

SWE-bench Verified tests whether a model can resolve real GitHub issues. A score of 80% means the model correctly resolves 4 in 5 tasks. Progress on this benchmark directly translates to production value in agentic coding workflows.

SWE-bench Pro is harder and designed to resist data contamination — tasks are drawn from less-popular repos and non-Python languages. It’s a better signal for where models actually stand when they can’t pattern-match training data.

LiveCodeBench uses live competitive programming problems (updated continuously, so training data can’t help), making it a clean signal for reasoning quality rather than memorization.

Treat all numbers as approximate signals, not precise rankings. Model capability is context-dependent. A model that tops SWE-bench might still be wrong for your codebase if your stack is niche, your tasks require very long context, or you need local deployment.