Skip to main content

AI Models Reference

·1109 words·6 mins·

A curated reference for engineers who need to track the AI model landscape without wading through hype. Focused on models relevant to coding, agentic workflows, and software development. Updated every Monday.

Benchmarks used here:

  • SWE-bench Verified — resolving real GitHub issues from popular repos
  • SWE-bench Pro — harder, multi-language variant designed to be contamination-resistant
  • LiveCodeBench — live competitive programming problems, updated continuously
  • HumanEval — function synthesis from docstrings (older benchmark, now mostly saturated)

Anthropic — Claude
#

The primary recommendation for serious agentic coding. Claude Code is built on this model family.

ModelReleasedContextSWE-bench VerifiedSWE-bench ProKey addition
Claude 1Mar 20239KFirst release, Constitutional AI
Claude 2Jul 2023100K2× longer context, improved reasoning
Claude 2.1Nov 2023200KReduced hallucinations, 200K context
Claude 3 HaikuMar 2024200KFast, lightweight, low cost
Claude 3 SonnetMar 2024200KBalanced speed/capability
Claude 3 OpusMar 2024200K~38%Most capable at launch, topped early benchmarks
Claude 3.5 Sonnet (v1)Jun 2024200K~49%Surpassed Opus on coding at lower cost
Claude 3.5 Sonnet (v2)Oct 2024200K~57%Computer use (beta), improved agentic behavior
Claude 3.5 HaikuNov 2024200K~41%Fast + capable small model
Claude 3.7 SonnetFeb 2025200K~70%Extended thinking, hybrid reasoning mode
Claude Haiku 4.5Late 2025200K4th-gen architecture, speed-optimized
Claude Sonnet 4.5Late 2025200KBalanced 4th-gen model
Claude Sonnet 4.6Early 20261M¹~75%1M token context GA (Mar 13, 2026)
Claude Opus 4.6Feb 5, 20261M¹80.8%53.4%Flagship at launch, 1M context GA
Claude Opus 4.7Apr 16, 20261M87.6%64.3%Implicit-need tests, 3× vision resolution, multi-agent coordination

¹ 1M token context became generally available on Sonnet 4.6 and Opus 4.6 on March 13, 2026, with standard pricing throughout.

On Claude Opus 4.7 — the current performance leader. Key improvements over 4.6: one-third the tool errors in agentic loops, 14% improvement on complex multi-step workflows using fewer tokens, and native multi-agent coordination for parallel workstreams. The first Claude to pass implicit-need tests — meaning it can infer which tools to reach for without being explicitly told. Became the default opus API alias on April 23, 2026.


OpenAI — GPT & o-series
#

ModelReleasedContextSWE-bench VerifiedSWE-bench ProKey addition
GPT-4Mar 20238K/32KFirst multimodal GPT, reasoning jump
GPT-4 TurboNov 2023128K128K context, lower cost, JSON mode
GPT-4oMay 2024128K~33%Omni model, faster, native multimodal
GPT-4o miniJul 2024128KSmall, cheap, high throughput
o1Sep 2024128K~49%Chain-of-thought reasoning, “thinking tokens”
o1 miniSep 2024128KReasoning at lower cost
o3Jan 2025200K~72%Strong reasoning, ARC-AGI breakthrough
o4 miniApr 2025200K~68%Efficient reasoning model
GPT-5Mid-2025256KMultimodal flagship
GPT-5.3-CodexFeb 5, 2026256K~78%First to participate in its own training pipeline; mid-turn steering
GPT-5.4Mar 5, 2026256K80.6%57.7%Superseded 5.3-Codex; integrated Codex plugin for Claude Code

On GPT-5.3-Codex — notable for being “instrumental in creating itself”: the team used early versions to debug training runs and manage deployment during its own production pipeline. Also introduced mid-turn steering (redirect the model mid-task without context loss) and became the first OpenAI model rated “High capability” for cybersecurity (77.6% CTF benchmark). Released the same day as Claude Opus 4.6 — the timing was not accidental.


Google — Gemini & Gemma
#

ModelReleasedContextSWE-bench VerifiedKey addition
Gemini 1.0 (Ultra/Pro/Nano)Dec 202332KFirst Gemini family, multimodal
Gemini 1.5 ProFeb 20241M1M token context, long-doc reasoning
Gemini 1.5 FlashMay 20241MFast and efficient with long context
Gemini 2.0 FlashDec 20241MAgentic capabilities, tool use, real-time
Gemini 2.5 ProMar 20251M~63%Thinking mode, strong coding benchmarks
Gemini 3.1 ProEarly 20261MSWE-bench Pro: 54.2%
Gemma 4Apr 2, 2026256KOpen-weight (Apache 2.0), 80% LiveCodeBench v6, 2,150 Codeforces ELO, runs on single consumer GPU

On Gemma 4 — 26B MoE architecture that runs on a single consumer GPU with 256K context. First open-weight model to make a serious case for local coding agents: 80% LiveCodeBench v6, Codeforces ELO of 2,150, and agentic tool-use scores that outclass the previous generation. Compatible with any OpenAI-compatible server — works directly with aider, continue.dev, and similar tools.


Meta — Llama
#

ModelReleasedParamsContextKey addition
Llama 2Jul 20237B–70B4KFirst major open-source release for production use
Llama 3Apr 20248B–70B8KStrong coding, instruction following
Llama 3.1Jul 20248B–405B128K405B matches frontier, 128K context
Llama 3.2Sep 20241B–90B128KMultimodal, small on-device models
Llama 4Apr 2025MoE1MMixture-of-Experts, near-frontier performance

Open-Source & Independent Labs
#

ModelLabReleasedLicenseKey achievement
Mistral LargeMistralFeb 2024CommercialCompetitive with GPT-4 on reasoning
DeepSeek-Coder V2DeepSeekMay 2024MITStrongest open-source coding model at launch
DeepSeek V3DeepSeekDec 2024MITNear-frontier, fraction of training cost
DeepSeek R1DeepSeekJan 2025MITOpen-source reasoning model, matched o1
Kimi K2.5MoonshotEarly 2026ProprietaryCompaction-in-the-loop RL; powers Cursor Composer 2
GLM-5.1Z.AIApr 8, 2026MIT754B open-weight, 58.4% SWE-bench Pro — beat GPT-5.4 and Opus 4.6 at time of release

On GLM-5.1 — 754B open-weight model under MIT license. Scored 58.4% on SWE-bench Pro at release, beating GPT-5.4 (57.7%), Claude Opus 4.6 (53.4%), and Gemini 3.1 Pro (54.2%). The headline demo: an 8-hour autonomous session that built a complete Linux desktop environment across 655 iterations. The closed-model monopoly on frontier coding capability just got its first serious challenger.


How to read the benchmark numbers
#

SWE-bench Verified tests whether a model can resolve real GitHub issues. A score of 80% means the model correctly resolves 4 in 5 tasks. Progress on this benchmark directly translates to production value in agentic coding workflows.

SWE-bench Pro is harder and designed to resist data contamination — tasks are drawn from less-popular repos and non-Python languages. It’s a better signal for where models actually stand when they can’t pattern-match training data.

LiveCodeBench uses live competitive programming problems (updated continuously, so training data can’t help), making it a clean signal for reasoning quality rather than memorization.

Treat all numbers as approximate signals, not precise rankings. Model capability is context-dependent. A model that tops SWE-bench might still be wrong for your codebase if your stack is niche, your tasks require very long context, or you need local deployment.