---
title: "81% vs. 46%: The AI Coding Benchmark That's Been Lying to You"
date: 2026-04-11
tags: ["SWE-bench","Benchmarks","AI Coding","Evaluation","Claude","GPT"]
categories: ["AI Tools","Industry"]
summary: "SWE-bench Verified — the benchmark that put every frontier model above 80% — is contaminated. OpenAI stopped reporting it in February. Here's what actually happened, what SWE-bench Pro replaces it with, and why 46% is a more honest number than 81%."
---


For the past twelve months, AI coding leaderboards have told a confident story: frontier models now solve more than 80% of real-world software engineering tasks. Claude Opus 4.5 at 80.9%. GPT-5.3 somewhere close behind. Gemini in the same vicinity. The implication was clear — we are rapidly approaching the point where AI can handle essentially any coding task you throw at it.

That story is wrong.

On February 23, 2026, OpenAI announced it would stop reporting SWE-bench Verified scores. Not because OpenAI's models stopped performing well — they still show 81% on Verified. They stopped reporting it because the benchmark has been compromised. The real number, on a contamination-resistant benchmark, is closer to 46%.

The 35-point gap is not a rounding error. It's a warning about what happens when the AI industry measures itself with its own ruler.

---

## What SWE-bench Verified Is

SWE-bench was introduced by researchers at Princeton and Chicago in 2023 as a novel approach to evaluating AI coding ability: instead of toy problems or contrived exercises, it uses **real GitHub issues from real open-source repositories**. Each task is a concrete bug report or feature request from a production codebase. The AI agent must produce a patch that passes the repository's existing test suite.

The "Verified" variant — SWE-bench Verified — was introduced by OpenAI in mid-2024. It curated a subset of 500 Python-only tasks that had been manually validated by human contractors to ensure the problems were solvable and the test suites were meaningful. It became the standard leaderboard that everyone cited.

It was also, it turns out, increasingly unreliable.

---

## What Went Wrong

OpenAI's internal audit identified three distinct failure modes in SWE-bench Verified:

**Contamination.** Every frontier model tested — GPT-5.2, Claude Opus 4.5, Gemini 3 Flash — showed evidence of training data contamination. For certain tasks, models could reproduce the **verbatim gold patch** from the benchmark's answer key, or recall problem-specific details that should only be accessible from the repository's commit history. The benchmark's tasks came from public GitHub repositories, which means they were present in training corpora for models trained after 2023.

**Broken tests.** OpenAI found that nearly 60% of the problems that their models failed contained fundamentally flawed tests. Some tests were too narrow — they checked for a specific implementation detail not mentioned in the problem description, meaning a correct solution would fail. Others were too wide — they required extra features that were never specified. The benchmark was penalizing models for being wrong when, in many cases, the tests themselves were wrong.

**Saturation.** Even setting aside contamination and test quality, Verified's 500 Python-only tasks weren't representative of real production codebases — which are polyglot, messy, and poorly documented. The benchmark had become easier than the work it was supposed to measure.

---

## SWE-bench Pro: The Replacement

The alternative that OpenAI now recommends, and that the broader research community is migrating to, is **SWE-bench Pro** — a benchmark built from scratch to address every failure mode of Verified.

Key differences:

**Scale and diversity.** Pro has **1,865 tasks** across multiple programming languages — Python, JavaScript, TypeScript, Go, Rust, Java. No more Python monoculture. Real engineering organizations don't work in one language; the benchmark now reflects that.

**Contamination resistance by design.** SWE-bench Pro draws exclusively from repositories under **strong copyleft licenses** (GPL and similar). The legal framework around these licenses creates a practical barrier to their inclusion in proprietary training corpora. OpenAI's audit found contamination cases in Pro to be "significantly rarer and less egregious" than in Verified, and found no model capable of producing a verbatim gold patch.

**Private tasks.** Beyond the public set, SWE-bench Pro includes tasks sourced from **private proprietary codebases** — code that definitively was not in any training set. These tasks function as a gold standard contamination check.

**Standardized scaffolding.** Scale AI maintains a SEAL leaderboard with standardized evaluation scaffolding, so that score differences reflect model capability, not differences in how each lab sets up its evaluation pipeline.

---

## What the Real Numbers Look Like

The performance gap tells the story more clearly than any analysis:

| Model | SWE-bench Verified | SWE-bench Pro (SEAL) |
|---|---|---|
| Claude Opus 4.5 | 80.9% | 45.9% |
| GPT-5.3-Codex | ~81% | 56.8% |
| Best with search subagent (Opus 4.6 + WarpGrep v2) | — | 57.5% |

The same Claude Opus 4.5 that scores 80.9% on Verified scores 45.9% on Pro. The model didn't change. The benchmark did. That 35-point gap is the contamination premium — the extra credit AI systems have been giving themselves for tasks they'd already seen.

Even the Pro leaders aren't close to 80%. The best result on the public leaderboard, GPT-5.3-Codex at 56.8%, represents genuinely impressive performance on uncontaminated, multi-language, real-world engineering tasks. But 56% is a very different claim than 81%.

---

## What This Means for How You Evaluate AI Coding Tools

If you've been using SWE-bench Verified scores to make purchasing or adoption decisions about AI coding tools, you've been looking at an inflated metric. That's not a knock on any particular vendor — everyone was playing the same game with the same benchmark — but it does mean the capability picture was rosier than reality.

A few practical implications:

**Don't cite Verified scores going forward.** OpenAI has retired them. The research community is following. Any vendor still prominently featuring SWE-bench Verified scores on their homepage is either unaware of the contamination issue or hoping you are.

**Pro scores are the new floor, not the ceiling.** A model scoring 45-57% on SWE-bench Pro is genuinely capable — these are hard, real tasks on uncontaminated code. But the industry needs to recalibrate its language around what "capable" means. "Solves 80% of real engineering tasks" was always too strong a claim.

**Benchmark diversity matters.** SWE-bench is not the only coding eval that matters. Terminal-Bench 2.0, LiveCodeBench v6, and proprietary evals from labs like Scale AI each measure different dimensions. No single number tells you everything. Any vendor claiming supremacy based on one leaderboard deserves skepticism.

**The gap between benchmark performance and production utility remains large.** Stackademic's April 2026 survey found that 84% of developers now use AI coding tools daily — but only 29% trust the output in production. That trust gap isn't primarily about benchmark scores. It's about reliability on *your* codebase, *your* test suite, *your* edge cases. Benchmarks can't substitute for that.

---

## Why It Matters That OpenAI Flagged This

It's worth pausing on the institutional dimension here. OpenAI published a detailed post-mortem on *its own benchmark*, acknowledged contamination in *its own models*, and stopped reporting the metric that made those models look best. That's epistemically unusual in an industry not known for rigorous self-criticism.

The OpenAI research team's framing: *"The standard for frontier coding evals is changing with model maturity."* That's a diplomatic way of saying: the benchmark succeeded at what it was designed to do, then the models ate the benchmark, and now we need harder tests.

This is a healthy cycle — benchmarks get saturated, better benchmarks replace them — but it requires the industry to resist the temptation to camp on flattering numbers longer than is honest. The fact that OpenAI made the Pro migration publicly and transparently is the right move, and it sets a reasonable standard for others.

---

## The Practical Upshot

The best AI coding models in 2026 can autonomously resolve roughly **50-57% of hard, uncontaminated, multi-language real-world software engineering tasks** — without human guidance, using only the repository and the issue description.

That's genuinely extraordinary. Three years ago it was 0%. The progress is real.

But it's not 80%. It never was. The gap between 46% and 81% is the space where contaminated training data was doing work that the models couldn't do themselves. Now that we've subtracted that credit, we know where the models actually stand.

The right response isn't cynicism — it's calibration. Use SWE-bench Pro scores to compare models. Use your own production tasks to verify the comparison. And stop letting any vendor tell you the problem of AI coding capability is nearly solved.

It's not. It's just that progress has been real and the benchmarks have been generous.

---

**Sources:**
- [Why SWE-bench Verified No Longer Measures Frontier Coding Capabilities — OpenAI](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/)
- [Is SWE-bench Verified Contaminated? OpenAI Shifts to SWE-bench Pro — CodeSOTA](https://www.codesota.com/news/swe-bench-contamination-debate)
- [SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81% — MorphLLM](https://www.morphllm.com/swe-bench-pro)
- [SWE-Bench Pro Leaderboard — Scale AI SEAL](https://labs.scale.com/leaderboard/swe_bench_pro_public)
- [⚡️ The End of SWE-Bench Verified — Latent Space](https://www.latent.space/p/swe-bench-dead)
- [OpenAI Abandons SWE-bench Verified After Finding 59% of Failed Tests Were Flawed — Blockchain.news](https://blockchain.news/news/openai-abandons-swe-bench-verified-contamination-flawed-tests)
- [84% of Developers Use AI Coding Tools in April 2026 — Only 29% Trust What They Ship — Stackademic](https://blog.stackademic.com/84-of-developers-use-ai-coding-tools-in-april-2026-only-29-trust-what-they-ship-d0cb7ec9320a)
- [OpenAI Developers on X — announcement thread](https://x.com/OpenAIDevs/status/2026002219909427270)

