AI Code Is Reviewed Faster Once Picked Up — But Nobody Picks It Up

Table of Contents

LinearB published its 2026 Software Engineering Benchmarks report this month. It is the largest dataset of AI coding productivity ever assembled: 8.1 million pull requests, 4,800 organizations, 42 countries, covering all major AI coding tools including GitHub Copilot, Devin, and Claude Code. The numbers confirm what engineering teams are feeling but struggling to articulate. AI code is not stuck because the models are bad. It is stuck because developers do not trust it enough to open it.

The Core Finding
#

AI-generated PRs have a 32.7% acceptance rate. Human PRs have an 84.4% acceptance rate. The gap is large, but the acceptance rate is not the most important number in the dataset.

The timing data is:

AI PRs wait 4.6x longer before a reviewer picks them up
Once a reviewer is assigned, AI PRs are reviewed 2x faster than manual code
Agentic AI submissions — fully autonomous multi-step outputs from tools like Devin — wait 5.3x longer than unassisted code before anyone touches them

Read those three numbers together. When a developer finally sits down with an AI-generated PR, they move quickly. They are not confused by the code. They are not finding it harder to understand. The review session itself is faster. But getting someone to commit to starting that session takes almost five times as long.

This is not a model quality problem. It is a social trust problem.

Why Reviewers Avoid the Queue
#

Software engineers are measured on throughput. Reviewing a PR is counted as a contribution to someone else’s throughput. Reviewing a PR that might be entirely wrong — that might require a full rewrite and a follow-up conversation with an AI tool that cannot participate in Slack — is a different kind of work than reviewing code a teammate wrote and will be accountable for.

The 4.6x pickup delay captures something real: no one wants to be the person who approved the AI slop. The 2x faster review time confirms that once someone does open the PR, the mechanical work of reading it is easier — AI-generated code tends to be consistent in style, well-named, and structurally obvious. But the psychological barrier to picking it up in the first place reflects a rational, if uncomfortable, calculus.

The 5.3x delay for agentic AI submissions makes the dynamic explicit. Devin opens a PR. No human wrote a single line. The PR is the output of a multi-step autonomous process that your team did not observe, did not guide, and cannot easily trace back to a specification. How do you prioritize reviewing that versus the PR your senior engineer wrote last Thursday that you already know is probably right?

Teams have not yet developed the practices to answer that question efficiently. The bottleneck is not the AI. It is the organizational infrastructure surrounding the AI.

The Copilot Slide Is Already Happening
#

The LinearB report surfaces a detail that deserves attention: Devin’s PR acceptance rate has been rising since April 2026, while GitHub Copilot’s has been slipping since May.

The timing is not coincidental. May 2026 is when the GitHub Copilot billing preview went live, showing developers what their token consumption actually costs under the new AI Credits model. Developers who saw $902 estimates where they expected $39 bills did not only start looking at alternatives — they started applying additional scrutiny to every Copilot-generated PR in their queue. If a PR consumed forty dollars of token compute to produce, the bar for approval rises. The cost makes the quality question concrete.

Copilot’s acceptance rate slippage correlates with billing awareness, not model capability changes. Copilot did not get worse in May. The price tag became visible.

Devin’s rising acceptance rate tells the complementary story. Cognition has been improving the quality of autonomous agent outputs in measurable ways — the SWE-1.6 release (discussed separately) is part of this trajectory — and the organizations that have committed to using Devin at scale have also developed the internal review processes that correspond to higher acceptance rates. Practice improves outcomes.

What Engineering Leaders Should Do With This
#

The LinearB data points to a specific intervention: the bottleneck is not model quality, it is reviewer readiness. Teams that close the 4.6x pickup gap will extract significantly more value from the same AI coding investment.

Three things that move the number:

Standardize AI PR metadata. An AI PR that clearly surfaces its specification source, the tests it ran, the files it modified, and the scope of what it was asked to do is easier to route to the right reviewer and easier to pick up quickly. This is not a technical problem — it is a template problem. Write a CLAUDE.md rule that requires every agentic output to include a structured summary block at the top of the PR description. It takes one hour to implement and immediately reduces the cognitive cost of picking up the review.

Route AI PRs to dedicated reviewers during onboarding. The 5.3x pickup delay for agentic outputs is highest in teams that do not have a designated “AI PR reviewer” role. The math is straightforward: if one engineer on a five-person team commits to reviewing all agentic PR outputs as their first daily task, the queue delay drops to near-zero and that reviewer builds pattern recognition that makes each subsequent review faster. This is how high-throughput teams operate today at Mercado Libre (targeting 90% autonomous coding) and similar early adopters.

Treat acceptance rate as a lagging indicator. The 32.7% AI acceptance rate will not improve materially until the pickup delay decreases and teams develop review cadence. Teams that focus on acceptance rate directly — by raising the bar on what gets submitted — are optimizing the wrong variable. Submit more frequently, review faster, and let the model learn from the rejection signal. The LinearB data shows that teams with higher review velocity also have higher acceptance rates, not the other way around.

The Structural Point
#

There is a version of the AI coding story that goes: models keep improving, benchmarks keep rising, everything else follows automatically. The LinearB data refutes this cleanly. SWE-bench scores do not close the pickup delay gap. Terminal-Bench improvements do not change the social calculus of who opens the AI PR first.

The organizational layer — review processes, team norms, role definitions, acceptance criteria — is not downstream of model quality. It is a separate engineering problem that requires separate work. Teams that treat it as a separate problem will compound their AI investment. Teams that wait for the models to solve it will be waiting in 2027 with the same 4.6x pickup delay and a more powerful model sitting in the same queue.

Autonomous agentic workflows that route output directly to CI pipelines, that produce structured test results alongside the code, that surface the specification and the diff in a format reviewers can scan in thirty seconds — these are the practices that will close the gap. The tools to build them exist today. The models are ready. The bottleneck is the process.

Sources:

LinearB 2026 Software Engineering Benchmarks Report — LinearB (8.1M PRs, 4,800 organizations, 42 countries)
LinearB named Gartner Magic Quadrant Leader for Developer Productivity Insight Platforms 2026 — LinearB
GitHub Copilot token billing developer backlash — TechCrunch
Cognition: Introducing SWE-1.6 — Cognition
Code with Claude SF 2026: Mercado Libre 90% autonomous coding target — sdd.sh

The Core Finding#

Why Reviewers Avoid the Queue#

The Copilot Slide Is Already Happening#

What Engineering Leaders Should Do With This#

The Structural Point#

Related

The Core Finding
#

Why Reviewers Avoid the Queue
#

The Copilot Slide Is Already Happening
#

What Engineering Leaders Should Do With This
#

The Structural Point
#