Agentic Coding Agent Comparison 2026: Benchmarks, Pricing, and Which One to Use

Table of Contents

Last updated 2026-07-15 — OpenAI’s GPT-5.6 Sol cleared federal evaluation under the June 2 executive order and went GA July 14 with no added restrictions, replacing GPT-5.5 “Spud” as Codex’s default model (64.1% SWE-bench Pro, up from 58.6%; 85.6% Terminal-Bench 2.0, up from 82.7%); Claude Fable 5’s included-usage window on Pro/Max/Team plans closed on schedule July 7 and it has run on usage-credits only since July 8; GitHub’s Copilot App usage-based billing and Agent Merge autonomy features are six weeks in with enterprise cost complaints still climbing; SpaceX’s $60B all-stock Cursor acquisition remains on track for a Q3 close.

The agentic coding agent market has matured enough that this comparison needs a monthly refresh. Model releases happen in weeks. Pricing models flip overnight. A tool that was a curiosity in Q1 is table-stakes infrastructure in Q2.

This article covers eight tools across the full spectrum — terminal-native agents, IDE-centric tools, async agents, and open-source harnesses — with benchmark data, current pricing, and a frank assessment of where each fits. It is not a promotional piece for any tool. It is, however, written from a point of view: autonomous beats assisted, terminal beats IDE, and the gap between “AI coding tool” and “AI coding agent” is the most important distinction in this market.

Benchmark Snapshot
#

Different tools measure themselves on different benchmarks. The table below uses the most recent available data for each. SWE-bench Pro is the hardest and most representative: held-out real GitHub issues on repositories the models were not trained on. SWE-bench Verified is the older benchmark (now retired by OpenAI as “no longer measuring frontier capabilities”). Terminal-Bench 2.0 tests complex CLI workflows end-to-end.

Agent	Model	SWE-bench Pro	SWE-bench Verified	Terminal-Bench 2.0	Flat-Rate Pricing
Claude Code	Fable 5 / Opus 4.8 / Sonnet 5	69.2%†	~88%‡	~71%‡	$20 – $200/mo
Antigravity 2.0	Gemini 3.5 Flash	—	76.2%	—	$100 – $200/mo
Grok Build	grok-code-fast-1	—	70.8%	—	$180/mo
OpenAI Codex	GPT-5.6 Sol	64.1%	retired	85.6%	subscription
GitHub Copilot	GPT-4.1 / Opus 4.7	—	—	—	usage-based†
Cursor	configurable	—	—	—	$20 – $40/mo
Windsurf	SWE-1.5 + GPT-5.4	—	—	—	$15 – $35/mo
Jules	Gemini 3.1 Pro	—	—	—	$0 – $124.99/mo

†Fable 5 benchmarks not yet published by Anthropic; 69.2% is Opus 4.8 score. Sonnet 5 (June 30, now the Claude Code default) scores 63.2% on Anthropic’s own agentic-coding eval, up from 58.1% for Sonnet 4.6 — a different composite metric than SWE-bench Pro, so it isn’t directly comparable to the 69.2% figure, but it’s the number to watch: a mid-tier model closing five-sixths of the gap to the frontier at roughly a third of Opus’s price.
‡Estimated from prior-model data; not independently published.
GitHub Copilot switched to usage-based AI Credits on June 1, 2026. Flat-rate era is over.

Reading the table: Claude Opus 4.8’s 69.2% on SWE-bench Pro is the highest published score on the hardest benchmark. GPT-5.6 Sol’s 85.6% on Terminal-Bench 2.0 is a genuine lead, but Terminal-Bench tests CLI workflow mechanics, not real GitHub issue resolution — and Sol’s 64.1% SWE-bench Pro still trails Opus 4.8 by five points. Antigravity’s 76.2% SWE-bench Verified is a legitimate result; Grok’s 70.8% lags Anthropic’s prior-generation model (Opus 4.7 was 87.6% Verified). Worth flagging even though neither is a standalone agent product: Z.AI’s open-weight GLM-5.2 posted 62.1% SWE-bench Pro (beating GPT-5.5’s 58.6%, MIT-licensed, roughly one-sixth GPT-5.5’s cost) and DeepReinforce’s Ornith 1.0 hit 82.4% SWE-bench Verified — both runnable through OpenCode, Aider, or Cline. Open-weight models are now within striking distance of frontier proprietary agents on the harnesses that don’t lock you to one vendor. The tools without benchmark rows (Copilot, Cursor, Windsurf, Jules) are harnesses over third-party models, not independently benchmarked.

Claude Code — The Agentic Terminal
#

Architecture: Terminal-native. No IDE dependency. Runs on your filesystem, in CI, on Anthropic’s cloud via Routines, or on AWS Bedrock with zero operator access.

Claude Code is built around one premise: the agent owns the problem. You describe a task; it plans, implements, tests, and iterates. The developer’s job shifts from writing code to specifying problems and reviewing results.

What just shipped:

Sonnet 5 (June 30): Quietly replaced Sonnet 4.8 as the default model across Claude Code and every Claude plan — no keynote, just a changelog entry. Native 1M-token context (not a beta flag), 63.2% on Anthropic’s agentic-coding eval (up from 58.1%), and introductory pricing of $2/$10 per million tokens through August 31 (stepping up to $3/$15 after). Early-access testers report it verifies its own work — running tests, checking diffs — without being asked, closing much of the argument for keeping a human in the loop on routine tasks.
Fable 5 restored globally (July 1): The Fable 5 tier went dark June 12 under a US export-control directive tied to a reported jailbreak, then came back July 1 after Anthropic shipped a cybersecurity classifier that blocks the pattern in 99%+ of attempts. The included-usage window on Pro/Max/Team plans (up to 50% of weekly usage) closed on schedule July 7; Fable 5 has run on usage-credits only since July 8, a real cost step-up for teams that leaned on it during the free window. Mythos 5 remains restricted to ~100 US critical-infrastructure organizations.
Opus 4.8 (May 28): 69.2% SWE-bench Pro (up from 64.3% in 41 days), 4x less likely to leave code flaws unreported. Fast mode runs at 2x price for 2.5x speed. Pricing $5/$25 per million tokens input/output. Anthropic frames Opus 4.8 and Sonnet 5 as complementary tiers, not a replacement — Opus for the hardest multi-day refactors, Sonnet 5 for the high-volume daily loop.
Dynamic Workflows (research preview): hundreds of parallel subagents coordinated by an orchestrating agent. Designed specifically for codebase-scale migrations — the tasks that overwhelm every other tool because context windows overflow mid-refactor.
Code Review GA: multi-agent reviewers post inline PR comments. $15–25 per PR, billed separately from subscriptions.
Managed Agents public beta: Dreaming (overnight memory curation), Outcomes (rubric-based task grading with webhook), Multiagent Orchestration (coordinator + up to 20 specialist agents, shared filesystem).
Routines: cloud-native scheduled automation — cron, API webhooks, GitHub event triggers — running on Anthropic’s infrastructure without your machine being on.
Bedrock Mantle (v2.1.94): zero operator access. Neither Anthropic nor AWS can see prompts or completions. The enterprise air-gap story now has cryptographic attestation.
v2.1.186–197 (June 22–30): credential sandboxing (sandbox.credentials) blocks agent-run commands from reading secrets like ~/.aws/credentials or ANTHROPIC_API_KEY; claude mcp login --no-browser brings OAuth-gated MCP servers into headless CI; /rewind --before-clear can now recover a session you thought you’d discarded with /clear.

What’s coming: Fable 5’s restoration is a two-tier story worth watching closely: the classifier now silently reroutes requests that resemble vulnerability analysis to Opus 4.8 mid-session, which means routine security-research and code-review workflows may see more false positives than before the ban — test your production prompts against the new behavior rather than assuming parity. Already shipping in GitHub Copilot Pro+/Max/Business/Enterprise (Anthropic’s mandatory 30-day data retention is a ZDR complication for enterprise teams — Azure Foundry is the ZDR-compliant path). Claude Code CLI is at v2.1.197 as of the Sonnet 5 launch.

Adoption: JetBrains April 2026 survey: 18% adoption at work (6× increase year-over-year), 91% CSAT, NPS 54. Highest satisfaction in the market.

Pricing: Pro $20/month · Max 5x $100/month · Max 20x $200/month (flat rate; roughly 18x cheaper than equivalent API usage at heavy scale)

Verdict: Best-in-class for complex, multi-step autonomous work. The only tool designed from the ground up for agents running without human supervision. The infrastructure advantage compounds with every Routines workflow and Managed Agent feature that doesn’t require you to keep a terminal open.

GitHub Copilot — The Everywhere Tool, Now With a Bill
#

Architecture: IDE plugin (VS Code, JetBrains, Neovim) plus, as of June 17, a standalone Copilot App for macOS, Windows, and Linux that runs independently of any editor. Coding Agent and Autopilot Mode (April 2026) add nested subagents and an MCP sandbox.

Copilot’s competitive advantage has always been ubiquity: wherever you work, it works. That advantage is unchanged. What changed this quarter is the business model — and, with the Copilot App, the first real attempt to get Copilot out from behind an IDE window.

Copilot App GA (June 17): A genuine architecture shift, not just a new billing model. The app centers on a canvas — a bidirectional surface where agents present plans, diffs, terminal output, and browser sessions for you to edit, reorder, or redirect. Parallel sessions ship with automatic git-worktree isolation, so three agents can work three issues simultaneously without collision. Agent Merge is the standout feature: once an agent opens a PR, it monitors CI, tracks required reviewers, fixes failing checks, waits for the merge queue, and executes the merge — unattended. Cloud Automations schedule and trigger agent runs on GitHub’s infrastructure, which means, for the first time, a Copilot workflow that runs with your laptop closed — the closest thing yet to Claude Code Routines or Cursor’s Automations. BYOM lets enterprise admins route sessions to Anthropic, Azure OpenAI, or Google endpoints while staying in GitHub’s billing and permission model.

Even so, GitHub is explicit that this isn’t autonomous: every canvas surface has a human approval gate, and interactive sessions still require the app open on your machine — Cloud Automations are the one path that doesn’t. That’s the same ceiling Copilot has always had, just with a better dashboard built up against it.

What changed June 1 (now a month in):

Flat-rate subscriptions are gone. Every plan — Free, Pro, Pro+, Business, Enterprise — now runs on AI Credits ($0.01/credit, token-based). Code completions remain free. Chat, agents, and code review consume credits. The Billing Preview tool released in April showed one developer’s $39/month Pro+ plan producing a $902 projected bill. The official announcement thread collected 893 downvotes. GitHub suspended new individual signups before the switch. A month in, enterprise teams running agentic workloads are reporting bill shock — the cost multiplier on real-world agent sessions is tracking toward the high end of pre-launch estimates. Some teams report Opus 4.8 sessions consuming their monthly AI Credits allocation in two days.

The double-billing problem is specific to code review: Copilot’s review feature simultaneously burns AI Credits and GitHub Actions minutes. Teams using auto-triggered reviews on every PR face compounding costs that were not visible in flat-rate pricing. Given the experiences some teams have had with usage-based billing, enable budget alerts before turning Agent Merge loose on a large backlog — GitHub still hasn’t published a per-task cost breakdown for it.

The autonomy ceiling: Autopilot Mode, Coding Agent, and now Agent Merge are real improvements — assigning a GitHub issue to Copilot and getting a mergeable, CI-clean pull request is a realistic workflow today. Cloud Automations narrow the gap further: that one path genuinely runs while your machine is off. But every interactive session — the actual coding work — still requires the app or the IDE open, and GitHub is explicit that the design intent is human-supervised, not autonomous. Close the editor or the app during an interactive session; lose the agent. That’s a materially better position than three months ago, but it’s a narrower ceiling than Claude Code’s terminal-and-cloud model, which never assumed a live process in the first place.

Pricing (as of June 1, 2026):

Pro $10/month (code completions free; chat/agents billed per credit)
Pro+ $39/month (higher included credits, model access)
Business $19/user/month
Claude Opus 4.7 carries a 27× credit multiplier — use it sparingly or the bill compounds fast

Verdict: Still the best choice for inline completions and teams that can’t change their editor, and the Copilot App with Agent Merge is the most credible autonomy story GitHub has shipped. It’s still a supervised tool by design — every decision surface keeps a human in the loop — and the usage-based pricing remains a serious risk for any team running agentic workflows at volume. Do the credit math before you run a single agent session this month.

Cursor — The AI-Native IDE
#

Architecture: VS Code fork. Agent runs inside a live Cursor application. Cursor 3 → 3.5 added the Agents Window (cloud, SSH, git-worktree agents), Automations (event-driven triggers), and multi-repo support.

Cursor is the best argument for an AI-native IDE in 2026. Its project-graph context, multi-file reasoning, and model flexibility are genuinely superior to an IDE plugin. Cursor 3.5’s Automations — event-driven agents triggered by PR creation, branch push, or schedule — narrow the gap with cloud-native automation tools like Claude Code Routines.

The ceiling: Cursor 3.5 describes itself as “agent-first.” That’s accurate for the interface. Every agent still runs through a live Cursor process. Close Cursor; end the agents. The Cursor SDK (now live) allows programmatic invocation, but it’s still an IDE dependency in disguise.

Security note: CVE-2026-26268 (CVSS 9.9, Novee Security, April 28) — a prompt-injection-to-RCE via malicious .git/hooks/ — was patched in Cursor 2.5. Update immediately if you haven’t. This vulnerability is a structural illustration of the IDE-embedded agent problem: a sandboxed AI running inside a privileged desktop process inherits that process’s full system access.

Market: SpaceX definitively acquired Cursor for $60B in an all-stock deal signed June 16 — four days after Cursor’s IPO, which set a Nasdaq record at $75B. The acquisition integrates Cursor with xAI (Grok). Expected close Q3 2026. SpaceX simultaneously holds a separate $10B Colossus compute deal with Anthropic — the same company now owns both the IDE-first and terminal-native paradigms via infrastructure bets.

Pricing: Hobby $20/month · Pro $40/month · Business $40/user/month

Verdict: The best IDE-centric tool, and Cursor 3.5’s Automations make a credible case for daily agentic workflows. Still architecturally bound to a running IDE. For teams where the IDE is a non-negotiable, Cursor is the right choice. For teams optimizing for genuine autonomy, the ceiling hasn’t moved.

Devin Desktop (formerly Windsurf) — The Parallel Agent IDE
#

Architecture: IDE with parallel Cascade agent sessions (Wave 13: five simultaneous agents via Git worktrees). Acquired by Cognition AI (December 2025, ~$250M) and officially rebranded to Devin Desktop in June 2026.

Wave 13 is the headline: five Cascade agents running in parallel on separate branches, monitored in side-by-side panes. Windsurf introduced parallel agent breadth before any other IDE, and that architectural decision is still its clearest differentiator. The rename to Devin Desktop signals the direction Cognition has planned since acquisition: Devin’s autonomous session capabilities will be primary, with the IDE as the monitoring layer.

The identity evolution: The Devin Desktop rebrand accelerates the product roadmap narrative — a Devin-powered IDE that hands tasks to a fully autonomous cloud session without leaving the interface. Until that integration ships (H2 2026), Devin Desktop is an excellent parallel-agent IDE with the same architectural ceiling every IDE has: every agent requires a live application.

Arena Mode: Two agents, hidden model identities, vote-driven output — still a differentiator for teams that want to evaluate models on real work rather than synthetic benchmarks.

SWE-1.6: Devin Desktop’s current underlying model — 10%+ SWE-bench Pro improvement over prior generation, 950 tok/s via Cerebras partnership, parallel tool calls.

Pricing: Free · Pro $15/month · Max $200/month · Teams $35/user/month

Verdict: Best parallel-agent IDE. The Devin Desktop rebrand clarifies the roadmap: Cognition is building toward a product where Devin’s autonomous capabilities are primary and the IDE is a monitoring layer. If the H2 2026 Devin integration ships with genuine autonomy transfer, this becomes significantly more interesting. Until then, it is a capable IDE-bound tool with better multi-agent UX than Cursor and a more credible autonomy roadmap than any other IDE player.

Grok Build — xAI’s Terminal-Native Agent
#

Architecture: CLI agent. Terminal-native, local-first, zero codebase data transmitted to xAI servers. Launched May 14, 2026.

xAI made the right architectural call: CLI, not IDE. Grok Build is the youngest terminal-native entrant and the one making the most aggressive capability bets with eight parallel sub-agents and (when it ships) Arena Mode’s automated output evaluation.

Grok 4.3 + Grok Skills (May 25): The most meaningful update since launch. Price cut from $300 → $180/month (40% reduction), 1M token context window, 16-Agent Heavy mode. The bigger story is Grok Skills — persistent, named expertise domains that accumulate context across sessions. Define your stack once (“gRPC microservices, CQRS domain events, pytest + factory_boy”) and every future session starts with that context loaded. Think team-level CLAUDE.md but cross-session and per-domain.

Benchmarks: grok-code-fast-1 posts 70.8% SWE-bench Verified — ahead of older Claude models, well behind Opus 4.7 (87.6%) and 4.8. SWE-bench Pro data not published.

Where it still falls short: No MCP ecosystem. No CLAUDE.md equivalent for project-level instructions. No cloud execution or scheduling (Routines-equivalent). Arena Mode — the headline differentiator — is not yet live.

Pricing: SuperGrok Heavy $180/month (down from $300)

Verdict: Right architecture, improving economics, Grok Skills is a genuinely novel primitive. The benchmark gap versus Opus 4.8 is 17 points on Verified (the easier benchmark); Pro data doesn’t exist for comparison. For teams with strict data sovereignty requirements and no enterprise Anthropic contract, Grok Build’s local-first model is a real differentiator. For everyone else, the capability and ecosystem gaps leave it as a watch-list item, not a daily driver.

Antigravity 2.0 — Google’s Terminal-Native Platform
#

Architecture: Go CLI + desktop app + Managed Agents API (isolated Linux environments). Gemini 3.5 Flash for execution, Gemini 3 Pro for planning. Launched at Google I/O, May 19, 2026.

Google’s most credible attempt at terminal-native agent infrastructure. Antigravity 2.0 ships a real Go CLI (not an afterthought), isolated Linux sandboxes for agent execution (the correct security architecture), and a public SDK for hosting custom agents on third-party infrastructure.

Benchmarks: 76.2% SWE-bench Verified. Above Grok Build, below Cursor Composer running Claude Opus 4.7 (87.6%), well below Opus 4.8.

Structural limits:

Gemini-only: no multi-model routing, no fallback when Gemini underperforms on a specific task type — and that constraint got more exposed this quarter. Google confirmed Gemini 3.5 Pro will miss its publicly committed June GA, days after Bloomberg reported four senior DeepMind researchers departed in a six-day span (June 18–24), three of them to Anthropic. Antigravity’s roadmap leans on Gemini’s reasoning tier closing the gap with Opus and Fable 5; a slipped flagship release and a visible talent exodus both work against that timeline.
Google Cloud gravitational pull: Managed Agents API runs on Google infrastructure with BigQuery, Vertex, and Workspace integrations baked in. Valuable if you’re GCP-native; friction if not.
MCP ecosystem gap: 6,400+ MCP servers exist for Claude Code. Antigravity ships its own connector model with thin community depth.

The Gemini CLI situation: Google’s free Gemini CLI — which had 100K+ stars and 6,000+ community pull requests — shut down June 18. Replaced by the proprietary Antigravity. The transition terms (30 days notice, no migration tooling, proprietary replacement with no feature parity at launch) damaged trust with the developer community that contributed the PRs. Developers who built on Gemini CLI free tier (1,000 req/day) found their quota dropped to 20 req/day — a 98% cut — with CI pipelines failing silently. Migration destinations: Antigravity CLI for Google Cloud-native teams; OpenCode, Aider, or Cline for open-source alternatives; Claude Code for the full agentic stack.

Pricing: AI Ultra $100/month · Premium AI Ultra $200/month

Verdict: The right architecture with real engineering behind it. The Gemini-only constraint and MCP ecosystem gap are real ceilings. The Gemini CLI bait-and-switch adds platform trust risk that matters for long-term infrastructure decisions. Best fit: teams already deep in Google Cloud who want a native agentic layer without introducing a new vendor. For multi-cloud teams, the integration friction outweighs the benchmark competitiveness.

Jules — Google’s Async GitHub Agent
#

Architecture: Fully async. You assign a GitHub issue; Jules runs in an isolated VM on Google’s infrastructure, iterates, and submits a pull request. No local setup, no IDE required.

Jules is the cleanest expression of the delegation model: not AI-assisted development, but AI-handled development. The CI failure loop is the defining feature — when Jules opens a PR and CI fails, Jules reads the error, fixes the code, commits, and resubmits without human intervention. For the 80% of CI failures that are deterministic and readable, the loop closes fully automatically.

Model upgrade (March 9): Gemini 3.1 Pro as the default for Pro users. 2×+ reasoning improvement over Gemini 3 Pro, 1M token context, 65K output tokens.

What it’s not: Jules is not an interactive coding environment. It’s an async contributor on your team. For complex, multi-hour tasks requiring strategic direction — the sessions where you want to watch agents work and intervene — Jules isn’t the tool. For delegated tasks you want to walk away from and return to a PR, it’s the most mature async agent available.

Pricing: Free (15 tasks/day) · Pro $19.99/month (~75 tasks/day) · Ultra $124.99/month (~300 tasks/day)

Verdict: Best-in-class for async GitHub-native delegation. The CI loop is a genuine capability milestone. Pair with a terminal-native agent (Claude Code, Antigravity) for complex interactive sessions; delegate routine issue resolution to Jules. The two models are complementary, not competitive.

OpenAI Codex — The Desktop GUI Agent
#

Architecture: Desktop GUI agent for macOS. Multi-agent desktop control (parallel background agents that see, click, and type across any application). 90+ MCP plugin integrations. GPT-5.6 Sol under the hood.

GPT-5.6 Sol GA (July 14): Sol cleared federal evaluation under the June 2 executive order with no added restrictions and replaced GPT-5.5 “Spud” as the default model in Codex and ChatGPT within a day. Benchmarks moved with it: 64.1% SWE-bench Pro (up from Spud’s 58.6%) and 85.6% Terminal-Bench 2.0 (up from 82.7%). Terra and Luna, the cheaper tiers of the same family, are GA alongside it for lower-stakes tasks. The model capability gain is real. The deployment architecture raises the same structural questions every GUI-first agent does.

The architecture problem: When a Codex agent navigates Jira’s web UI to file a ticket, it’s parsing pixels and clicking buttons. That breaks when UI changes, when a modal appears unexpectedly, when the network is slow. The Jira REST API doesn’t break when Jira ships a redesign. More importantly: desktop GUI control anchors the agent to your machine. You have one screen. Agents that control your desktop can’t run in parallel at scale, can’t be isolated to clean git worktrees, can’t trigger from a GitHub webhook while your laptop is closed.

What Codex gets right: MCP ecosystem adoption (90+ servers, same standard as Claude Code). Persistent memory across sessions — a genuine gap in Claude Code’s current architecture. For teams already in the OpenAI ecosystem who need cross-session context accumulation, these are real arguments.

Pricing: GPT-5.6 Sol API: $5 input / $30 output per million tokens, 1.5M context (Terra $2.50/$15, Luna $1/$6). Available in ChatGPT and Codex for paid subscribers.

What’s next: Sol is the first frontier release to clear the June 2 executive order’s federal evaluation without added restrictions — a contrast with Fable 5’s classifier-gated reinstatement and Mythos 5’s ongoing ~100-org cap. Too early to call the pattern for the rest of the industry, but it’s a data point against “government review = permanent ceiling.”

Verdict: GPT-5.6 Sol is the best model OpenAI has shipped for agentic tasks, closing roughly a third of the SWE-bench Pro gap to Claude Opus 4.8 (69.2%) in one release. The deployment architecture remains Codex’s actual bottleneck — not model quality, but where and how it runs. For teams building CLI-integrated, cloud-schedulable, parallel agent pipelines, the GUI-first model still doesn’t compose. For teams that want AI to handle cross-application coordination on a single machine, this is the most mature GUI agent available — now with a genuinely frontier model under the hood.

Decision Framework
#

If you need…	Use…
Autonomous multi-step coding, complex reasoning, codebase-scale tasks	Claude Code
Cloud-native scheduling, fire-and-forget automation without keeping a terminal open	Claude Code Routines / Managed Agents
Inline completions across any editor with minimal switching cost	GitHub Copilot (set a spending cap today)
Large codebases, multi-file edits, project-graph context inside an IDE	Cursor
Parallel agent breadth — multiple agents on separate branches simultaneously	Windsurf
Async GitHub issue → PR delegation, CI loop automation	Jules
Zero data transmission, air-gap compatibility without an enterprise contract	Grok Build
Terminal-native agent on Google Cloud-native infrastructure	Antigravity 2.0
Model flexibility, no subscription floor, budget-constrained or multi-provider workflows	OpenCode (open-source)
Cross-session memory accumulation, GUI-driven cross-app automation	OpenAI Codex

These tools are not mutually exclusive. The pattern that’s emerged among serious teams in 2026: Claude Code for complex autonomous sessions, Jules for delegated routine issues, a preferred IDE (Cursor or Windsurf) for flow-state coding. GitHub Copilot for inline completions — with a hard spending cap now that the meter is running.

Competitive Landscape as of July 15, 2026
#

Anthropic ($965B, $47B ARR): Claude Sonnet 5 (June 30) is the new default across Claude Code and every plan — 63.2% agentic-coding eval, $2/$10/M tokens through August 31, native 1M context. Fable 5’s included-usage window on Pro/Max/Team plans closed on schedule July 7; it has run on usage-credits only since July 8, with the cybersecurity classifier’s silent-reroute-to-Opus behavior still the operative caveat. Mythos 5 stays capped at ~100 US critical-infrastructure organizations. Claude Code shipped credential sandboxing and headless MCP auth (v2.1.186–197) in the same window. Dynamic Workflows in research preview, Managed Agents public beta, Code Review GA.

Google: Antigravity 2.0 is the strongest hyperscaler attempt at terminal-native agentic infrastructure, but the ground under it shifted this quarter — Gemini 3.5 Pro missed its committed June GA days after four senior DeepMind researchers departed in six days, three to Anthropic. The Gemini CLI shutdown (June 18) is complete — 6,000 community contributors abandoned, quota cut 98%. Jules is still a mature async agent. The Gemini-only constraint is the ceiling; no multi-model routing.

Microsoft/GitHub: Copilot’s flat-rate era ended June 1; six weeks into usage-based billing, enterprise teams running agentic workloads confirm the 10×–50× cost increase range. The Copilot App went GA June 17 with Agent Merge and Cloud Automations — the most credible autonomy story GitHub has shipped, though still explicitly human-supervised by design. Claude Code Max at $200/month flat rate remains the dominant alternative narrative.

xAI/SpaceX: Grok 4.3’s price cut and Grok Skills are meaningful. The capability gap versus Opus 4.8 is real. The SpaceX/Cursor acquisition remains on track for a Q3 close — xAI will have both the terminal-native brand (Grok Build) and the dominant IDE (Cursor). The combined entity is the most interesting competitive development of the quarter.

Cursor ($75B IPO → SpaceX acquisition at $60B all-stock): Best IDE-centric tool. The deal was signed definitively June 16 and remains on track for a Q3 close. Cursor 3.5 Automations are the closest an IDE has come to cloud-native event-driven agents. The architectural ceiling is unchanged by the ownership change.

OpenAI: GPT-5.6 Sol cleared federal evaluation and went GA July 14 with no added restrictions — the first frontier release to clear the June 2 executive order review cleanly, a contrast to Fable 5’s classifier-gated reinstatement and Mythos 5’s ongoing ~100-org cap. Sol lifts Codex to 64.1% SWE-bench Pro and 85.6% Terminal-Bench 2.0 — still short of Opus 4.8’s 69.2%, but the closest OpenAI has come. The deployment gap — excellent models, IDE-and-GUI-centric harness — remains the strategic question. Confidential S-1 at $852B–$1T valuation; quarterly earnings pressure post-IPO may accelerate model retirement cycles.

Open-weight models: Z.AI’s GLM-5.2 (MIT license, 62.1% SWE-bench Pro, one-sixth GPT-5.5’s cost) and DeepReinforce’s Ornith 1.0 (MIT license, 82.4% SWE-bench Verified, matching Opus 4.7) both landed this quarter. Neither is a standalone agent product — they’re models you run through OpenCode, Aider, or Cline — but the gap between “best open-weight model” and “frontier proprietary model” is now measured in single digits, not tiers.

This article is refreshed biweekly (1st and 15th of each month, plus Saturday spot updates on major news). Benchmark data reflects the most recent independently verified numbers for each tool. Pricing reflects published rates as of the last update date; vendor pricing changes frequently.

Sources:

Benchmark Snapshot#

Claude Code — The Agentic Terminal#

GitHub Copilot — The Everywhere Tool, Now With a Bill#

Cursor — The AI-Native IDE#

Devin Desktop (formerly Windsurf) — The Parallel Agent IDE#

Grok Build — xAI’s Terminal-Native Agent#

Antigravity 2.0 — Google’s Terminal-Native Platform#

Jules — Google’s Async GitHub Agent#

OpenAI Codex — The Desktop GUI Agent#

Decision Framework#

Competitive Landscape as of July 15, 2026#

Related