# sdd.sh — Full Content
> CTO & Software engineer
Author: Florent Clairambault
Site: https://sdd.sh/
Generated: 2026-05-20
This file contains the full content of every article on sdd.sh, concatenated for AI ingestion in a single context window. Articles are sorted newest first.
For the article index with summaries only, see https://sdd.sh/llms.txt.
---
# Gemini 3.5 Flash: Google's "Budget" Model Outperforms Flagships on Agentic Benchmarks
URL: https://sdd.sh/2026/05/gemini-3-5-flash-benchmarks-agentic-coding/
Date: 2026-05-20
Tags: google, gemini, gemini-3-5-flash, benchmarks, agentic-coding, model-release, claude-code
Categories: AI Tools, Industry
Summary: Gemini 3.5 Flash launched at Google I/O on May 19. Google calls it a Flash model — implying budget tier — but at $9/M output tokens it sits between Haiku and Sonnet pricing while hitting 76.2% on Terminal-Bench 2.1 and leading all competitors on MCP Atlas. It does not beat Claude Opus 4.7 on SWE-bench. The benchmark picture is more complicated than Google's marketing suggests.
Google launched Gemini 3.5 Flash at I/O 2026 on May 19. The name carries expectations: Flash has always meant fast and cheap, the tier you use when you need throughput at scale and can accept some quality tradeoff. Gemini 3.5 Flash breaks that contract. The benchmarks are near-frontier. The price is not budget. Understanding what Google actually built here matters more than the naming.
## What Shipped
Gemini 3.5 Flash is available immediately in the Gemini API, Google AI Studio, Antigravity 2.0, and Gemini CLI. The specs:
- **Context window**: 1 million tokens input, 64,000 output
- **Speed**: approximately 4x faster than comparable frontier models on output tokens per second
- **Pricing**: $1.50/M input tokens, $9.00/M output tokens, $0.15/M for cached input
That pricing sits between Claude Haiku 4.5 ($0.80/$4.00) and Claude Sonnet 4.6 ($3/$15) on input, but the $9/M output rate is considerably higher than Haiku and approaching Sonnet. For pure budget workloads, Claude Haiku 4.5 remains meaningfully cheaper. Google is asking you to pay near-Sonnet rates for a model the company brands as Flash.
The implicit argument: the performance justifies the price. Let's look at whether it does.
## The Benchmark Picture
Gemini 3.5 Flash leads or ties on several benchmarks that matter for agentic coding, while trailing on others.
**Where 3.5 Flash leads:**
- **MCP Atlas**: 83.6% — the benchmark that measures tool-use across MCP server integrations. This is the one Google built the model around, and it shows. Claude Opus 4.7 and GPT-5.5 trail here.
- **GDPval-AA**: 1,656 Elo — a real-world agentic evaluation benchmark. Gemini 3.1 Pro scored 1,314. That is a substantial jump.
- **Finance Agent v2**: 57.9% versus Gemini 3.1 Pro's 43.0%. The model handles multi-step financial workflows significantly better than its predecessor.
- **CharXiv Reasoning**: 84.2%, leading comparable models.
- **GPQA Diamond**: 90.4%, competitive with frontier models on graduate-level reasoning.
- **Terminal-Bench 2.1**: 76.2%, ahead of Gemini 3.1 Pro's 70.3%.
**Where it trails:**
- **SWE-bench Verified**: 78%. Claude Opus 4.7 scores approximately 87.6% and GPT-5.5 scores around 83%. For pure coding correctness at repo scale — finding bugs in existing code, implementing features in established codebases — the quality gap versus Opus 4.7 is real and meaningful.
- **Terminal-Bench 2.1**: GPT-5.5 leads at 82.7%. Gemini 3.5 Flash's 76.2% is stronger than 3.1 Pro but does not take the top position on terminal-native coding tasks.
The pattern: Gemini 3.5 Flash is optimized for MCP-driven agentic workflows and real-world multi-step tasks, at the cost of raw coding correctness on tasks that require reading and editing complex existing codebases. This is a design choice, not a deficiency.
## "Flash" Is Now a Speed Tier, Not a Budget Tier
The naming matters because it shapes expectations and buying decisions. For three generations, Flash meant: acceptable quality, fast inference, low cost — use it for high-volume, latency-sensitive workloads where you can tolerate some quality reduction versus the Pro/Ultra/Opus tier.
Gemini 3.5 Flash changes this. At $9/M output, it is not a budget model. At 76.2% Terminal-Bench 2.1, it is not a quality-compromised model. It is a speed-tier model: frontier-class performance at frontier-class speed, at a price point below the flagships ($25/M output for Opus 4.7, $30/M for GPT-5.5) but above what developers historically expected from Flash.
The TechTimes headline "costs 3x more per token" versus prior Flash models is accurate in absolute terms. Whether you view that as expensive depends on the comparison: versus flagship models, 3.5 Flash is considerably cheaper. Versus prior Flash models and true budget options like Haiku 4.5, it is substantially more expensive.
Google is repositioning the Flash tier. The question for teams is whether the performance jump justifies paying more than Haiku while falling short of Opus 4.7 on the metrics that matter most for complex coding.
## Where 3.5 Flash Wins in Practice
The strongest case for Gemini 3.5 Flash is MCP-orchestrated agentic workflows on Google infrastructure.
If your agent stack uses Antigravity 2.0 for deployment, BigQuery for data access, and MCP servers for tool integration, Gemini 3.5 Flash is the fastest path to production. The model leads on MCP Atlas specifically — not because Google gamed the benchmark, but because the model was built with this architecture in mind. Speed (4x faster than frontier) matters when you are running agents with 15-30 MCP tool calls per workflow.
The combination of Firebase Studio (launched at I/O 2026 as the agent-native build environment), Jules (free-tier async coding agent), and Gemini 3.5 Flash in Antigravity creates a coherent Google-native stack that is genuinely competitive for teams already in the Google Cloud ecosystem.
**The realistic comparison for a Google Cloud team:**
- Gemini 3.5 Flash in Antigravity: MCP Atlas leadership, 4x speed, tight Google Cloud integration, $9/M output
- Claude Code on Bedrock: Opus 4.7 foundation, 87.6% SWE-bench Verified, Managed Agents depth, $25/M output
The price delta is real. If your workload is primarily MCP-orchestrated pipeline work rather than deep repo-scale coding, 3.5 Flash on Antigravity is a defensible choice. If your workload is spec-driven autonomous development at the scale that Managed Agents and Code Review address, the SWE-bench quality gap matters more than the speed advantage.
## The Distribution Argument
Google's actual competitive advantage is not Gemini 3.5 Flash's benchmark numbers. It is where the model runs.
Gemini CLI is free with 1,000 requests/day for any developer. Firebase Studio now provisions it by default for new agent-native projects. Antigravity 2.0 runs it as the default model for Google Cloud agentic deployments. Every developer who starts a new project in Firebase Studio, opens a Gemini CLI session, or deploys to Cloud Run through Antigravity is defaulting to Google's model stack.
This is the distribution moat that benchmark tables do not capture. OpenAI's equivalent is ChatGPT's installed base and Azure's enterprise relationships. Anthropic's equivalent is Amazon Bedrock's 100,000+ enterprise customers and the GitHub Copilot Pro+ integration. Google's is the developer surface area of the Google Cloud ecosystem and the free access tier that gets Gemini CLI into every developer's terminal.
Benchmark leadership matters. Distribution at scale matters more.
## Bottom Line
Gemini 3.5 Flash is a meaningful model release. It is not the "budget Flash" the name implies. It is a near-frontier agentic model optimized for MCP-driven workflows, fast inference, and Google Cloud native integration, priced at a substantial premium over prior Flash models but below flagship pricing.
Claude Opus 4.7 retains the SWE-bench Verified lead. GPT-5.5 retains the Terminal-Bench 2.1 lead. Gemini 3.5 Flash leads on MCP Atlas and GDPval-AA — the benchmarks that most directly measure real-world agentic workflow performance.
The practical read: if you build on Google Cloud and your agents are MCP-orchestrated pipeline work, evaluate 3.5 Flash seriously. If you are running spec-driven autonomous development where coding correctness under uncertainty matters, Opus 4.7 remains the benchmark and the gap is not closed yet.
Google is doing what Google does: competing on breadth and integration rather than narrow benchmark supremacy. That has worked before.
---
**Sources:**
- [Gemini 3.5 Flash: Benchmarks, Pricing, and Complete Specs](https://llm-stats.com/blog/research/gemini-3.5-flash-launch) — LLM Stats
- [Google releases Gemini 3.5 Flash; surpasses GPT-5.5 in agentic benchmarks](https://seekingalpha.com/news/4595030-google-releases-gemini-3_5-flash-surpasses-gptminus-5_5-in-agentic-benchmarks) — Seeking Alpha
- [Gemini 3.5 Flash — Google DeepMind](https://deepmind.google/models/gemini/flash/) — Google DeepMind
- [Google Unleashes Gemini 3.5 Flash: A Coding Powerhouse That's 4x Faster and Half the Cost](https://finance.biggo.com/news/202605191936_Google_Gemini_3.5_Flash_launched_at_IO_2026) — BigGo Finance
- [Google Ships Gemini 3.5 Flash, a Cheap-to-Run Agent Model That Costs 3x More Per Token](https://www.techtimes.com/articles/316861/20260519/google-ships-gemini-35-flash-cheap-run-agent-model-that-costs-3x-more-per-token.htm) — TechTimes
- [Gemini 3.5 Flash vs GPT-5.5 vs Opus 4.7: Agentic Coding](https://www.digitalapplied.com/blog/gemini-3-5-flash-vs-gpt-5-5-opus-4-7-agentic-coding) — Digital Applied
- [Gemini 3.5 Flash: more expensive, but Google plan to use it for everything](https://simonwillison.net/2026/May/19/gemini-35-flash/) — Simon Willison
- [Gemini 3.5 Flash Benchmarks, Pricing & Context Window](https://llm-stats.com/models/gemini-3.5-flash) — LLM Stats
---
# Claude Code v2.1.139: Agent View Turns Your Terminal Into a Fleet Dashboard
URL: https://sdd.sh/2026/05/claude-code-agent-view-goal-command-v2-1-139/
Date: 2026-05-20
Tags: claude-code, anthropic, agent-view, agentic-coding, background-sessions, goal-command
Categories: AI Tools, Agentic Workflows
Summary: Claude Code v2.1.139 ships two features that change how multi-agent work actually looks: Agent View — a unified dashboard showing every running, blocked, and completed session — and the /goal command, which keeps Claude working autonomously across turns until a defined completion condition holds.
Claude Code has supported background sessions and parallel worktrees for months. The workflow was real but the visibility was not: you launched agents with `claude --bg`, context-switched between terminals to check progress, and manually tracked which sessions were waiting on you versus still running. v2.1.139 closes that gap with two features that make multi-agent development something you can actually see and direct.
## Agent View: One Screen for Every Session
`claude agents` opens a live list of every Claude Code session in the current environment, organized by state:
- **Running** — the agent is actively working on its current turn
- **Blocked** — the agent needs a human decision before it can continue
- **Done** — the agent has completed its work and is ready for review
The view updates in real time. If you have five background sessions across three repos and one is waiting on a tool approval, it appears in the Blocked row immediately. You do not have to cycle through terminal tabs or check process output to discover that.
The practical effect is a shift in how you allocate attention during multi-agent work. Today's pattern: launch a task, forget it, discover it was stuck ten minutes ago when you check back. The Agent View pattern: glance at the dashboard every few minutes, handle anything in the Blocked column, continue with whatever you're doing.
Agent View also shows sessions started via `claude --bg` alongside interactive ones, with background sessions marked `bg`. The `/resume` command works directly from the view, which means handling a blocked session no longer requires remembering which background session handles which task.
The feature ships as a research preview, which is the same stage Computer Use and Code Review shipped before going GA. Expect it to evolve based on usage patterns before it stabilizes.
## /goal: The Command That Doesn't Stop
The /goal command takes outcome-based execution seriously. Instead of running a task and returning control when it completes, /goal defines a completion condition and Claude keeps working across as many turns as necessary until that condition holds:
```
/goal All tests in the auth module pass and coverage is above 85%
```
After each turn, a separate Haiku model evaluates whether the condition holds. If the condition is not met, Claude starts another turn. If it is met, execution stops and control returns to you.
The dual-model design is worth paying attention to. Using the same model to decide both what to do and when to stop creates a failure mode where the agent convinces itself the condition has been met before it actually has — what practitioners call mission drift. By running the evaluation in a separate Haiku context that receives only the current state and the original condition, the architecture keeps the judgment about success structurally separate from the work itself.
While /goal runs, a live overlay panel shows elapsed time, turn count, and current token usage. You can monitor cost accumulation in real time rather than discovering a large bill after the fact.
The command works in interactive mode, with the `-p` flag for scripted invocations, and in Remote Control. This means you can wire /goal into a CI pipeline, a Routine, or any other orchestration layer that drives Claude Code programmatically.
## What Else Shipped in 2.1.139
The two headline features are accompanied by a set of smaller but useful changes:
**Plugin improvements**: `claude plugin details` now shows a plugin's full component inventory and a projected per-session token cost estimate before you install or run it. The marketplace also enforces plugin dependencies — if a plugin requires another plugin or MCP server to function, the dependency is flagged at install time rather than discovered at runtime. Both changes address the common experience of installing a plugin that silently fails to work.
**Session and background controls**: Background sessions (`--bg`) now appear in interactive mode session lists marked as `bg`. The `/resume` command accepts session IDs from background sessions, making it consistent with the existing interactive session resumption flow.
**Transcript navigation**: The transcript view now supports keyboard shortcuts for jumping between user prompts. For long sessions involving many turns, this is meaningfully faster than scrolling.
**Observability for multi-agent work**: API requests from subagents now carry `x-claude-code-agent-id` and `x-claude-code-parent-agent-id` headers. The `claude_code.llm_request` OpenTelemetry span includes matching `agent_id` and `parent_agent_id` attributes. Teams running multiple Claude Code agents in orchestrated workflows can now trace which agent made which API call without correlating by timestamp.
**/scroll-speed**: A minor quality-of-life command that adjusts mouse wheel scroll speed in the terminal with a live preview. Trivial, but the kind of thing that grates when it's wrong for your setup.
**Notable fixes**: A deadlock that blocked `claude auth login`, `logout`, and `status` when expired credentials coincided with the `forceRemoteSettingsRefresh` enterprise policy is resolved. The `autoAllowBashIfSandboxed` flag now correctly approves commands that include shell expansions. Unbounded memory growth from HTTP/SSE MCP servers streaming non-protocol data is patched.
## From Session to Fleet
The direction of 2.1.139 is readable: Claude Code is evolving from a per-session tool into a multi-agent coordination platform you operate from a single interface.
Agent View is the control plane. /goal is the autonomy primitive that makes agents worth controlling. The observability additions (OTel agent headers, projected plugin costs) are the instrumentation layer that lets you understand what the fleet is doing and what it costs. These three things are coherent; they belong together.
The practical ceiling for today's solo developer is probably eight to twelve parallel sessions before coordination overhead exceeds productivity gain. But the relevant frame is not the solo developer — it is the team running Claude Code Routines overnight, the enterprise running Code Review on every PR, the startup where one developer is directing a dozen specialized agents across a monorepo. Agent View makes that model of working possible to manage without losing track.
Upgrade via `npm install -g @anthropic-ai/claude-code` or wait for the auto-update. The Agent View research preview requires no configuration beyond launching `claude agents`.
---
**Sources:**
- [Release v2.1.139 · anthropics/claude-code](https://github.com/anthropics/claude-code/releases/tag/v2.1.139) — GitHub
- [Claude Code CLI 2.1.139 changelog](https://x.com/ClaudeCodeLog/status/2053913638197416198) — ClaudeCodeLog on X
- [Claude Code v2.1.139: Agent View, Goal Setting, and Enhanced Workflow Control](https://claude-world.com/articles/claude-code-21139-release/) — ClaudeWorld
- [Claude Code Agent View and Goal Command for AI Engineers](https://zenvanriel.com/ai-engineer-blog/claude-code-agent-view-goal-command-guide/) — Zen van Riel
- [Claude Code 2.1.139 adds /goal command](https://explainx.ai/blog/claude-code-goal-command-long-running-agents-2026) — explainx.ai
- [Changelog - Claude Code Docs](https://code.claude.com/docs/en/changelog) — Anthropic
---
# Google I/O 2026: Firebase Studio Is Live, Jules Goes Free, and the Agentic Race Gets a Third Contender
URL: https://sdd.sh/2026/05/google-io-2026-firebase-studio-jules-free-gemini-code-assist-recap/
Date: 2026-05-19
Updated: 2026-05-19
Tags: google, google-io, firebase, firebase-studio, jules, gemini, gemini-code-assist, agentic-coding, claude-code
Categories: AI Tools, Agentic Workflows, Industry
Summary: Google I/O 2026 delivered the developer tools story it promised: Firebase Studio launched as a full-stack agent-native development platform, Jules exited beta with free-tier access, and Gemini Code Assist hit general availability. Google's agentic coding stack is now a real product, not a roadmap.
The [preview article I published six days ago](https://sdd.sh/posts/google-io-2026-preview-gemini-4-firebase-agents-agentic-coding/) said to watch three things at Google I/O 2026: whether Gemini 4's context window advantage translated into better coding outcomes, whether Firebase Studio became a real agent-native platform or a rebranding exercise, and whether Jules V2 had a credible answer to Claude Code Routines.
The keynote delivered answers. Not all of them were what Google needed — but some were.
## What Actually Shipped
Google's I/O 2026 developer story is cleaner than most expected. Three products moved from preview or beta to shipped:
- **Firebase Studio** launched as an agent-native full-stack development environment
- **Jules** exited beta and became available to all users on free and paid AI Pro and Ultra tiers
- **Gemini Code Assist** hit general availability for individuals and GitHub users, powered by Gemini 2.5
Supporting these launches: **Gemini Intelligence** — Google's integrated AI suite across Android, ChromeOS, Wear OS, Android Auto, and Android XR — and **Googlebooks**, the first Aluminium OS laptops from Acer, ASUS, Dell, HP, and Lenovo arriving this fall.
Google did not ship Gemini 4 as a standalone model with a clean benchmark story. The model capability layer is table stakes now; the developer tooling story is what differentiated I/O 2026.
## Firebase Studio: The Real Announcement
Firebase Studio is the most significant thing Google shipped today for developers who care about agentic workflows. It is not Project IDX with a new name. It is a substantively different product.
The architecture: a Code OSS IDE environment in the browser, a no-code prototyping layer for non-developer stakeholders, and an agent mode capable of executing multi-step development tasks autonomously. Figma integration means a design file becomes an application prototype in Firebase Studio without manual handoff. Google Cloud backend provisioning is automated — Cloud Run, Firebase Hosting, and related services are available without configuration.
The intended workflow is: prototype in Google AI Studio → build in Firebase Studio → deploy to Google Cloud. For teams already in the Google ecosystem, this is a credible end-to-end pipeline with fewer seams than anything Google has shipped before.
The comparison to Claude Code is the obvious one and Google knows it. Firebase Studio's thesis is that the browser is where more developers live, that cloud-native development removes the local environment complexity, and that Figma-to-deployment in a single environment lowers the barrier for teams that currently have a designer-to-developer handoff problem.
Claude Code's thesis is that the terminal provides the access, flexibility, and tooling depth that truly autonomous agents require — and that browser-based environments introduce platform constraints that limit what agents can do. Both theses are coherent. They're not targeting exactly the same developer.
Where Firebase Studio wins: Google Cloud-native integration depth is real and unmatched. If you are deploying to Cloud Run, using BigQuery, or running on Firebase Hosting, the one-click deployment and native service wiring saves hours of configuration that Claude Code requires separately via MCP servers or custom scripts.
Where Claude Code wins: environment ownership. Claude Code agents can run arbitrary shell commands, modify system configs, install toolchains, and manage processes in ways that a browser IDE cannot. For the kind of spec-driven, multi-agent, CI-integrated autonomous development that Claude Code Routines enables, Firebase Studio's browser-native architecture is a constraint, not an advantage.
Firebase Studio is a real product that will earn real adoption. It is not Claude Code.
## Jules Goes Free: The KPI-Driven Bet
Jules exiting beta is significant for one reason: it is now available to all users, including the free tier. That means any developer can queue an async task on Google's infrastructure, walk away, and come back to a pull request.
The architectural story has not changed from [the Jules deep dive published in March](https://sdd.sh/posts/jules-deep-dive-google-async-agent-ci-loop/). Jules integrates with GitHub, runs on Google infrastructure, creates multi-step plans, executes them asynchronously, and presents results as a diff with reasoning attached. Audio changelogs of its work are available. The CI loop closes automatically.
What is new is Project Jitro — the Jules V2 approach that changes the input model. Instead of telling Jules what to do (fix this bug, refactor this module), you tell it what to achieve: raise test coverage to 80%, reduce p95 API latency by 30 milliseconds, resolve all accessibility violations in the component library. Jitro maps the goal to the required code changes, runs asynchronously, and delivers a pull request targeting the metric, not the task.
KPI-driven development is a genuinely interesting framing. It is also harder to evaluate than task-driven development because the rubric for success is embedded in the goal definition. If you tell an agent "raise test coverage to 80%," the most efficient path is to write trivial tests that cover lines without exercising real behavior. Whether Google has solved that evaluation problem is not yet clear from today's launch.
Claude Code's analog is Managed Agents Outcomes, announced at Code with Claude SF on May 6: a separate rubric-based grader that runs in its own context window, evaluates whether the agent's output meets defined criteria, and triggers re-runs if it doesn't. The grader runs independently of the agent, which is structurally different from building the goal evaluation into the task itself. Neither approach has published failure rate data at production scale.
Jules free tier changes who can evaluate these tools. The comparison is now available to any developer without a budget commitment.
## Gemini Code Assist: GA With Caveats
Gemini Code Assist reached general availability for individuals and GitHub users. Gemini 2.5 powers the assistant. A 2 million token context window is announced as coming soon, not yet live.
The "coming soon" caveat matters. A 2M token context window for Gemini Code Assist would change the competitive comparison with Claude Code's 1M context window significantly for whole-codebase tasks. But it is not shipping today. GA means the product is available and supported — it does not mean the context window feature announced for the future is present now.
At current Gemini 2.5 Pro performance levels, the model quality gap versus Opus 4.7 on SWE-bench Pro is approximately ten percentage points (54% vs 64.3%). Gemini Code Assist's competitive advantage today is not model quality — it is Google Cloud native integration and price. The free tier (700,000+ VS Code installs) gives Google enormous distribution. If Gemini 4 substantially closes the model quality gap, that distribution becomes a moat.
## The Platform Layer: Gemini Intelligence and Googlebooks
Two announcements from today are not directly about developer tooling but matter for the longer arc.
**Gemini Intelligence** — the integrated AI suite across Android, ChromeOS, Wear OS, Android Auto, and Android XR — represents Google's bet that the agent layer lives in the OS, not just in development tools. Features like proactive task automation, custom widget generation, and the Rambler speech-to-text assistant are not developer tools. They are consumer surfaces that normalize agentic behavior for users who will eventually consume agentic software. Google is building the audience for agentic applications at the OS level while simultaneously building the tools to create them.
**Googlebooks** — the first Aluminium OS laptops from major OEMs — is the physical manifestation of the Android-ChromeOS merger that developers have been tracking for two years. Aluminium OS arrives in fall 2026. For Android and web developers, it is a new primary development and consumption target.
## The Three-Way Race
I/O 2026 establishes something that was not clearly true twelve months ago: the agentic coding market has three serious competitors.
Claude Code leads on model quality (Opus 4.7, 64.3% SWE-bench Pro), terminal-native autonomy, Managed Agents platform depth, and enterprise infrastructure (Cowork, Analytics API, Code Review GA). It is the benchmark that everyone else is chasing.
OpenAI's Codex is the credible cost alternative, leveraging GPT-5.5 (82.7% Terminal-Bench 2.0, 58.6% SWE-bench Pro) with async execution, a mobile supervision layer, and pricing that enterprise procurement finds easier to defend than Claude Code's per-token costs.
Google now has Firebase Studio (agent-native platform for Google Cloud deployments), Jules (free-tier async agent with KPI-driven V2 approach), and Gemini Code Assist (2M context window incoming, 700K VS Code installs). Google's stack wins on distribution, integration depth within the Google ecosystem, and price. It loses on autonomous execution depth and current model quality.
The developer who builds on Google Cloud, has Figma in their design workflow, and wants an integrated environment for new projects now has a real choice that Firebase Studio represents. The developer doing spec-driven multi-agent development at team scale with CI integration, production CLAUDE.md invariants, and autonomous overnight coding runs has not had a reason to switch today.
The benchmark that matters: when Gemini Code Assist ships the 2M context window and if Google closes the SWE-bench Pro gap with whatever model ships next, the model-quality argument for Claude Code's premium pricing weakens. Until then, Google has narrowed the tooling gap without closing the quality gap.
That is real progress. It is also still a gap.
---
**Sources:**
- [Everything announced at The Android Show: I/O 2026 edition](https://www.engadget.com/2171038/everything-announced-at-android-show-google-io-2026/) — Engadget
- [Google I/O 2026: The Developer Briefing](https://byteiota.com/google-io-2026-developer-preview/) — Byteiota
- [Firebase Studio - Google](https://firebase.google.com/docs/studio) — Firebase Docs
- [Firebase Studio lets you build full-stack AI apps with Gemini](https://cloud.google.com/blog/products/application-development/firebase-studio-lets-you-build-full-stack-ai-apps-with-gemini) — Google Cloud Blog
- [Google's Next Coding Agent Could Change How Developers Think About Their Work](https://devops.com/googles-next-coding-agent-could-change-how-developers-think-about-their-work/) — DevOps.com
- [Google tests Jules V2 agent capable of taking bigger tasks](https://www.testingcatalog.com/google-prepares-jules-v2-agent-capable-of-taking-bigger-tasks/) — Testing Catalog
- [Google I/O 2026 Developer Preview: Gemini 4, Android 17, Agentic Coding](https://www.abhs.in/blog/google-io-2026-may-19-gemini-4-android-17-agentic-coding-developer-preview) — Abhishek Gautam
- [Google Counters GitHub & Microsoft with Jules Agent & Enhanced Gemini AI](https://visualstudiomagazine.com/articles/2025/05/20/google-counters-github-microsoft-with-jules-agent-enhanced-gemini-ai.aspx) — Visual Studio Magazine
- [AI assistance within Firebase Studio](https://firebase.google.com/docs/studio/ai-assistance) — Firebase Docs
---
# Anthropic Passed OpenAI in Business AI Spend. The Ramp Data Is Decisive — and the Threats Are Serious.
URL: https://sdd.sh/2026/05/anthropic-overtakes-openai-ramp-ai-index-may-2026/
Date: 2026-05-19
Updated: 2026-05-19
Tags: anthropic, claude-code, openai, enterprise, market-share, industry, ramp
Categories: Industry, AI Tools
Summary: The May 2026 Ramp AI Index shows Anthropic at 34.4% of US business AI spend — past OpenAI's 32.3% for the first time. Claude Code is the engine. But the same report flags three structural threats that could erase the lead as fast as it was built.
For the first time since the AI race started in earnest, more American businesses are paying for Anthropic than for OpenAI. The May 2026 Ramp AI Index, compiled from transaction data across more than 50,000 US businesses and $100 billion in annual spend, shows Claude at 34.4% adoption — up 3.8% in April — while ChatGPT fell to 32.3%, down 2.9% in the same period.
This is a single data point from a single measurement methodology. It is also the most credible third-party spend-tracking dataset in the AI market, and the crossover it records is unambiguous.
## How This Happened
The trajectory is not subtle. Anthropic climbed from 0.03% of businesses in June 2023 to 7.94% by April 2025 — then rocketed to 34.4% by April 2026. That is a quadrupling in a single year. OpenAI's business adoption grew 0.3% over the same period.
The driver is not Claude's chat product. It is Claude Code.
Ramp's analysis identifies Claude Code as the fastest-growing product in Anthropic's history and the primary mechanism behind the adoption surge. That tracks with external signals: a separate analysis of public GitHub data published this month estimated that 4% of all global public commits are now being authored by Claude Code — double the percentage from just one month prior. For context, GitHub processed roughly 4 billion commits in 2025. Four percent of that is 160 million commits per year. One tool, one year, that kind of scale.
The growth compounds because Claude Code is a workflow tool, not a chatbot. When a team adopts Claude Code, usage accumulates through API calls rather than seat licenses. One engineer doing serious agentic work can consume between $500 and $2,000 per month in API costs. Multiply that across engineering orgs, and Anthropic's revenue per customer is significantly higher than a traditional SaaS model where everyone pays the same monthly fee regardless of usage.
Anthropic's annualized revenue run rate hit approximately $30 billion in early 2026, up from $9 billion at the end of 2025. OpenAI is tracking at $24-25 billion over the same period — a reversal from a lead that had seemed structural just eighteen months ago.
## Three Threats the Data Surfaces
Ramp doesn't just track the crossover. It flags three structural risks that could unwind Anthropic's position. Each deserves honest analysis.
**Threat 1: The token incentive trap.** Anthropic makes money when customers use more tokens. That creates a structural incentive to push customers toward expensive models and high-context workflows even when cheaper alternatives would suffice. Ramp frames this bluntly: "Anthropic profits from increased token consumption, creating pressure to push customers toward expensive models even when cheaper ones are sufficient."
This is the underlying economics behind what Uber's CTO described publicly: the company burned through its entire 2026 AI budget in four months, largely on Claude Code and Cursor. Individual engineers are reporting $500 to $2,000 per month in personal API costs for serious agentic workflows. At those numbers, CFOs start paying attention. And when CFOs start paying attention, they look for alternatives.
**Threat 2: Reliability and cost shifts.** Ramp's data captured a period of user frustration — "frequent outages, rate limits, and increasing dissatisfaction with results." Anthropic responded by resetting usage limits in April and securing additional compute capacity through the SpaceX Colossus deal (300MW, 220,000+ NVIDIA GPUs in Memphis). The rate limit reset helped. But the underlying compute constraint that created the reliability problems is a consequence of the 80x growth in Q1 2026 that Anthropic had only planned as 10x. That kind of demand mismatch doesn't resolve cleanly.
A separate cost issue: recent model changes tripled token costs for image-inclusive prompts. That's a significant jump in a category where usage is growing. Claude Code's computer use features and the visual analysis capabilities of Opus 4.7 both involve image tokens. Developers building on those capabilities took an unexpected cost hit.
**Threat 3: OpenAI Codex as cost-effective alternative.** OpenAI's Codex — the async agentic coding agent, not the legacy model — now covers substantial overlap with Claude Code's core workflow at a lower per-task cost and with minimal switching friction. Ramp identifies inference platforms offering cheap, open-source alternatives as the fastest-growing competing category in their dataset. Codex isn't open-source, but its pricing structure and the ease of migration via standard API patterns means that cost-sensitive teams have a credible exit path.
The switching cost from Claude Code to Codex is lower than it looks from the outside. Both tools operate via terminal, both support CLAUDE.md-style configuration, both integrate with GitHub. The moat is model quality, CLAUDE.md ecosystem depth, and the Managed Agents platform. If OpenAI closes the model quality gap on SWE-bench Pro (currently 58.6% GPT-5.5 vs 64.3% Opus 4.7), the Codex cost argument gets harder to dismiss.
## What Actually Changes
The Ramp crossover is symbolically significant and operationally real. "Most businesses paying for Anthropic vs OpenAI" means enterprise IT procurement conversations are now tilted differently than they were six months ago. When Anthropic walks into a 10,000-seat enterprise negotiation, it no longer needs to defend itself against "but everyone uses ChatGPT." The data now says the opposite.
But the threats Ramp surfaces are also real, not hypothetical. Uber's budget story will be repeated in CFO conversations at every large enterprise that has given engineers open-ended Claude Code access. The response from those CFOs won't necessarily be "switch to a competitor" — it may be "governance and spend controls." That's exactly what Claude Cowork GA addresses (group spend limits, per-user caps, Analytics API for cost attribution). Anthropic has built the enterprise controls. The question is whether adoption of those controls keeps pace with the cost concerns they're meant to address.
The deeper question is whether Claude Code's architectural advantages are durable. The terminal-native, agent-owned model — where Claude Code has full environment access and owns the full development lifecycle from spec to deployment — is qualitatively different from IDE-embedded tools. But "qualitatively different" only maintains a price premium if users feel the difference in their outcomes, not just in their benchmarks.
The 4% of GitHub commits metric is the most direct signal available. At 160 million commits per year, something about the outcomes is working. The business adoption crossover confirms the enterprise is noticing.
The threats are real. The lead is real. The next quarter of Ramp data will be informative.
---
**Sources:**
- [Ramp AI Index — May 2026](https://ramp.com/leading-indicators/ai-index-may-2026) — Ramp
- [Anthropic now has more business customers than OpenAI, according to Ramp data](https://techcrunch.com/2026/05/13/anthropic-now-has-more-business-customers-than-openai-according-to-ramp-data/) — TechCrunch
- [Anthropic finally beat OpenAI in business AI adoption — but 3 big threats could erase its lead](https://venturebeat.com/technology/anthropic-finally-beat-openai-in-business-ai-adoption-but-3-big-threats-could-erase-its-lead) — VentureBeat
- [Anthropic Passes OpenAI in Business Adoption: Ramp AI Index](https://letsdatascience.com/blog/anthropic-passed-openai-business-adoption-ramp-index) — Let's Data Science
- [Anthropic 34.4% Just Passed OpenAI — Ramp Flip May 2026](https://theplanettools.ai/blog/anthropic-overtakes-openai-ramp-business-adoption-may-2026) — ThePlanetTools.ai
- [Anthropic vs OpenAI Business Adoption: What the Data Says About Enterprise AI](https://www.mindstudio.ai/blog/anthropic-vs-openai-business-adoption-2026) — MindStudio
---
# OpenAI Codex Mobile: Remote Control for Your Agent, Not Code on Your Phone
URL: https://sdd.sh/2026/05/openai-codex-mobile-remote-control-agentic-sessions/
Date: 2026-05-18
Updated: 2026-05-18
Tags: openai, codex, mobile, agentic-workflows, remote-access
Categories: AI Tools, Agentic Workflows
Summary: OpenAI shipped Codex inside ChatGPT for iOS and Android on May 14 — but not as a code execution environment. It's a remote viewport onto a session running on a host machine. Remote SSH also went GA. The architectural choice is correct, and it reveals more about agentic coding than the headline does.
The announcement almost always gets written wrong. "Codex comes to your phone" — technically accurate, architecturally misleading. What launched on May 14 is a remote viewing and control surface, not a mobile runtime. Your phone becomes a window into a Codex session running on a host machine — your laptop, a Mac mini, a dev box, or a cloud VM you've configured.
This is the right design. And it says more about where agentic coding is headed than the headline suggests.
## Why Mobile Can't Be the Runtime
Agentic coding at the frontier requires substantial compute and system access. Anthropic's SpaceX Colossus deal — 300 megawatts, 220,000+ NVIDIA GPUs allocated to Claude — suggests the scale these systems will eventually run at. That isn't coming to the A18 chip.
But raw compute is only part of it. Coding agents need persistent filesystem access (your project files, not a sandboxed documents folder), tool execution (git, cargo, pytest, Docker, whatever your stack requires), credential access (API keys, SSH certs, cloud auth), and long-running sessions that can stretch across hours. None of these fit cleanly on a phone.
OpenAI made the constraint explicit in the launch notes: files, credentials, and local setup stay on the host machine. The phone only receives what the host chooses to stream — terminal output, screenshots, file diffs, test results. You can review outputs, approve commands, change models, and start new sessions from the phone. The agent itself stays on the machine that can do the work.
## What Actually Shipped
Three things launched together on May 14:
**Codex on ChatGPT mobile (preview)** — Available across all ChatGPT plans, including Free and Go. The mobile app shows a live view of active Codex threads: what the agent is doing, what it's produced, which commands it's requesting. You can redirect a session, approve a step, or kill a runaway loop from the phone without touching a laptop.
**Remote SSH GA** — Codex can now connect to any SSH-accessible machine. This moved from preview to general availability on May 14. The practical implication: Codex doesn't need to run on your local laptop at all. SSH into a powerful dev box, a team server, or a cloud VM — the session runs there. The phone (or any client) streams it.
**Programmatic access tokens for Enterprise and Business** — Scoped credentials that let Codex sessions authenticate to services without exposing primary account credentials. Useful for CI/CD pipelines and automated workflows where an agent needs service-level access rather than user-level access.
HIPAA-compliant Codex in CLI, IDE, and the Codex app also launched for Enterprise customers — but local environments only. Healthcare teams running agents against protected data keep everything on-premises. The mobile viewer is excluded from this mode by design.
## The Architecture It Reveals
Mobile-as-viewport isn't a compromise. It's an acknowledgment that agentic workflows have natural supervision points — moments where a human should glance at what's happening and decide whether to continue, redirect, or stop. You don't need a laptop to do that.
This pattern was already live for Claude Code users. Claude Code Channels (launched March 2026) routes agent sessions through Telegram or Discord. The agent runs on your machine or Anthropic's infrastructure; the messaging app is the supervision surface. You get a message when the agent finishes a step, needs approval, or hits an error. You reply with instructions.
The key architectural difference is where the agent lives. Codex on mobile still requires a host machine — your device, a dev box, or a cloud VM you've configured. Claude Code Routines run on Anthropic's infrastructure natively; there's no host to provision or maintain. When a Routine finishes overnight, the result shows up wherever you're watching. Remote SSH GA narrows the gap for Codex — if you provision a powerful cloud VM and SSH Codex into it, you get a similar effect — but it requires managing that infrastructure yourself.
## Who This Is Actually For
The core use case is supervision during transitions. You kick off a multi-hour Codex session before a meeting. Mid-meeting, you check your phone, verify the agent is making progress, review an intermediate diff. After the meeting, you approve the next step or give the agent new direction. That turns "agentic coding requires sitting at a computer" into "agentic coding requires occasional glances from wherever you are." That's a meaningful quality-of-life change for anyone running long sessions.
The enterprise features have a different constituency. Remote SSH GA lets engineering teams provision powerful centralized dev boxes and let developers SSH Codex sessions into them — shared infrastructure rather than per-developer laptops. The HIPAA-compliant mode signals OpenAI's intent in regulated industries. Healthcare teams that needed fully on-premises agent workflows can now use Codex agents within those constraints, supervised via the desktop app.
The Free and Go plan access for the mobile viewer is the broader play. OpenAI is normalizing the idea that you supervise agents from your phone rather than owning and operating a dev machine. For individual developers and indie hackers, lower barrier to entry matters.
## The Problem That Remains
Codex on mobile is a good execution within real constraints. The constraints are legitimate: mobile is a viewport, not a runtime, and OpenAI designed accordingly.
But every Codex session still requires a host machine. If your laptop runs out of battery, the session stops. If you're traveling and didn't provision a cloud VM, there's no fallback. The agent isn't cloud-native by default. It's cloud-accessible-if-you-set-it-up.
The natural next step is fully managed cloud execution: OpenAI runs the agent on their infrastructure by default, mobile and desktop clients supervise it. That would make the phone viewer genuinely powerful rather than convenient — the agent outlasts your hardware. Whether OpenAI builds this before Claude Code's Routines become the default expectation for how autonomous agents work is the product trajectory to watch.
For now, May 14 is a worthwhile milestone. Remote SSH GA alone is the more significant technical change — it decouples Codex from the developer's local machine, which is the prerequisite for everything else. The mobile viewer is the consumer face of an infrastructure shift that matters.
---
**Sources:** [TechCrunch](https://techcrunch.com/2026/05/14/openai-says-codex-is-coming-to-your-phone/), [9to5Mac](https://9to5mac.com/2026/05/14/openai-brings-codex-control-to-chatgpt-for-iphone-and-android/), [SiliconANGLE](https://siliconangle.com/2026/05/14/openai-brings-codex-mobile-devices-adds-customization-features/), [OpenAI Developer Docs — Remote Connections](https://developers.openai.com/codex/remote-connections), [Gadget Bridge](https://www.gadgetbridge.com/news/openai-codex-lands-on-chatgpt-mobile-app-for-ios-and-android-with-remote-ssh-support/)
---
# Cursor 3.3 and 3.4: Parallel Build Plans, Cloud Dev Environments, and the Ceiling That Remains
URL: https://sdd.sh/2026/05/cursor-33-34-parallel-agents-cloud-dev-environments/
Date: 2026-05-18
Updated: 2026-05-18
Tags: cursor, parallel-agents, cloud-environments, agentic-workflows, code-review
Categories: AI Tools, Agentic Workflows, Industry
Summary: Cursor shipped two meaningful updates in May: Parallel Build Plans and PR Splitting in 3.3 (May 7), and Cloud Agent Development Environments plus configurable Bugbot effort levels in 3.4 (May 13). Both updates are genuine improvements. Both also clarify what Cursor is and isn't.
Cursor shipped two changelogs in quick succession this month. Version 3.3 on May 7 added Parallel Build Plans and built-in PR Splitting. Version 3.4 on May 13 added Cloud Agent Development Environments and configurable Bugbot effort levels. The combined effect is a meaningfully more capable agentic coding environment — and a clearer picture of where Cursor's architecture lands.
Both things are true: these are real improvements, and the ceiling is real. Let's look at both.
## 3.3: Parallel Build Plans and PR Splitting
The headline feature in 3.3 is "Build in Parallel." When Cursor generates a multi-step implementation plan, a button now identifies which parts are independent and runs them as async subagents concurrently. Steps that require earlier output stay ordered; everything else runs in parallel.
This is a genuine improvement over the single-agent sequential loop. A plan to add authentication middleware, update the database schema, and write integration tests has three largely independent branches. Running them sequentially meant the agent worked on one while the others waited. Build in Parallel runs all three concurrently and merges results when they complete.
PR Splitting is the complementary feature. Once an agent produces a large diff, a quick-action pill in the PR view proposes how to split it into logically independent pull requests. Cursor shows the proposed split, creates a backup snapshot, and executes if you confirm. The chat context from the session informs how it identifies slices — if you told the agent "add auth and fix the caching bug," it knows those are separate concerns and splits accordingly.
Both features address a real friction: AI-generated diffs routinely mix multiple logical changes in a single commit because agents tend to fix adjacent things they notice. Giving the agent a splitting primitive and parallel execution reduces that sprawl.
## 3.4: Cloud Dev Environments
The more architecturally significant update is in 3.4. Teams can now configure a Dockerfile-based development environment that Cursor agents use when running in the cloud. The Dockerfile specifies the repository, dependencies, credentials, build system access, and any tooling the agents need. It's reusable across sessions and supports multi-repo configurations.
This addresses a real failure mode: cloud agents fail in opaque ways when they can't find a dependency, authenticate to a service, or run a build command. A team-managed Dockerfile that defines the full environment removes that ambiguity. Every agent session starts from a known-good state.
The enterprise value is clear. Previously, Cursor cloud agents ran in generic sandboxes that may or may not match your actual development environment. Now a team can define "this is what our stack looks like" and have agents operate reliably within it. Persistent environments across sessions, multi-repo access, pre-baked credentials — these are the table stakes for production agentic workflows.
Bugbot gains effort levels in 3.4. Default mode finds 0.7 bugs per run with 79%+ resolved by merge time. High mode climbs to 0.95 bugs per run — slower, more expensive, more thorough. A Custom mode takes natural-language instructions for when to use which: "use High effort for PRs touching the payment flow, Default for everything else."
These are actual numbers, which is unusual in agent quality claims. 0.7 bugs per run as the default, reaching 0.95 in high-effort mode, with a documented merge-time resolution rate — that's the kind of measurement that makes it possible to have a real conversation about whether the tool is worth the cost.
## What These Features Don't Change
Credit given: parallel build plans, cloud dev environments, and configurable Bugbot are the right direction. Cursor is building infrastructure for agentic workflows, not adding more autocomplete. These are serious product improvements.
The architectural ceiling is structural, not cosmetic.
**On parallel execution:** "Build in Parallel" runs subagents within Cursor's orchestration layer — concurrent subtask execution within a single agent context. Claude Code Agent Teams (March 2026) ships a 15-agent mailbox architecture where each agent is an independent peer with its own tool access, memory, and task queue. The difference is not degree — it's category. Cursor's parallel execution is concurrency. Agent Teams is coordination. In multi-day, multi-component projects, that gap surfaces.
**On cloud dev environments:** Cursor's cloud environments give agents a configured execution context — a Dockerfile that replicates your dev setup. This solves the "agent can't find the dependency" problem. It doesn't change who owns and manages the infrastructure. Your team writes the Dockerfile, provisions the environments, and maintains them as your stack evolves. Claude Code Managed Agents uses Anthropic's infrastructure with Anthropic's reliability SLA. Claude Code on AWS Bedrock (GA since April 18) gives you AWS-managed infrastructure with Mantle zero-operator-access guarantees. Different risk profiles for different organizational requirements.
**On Bugbot:** 0.95 bugs per run is a good single-agent review number. Claude Code Review (GA May 6, $15–25 per PR) runs multiple independent review agents in parallel, each with its own context and focus area. Multi-agent review isn't just "more bugs found" — it's different reviewers looking for different things simultaneously. Comparing single-agent and multi-agent review on a per-bug metric misses the architectural difference.
## The Cursor Question
The pattern across 3.3 and 3.4 is consistent: Cursor adds depth to features the IDE already had. The additions work, the metrics are real, and teams using Cursor benefit directly from these updates.
What Cursor is building is an IDE that orchestrates AI agents. What Claude Code is building is an agent that can use an IDE when it needs to. The distinction matters more with each release because the use cases are diverging.
If your team's workflow requires visual diff review, inline suggestions, and IDE-integrated chat — and you're willing to be in the loop during agent execution — Cursor 3.3 and 3.4 are good reasons to stay. Parallel build plans speed up multi-component work. Cloud dev environments reduce agent failures in production workflows. Configurable Bugbot means you can tune the quality/cost tradeoff per PR.
If your question is "what's the fastest path to autonomous software development with minimal developer bottlenecks," these updates don't move the answer. They extend a model where a developer is a necessary orchestration participant. That's a legitimate product choice — most engineering teams aren't ready to remove that participant. But it's worth being clear that the choice is being made, not just the tool.
Cursor at $50 billion valuation (announced April 2026) has resources to ship more quickly. The direction is right. The question is whether features added to an IDE-centric architecture can close the gap on a terminal-native agentic architecture, or whether those are just different products serving different markets.
The next version will tell us something about which it is.
---
**Sources:** [cursor.com/changelog (3.3, May 7, 2026)](https://cursor.com/changelog/05-07-26), [cursor.com/changelog (3.4, May 13, 2026)](https://cursor.com/changelog/05-13-26), [Cursor Cloud Agent Development Environments blog](https://cursor.com/blog/cloud-agent-development-environments), [Cursor Bugbot effort levels blog](https://cursor.com/blog/may-2026-bugbot-changes), [The Decoder — Cursor 3 parallel agent coverage](https://the-decoder.com/new-cursor-3-ditches-the-classic-ide-layout-for-an-agent-first-interface-built-around-parallel-ai-fleets/)
---
# From Ghost Text to Autonomous Agent: Five Years of AI Coding Tools
URL: https://sdd.sh/2026/05/from-copilot-to-autonomous-agents-ai-coding-evolution-2021-2026/
Date: 2026-05-17
Updated: 2026-05-17
Tags: github-copilot, claude-code, cursor, swe-bench, agentic-workflows, history, autonomous-agents, mcp
Categories: AI Tools, Industry
Summary: Five years ago, GitHub Copilot autocompleted a function and developers argued whether it was cheating. Today, Google says 75%+ of its new code is AI-generated and Claude Opus 4.7 scores 87.6% on SWE-bench Verified. This is the arc — and the rupture nobody predicted.
In June 2021, a developer on Twitter posted a screenshot of GitHub Copilot completing a for-loop and wrote: "impressive, but it's just autocomplete." That take was correct and completely missed the point. What Copilot started was not a feature — it was the first turn of a loop that would, five years later, produce autonomous agents writing, testing, and shipping software while the engineer supervises from a terminal.
I've been here for the whole ride. And the most important thing I can tell you is that this evolution was not smooth. There was a rupture. And most of the tools you're familiar with are on the wrong side of it.
---
## 2021–2022: The Autocomplete Era
GitHub Copilot launched its private beta in June 2021 on top of OpenAI Codex. The experience was genuinely magical in a narrow way: you typed a comment describing what you wanted and ghost text appeared, offering a plausible completion. Functions filled themselves in. Boilerplate evaporated.
But the paradigm was firmly tab-to-accept. The model was passive. It waited for you to type, offered a suggestion, and disappeared until you typed again. The human was the engine; the AI was the turbocharger. Amazon CodeWhisperer followed the same template. The competitive question was which model produced more accurate completions, not what the model was capable of doing on its own.
The discourse of this era aged badly. "It'll just write buggy code you'll have to fix anyway." "It's a Clippy for developers." "It trains on your private repo." Some of these concerns were legitimate; none of them engaged with the trajectory. Copilot went generally available in June 2022 and immediately became the most widely adopted developer tool in a generation. The tool was limited. The appetite it revealed was not.
---
## 2022–2023: The Chat Era
GPT-4 landed in March 2023 and broke the autocomplete paradigm. Not because GPT-4 was better at completing lines — though it was — but because it could sustain a coherent conversation about a codebase across hundreds of turns. Developers stopped asking "complete this function" and started asking "why does this fail, what should I change, how would you design this differently."
This was the era of vibe coding, a term that emerged to describe a workflow that was equal parts productive and reckless: paste the error message, accept the fix, run it again, don't read the diff. Engineers started shipping features faster than they could reason about what they were shipping. Technical debt accumulated at AI speed.
SWE-bench was created in late 2023 by Princeton NLP researchers, and its arrival mattered more than most people realized at the time. For the first time there was a structured benchmark measuring something close to real software engineering — resolving GitHub issues in real Python repositories. The initial numbers were humbling: state-of-the-art models solved less than 5% of tasks. That number would become a speedometer for the entire field.
The chat era was real progress. But it still kept the human firmly in the loop. The model reasoned; you acted. The model suggested; you typed. The computer did not do anything you didn't explicitly ask for.
---
## 2023–2024: The Agent Experiments
In March 2024, Cognition AI launched Devin with a claimed 13.86% on SWE-bench — more than double anything that had come before — and a press release that called it "the world's first AI software engineer." The backlash was immediate and partly warranted: independent researchers found the methodology questionable and real-world performance disappointing. But the significance of the moment had nothing to do with Devin's actual capabilities. It had to do with the framing.
For the first time, a serious company shipped a product positioned not as a tool for engineers but as a replacement agent. The Overton window shifted. "AI software engineer" stopped being science fiction and started being a product category.
Cursor launched around the same time as an AI-first fork of VS Code, and it was genuinely good. Context-aware edits, inline chat, codebase indexing — it pushed the IDE model further than Copilot had. Developers who lived in VS Code found it transformative. The model had also improved dramatically: Claude 3 Sonnet and Opus raised the quality ceiling on what an AI could reason about code.
But Cursor's architecture made a bet: that the right interface for AI-assisted development was still the IDE. That developers would stay in their editor, and the AI would work within that frame. It was a defensible bet. It was also, I'd argue, a ceiling.
---
## 2024–2025: The Terminal-Native Rupture
Claude Code launched in early 2025 and it was architecturally different from everything that came before. Not marginally different — structurally different. It ran in the terminal. It had no IDE dependency. It could read your entire repository, plan across files, run tests, interpret the output, iterate, and complete a multi-step task without asking for confirmation at every turn.
The IDE-vs-terminal debate that followed was widely misread as a UI preference war. It was not. It was a debate about who holds the steering wheel.
In Copilot, in Cursor, the human is always in the critical path. You accept or reject suggestions. You trigger actions. The model is a very powerful tool you're operating. In Claude Code — especially after MCP shipped in late 2024 — the model can hold the plan across a long-horizon task. You can describe what you want, walk away, and come back to a pull request. The human is a supervisor, not an operator.
MCP (Model Context Protocol) deserves more credit than it gets in this story. Shipping in late 2024, it gave Claude Code — and any conforming agent — a standardized way to plug into external tools: databases, APIs, file systems, CI pipelines. By mid-2025 it had 97 million downloads. MCP turned Claude Code from a capable terminal agent into an extensible platform.
SWE-bench Verified hit roughly 60% in this period, up from under 20% two years earlier. The benchmark was moving fast enough that researchers started debating whether it was still measuring the right thing.
---
## 2025–2026: The Agentic Era Is Not Coming — It's Here
Claude Opus 4.7 scores 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro. The Stanford AI Index 2026 notes that SWE-bench Verified is approaching the human performance baseline. Google announced at Cloud Next 2026 that over 75% of its new code is AI-generated. Claude Code crossed $2.5B ARR. Managed Agents, Code Review GA, and Agent Teams shipped.
Let me sit with those numbers for a moment, because if you'd shown them to the engineer who posted that Copilot screenshot in 2021, they would have assumed you were describing a dystopian film.
The workflow that's emerging — not at some companies, but at most serious engineering organizations — looks like this: the engineer writes a spec or describes a task, an agent implements it, runs the tests, opens a PR, and flags edge cases for human review. The engineer's primary interface is no longer the editor. It is increasingly the spec, the review, the judgment call on ambiguity.
This is not the elimination of software engineers. It is the elimination of a large fraction of the work software engineers have historically done. The implementation layer is being automated. What remains irreducibly human is the part that was always undervalued: understanding why the system should exist, what it should do in cases the spec didn't anticipate, and whether the thing the agent built is actually what the business needed.
---
## What Is the Software Engineer's Irreducible Role?
I don't have a clean answer, and I'm suspicious of people who do.
The honest version is that the industry is mid-restructuring and anyone claiming to know the stable endpoint is extrapolating from incomplete evidence. What I can say is that the engineers thriving right now are the ones who have shifted their leverage point. They are writing fewer lines and making more consequential decisions per day. They are treating AI agents as junior engineers who need clear requirements, good test coverage to catch regressions, and explicit feedback loops — not as autocomplete on steroids.
The engineers struggling are the ones who experienced Copilot as the destination and Cursor as the upgrade, and don't understand why they feel like they're falling behind despite using good tools. The tools are good. But they were optimized for a paradigm that is being superseded.
Five years ago, the question was whether autocomplete was cheating. Today, the question is what judgment, taste, and systems thinking look like when implementation is nearly free. That is a much better question to be asking. The industry took five years to get here. I don't think it'll take five more to find the answer.
---
### Sources
- [GitHub Copilot General Availability](https://github.blog/news-insights/product-news/github-copilot-is-generally-available-to-all-developers/) — GitHub Blog, June 2022
- [SWE-bench: Can Language Models Resolve Real-World GitHub Issues?](https://arxiv.org/abs/2310.06770) — Princeton NLP, October 2023
- [Introducing Devin](https://cognition.ai/blog/introducing-devin) — Cognition AI, March 2024
- [Stanford AI Index Report 2026](https://aiindex.stanford.edu/report/) — Stanford HAI
- Google Cloud Next 2026 — AI-generated code keynote announcement
- Anthropic Claude Code ARR reporting, 2025–2026
---
# 200,000 MCP Servers Have a Command Injection Problem Nobody Told You About
URL: https://sdd.sh/2026/05/mcp-stdio-security-200k-servers-exposed/
Date: 2026-05-17
Updated: 2026-05-17
Tags: mcp, security, stdio, command-injection, agentic-workflows, claude-code
Categories: AI Tools
Summary: An Ox Security audit published in May 2026 found that STDIO transport — used by over 200,000 MCP servers — has no execution boundary and no input sanitization, leaving it wide open to command injection via malicious tool responses. Separately, 7,000+ MCP servers are running on public IPs with zero authentication. This is the third distinct MCP security crisis in 2026, and the most fundamental one yet.
Two hundred thousand. That's the floor estimate for how many MCP servers are running the STDIO transport — the original, simplest, fastest way to wire an AI agent to a local tool. Ox Security published an audit in May 2026 that lands an uncomfortable punch: STDIO has no execution boundary, no input sanitization, and no meaningful defense against a malicious server response injecting arbitrary shell commands through the pipe. The fix exists. Most operators don't know they need it.
This isn't a panic post. The sky isn't falling on MCP as a protocol. But the ecosystem grew from zero to 97 million downloads in 18 months, and security was the tax deferred. That bill is coming due in quarterly installments.
## What STDIO Transport Is and Why It's Vulnerable
STDIO (standard input/output) is how MCP servers were originally designed to work: the host process launches the server as a child process and communicates by writing JSON to stdin and reading responses from stdout. No network stack. No ports. No HTTP overhead. For local tools — file readers, shell executors, database clients — it's elegant and fast.
The problem is that STDIO was designed for a trust model that no longer holds. The original use case assumed the server was a trusted local binary. When an agent connects to a third-party MCP server, retrieves tool definitions, and starts executing those tools, the server's responses flow back through the same pipe — and STDIO has no concept of "this response is trying to do something it shouldn't."
A malicious tool response can embed shell metacharacters, newline injections, or control sequences that, when processed by the host's STDIO handler, execute arbitrary commands in the context of whatever user launched the agent. There's no sandbox. There's no escaping layer. The pipe is just a pipe.
The Ox Security audit identified two compounding failures: the command injection vector itself, and a separate finding that over 7,000 MCP servers are running on public IP addresses with no authentication layer at all. Those aren't local tools. They're internet-exposed services — some of them apparently production deployments — with the attack surface of a 1990s telnet daemon.
## Three MCP Security Crises, One Pattern
2026 has handed the MCP ecosystem three distinct security crises. Laid out in chronological order, the pattern is hard to ignore.
**April 2 — OAuth mix-up attacks on HTTP transport.** The MCP Dev Summit NYC surfaced findings that 43% of MCP servers using HTTP transport had OAuth implementation flaws. Mix-up attacks — where a malicious authorization server tricks a client into sending tokens to the wrong endpoint — were demonstrated as practical exploits against real deployments. The summit framed this as the dominant unsolved problem in the ecosystem.
**April 7 — CVE-2026-21852: CLAUDE.md supply-chain poisoning.** A critical vulnerability in Claude Code's config parser allowed a malicious `CLAUDE.md` file to bypass deny rules and execute arbitrary commands. Attackers could embed a payload just past the parser's invisible 50-subcommand cap. The bug was patched, but [as covered here at the time](/posts/claude-code-cve-2026-claudemd-supply-chain-attack/), the underlying attack surface — agentic tools trusting project configs from untrusted repos — is structural.
**May 2026 — STDIO command injection at 200K-server scale.** This one.
Each crisis hit a different layer of the stack: the client config layer, the HTTP authentication layer, and now the core transport layer. If that feels like an escalating audit of the entire protocol surface, that's because it is. Researchers are working their way down from application-level misconfigurations toward protocol fundamentals. STDIO is as fundamental as it gets.
## What the Ox Security Audit Actually Found
The audit's core finding is a category of vulnerability, not a single CVE. Any MCP server running STDIO transport that processes tool responses without strict output sanitization is potentially injectable. The specific mechanics involve:
- **No execution boundary**: STDIO doesn't distinguish between "this is data" and "this is a command." The host process reads the pipe and acts on it. A response that embeds shell control characters, environment variable expansions, or newline-delimited subcommands can escape the expected data context.
- **No input sanitization in the spec**: The MCP specification does not mandate sanitization of tool responses before they reach the execution layer. Individual implementations may add it. Many don't.
- **Scale through defaults**: STDIO is the default transport in the majority of MCP tutorials, starter templates, and quickstart guides. Developers reaching for their first MCP integration almost always land on STDIO. That's why the number is 200,000 and not 2,000.
The 7,000 internet-facing servers with no authentication are a separate but related failure. These appear to be teams that stood up an MCP server for production use and either didn't implement authentication or actively disabled it because it was friction. An unauthenticated STDIO server exposed on a public IP is exactly as dangerous as it sounds.
## Who's Actually at Risk
Let's be precise about threat models, because "200K servers" sounds maximally alarming and the reality is more segmented.
**Enterprise teams using third-party MCP servers are the primary risk.** If your engineering team connects Claude Code or another agentic tool to an MCP server you didn't write — a vendor integration, an open-source community server, a marketplace-listed tool — that server's responses flow through STDIO. You're trusting the server operator's sanitization. You probably shouldn't.
**Developers running public-facing MCP servers** — especially the 7,000 on public IPs — are directly exploitable without even needing to compromise a tool response. A network-level attacker who can send arbitrary data to those servers doesn't need injection tricks.
**Casual local users** running a small set of personally maintained STDIO servers against their own toolchain are at lower risk. If you wrote the server, you know what it returns. The injection vector is still theoretically present, but the practical attack path requires you to have already compromised yourself.
Claude Code's own MCP implementation is not the vulnerability here. The attack surface is the third-party MCP servers you connect to. Claude Code is the client; if a connected server is malicious or compromised, STDIO gives that server a lever into your execution environment.
## What to Do Right Now
**1. Audit your MCP server inventory.** List every MCP server your agents connect to. For each one: who operates it? Do you trust their sanitization practices? Is it running STDIO or HTTP+SSE?
**2. Prefer HTTP+SSE transport for production.** The HTTP transport with Server-Sent Events moves the response channel out of the STDIO pipe and into a structured HTTP response layer where you can apply standard web security controls. It has its own auth problems — see the April summit findings — but command injection via response data is not one of them.
**3. Add strict input validation at the client layer.** If you're running STDIO servers you can't immediately migrate, validate and escape tool responses before they touch anything that executes code. Treat every server response like untrusted user input, because that's exactly what it is.
**4. Firewall your STDIO servers.** If an MCP server has no business being on the network, block it at the firewall level. STDIO was designed for localhost. Anything listening on a public IP without authentication is a misconfiguration, not a deployment.
**5. Watch the MCP roadmap for the spec-level fix.** The [2026 MCP roadmap](https://blog.modelcontextprotocol.io) includes security hardening as a stated priority. Anthropic and the Agentic AI Foundation are aware of these findings. Spec-level sanitization requirements and transport security guidelines are in progress. Stay close to SDK updates over the next 60 days.
## The Cost of Velocity
MCP reached 97 million downloads in roughly 18 months. That's not just fast — it's "internet infrastructure-level growth while still in RFC status" fast. The HTTP OAuth vulnerability, the supply-chain config poisoning, and now STDIO command injection are three different facets of the same root cause: the ecosystem moved at product velocity while the security model was still being designed.
That's not a reason to stop using MCP. The protocol solves a real problem — standardized tool integration for AI agents — and it solves it well enough that OpenAI adopted it, the Linux Foundation governs it, and enterprises are betting production workloads on it. The architecture is sound. The implementation surface is still being hardened.
But if you're a CTO or a tech lead making decisions right now about which MCP servers your agents connect to, treat third-party STDIO servers with the same scrutiny you'd apply to a third-party binary you're running as root. Because at the moment, the trust model is approximately that loose.
The fix is not complex. The audit trail is public. The time to act is before your agent does something you didn't ask it to.
---
**Sources**
- Ox Security MCP STDIO vulnerability audit, May 2026 — [VentureBeat coverage](https://venturebeat.com)
- [MCP 2026 roadmap — blog.modelcontextprotocol.io](https://blog.modelcontextprotocol.io)
- The New Stack on MCP production readiness
- [CVE-2026-21852: The CLAUDE.md Supply-Chain Attack](/posts/claude-code-cve-2026-claudemd-supply-chain-attack/) — prior coverage on this blog
- [MCP Dev Summit NYC 2026: Authentication Is the Crisis](/posts/mcp-dev-summit-nyc-2026-auth-scale-openai/) — prior coverage on this blog
---
# ServiceNow Build Agent Goes Everywhere: Enterprise MCP Governance for Every AI Coding Tool
URL: https://sdd.sh/2026/05/servicenow-build-agent-ga-mcp-governance-enterprise/
Date: 2026-05-16
Updated: 2026-05-16
Tags: servicenow, mcp, enterprise, claude-code, cursor, windsurf, governance, agentic-workflows
Categories: AI Tools, Industry
Summary: ServiceNow made Build Agent generally available at Knowledge 2026, extending its core skills into Claude Code, Cursor, Windsurf, GitHub Copilot, OpenAI Codex, and Antigravity via MCP — with enterprise governance, OAuth, audit trails, and a real-time AI Gateway baked in by default. It's the model for how enterprise platforms will integrate with the agentic coding ecosystem.
At Knowledge 2026, ServiceNow made a decision that matters more than any single product announcement: instead of building a proprietary coding agent and asking developers to switch, they shipped into the tools developers already use.
Build Agent is now generally available in ServiceNow Studio, Cursor, Windsurf, Claude Code, GitHub Copilot, OpenAI Codex, and Antigravity — the full roster of mainstream AI coding tools — with the same enterprise governance applied regardless of where the code gets written. It's a governance layer on top of the agentic coding ecosystem, not a replacement for it.
---
## What Build Agent Actually Does
Build Agent started as an AI assistant for building ServiceNow applications inside ServiceNow Studio. Its job was to accelerate development of scoped apps that run on the Now Platform — helping developers scaffold data models, workflows, integrations, and UI components without writing every line from scratch.
The General Availability announcement at Knowledge 2026 is not a rebrand. It's an expansion: Build Agent's core skills now work as an MCP server that any compatible coding tool can invoke. When a developer working in Claude Code or Cursor needs ServiceNow context — platform APIs, data schema, security roles, workflow models — the Build Agent MCP server provides it directly, without switching to ServiceNow Studio.
The workflow looks like this: you write code in your preferred tool, Build Agent provides ServiceNow-aware context and validation, and when you're ready to ship, you export to ServiceNow Studio like any other scoped app. Governance, security roles, and data model enforcement happen at export time — applied by the platform, not the developer's discipline.
This is the right design. It accepts that developers will use their preferred tools and puts governance at the platform boundary rather than at the tool boundary.
---
## The MCP Server: Included by Default
The ServiceNow MCP Server is **generally available and included in every Now Assist and AI Native SKU** — no separate license required. For organizations already running ServiceNow in production, this is a meaningful shift: the MCP integration arrives in the next contract renewal, not as a separate line item.
The MCP Server Console provides enterprise controls that matter to the buyers making these decisions:
- **AICT governance**: AI Control Tower integration for centralized agent observability
- **Consumption metering**: per-request tracking of what every agent is consuming from the platform
- **Managed OAuth**: enterprise-grade authorization without each developer managing their own credentials
- **Audit trails**: complete logs of which agent made which platform call, when, and from which tool
- **Session management**: agent session lifecycle controls that match how enterprises think about access
- **Role-based tool packages**: different tool sets for different developer roles, controlled by platform administrators
For comparison: most MCP server deployments in production today are developer-managed, with minimal observability and ad-hoc access control. The ServiceNow MCP Server Console is the first production-grade, enterprise-class control plane for MCP I've seen from a major platform vendor.
---
## Action Fabric and the AI Gateway
Beyond Build Agent, ServiceNow announced **Action Fabric** — a governed access layer that lets AI agents invoke ServiceNow's full system of action directly, without a human opening a browser or running a workflow manually.
The practical meaning: when Claude Code or a Managed Agent needs to create a ticket, update a CMDB record, trigger an approval workflow, or escalate an incident, Action Fabric provides a headless API surface with ServiceNow's full governance stack applied. Agents get the same access a human ServiceNow administrator would have, with the same audit trail and the same role-based constraints.
The **AI Gateway** is the runtime control layer on top of this. It provides real-time controls for agentic workloads — rate limiting, policy enforcement, circuit breakers for runaway agents — along with observability and security for traffic flowing across any third-party AI system. This is how an enterprise IT team monitors what 200 developers' coding agents are doing to the production platform at 2 AM.
Build Agent also connects outward as an MCP Client, pulling context from external tools: design specs from Figma, requirements from Miro, code context from GitHub. The same governance that applies to outbound ServiceNow calls applies to these inbound integrations — everything flows through the AI Gateway.
---
## Why This Architecture Wins Enterprise
The conventional enterprise software playbook for AI is to build a first-party AI assistant and ask developers to use it exclusively. ServiceNow could have done that. They chose not to, and the reasons are instructive.
Developer tool preferences are high-stakes and sticky. Telling a team of engineers who have spent months building Claude Code workflows, CLAUDE.md configs, and MCP integrations that they need to switch to a ServiceNow-specific coding interface is a losing argument. It doesn't matter how good the ServiceNow interface is — the switching cost is real and the resentment is reliable.
The alternative — embed your governance into the tools developers already use — solves the adoption problem by eliminating it. Build Agent doesn't compete with Claude Code. It extends it.
This is also a governance story, not a capability story. The ServiceNow MCP Server doesn't make Claude Code smarter. It makes Claude Code's interactions with the ServiceNow platform auditable, compliant, and centrally observable. That's what enterprise IT buyers actually need to approve a deployment. Capability is table stakes; compliance is the procurement blocker.
The MCP standard is what makes all of this possible. By building against a standardized protocol rather than tool-specific integrations, ServiceNow's governance layer works across Claude Code, Cursor, Windsurf, Copilot, and Codex simultaneously. New tools that implement MCP inherit the integration automatically.
This is the MCP ecosystem flywheel working as intended: tool vendors invest in MCP compliance, platform vendors invest in MCP servers, and developers get governed access to enterprise systems from their preferred environment. Nobody wins by building a private integration ecosystem anymore.
---
## Anthropic Models Inside ServiceNow
One detail from the announcement worth noting: Build Agent on the ServiceNow AI Platform is now powered by Anthropic models. The specific benefit cited is longer context sessions — developers can work through entire application builds without losing continuity.
This is the enterprise distribution story Anthropic has been building toward. Claude doesn't have to be the interface that developers see; it can be the reasoning engine inside platforms they already use. ServiceNow joining that list (alongside Amazon Bedrock, Google Vertex AI, Azure AI Foundry) reinforces the pattern: Anthropic sells capability, partners sell workflow integration.
For Claude Code users, the practical implication is coherence. When you use Build Agent skills from within Claude Code, the underlying model driving ServiceNow's guidance and Claude Code's autonomous execution is coming from the same lab. That's alignment in the literal sense — the models share the same training lineage and capability profile, which reduces the kind of instruction drift that happens when heterogeneous AI systems try to collaborate.
---
## The Governance Gap in Today's MCP Deployments
Most teams using MCP in production today are running it without any of the controls the ServiceNow MCP Server Console provides. MCP servers are typically developer-deployed, with access granted by API key, no consumption metering, no centralized audit trail, and no role-based access control.
This works fine for small teams. It does not work fine for a 50,000-person enterprise where AI agents are making calls to production CRM, ITSM, and financial systems on behalf of 2,000 developers. The audit question — *which agent called this endpoint, when, and with whose authorization?* — is currently unanswerable in most MCP deployments.
ServiceNow has answered it. The AI Gateway and MCP Server Console are the first enterprise-grade answer I've seen to the MCP governance gap. If this architecture gets replicated by Salesforce, SAP, and Workday — which it should — it will become the standard pattern for how enterprise platforms integrate with the agentic coding ecosystem.
---
## What to Watch
The MCP Server is GA and in production. The AI Gateway additional features are planned for H2 2026. Devin integration for Windsurf is also planned for H2 2026, which would mean ServiceNow Build Agent running through a Windsurf + Devin autonomous session — governed by the AI Gateway, audited by the MCP Server Console. That's a plausible production architecture by end of year.
The market question is whether other enterprise platforms move on this timeline or wait for the pattern to mature. Given that ServiceNow is the first major platform vendor to ship enterprise-grade MCP governance, they have a meaningful window to define what enterprise MCP integration looks like before the standard gets set by committee.
---
**Sources:**
- [ServiceNow Build Agent now works inside every major AI coding tool, governed by default — Business Wire](https://www.businesswire.com/news/home/20260506008934/en/ServiceNow-Build-Agent-now-works-inside-every-major-AI-coding-tool-governed-by-default)
- [ServiceNow opens its full system of action to every AI Agent in the enterprise — ServiceNow Newsroom](https://newsroom.servicenow.com/press-releases/details/2026/ServiceNow-opens-its-full-system-of-action-to-every-AI-Agent-in-the-enterprise/default.aspx)
- [ServiceNow Knowledge 2026 — AI Control Tower expands, Autonomous Workforce reaches every function — Diginomica](https://diginomica.com/servicenow-knowledge-2026-ai-control-tower-expands-autonomous-workforce-reaches-every-function-and)
- [ServiceNow AI Governance Push: Knowledge 2026 — CX Today](https://www.cxtoday.com/security-privacy-compliance/servicenow-ai-agent-governance-knowledge-2026/)
- [ServiceNow Wants to Be the Control Layer for Every AI Agent in the Enterprise — Reworked](https://www.reworked.co/digital-workplace/servicenow-launches-action-fabric-major-overhaul-of-ai-control-tower/)
- [Building ServiceNow apps via Claude Code and the ServiceNow SDK — ServiceNow Community](https://www.servicenow.com/community/developer-advocate-blog/building-servicenow-apps-via-claude-code-and-the-servicenow-sdk/ba-p/3525677)
- [ServiceNow MCP Integration with Claude Code — Composio](https://composio.dev/toolkits/servicenow/framework/claude-code)
---
# Grok Build: xAI's First Coding Agent Has Eight Parallel Agents, a Privacy-First Architecture, and One Major Problem
URL: https://sdd.sh/2026/05/grok-build-xai-coding-agent-arena-mode/
Date: 2026-05-16
Updated: 2026-05-16
Tags: xai, grok, coding-agents, terminal-native, benchmarks, claude-code
Categories: AI Tools, Industry
Summary: xAI launched Grok Build on May 14 — a terminal-based coding agent with 8 parallel sub-agents, Arena Mode automated evaluation, and a local-first privacy model that sends zero codebase data to xAI servers. It scores 70.8% on SWE-bench Verified at $0.20/M tokens. Here's what it gets right, what's missing, and how it stacks up against Claude Code.
xAI launched Grok Build on May 14 — Elon Musk's first serious move into the AI coding agent market that Anthropic, OpenAI, and Google have been fighting over for the past year. It's an early beta, exclusive to SuperGrok Heavy subscribers, and it has a genuinely interesting architecture. It also has a benchmark score that's 17 points behind the current leader, and its most-hyped feature isn't live yet.
Here's what Grok Build actually is, what it gets right, and why Claude Code users shouldn't be canceling their subscriptions this week.
---
## What Grok Build Is
Grok Build is a CLI-based coding agent — not an IDE plugin, not a VS Code fork. You invoke it from your terminal, describe a task, and it runs. In that sense, xAI is making the same architectural bet Anthropic made: the terminal, not the editor, is the right home for a serious coding agent.
The underlying model is **grok-code-fast-1**, with a 256,000-token context window and API pricing of $0.20 per million input tokens and $1.50 per million output tokens. The model scores **70.8% on SWE-bench Verified** — meaningful, but not frontier-level. Claude Code, running on Opus 4.7, sits at 87.6% on SWE-bench Verified and 64.3% on the harder, contamination-resistant SWE-bench Pro. GPT-5.5 Spud is at 82.7% on Terminal-Bench 2.0. Grok Build enters a competitive benchmark field where it's not yet the leader on any metric.
**Pricing:** SuperGrok Heavy at $300/month, with an introductory deal at $99/month for the first six months. API access at $0.20/$1.50 per million tokens is genuinely competitive — Anthropic's Opus 4.7 runs at $5/$25.
---
## The Architecture: Eight Agents, One Arena
The core selling point of Grok Build is multi-agent parallelism. When you hand it a task, it doesn't run a single chain-of-thought loop. It spawns up to **eight concurrent sub-agents**, each specialized across a three-stage workflow: plan, search, and build.
This is closer to Windsurf's parallel-agent model than to Claude Code's single-agent-goes-deep approach. Complex tasks get subdivided and attacked simultaneously. The theoretical benefit is wall-clock time: eight agents working a problem in parallel can compress multi-step tasks that would otherwise be sequential.
**Arena Mode** is the feature that's generated the most discussion. The concept: every agent response appears side by side, automatically scored and ranked before a developer ever reviews it. You'd get an automated evaluation layer that selects the best output from the parallel runs before it ever reaches your screen. It was confirmed in code traces as far back as February 2026.
It's not live in the early beta.
Arena Mode is a genuinely clever idea — it shifts output selection from human judgment to algorithmic scoring, which is the right direction. But until it ships, Grok Build's parallel architecture is producing eight outputs you still evaluate manually. That's a different workflow than Windsurf's side-by-side visual comparison, and it's more cognitive load than Claude Code's single-agent model, not less.
---
## Local-First Privacy: The Differentiator That Actually Works
The architecture detail that deserves the most attention is privacy. **Grok Build is local-first: no code is transmitted to xAI's servers during a session.** Computation happens on your machine. The tool is air-gap compatible once initial setup is complete.
This is a serious differentiator for regulated industries. Anthropic has answered this problem with Claude Code on Bedrock's Mantle backend (zero operator access, NitroTPM attestation), but that's an enterprise SKU that requires AWS infrastructure and setup. Grok Build's local-first model is a single-machine answer to the same problem — no cloud configuration required.
For individual developers working on proprietary codebases who don't have an enterprise Anthropic contract, local-first matters. xAI has correctly identified that "where does my code go?" is a real blocker for a meaningful portion of the addressable market.
**Plan Mode** reinforces this philosophy. Before Grok Build modifies a single file, it presents the complete execution plan — including which files it intends to change, what it will do to each, and why. You can review it, comment on individual steps, rewrite parts of it, or kill it entirely. The plan is editable; the execution doesn't start until you approve.
Claude Code users in Auto mode will recognize the tradeoff: Plan Mode adds a review gate that slows autonomous execution in exchange for control. It's the right choice for an early beta where trust hasn't been established. The question is whether that gate is still mandatory six months from now.
---
## What's Missing
Honest accounting of what Grok Build doesn't have yet:
**No MCP ecosystem.** Claude Code's 6,400+ MCP servers represent two years of community and enterprise tool integration — ServiceNow, Figma, Jira, Salesforce, GitHub. Grok Build has no equivalent ecosystem. A coding agent without tool integration is a code-writing loop, not a full development workflow.
**No CLAUDE.md equivalent.** Anthropic's project instruction system lets teams encode invariants, style guides, architecture rules, and agent behavior constraints into a file that every Claude Code session reads. It's how organizations scale consistent AI behavior across hundreds of engineers. Grok Build has no documented equivalent mechanism.
**No scheduling or cloud execution.** Claude Code Routines run on Anthropic's infrastructure — cron triggers, API webhooks, GitHub event triggers — without your machine being online. Grok Build requires an active session.
**No enterprise governance layer.** No Analytics API, no per-user spend controls, no SCIM integration, no OpenTelemetry export. For teams buying agentic coding tools at the enterprise level, these aren't nice-to-haves.
**Arena Mode isn't live.** The headline feature is coming. Features in early beta often arrive on timeline; they also sometimes don't.
---
## The Real Competitive Picture
Grok Build is positioned as a direct Claude Code competitor, and xAI has made the right architectural call by going terminal-native rather than IDE-embedded. But the current beta reveals a tool that's compelling on privacy and interesting on parallelism, while trailing significantly on benchmark performance and ecosystem depth.
70.8% SWE-bench Verified is not a bad score. Three months ago it would have been near the top of the leaderboard. Today, the frontier has moved. Claude Opus 4.7 at 87.6%, GPT-5.5 at 82.7% Terminal-Bench 2.0, and Kimi K2.6 at 58.6% SWE-bench Pro are the relevant comparisons. Grok Build's model enters a field where it needs significant improvement to be the benchmark leader, and benchmark leadership is how developers justify switching costs.
The pricing math is also complicated. $99/month introductory rate is reasonable for a SuperGrok Heavy bundle. $300/month steady-state is at the ceiling of what individual developers pay. Claude Code Max 20x costs $200/month with a substantially larger ecosystem, higher benchmark scores, and years of production hardening. Grok Build needs to close the capability gap before the introductory pricing window closes.
---
## What to Watch
xAI is not a research lab dabbling in developer tools. They have real infrastructure, real compute, and a clear commercial incentive to make Grok Build competitive. Arena Mode could be a genuine workflow innovation when it ships. The local-first privacy model is architecturally sound and serves a real market segment.
But early betas are tested by characteristics, not promises. Right now, Grok Build is a compelling idea with benchmark numbers that need to improve and a flagship feature that's coming soon. That's not unusual for a first release. It's a reason to put it on the watchlist, not the primary workflow.
Check back when Arena Mode ships and grok-code-fast-2 benchmarks. Those two data points will tell you whether xAI is serious about catching the frontier.
---
**Sources:**
- [xAI Enters the Coding Agent Race With Grok Build — DevOps.com](https://devops.com/xai-enters-the-coding-agent-race-with-grok-build/)
- [Grok Build Early Beta: 6 Ways xAI's New AI Coding Agent Plans to Take on Claude Code — Techloy](https://www.techloy.com/grok-build-early-beta-6-ways-xais-new-ai-coding-agent-plans-to-take-on-claude-code/)
- [xAI Grok Build: Multi-Agent Arena Mode Redefines AI Coding — AI2Work](https://ai2.work/blog/xai-grok-build-multi-agent-arena-mode-redefines-ai-coding/)
- [xAI Unveils Grok Build: An Agentic AI Coding Tool to Take on OpenAI, Google & Anthropic — AndroidHeadlines](https://www.androidheadlines.com/2026/05/xai-grok-build-agentic-ai-coding-tool-launch-beta.html)
- [Grok Build: xAI's Agentic Coding CLI Takes On Claude Code — Pasquale Pillitteri](https://pasqualepillitteri.it/en/news/2584/grok-build-xai-cli-2026)
- [Grok Build CLI: xAI's Answer to Claude Code — Beginners in AI](https://beginnersinai.org/grok-build-cli/)
---
# AI is Finding 20-Year-Old Bugs Everywhere. Your Stack Is Next.
URL: https://sdd.sh/2026/05/ai-cve-surge-open-source-2026/
Date: 2026-05-16
Updated: 2026-05-16
Tags: security, CVE, open-source, postgresql, AI, vulnerability-discovery, linux-kernel, spring
Categories: Industry, AI Tools
Summary: PostgreSQL fixed 11 CVEs in its May 2026 release — unusually high for a project that typically ships 1–4 per quarter. Spring went from 17 CVEs in all of 2025 to 30 in two months. Chrome is up 563% year-to-date. This isn't a code quality crisis. It's AI-assisted vulnerability discovery, and it's systematically sweeping every major open-source project.
Last week, PostgreSQL shipped versions 18.2, 17.8, 16.12, 15.16, and 14.21. Eleven security vulnerabilities fixed in a single quarterly release. For context: PostgreSQL typically ships one to four CVEs per release. The project has a 30-year track record of quiet, disciplined engineering. Eleven is not normal.
But it's not an anomaly either. It's the new baseline.
## The Numbers Across the Stack
PostgreSQL is one data point in what NIST now confirms is a structural shift. CVE submissions in Q1 2026 were **33% higher** than Q1 2025 — and 2025 was already a record year. NIST enriched nearly 42,000 CVEs in 2025, more than any prior year, and still could not keep pace with submissions.
The per-project numbers are harder to ignore:
| Project | Change (YTD 2026) |
|---|---|
| Chrome | +563% |
| GitHub | +476% |
| Apache | +170% |
| Mozilla | +157% |
| Spring Framework | 17 CVEs in all of 2025 → 30 in 2 months of 2026 |
| Linux kernel | 3 local-root privilege escalation CVEs in the same code area, weeks apart |
Spring Security released emergency patches on April 21, 2026 fixing multiple CVEs, including an infinite recursion OOM in Spring Cloud Function and a filter-expression injection in Spring AI. The Linux kernel disclosed *Copy Fail* (CVE-2026-31431), then *Dirty Frag* (CVE-2026-43284 / CVE-2026-43500), then *Fragnesia* (CVE-2026-46300) — three separate local-privilege-escalation vulnerabilities in related kernel code, each allowing any unprivileged user to reach root via a public proof-of-concept, each disclosed within weeks of the last.
This is not the fingerprint of a sudden regression in code quality. These projects haven't gotten worse. The tooling for finding what was already broken has gotten dramatically better.
## AI Found the Bugs. AI Is Also Looking for Them on the Other Side.
The proximate cause is well-documented at this point. CSO Online reported in early 2026 that AI tooling had uncovered 20-year-old bugs in PostgreSQL and MariaDB — latent vulnerabilities that had been sitting in plain sight through dozens of human security audits. In April 2026, Anthropic disclosed that Claude Mythos Preview had identified thousands of zero-day vulnerabilities across major operating systems and browsers.
The economics have inverted. A skilled security researcher running manual analysis might audit one component of one project in a week. An AI model can sweep an entire codebase in minutes, flag plausible vulnerability patterns across every execution path, and do it again tomorrow after the next commit. Every major open-source project is now subject to continuous, automated re-examination at a scale that would have required a large, dedicated red team a year ago.
The bugs being found are real. These aren't false positives — the PostgreSQL CVEs carry CVSS scores of 8.2 to 8.8. The pgcrypto heap buffer overflow (CVE-2026-2005), the intarray arbitrary code execution (CVE-2026-2004), the pg_trgm heap overflow (CVE-2026-2007) — all high-severity, all in extensions that have been shipped and trusted for years.
The uncomfortable flip side: the same AI capability that finds these bugs can be used to weaponize them. Barracuda Networks' May 2026 threat report documents a measurable collapse in the time between CVE disclosure and functional exploit availability. The exploit window — historically measured in weeks — is now measured in hours for well-documented vulnerabilities. AI doesn't just find the bug; it can write the PoC faster than the patch reaches most production systems.
## The Triage Crisis Nobody Planned For
Here is the operational problem that doesn't make headlines: the humans responsible for validating and fixing these vulnerabilities were not resourced for this volume.
Most major open-source projects are maintained by small teams — often partially or entirely volunteers. PostgreSQL, Spring, and the Linux kernel are better-resourced than most, but even they are absorbing a materially higher triage load with the same team sizes. For the thousands of smaller open-source projects that underpin the modern stack, the math is worse.
A CVE report is not a fix. It's a claim that requires validation: Is this actually exploitable? Under what conditions? Does the proposed patch address root cause or just the reported surface? The cost of generating a vulnerability report with AI has dropped to near-zero. The cost of verifying one has not changed.
Security teams downstream are experiencing this as an advisory flood. Two-thirds of security teams in ProjectDiscovery's 2026 AI Coding Impact Report are already spending more than half their time manually triaging AI-generated findings rather than remediating them. That was before the current CVE surge hit its current rate.
## What This Means If You Run Production Software
The practical implications are not subtle.
**Your patching cadence is now wrong.** If you're on quarterly patch cycles, you are structurally behind. PostgreSQL shipped 11 CVEs with CVSS scores up to 8.8. Linux had a local-root exploit with a public PoC. Both in May 2026. If you patched in March and your next window is June, you have a gap.
**Extensions and embedded dependencies are the attack surface.** The PostgreSQL CVEs weren't in the core engine — they were in pgcrypto, intarray, and pg_trgm. The Spring CVEs included Spring AI and Spring Cloud Function. AI vulnerability discovery is thorough: it doesn't skip the extension ecosystem the way human auditors sometimes do. Your threat surface is larger than your primary dependency list.
**AI-generated code is being scanned by the same tools.** If 51% of GitHub commits in 2026 are AI-assisted, and AI models generate code that contains OWASP top-10 vulnerabilities at a high base rate, then the CVE surge isn't only about old bugs in legacy code. It's also about new bugs in recently shipped AI-generated features. Both populations are being scanned simultaneously.
**The time between disclosure and exploit is now too short for slow response.** When a public PoC for a local-root Linux vulnerability is available within hours of CVE publication, the margin for "we'll patch it in the next maintenance window" is gone. Automated patching infrastructure — KernelCare, live patch pipelines, dependency bots — stops being a nice-to-have and becomes a baseline requirement.
## The Correct Response Is Not Panic
None of this argues for slowing down your stack or auditing it into paralysis. The bugs being found are real, but most of them are also patchable. The CVE surge is, in a meaningful sense, good news: these vulnerabilities existed before AI started finding them. The only thing that changed is that we now know about them.
The practical response is architectural:
**Treat your dependency update pipeline as infrastructure, not maintenance.** Renovate, Dependabot, automated patch PRs — these should be running continuously and merging on green CI. A project with a working automated update pipeline will absorb the CVE surge without additional human load. A project that patches manually on a quarterly schedule will not.
**Scope your exposure by extension and plugin.** The PostgreSQL and Spring CVEs were concentrated in optional extensions that not everyone uses. Before a patch is available, the fastest risk reduction is confirming whether the vulnerable component is actually deployed in your environment. pgcrypto, intarray, pg_trgm — if you don't use them, disable or remove them.
**Build agentic security review into the generation loop.** If AI is generating a meaningful fraction of your code, the same AI capability that finds old vulnerabilities can review new ones. A Claude Code pre-commit hook running security-focused static analysis isn't a future aspiration — it's a deployable pattern today. AI-generated code with an AI security reviewer in the loop produces fewer vulnerabilities than human-reviewed AI code, because the reviewer doesn't fatigue.
**Monitor disclosure feeds, not just release notes.** The time between CVE publication and patch availability can be hours for some projects. If your threat intelligence is "wait for the vendor release email," you're reading about exploits after the fact. NIST NVD, VulnCheck, and OpenCVE all offer real-time feeds that can be piped into automated triage workflows.
## The Broader Shift
The CVE surge is the security industry's version of the broader AI acceleration pattern: AI is increasing the rate at which consequential things happen, in both directions. Code gets written faster. Bugs get found faster. Exploits get developed faster. Patches need to ship faster.
The organizations that will absorb this well are the ones that have already automated the low-value, high-frequency work: dependency updates, basic security scanning, patch deployment. The ones that will struggle are the ones whose security posture still depends on human reviewers moving at human speed against a threat surface that is now being probed at machine speed.
Your stack is being scanned right now. Whether the results show up in a responsible disclosure report or in an attacker's toolbox first depends partly on luck and partly on how fast your patching infrastructure runs.
Probably a good time to find out which one you have.
---
**Sources:**
- [PostgreSQL 18.2, 17.8, 16.12, 15.16, and 14.21 Released — postgresql.org](https://www.postgresql.org/about/news/postgresql-182-178-1612-1516-and-1421-released-3235/)
- [AI finds 20-year-old bugs in PostgreSQL and MariaDB — CSO Online](https://www.csoonline.com/article/4167137/ai-finds-20-year-old-bugs-in-postgresql-and-mariadb.html)
- [AI Vulnerability Discovery and the Open Source CVE Surge — Security Boulevard](https://securityboulevard.com/2026/05/ai-vulnerability-discovery-and-the-open-source-cve-surge/)
- [The First CVE Wave: AI-Assisted Vulnerability Discovery — VulnCheck](https://www.vulncheck.com/blog/ai-assisted-vulnerability-discovery)
- [30 CVEs in Two Months: What the Spring Numbers Tell Us — HeroDevs](https://www.herodevs.com/blog-posts/30-cves-in-two-months-what-the-spring-numbers-tell-us-about-the-future-of-open-source-security)
- [Dirty Frag Linux Kernel CVEs — TuxCare](https://tuxcare.com/blog/dirty-frag-cve-2026-43284-cve-2026-43500-kernelcare-live-patches-released/)
- [Fragnesia CVE-2026-46300 — AlmaLinux](https://almalinux.org/blog/2026-05-13-fragnesia-cve-2026-46300/)
- [AI-Driven Vulnerability Discovery and Exploit Trends — Barracuda Networks](https://blog.barracuda.com/2026/05/15/CVE-surge-patch-diff-exploitation-vendor-targeting-trends)
- [NIST CVE Prioritization as AI Speeds Up Discovery — Penligent](https://www.penligent.ai/hackinglabs/nist-cve-prioritization-as-ai-speeds-up-vulnerability-discovery/)
---
# Why Engineers Are Writing Specs in HTML (And When You Should Too)
URL: https://sdd.sh/2026/05/html-specs-structured-machine-readable/
Date: 2026-05-15
Updated: 2026-05-15
Tags: spec-driven-development, html, ai-agents, specs, claude-code, structured-data
Categories: Spec-Driven Development, Guides
Summary: A growing number of engineering teams are ditching Markdown for HTML when writing specs — not because they enjoy writing more verbose documents, but because HTML's semantic structure gives AI agents significantly richer context when implementing from a spec. Here is where the tradeoff makes sense and how to do it well.
Markdown is the default format for everything in software engineering: README files, wikis, ADRs, specs. It is frictionless, readable in any text editor, and renders beautifully in GitHub. For most purposes it is perfectly fine.
But "perfectly fine" is not the same as "optimal for machine consumption." When you are practicing Spec-Driven Development — writing a spec and handing it to an AI agent to implement — the format of that spec is not a cosmetic detail. It is load-bearing infrastructure.
A growing number of teams are discovering that HTML, specifically semantic HTML, is a better substrate for complex specs. Not because HTML is fun to write, but because the semantic signal it carries meaningfully changes what an AI agent can infer from the document.
## The Semantic Gap Between Markdown and HTML
Consider two ways of marking up the same content.
In Markdown:
```markdown
# Auth
## Token format
...
## Refresh logic
...
```
In semantic HTML:
```html
```
The Markdown version is a flat list of headings. The HTML version is a graph. The agent reading the HTML version knows that "Refresh logic" is a child of "Auth," that a team named `team-identity` owns this section, and that the section has been approved — not just drafted. It can also link directly to `#auth-token-format` from anywhere else in the document without ambiguity.
In a 20-page spec this distinction is academic. In a 100-page spec covering authentication, payments, notifications, compliance, and internal APIs, it becomes the difference between an agent that navigates the document purposefully and one that drifts.
## Where HTML Genuinely Wins
**Large, multi-team specs.** When more than one team owns different sections of a spec, `data-owner` attributes give you machine-readable provenance without cluttering the human-readable content. An agent generating code for the payments flow can filter the spec to only sections where `data-owner="team-payments"` and avoid pulling in noise from adjacent sections.
**Stable internal cross-references.** Markdown's internal link syntax (`[see auth](#auth)`) works, but the anchor targets are derived from heading text, which changes. In HTML, `id` attributes are explicit and stable. A spec that references `refresh behavior ` will not silently break if someone rewords the heading.
**Tabular data and API definitions.** HTML tables explicitly separate `` from `
`. A `` (definition list) is semantically perfect for API field definitions — each `` is a field name, each ` ` is its description and type. AI agents reading these elements know they are processing structured data, not prose.
**Status tracking.** `data-status="draft"` versus `data-status="approved"` versus `data-status="deprecated"` gives an agent immediate signal about which sections to implement against and which to flag for review. This is metadata that Markdown forces you to embed inline as text — where it is invisible to automated parsing.
**Long-lived living documentation.** If a spec will outlive the initial implementation and serve as the canonical reference for a system, HTML's explicit structure makes it easier to maintain, diff, and query over time.
## Practical Patterns Worth Adopting
Use `` with explicit `id` attributes for every major and minor section. Use `` for self-contained components (a single API endpoint, a single data model). Use `` for rationale, edge cases, and notes that should inform the agent but are not implementation requirements.
Use `` and `` to fold verbose reference material — exhaustive field tables, example payloads, error code lists — so the document stays navigable for human readers without hiding anything from the agent, which processes the full DOM.
Embed Mermaid diagrams inline inside a `` with a ``. Claude Code and most modern agents will parse the diagram source, giving them a structured representation of flows and state machines without relying on a screenshot.
For API field definitions:
```html
amount
Integer. Amount in minor currency units. Required.
currency
String. ISO 4217 code. Required.
idempotency_key
String. Client-generated UUID. Optional but strongly recommended.
```
This is unambiguous in a way that a Markdown bullet list is not.
## Where Markdown Still Wins
Do not reach for HTML when you are in the discovery phase. Early specs are supposed to be messy. You are capturing intent, not formalizing a contract. Markdown's low friction is a feature during this stage.
Markdown also wins for anything that lives primarily in a GitHub pull request — review comments, ADRs, brief technical proposals. GitHub renders Markdown natively and inline. HTML in a PR diff is hostile to reviewers.
Single-author specs with a short implementation window do not need the overhead of explicit IDs and data attributes. The structural payoff only compounds over time and across teams.
## The Honest Tradeoff
Writing specs in HTML requires more discipline upfront. You will write more characters per section heading. Your spec files will not render prettily in a default GitHub blob view. You need to either host the spec somewhere or agree that the raw HTML is the authoritative source.
In return, you get a document that an AI agent can navigate like a database, cross-reference without ambiguity, and filter by ownership or status. For a complex system being implemented by autonomous agents over multiple sessions — exactly the SDD workflow — that is a meaningful structural advantage.
The format of your spec is a protocol between you and the agent implementing it. HTML is a richer protocol than Markdown. Whether the overhead is worth it depends entirely on the complexity and longevity of what you are building.
For a greenfield microservice with two engineers and a two-week timeline, write Markdown. For a multi-team platform with a six-month implementation arc and a spec that will outlive the engineers who wrote it, the structural investment in HTML pays for itself quickly.
---
# Mythos Is Not a Cybersecurity Tool. It's a Geopolitical Weapon.
URL: https://sdd.sh/2026/05/mythos-ai-weapon-geopolitics-anthropic/
Date: 2026-05-15
Updated: 2026-05-15
Tags: anthropic, mythos, cybersecurity, geopolitics, ai-policy, claude, national-security
Categories: AI Tools, Industry
Summary: Anthropic's Mythos can autonomously find and exploit thousands of zero-day vulnerabilities across every major OS and browser. Access is tightly controlled — by a US company, for US-aligned entities. The Atlantic Council calls it more consequential than the Iran war. They're right. The US just turned AI into a cyberweapon and nobody voted on it.
The world was watching Iran. The actual inflection point happened in a press release.
In April 2026, Anthropic announced Claude Mythos Preview — a frontier AI model so capable of identifying and exploiting software vulnerabilities that the company decided it was too dangerous to release publicly. It can autonomously find zero-day flaws across every major operating system and web browser. In internal tests, 99% of the vulnerabilities it discovered were unpatched. The UK's AI Security Institute gave it expert-level hacking tasks and it succeeded 73% of the time.
Anthropic's response was to create Project Glasswing: a controlled-access program giving Mythos to a select group of companies for "defensive purposes only." Launch partners include AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, Microsoft, NVIDIA, and Palo Alto Networks. Forty additional organizations were added. Anthropic committed $100M in usage credits across the effort.
Every single launch partner is either a US company or a company deeply embedded in US commercial and government infrastructure.
This is not a coincidence. This is a policy.
## The "Defensive Use Only" Fiction
Let's be precise about what Mythos can do. It can autonomously identify previously unknown vulnerabilities, generate working exploits, and carry out complex cyber operations with minimal human input. It found critical flaws in every widely used operating system and browser — flaws that survived decades of human review and millions of automated security tests.
Anthropic says Glasswing is for defense. The NSA is using Mythos anyway.
Despite the Pentagon designating Anthropic as a "supply chain risk" in March 2026, and Trump ordering federal agencies to stop using its products, the NSA is reportedly running Mythos. Several other federal agencies — including Commerce's CAISI — are circumventing the formal ban to test the model. The White House is simultaneously negotiating to give all federal agencies official access.
Let that sentence sit for a moment. An intelligence agency is using a tool that its own government officially banned, while the executive branch is quietly negotiating to un-ban it for national security purposes.
The defensive framing is doing a lot of work here. A tool that can find every zero-day in every major OS is not only a scanner. It is — by definition — also a targeting system. The line between "we found a vulnerability to patch it" and "we found a vulnerability to exploit it" is a policy decision, not a technical constraint. And those policy decisions are now being made unilaterally by a private US company and a US government that cannot agree with each other in public but are apparently aligned in private.
## Who Gets Protected. Who Doesn't.
Here is the geopolitical math that the Atlantic Council, quite correctly, flagged as more significant than the Iran war.
Mythos knows where every critical vulnerability is. Project Glasswing tells you who gets to use that knowledge.
**The UK got evaluation access** — full hands-on review via the AI Security Institute. The special relationship, in AI form.
**The EU has been denied access.** Not delayed. Denied. Anthropic skipped a European Parliament hearing on Mythos's cyber risks. OpenAI, facing the same pressure, moved to give European cybersecurity teams access to its own cyber model. Anthropic held out. The result: European banks, governments, and infrastructure operators are running systems with vulnerabilities that Mythos has already identified — and they're not in the room.
**China is explicitly shut out.** Chinese entities have sought Glasswing access and been refused. This is the least surprising part of the story — and the most consequential. China has significant AI capability and is developing its own equivalent. When that model reaches parity with Mythos, the asymmetry inverts.
**Japan is scrambling.** Prime Minister Sanae Takaichi ordered an emergency cabinet-level cybersecurity review specifically citing Mythos. The message to Japan's government was clear: a private American company has a model that can compromise your infrastructure, you have no access to it, and you need to figure that out on your own.
This is access-as-geopolitics. It is not subtle. The decisions Anthropic is making about who can use Mythos are the functional equivalent of an arms export policy — except there is no Arms Export Control Act governing it, no State Department license required, no congressional oversight, and no international treaty framework in play.
A private company incorporated in Delaware is deciding which nation-states get to defend their digital infrastructure.
## The Weaponization Is Already Done
The question "will the US use AI as a weapon?" has been answered. The next question is "against whom and when?"
Consider the position this creates. The US government has — through a mix of formal and informal channels — effective access to a tool that can identify exploitable vulnerabilities in the digital infrastructure of any country on earth. The offensive applications are not theoretical. They are the same capabilities that the defensive framing describes, run in the other direction.
Stuxnet was a cyberweapon that required years of development and targeted a single facility. Mythos can autonomously identify attack surfaces across global critical infrastructure. The delta between those two capabilities is not incremental. It is categorical.
The Trump administration's behavior tells you what the posture actually is. Publicly: ban Anthropic as a supply chain risk. Privately: have the NSA run Mythos, negotiate White House access, and reconsider the relationship entirely once the national security implications became clear. This is not incoherence. This is a government that understood what it had and immediately moved to control it — not to constrain it, but to own it.
## The Governance Gap Is Structural
The Lawfare Institute's framing of Mythos as exposing a "governance gap" is accurate but understates the problem. The gap is not just regulatory. It is architectural.
The existing international frameworks for controlling dangerous technologies — the Nuclear Non-Proliferation Treaty, the Chemical Weapons Convention, the Wassenaar Arrangement for export controls on dual-use technology — were all built after the technologies existed, after their destructive potential was demonstrated, and often after they were already used. AI is on the same trajectory, moving faster.
What makes Mythos different from, say, a previous-generation hacking tool is scale and autonomy. A team of skilled human hackers can compromise some systems. Mythos can find vulnerabilities in all systems, automatically, continuously, and at a pace that no human security team can match. The offense-defense balance in cyberspace has shifted permanently, and it has shifted toward whoever holds the best model.
Right now, that is the United States. But "right now" is doing enormous work in that sentence.
## What Happens When China Catches Up
CSO Online asked the obvious question: what happens when China's AI catches up to Mythos?
The answer depends entirely on whether some international framework exists by then to constrain its use. Currently, no such framework exists. The trajectory of US behavior — racing to deploy Mythos capabilities across government while engaging in nominal safety theater — does not suggest that the US will be the party pushing for multilateral constraints.
There are reports that Mythos may be restarting US-China AI safety dialogue. That would be good. But AI safety dialogue between great powers typically produces the same outcome as nuclear non-proliferation diplomacy: agreements that constrain declared capabilities while both sides develop undeclared ones. The model weights are not in a silo in Nevada. They are parameters in a neural network that can be replicated by any sufficiently capitalized lab with sufficient compute.
The asymmetry the US currently holds is real, but it has a shelf life. The strategic window is probably measured in months, not years.
## The Anthropic Paradox
There is an uncomfortable irony sitting at the center of this story.
Anthropic was founded explicitly on the premise that advanced AI is dangerous and that building it carefully, with safety as a first principle, is the only responsible path. Constitutional AI, the Responsible Scaling Policy, the decision to restrict Mythos rather than release it publicly — these are genuine expressions of that philosophy.
And yet Anthropic has built the most capable offensive cyberweapon in existence, restricted access to US-aligned entities, quietly allowed the NSA to run it despite the official ban, and is now in negotiations to put it in the hands of the full US federal government.
You can believe that Anthropic made the least-bad choices available to it given the technology's capabilities. You can also observe that the outcome — a US AI company holding a cyberweapon under informal US government control, with no international oversight, no treaty framework, and selective access based on geopolitical alignment — is precisely the outcome that a naive reading of "safety-focused AI development" was supposed to prevent.
Both things can be true simultaneously. They are.
## What This Means for the Rest of the World
If you are a software engineer, CTO, or policy maker outside the US-UK axis, the practical implications are:
**Your infrastructure is exposed.** Mythos has likely already catalogued vulnerabilities in the systems you run. You do not have access to the patch list. Whether those vulnerabilities get exploited depends on political decisions made in Washington, not technical decisions made by your security team.
**"Too dangerous to deploy" means "deployed selectively."** Anthropic's public position is that Mythos is too dangerous for general release. The actual release policy is: available to US companies, US government agencies, and US-aligned intelligence services. The danger is not being constrained. It is being channeled.
**AI access is now diplomatic currency.** OpenAI gave the EU its cyber model. Anthropic withheld Mythos from the EU. Japan is in emergency session over it. The decisions about which countries get access to frontier AI capabilities are being made by private US companies, with the same geopolitical consequences as arms sales — without any of the legal framework that governs arms sales.
**The clock is running.** Every month that passes without an international framework for governing dual-use AI capabilities is a month in which the US exploits its first-mover advantage and other powers race to close the gap. The window for establishing norms before the capabilities proliferate is narrow and closing.
The Atlantic Council is right. Mythos is more consequential than the Iran war. Wars end. This shift in the offense-defense balance in cyberspace is permanent, and the norms — or lack thereof — established in the next twelve months will define the terrain for decades.
The US just turned AI into a weapon. The question is what the rest of the world does about it.
---
*Sources: [Atlantic Council](https://www.atlanticcouncil.org/content-series/inflection-points/mythos-not-the-iran-war-is-the-most-significant-geopolitical-warning-of-our-time/) · [Just Security](https://www.justsecurity.org/138011/too-dangerous-anthropic-mythos/) · [Axios / NSA](https://www.axios.com/2026/04/19/nsa-anthropic-mythos-pentagon) · [Axios / White House](https://www.axios.com/2026/04/16/white-house-anthropic-ai-mythos-government-national-security) · [Bloomberg](https://www.bloomberg.com/news/articles/2026-04-16/white-house-moves-to-give-us-agencies-anthropic-mythos-access) · [CNBC / EU](https://www.cnbc.com/2026/05/11/openai-eu-cyber-model-anthropic-mythos-gpt.html) · [The Register / Japan](https://www.theregister.com/security/2026/05/12/japans-pm-orders-cybersecurity-review-to-defend-against-anthropic-mythos/5238501) · [CSO Online](https://www.csoonline.com/article/4170818/what-happens-when-chinas-ai-catches-up-to-mythos.html) · [Anthropic Glasswing](https://www.anthropic.com/glasswing) · [Lawfare](https://www.lawfaremedia.org/article/mythos-fallout--u.s.-government-weighs-ai-model-regulation) · [Rest of World](https://restofworld.org/2026/ai-cybersecurity-anthropic-mythos/) · [Schneier on Security](https://www.schneier.com/blog/archives/2026/04/on-anthropics-mythos-preview-and-project-glasswing.html)*
---
# Microsoft Cancels Claude Code Licenses. Claude Still Wins.
URL: https://sdd.sh/2026/05/microsoft-cancels-claude-code-licenses-copilot-cli/
Date: 2026-05-15
Updated: 2026-05-15
Tags: claude-code, microsoft, copilot, enterprise, ai-tools, strategy
Categories: AI Tools, Industry
Summary: Microsoft's Experiences + Devices division is canceling thousands of Claude Code licenses by June 30, forcing engineers onto GitHub Copilot CLI. The headline looks bad for Anthropic. The reality is more complicated — and more instructive.
On May 14, The Verge's Tom Warren broke the story: Microsoft is starting to cancel Claude Code licenses. Engineers in the Experiences + Devices team — the division responsible for Windows, Microsoft 365, Outlook, Teams, and Surface — will have to transition to GitHub Copilot CLI by the end of June. The cutoff aligns neatly with the close of Microsoft's fiscal year.
The immediate framing, in plenty of corners of the internet, was that this is a blow to Anthropic. I think that's mostly wrong. Here's why.
## What Actually Happened
Microsoft gave Claude Code to thousands of its internal developers starting around December 2025. The explicit goal was to get project managers, designers, and non-traditional coders to experiment with programming. By all accounts, it worked — Claude Code proved popular inside Microsoft, particularly among engineers who had the autonomy to pick their own tools.
Six months later, Microsoft is reclaiming those licenses. The reasons given, per internal sources:
- **Platform convergence.** Copilot CLI is the first-party CLI agent. Microsoft wants its developers on tools that are natively wired to GitHub, Azure DevOps, and Visual Studio Code.
- **Telemetry and support.** Third-party tools running on developer machines are opaque to IT. Copilot CLI gives Microsoft full usage visibility, security controls, and a support channel it owns.
- **Cost.** Maintaining parallel Anthropic and OpenAI/GitHub subscriptions across thousands of developers is expensive. Fiscal year-end is a natural forcing function for rationalizing the bill.
This is, at its core, an enterprise platform decision. It has very little to do with which tool is better at coding.
## The Part Everyone Is Missing
Here's the detail that got buried in most coverage: **Anthropic's Claude models are staying in Microsoft's stack.** They're just being accessed through a different interface.
Claude is available in Microsoft 365 Copilot. Claude models run through Copilot Studio. The underlying Anthropic API is still powering parts of what Microsoft engineers will use every day — it's just branded as Copilot now.
This is the structural reality of Anthropic's distribution strategy: Claude the model is everywhere. Claude Code the interface is one distribution channel. Microsoft is closing that channel internally, but the model itself isn't going anywhere. Anthropic has Amazon at $25B in compute commitments, Google at 3.5GW of TPUs, and an Akamai deal worth $1.8B. The company isn't dependent on any single enterprise's internal tooling decisions.
## Platform Lock-In: A Familiar Story
What Microsoft is doing here is not new. It's the same playbook enterprise software has always run: give engineers freedom to experiment, measure what wins, then mandate the internal product and reclaim the budget.
Microsoft has done this with browsers (Internet Explorer), productivity suites, cloud platforms, and developer tools. When you work at Microsoft, you use Microsoft products. That's not cynical; it's rational — from Microsoft's perspective. You get better integration, you eat your own cooking, and you don't fund a competitor's growth metrics.
The interesting question is: what does this say about GitHub Copilot CLI?
In early 2026, GitHub Copilot CLI was the less-capable option. Claude Code had a reputation as the tool serious developers used. Microsoft's decision to force its engineers back onto Copilot CLI is simultaneously a vote of confidence in Copilot's recent progress and an acknowledgment that mandate, not quality, is how enterprises win platform battles.
For context: GitHub Copilot CLI went GA in April 2026 with autopilot mode and multi-model support. It's a meaningfully better product than it was twelve months ago. But the developers who had access to both tools and chose Claude Code tells you something. They didn't need to be told which one to use. Until now.
## What This Means for Anthropic
The honest answer is: not much, in the medium term.
Microsoft was never a primary revenue source for Anthropic the way a Fortune 500 customer signing a $1M+ enterprise contract is. Claude Code's $2.5B ARR comes from developers and enterprises choosing the product directly — not from Microsoft repackaging it internally for cost reasons.
What matters more is the competitive signal. Microsoft is investing in Copilot CLI as its flagship agentic CLI product. It is not ceding the terminal-native coding agent space to Claude Code. That means Anthropic has to keep shipping. The [Code with Claude SF 2026](/posts/code-with-claude-sf-2026-recap/) event in early May showed Anthropic is aware of this: doubled rate limits, Code Review GA, Managed Agents Multiagent orchestration. The pace of shipping is the moat.
There's also a talent signal here. The engineers at Microsoft who were using Claude Code and now have to switch to Copilot CLI know which tool they preferred. Some of them will leave and continue using Claude Code. Some will adapt. None of this moves the revenue needle for Anthropic, but it does move the developer sentiment one — and developer sentiment is a leading indicator of where enterprise adoption goes next.
## The Structural Argument That Won't Go Away
Microsoft's move actually illustrates, in vivid color, the core thesis of the terminal-native vs. IDE-embedded debate.
IDE-embedded tools — whether that's Cursor, Copilot in VS Code, or GitHub Copilot CLI in the terminal — exist inside a company's controlled environment. They're procured centrally, monitored by IT, and subject to platform decisions made by executives, not engineers. When a company decides to switch, the developers don't get a vote.
Claude Code, used directly through an Anthropic subscription, exists outside that envelope. It's harder to mandate away from an individual developer paying their own Max subscription. The switch from company-provided Claude Code to personal Claude Code is lower friction than the switch from Claude Code to Copilot CLI.
Microsoft can mandate the tool on its corporate developer machines. It cannot mandate what its best engineers use on their personal setups at 10pm when they're working on something they care about. That's where the next generation of workflows gets discovered.
## The Bottom Line
Microsoft canceling Claude Code licenses internally is a business decision driven by platform strategy, cost rationalization, and fiscal calendar, not by engineering quality. Claude's models aren't leaving Microsoft's stack — they're being rebranded behind a Copilot interface.
The story that matters is simpler: Anthropic built a tool good enough that one of the world's largest technology companies gave it to thousands of internal developers and watched it become popular. The fact that Microsoft is now mandating its own product instead is a lesson about how enterprise platform control works, not a verdict on which AI coding tool is actually better.
Developers know the difference. Many of them are already on Anthropic's Max plan anyway.
---
**Sources:**
- [Tom Warren / The Verge: Microsoft starts canceling Claude Code licenses](https://x.com/tomwarren/status/2055000505923871219)
- [Techmeme aggregation](https://www.techmeme.com/260514/p32)
- [Let's Data Science: Microsoft cancels Claude Code licenses](https://letsdatascience.com/news/microsoft-cancels-claude-code-licenses-shifts-developers-to-4a042e96)
- [MLQ.ai: Microsoft reportedly scraps most internal Claude Code licenses](https://mlq.ai/news/microsoft-reportedly-scraps-most-internal-claude-code-licenses-and-steers-engineers-back-to-copilot/)
- [KuCoin Flash: Microsoft to cut Claude Code licenses, mandate Copilot CLI by June 30](https://www.kucoin.com/news/flash/microsoft-to-cut-claude-code-licenses-mandate-use-of-github-copilot-cli-by-june-30)
---
# Claude Code v2.1.129: Bedrock Tiers, Smarter MCP, and a Gateway Reversal
URL: https://sdd.sh/2026/05/claude-code-v2-1-129-bedrock-service-tier-mcp-auto-retry/
Date: 2026-05-15
Updated: 2026-05-15
Tags: claude-code, changelog, bedrock, mcp, enterprise, aws
Categories: AI Tools, Guides
Summary: Claude Code v2.1.129 shipped quietly on May 6. For most users it's invisible. For Bedrock enterprise shops, MCP server operators, and anyone using third-party model gateways, it changes real behavior — and one change is a deliberate reversal of v2.1.126.
Claude Code's changelog entries rarely make headlines. That's usually a good sign — it means the product is stable enough that weekly releases are refinements, not rescue operations. Version 2.1.129 (shipped May 6, alongside [Code with Claude SF 2026](/posts/code-with-claude-sf-2026-recap/)) is that kind of release: narrow scope, high precision.
But three changes in this release affect production deployments in ways that don't show up as obvious UI differences. If you're running Claude Code at scale on AWS Bedrock, operating MCP servers that hit startup errors, or routing Claude Code through a third-party gateway, you need to know about this release specifically.
## AWS Bedrock Gets Service Tier Control
The headline addition for enterprise users: `ANTHROPIC_BEDROCK_SERVICE_TIER`.
Set this environment variable to `default`, `flex`, or `priority`, and Claude Code will send the corresponding `X-Amzn-Bedrock-Service-Tier` header on every Bedrock API call. This maps directly to AWS Bedrock's three inference tiers:
- **default** — standard throughput, standard pricing
- **flex** — burstable capacity, trades guaranteed throughput for cost flexibility
- **priority** — reserved throughput, higher cost, consistent latency
Before this change, Claude Code on Bedrock always requested default-tier inference. If your organization had negotiated a priority-tier reserved capacity block, you couldn't steer Claude Code onto it — you had to let it compete for standard capacity alongside every other workload in your account.
Now you can. For teams running Claude Code in production pipelines where latency consistency matters — automated PR review, CI/CD integration with Code Review GA, scheduled Routines — this is the knob you've been waiting for.
**How to use it:**
```bash
export ANTHROPIC_BEDROCK_SERVICE_TIER=priority
claude
```
Or pin it in your shell profile / container environment for permanent effect. For Claude Code Routines running on Anthropic's infrastructure against Bedrock, set it in the Routine's environment configuration.
## /resume Now Works by PR URL
Small change, big workflow impact: pasting a pull request URL into the `/resume` search box now finds the session that created that PR.
Previously, `/resume` searched by session timestamp, conversation title, or working directory. If you wanted to continue work on a PR you'd handed off — or pick up a session from a different machine — you had to remember when you started it or scroll through a list.
Now you paste the PR URL — GitHub, GitHub Enterprise, GitLab, or Bitbucket format all work — and `/resume` does the lookup. Claude Code surfaces the matching session, including the full context of what was being built when the PR was opened.
The practical case: you create a PR, hand it off for review, reviewer comes back with comments three days later, and you want to continue in the same Claude Code session rather than starting cold. Paste the PR URL, resume, and Claude has the full context of what it built and why.
## MCP Servers Get Three Startup Retries
MCP servers that fail to connect during Claude Code startup have always stayed disconnected for the session. If an MCP server was slow to initialize — database connection under load, container cold start, network hiccup — Claude Code would mark it as unavailable and move on. You'd discover mid-session that the tool you needed wasn't there.
Version 2.1.129 adds automatic retry logic: up to 3 retry attempts on transient startup errors before giving up. The retries are brief and happen before Claude Code's main prompt appears, so you don't see them unless they're happening on a slow connection.
For MCP operators running servers with any startup latency variance — local containers, remote MCP endpoints, database-backed servers that take a moment to warm up — this is a reliability improvement that requires zero configuration. It just works.
The distinction the changelog makes: transient errors only. Servers that fail due to missing configuration, bad credentials, or incompatible protocol versions still fail fast. The retry logic is targeted at infrastructure-level flakiness, not configuration problems.
## Gateway Model Discovery Is Now Opt-In
This one is a reversal, and it deserves attention.
Version 2.1.126 (shipped ~April 25) made the `/model` picker's gateway discovery automatic. Claude Code would query `ANTHROPIC_BASE_URL/v1/models` and populate the model picker with whatever models the gateway reported. Third-party OpenAI-compatible gateways — LiteLLM, OpenRouter, local Ollama setups, enterprise API proxies — suddenly appeared as selectable models without any user action.
v2.1.129 rolls that back. Gateway discovery is now opt-in via `CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY=1`.
The default is off.
**Why the reversal?** The auto-discovery behavior had two problems:
1. **Security surface.** A malicious or misconfigured `ANTHROPIC_BASE_URL` could inject unexpected model names into the picker. This is a minor risk in most setups, but it's the kind of ambient attack surface that security-conscious enterprise environments will flag immediately.
2. **Noise in the model picker.** Power users who set `ANTHROPIC_BASE_URL` for unrelated reasons (proxy routing, traffic inspection) found their model picker filled with unexpected entries. The opt-in flag keeps the picker clean by default.
If you were using auto-discovery from v2.1.126 through v2.1.128 and want to keep it, add `CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY=1` to your environment. If you upgraded to v2.1.129 and your model picker suddenly got shorter, this is why.
## The Rest of v2.1.129
Several smaller additions that are worth knowing about:
**`--plugin-url `** — fetch a plugin `.zip` archive from a URL for the current session. Previously, plugins had to be installed to the local plugin directory. This enables ephemeral plugin loading for CI/CD workflows: pull a plugin from an artifact store, use it for the duration of the build, discard it. No persistent installation required.
**`CLAUDE_CODE_FORCE_SYNC_OUTPUT=1`** — forces synchronized terminal output on terminals where auto-detection fails. The primary target is Emacs's `eat` terminal emulator, which has a rendering model that trips Claude Code's async output path. If you're an Emacs user running Claude Code inside `eat` and seeing garbled output, this is the fix.
**`CLAUDE_CODE_PACKAGE_MANAGER_AUTO_UPDATE`** — when set, enables automatic background updates on Homebrew and WinGet installations with a restart prompt when an update is ready. Useful for developer workstations where staying current matters but you don't want to remember to run `brew upgrade` before each session.
**Plugin manifest changes** — the `experimental` namespace in plugin manifests is now required for `themes` and `monitors` keys. Top-level declarations still work but trigger warnings from `claude plugin validate`. If you maintain a plugin, update your manifest before the warnings become errors in a future release.
**VS Code Windows fix** — a regression in recent releases caused the VS Code extension to fail activating on Windows due to a hardcoded build path in the bundled SDK. Fixed.
**Mantle auth fix** — the Bedrock enterprise Mantle backend was missing the `x-api-key` header in some authentication flows, causing intermittent auth failures for Bedrock GA users. Fixed.
## The Pattern in This Release
v2.1.129 is a configuration-layer release. The features aren't about new AI capabilities — they're about giving operators more control over how Claude Code behaves in specific infrastructure environments.
Bedrock service tiers, MCP retry logic, gateway discovery opt-in, plugin URL loading — all of these are signals that Claude Code is being deployed in environments complex enough to need these knobs. The engineering team is spending cycles on enterprise infrastructure plumbing. That's not a criticism; it's a sign the product has moved from developer-curious to production-serious.
For the vast majority of Claude Code users running it interactively on a developer machine with a direct Anthropic API connection, v2.1.129 is invisible. For the teams running it in pipelines, on Bedrock, behind gateways, or with MCP servers that have any startup variability — this is the release to upgrade for.
---
**Sources:**
- [Claude Code Changelog (official)](https://code.claude.com/docs/en/changelog)
- [Claude Code v2.1.129 Release Notes – Claude World](https://claude-world.com/articles/claude-code-21129-release/)
- [Claude Code Updates May 2026 – Releasebot](https://releasebot.io/updates/anthropic/claude-code)
- [Claude Code Changelog tracking – claudelog.com](https://claudelog.com/claude-code-changelog/)
---
# Anthropic Signs a $1.8B Deal With Akamai. Why a CDN Company?
URL: https://sdd.sh/2026/05/anthropic-akamai-1-8-billion-compute-deal-edge/
Date: 2026-05-14
Updated: 2026-05-14
Tags: anthropic, akamai, compute, infrastructure, edge-inference, claude-code
Categories: AI Tools, Industry
Summary: Anthropic has signed a $1.8B, seven-year computing contract with Akamai Technologies — the largest deal in Akamai's history. Akamai isn't just a CDN anymore: it launched a global AI inference network across 4,400 edge locations built on NVIDIA Blackwell GPUs in March. The deal is the fourth pillar of Anthropic's deliberate strategy to never depend on a single compute supplier.
When Bloomberg reported that Anthropic had signed a $1.8 billion computing contract with Akamai Technologies on May 8, the immediate reaction was confusion. Akamai? The content delivery network? The company that routes your Netflix stream and protects websites from DDoS attacks?
Yes, that Akamai. And once you understand what the company has been quietly building over the past eighteen months, the deal makes perfect sense — both for Anthropic and for developers who think about where Claude actually runs.
## Akamai Is Not Just a CDN Anymore
The mental model most engineers carry of Akamai — edge servers that cache static content close to users — is significantly out of date.
In March 2026, Akamai announced what it described as the industry's first global-scale implementation of NVIDIA's AI Grid reference architecture: a distributed network of thousands of NVIDIA RTX PRO 6000 Blackwell Server Edition GPU clusters, coordinated across more than 4,400 edge locations worldwide. The product is called the Akamai Inference Cloud Platform, and it is specifically designed to run AI workloads — both training and inference — either at the distributed edge or from centralized multi-thousand GPU clusters.
The NVIDIA AI Grid orchestration layer handles intelligent routing: decide dynamically whether a given workload should run on a concentrated cluster or be distributed across edge nodes based on latency requirements, cost, and load. For a company whose entire value proposition is "compute close to where users actually are," this is a natural extension of the core business.
The Anthropic deal sent Akamai's stock up 27% on May 8. At approximately $257 million per year over seven years, it would more than double Akamai's current cloud segment annual run rate. The company called it the largest deal in its history.
## The Four-Pillar Compute Strategy
To understand why Anthropic signed with Akamai, you need to zoom out and look at the full compute picture.
CEO Dario Amodei has been explicit: Anthropic is experiencing demand it cannot currently serve fast enough. At Code with Claude SF, he cited "80x growth" in annualized revenue and usage during Q1 2026. That kind of curve requires not just more compute, but compute from multiple sources — because any single supplier represents a single point of failure.
Anthropic's current compute stack now has four major pillars:
| Partner | Scale | Structure |
|---|---|---|
| **Amazon** | $25B investment + $100B AWS, 10 years | Cloud + Trainium3 chips |
| **Google** | 3.5 GW TPU capacity | Partnership + investment |
| **SpaceX Colossus 1** | 300 MW, 220,000+ NVIDIA GPUs | Compute lease, announced May 6 |
| **Akamai** | $1.8B, 7 years, multi-thousand GPU clusters | AI Grid + edge inference |
No other AI lab has deliberately diversified its compute base this broadly. OpenAI relies heavily on Microsoft's Azure. Google's DeepMind runs primarily on Google's TPU stack. The concentration risk those arrangements create is real: a contract dispute, a capacity crunch, or a strategic realignment can put model availability at risk.
Anthropic has structured each of these deals to be complementary rather than redundant. Amazon handles a large share of API traffic through Bedrock. Google provides training compute via the TPU stack. SpaceX Colossus unlocked the rate limit doubling. And Akamai brings something the others don't: infrastructure designed from the ground up for distributed inference at the edge.
## Why Edge Inference Matters
The traditional model of AI inference is centralized: a request travels to a large data center, runs through a GPU cluster, and the response travels back. For many workloads this is fine. For agentic tasks — Claude Code sessions, Routines, Managed Agents orchestrating multi-step workflows — the cumulative latency of round-trip calls to centralized compute adds up.
Akamai's 4,400 edge locations cover nearly every major metropolitan market on the planet. NVIDIA's AI Grid orchestration layer allows inference to be routed to the closest capable node rather than always going to a central cluster. The practical effect for Claude Code users: faster responses for interactive sessions, and lower per-token costs once the amortized infrastructure math shifts from centralized to distributed.
Anthropic has not announced a specific edge inference product using Akamai's network. The deal covers both training and inference workloads, and revenue recognition doesn't begin until Q4 2026. But the architectural implication is clear: Anthropic is building toward a model where Claude can run not just in AWS data centers but distributed across thousands of edge points. For the use cases that matter most to developers — real-time tool calls, computer-use sessions, sub-second agentic step times — this infrastructure investment is directly relevant.
## The Non-Hyperscaler Bet
There is a second read on this deal that is worth making explicit.
Amazon, Google, and Microsoft are hyperscalers: they sell compute as a core business, they have massive leverage over their customers, and the contractual and commercial relationships can become complicated when those customers are also competitors in AI. Akamai is not a hyperscaler. It has no AI model ambitions. It is a pure infrastructure play, and its relationship with Anthropic is cleanly transactional.
The same logic applies to the SpaceX Colossus deal. SpaceX is not building a competing AI lab. The compute is Anthropic's to run as it chooses, without the implicit tension of being a major customer of a company that is simultaneously your rival.
Deliberately sourcing compute from non-competing infrastructure providers is a strategic hedge that Anthropic has built quietly into its supply chain. The Akamai deal is the clearest expression of that strategy yet.
## What This Means for Claude Code Users Today
The Akamai deal doesn't change anything about how you interact with Claude Code today. The contract ramps in Q4 2026, and practical developer impact will come later. But taken together with the SpaceX Colossus announcement (rate limits doubled May 6) and the broader compute diversification strategy, the direction is clear:
**Rate limits continue to rise.** Each compute deal Anthropic signs expands the ceiling. The 5-hour daily limit, the peak-hour throttling — these are artifacts of supply constraint, not product design. As supply catches up to demand, they become irrelevant.
**Agentic reliability improves.** Long Claude Code sessions, multi-agent Routines, and Managed Agents workflows require sustained compute availability. Distributed edge infrastructure reduces the probability of any single data center bottleneck affecting an active session.
**The compute moat widens.** A developer choosing between Anthropic and a competitor also implicitly choosing their underlying infrastructure. A company with $1.8B locked into Akamai, $25B with Amazon, and 300MW of SpaceX capacity is not running out of tokens anytime soon. The infrastructure bets being made now directly shape which models will be available and at what price in 2027 and beyond.
## The Bigger Picture
Akamai's transformation is itself a story worth watching. A company known for serving CDN traffic has built a global AI inference grid using NVIDIA's latest Blackwell architecture, signed the largest deal in its corporate history, and seen its stock jump 27% in a single day. That's what the demand signal from Anthropic and others is doing to infrastructure companies that position themselves correctly.
For Anthropic, the deal is one more piece of a compute foundation that no other AI lab has assembled quite like this: diversified across hyperscalers and non-hyperscalers, balanced between centralized training clusters and distributed edge inference, locked in across timescales ranging from seven to ten years.
The $1.8B number sounds large. Spread across seven years and the compute it unlocks, it is a bargain.
---
*Sources: [Bloomberg](https://www.bloomberg.com/news/articles/2026-05-08/anthropic-inks-1-8-billion-computing-deal-with-akamai) · [Benzinga](https://www.benzinga.com/markets/tech/26/05/52434312/anthropic-signs-1-8-billion-akamai-cloud-deal-amid-surging-claude-ai-demand-report) · [The Next Web](https://thenextweb.com/news/akamai-anthropic-cloud-deal-ai-infrastructure) · [Akamai Inference Cloud Platform](https://www.akamai.com/products/akamai-inference-cloud-platform) · [Akamai AI Grid press release](https://www.akamai.com/newsroom/press-release/akamai-launches-ai-grid-intelligent-orchestration-for-distributed-inference-across-4400-edge-locations) · [Let's Data Science](https://letsdatascience.com/news/anthropic-signs-18b-computing-deal-with-akamai-7069d163)*
---
# Anthropic Is in Talks to Raise $30B at a $900B Valuation. The Numbers Explain Why.
URL: https://sdd.sh/2026/05/anthropic-900b-valuation-30b-funding-round/
Date: 2026-05-14
Updated: 2026-05-14
Tags: anthropic, funding, valuation, claude-code, ipo, enterprise-ai
Categories: AI Tools, Industry
Summary: Anthropic is in early talks to raise at least $30B at a pre-money valuation exceeding $900B — nearly triple its February figure of $380B. The leap is backed by real revenue: $44B annualized ARR, 70% gross margins, and Claude Code generating $2.5B on its own. If the round closes, Anthropic would surpass OpenAI's $852B March valuation.
Three months ago, Anthropic closed a Series G at a $380 billion valuation and the AI world called it a record. Now Bloomberg is reporting that Anthropic is in early talks to raise at least another $30 billion — this time at a pre-money valuation exceeding $900 billion. No term sheet has been signed, but sources indicate the round could close by the end of May.
The obvious question: how does a private company nearly triple its valuation in a single quarter? The answer is that the revenue growth is real, and it's accelerating faster than anyone predicted.
## The Numbers Behind the Number
When Anthropic closed the Series G in February, annualized revenue stood at roughly $9–10 billion. By May 2026, that figure has surged to approximately $44 billion annualized — a more-than-4x jump in roughly four months. That kind of trajectory doesn't happen on hype alone.
A few metrics deserve attention:
- **Gross margins**: 38% a year ago; now above 70%. This matters enormously for a compute-intensive business. Anthropic has compressed its inference cost per token faster than critics assumed possible.
- **Enterprise mix**: roughly 80% of revenue comes from enterprise customers, not consumer subscriptions. Enterprise revenue is stickier, carries larger contract sizes, and is far less subject to churn.
- **Fortune 10 penetration**: eight of the ten largest companies in the world are active Anthropic customers.
- **$1M+ customers**: over 1,000 enterprise clients spend more than a million dollars annually.
- **Business subscriptions**: quadrupled since the start of 2026.
And then there is Claude Code. The terminal-native coding agent generates approximately $2.5 billion in annualized revenue as a standalone product line — roughly 6% of total ARR, from a tool that launched in earnest less than a year ago. No single AI coding tool has ever grown to that scale this fast.
## Surpassing OpenAI — on Paper
OpenAI closed a funding round in March 2026 at an $852 billion valuation, led by Amazon, Nvidia, and SoftBank. Should Anthropic finalize this round at $900B+, the Claude maker would, on paper at least, surpass its better-known rival.
The valuation comparison is worth holding loosely — both companies are private, and their approaches to enterprise vs. consumer revenue, compute infrastructure, and safety investments differ substantially. But directionally, the market is saying that the race is genuinely competitive in a way it simply was not two years ago, when OpenAI held an overwhelming lead.
Anthropic has consistently differentiated itself on enterprise and developer tools rather than consumer applications. The approach has proved more defensible and more profitable than chasing monthly active users.
## What the Money Is Actually For
This is not a fundraise to build a better chatbot. Anthropic's stated priority is compute: the company is racing to secure GPU and TPU capacity at a scale that can serve its current customers while supporting continued model development.
The compute strategy has been diversified deliberately:
- **SpaceX Colossus 1**: 300MW, 220,000+ NVIDIA GPUs, announced at Code with Claude SF in May. Doubled Claude Code rate limits across all tiers.
- **Amazon**: up to $25B investment, $100B AWS commitment over ten years, 5GW of Trainium3 compute.
- **Google**: 3.5GW TPU capacity via the existing partnership.
- **Akamai**: $1.8B, seven-year contract for multi-thousand GPU clusters supporting training and inference. (More on this separately.)
The goal is not to be dependent on any single hyperscaler. Every one of these deals represents both a capital commitment and an insurance policy against supply concentration. The $30B raise funds the next phase of that diversification — and likely accelerates the training runs for whatever comes after Opus 4.7.
## The IPO Is Now Real
Anthropic has held preliminary IPO discussions that point toward a listing as early as October 2026. A $30B+ round at a $900B valuation, if it closes at that scale, serves two purposes: it provides runway and signals market confidence in the lead-up to a public offering.
The path from $380B private valuation (February) to a potential $60B+ IPO raise (October estimate) in under a year would be unprecedented. For context, the largest tech IPO in history was Alibaba's $25B raise in 2014. Anthropic's October target, if it materializes, could set a new record.
Going public puts Anthropic under a different kind of scrutiny — quarterly earnings calls, disclosure requirements, and the pressure to optimize for near-term margins at the expense of long-term research investments. Dario and Daniela Amodei have been vocal about prioritizing safety research and the long game. How that commitment survives the IPO grinder will be one of the more interesting corporate stories of the next twelve months.
## Why Claude Code Developers Should Care
If you are building on or with Claude Code, this funding round has a few practical implications:
**Rate limits are not going away.** The doubled limits from the SpaceX Colossus deal were a direct consequence of capital deployment into compute. More compute from more sources means the ceiling continues to rise. The 5-hour daily limit that plagued power users through Q1 is already behind us.
**Model releases accelerate.** Anthropic's gross margin improvement (38% → 70%) means each dollar of revenue funds more research. The pace of releases — Opus 4.7 in April, the rumored Sonnet 4.8 still expected this month — reflects a lab that has solved enough of its inference-cost problem to reinvest aggressively.
**Enterprise features compound.** The $1M+ customer base is the pressure that shipped Routines, Claude Cowork, Analytics API, RBAC/SCIM, and the Claude Platform on AWS. A $900B company with 1,000+ enterprise customers at that spend level is under enormous pressure to keep shipping. That pressure benefits every developer on the platform.
**The safety-versus-capability tension intensifies.** At $900B and pre-IPO, Anthropic faces investor pressure that no private company does. The Constitutional AI research, the Responsible Scaling Policy, and the deliberate release decisions around Opus 4.7's cybersecurity capabilities — all of that has been easier to maintain as a founder-controlled private company. Watching how it survives contact with public market incentives is not an abstract concern for the developers and enterprises whose workflows depend on reliable, principled AI.
## The Floor Under the Valuation
Skeptics will note that $900B for a company with $44B ARR implies a roughly 20x revenue multiple — elevated, but not absurd for a business with 70% gross margins growing at this pace. For comparison, Salesforce trades at roughly 6x revenue, but Anthropic is growing orders of magnitude faster.
The real floor under the valuation is the combination of gross margin trajectory, compute moat (the locked-in multi-decade relationships with SpaceX, Amazon, Google, and Akamai), and the developer ecosystem stickiness that Claude Code has built through CLAUDE.md workflows, skills, Routines, and MCP integrations. None of that evaporates if OpenAI ships a competing tool.
The round is not closed. A term sheet has not been signed. But the direction of travel is clear: Anthropic has converted its safety-focused bet into a machine that generates $44B a year, and the market is pricing in what comes next.
---
*Sources: [Bloomberg](https://www.bloomberg.com/news/articles/2026-05-12/anthropic-in-talks-to-raise-30-billion-at-900-billion-valuation) · [TechFundingNews](https://techfundingnews.com/anthropic-30b-fundraise-900b-valuation-mega-round/) · [TradingKey analysis](https://www.tradingkey.com/analysis/stocks/us-stocks/261889029-anthropic-funding-30b-valuation-trillion-claude-code-revenue-growth-ipo-spacex-colossus-tradingkey) · [Yahoo Finance](https://finance.yahoo.com/news/anthropic-talks-raise-30-billion-210804604.html)*
---
# OpenAI Just Built an IT Services Company. That's an Admission.
URL: https://sdd.sh/2026/05/openai-deployment-company-4-billion-enterprise-services/
Date: 2026-05-13
Updated: 2026-05-13
Tags: openai, enterprise, ai-tools, industry, claude-code
Categories: AI Tools, Industry
Summary: OpenAI launched a $4B+ PE-backed deployment company on May 11, acquiring AI consultancy Tomoro and embedding 150 engineers into enterprise clients. The structure tells a story: if models alone were enough to win enterprise, you wouldn't need a 1,000-person professional services arm.
On May 11, 2026, OpenAI did something IBM would recognize immediately. It launched the [OpenAI Deployment Company](https://openai.com/index/openai-launches-the-deployment-company/), a majority-owned subsidiary backed by more than $4 billion in initial capital from 19 investment firms, consulting houses, and systems integrators — including TPG, Goldman Sachs, SoftBank, Bain Capital, Brookfield, and McKinsey & Company.
Alongside the launch, OpenAI acquired Tomoro, an AI consultancy with roughly 150 engineers that counts Mattel, Red Bull, Tesco, and Virgin Atlantic among its clients.
IT services stocks dropped the same day. That reaction tells you everything about what OpenAI just entered.
## The "Forward Deployed Engineer" Model
The OpenAI Deployment Company's pitch is straightforward: it will embed "Forward Deployed Engineers" (FDEs) directly into enterprise customer teams. These specialists identify high-impact AI use cases, redesign workflows around them, and — in OpenAI's framing — "turn those gains into durable systems."
If that sounds like Accenture or McKinsey Digital, it should. This is the professional services model that has been the dominant route to enterprise software adoption for 40 years. Sell the platform, then sell the integration. Sell the model, then sell the deployment.
The Tomoro acquisition seeds the business immediately. Tomoro was formed in 2023 in alliance with OpenAI, which means it was already building on GPT-4/GPT-5 stacks for clients. The ~150 engineers come with relationships, playbooks, and institutional knowledge about where enterprise AI gets stuck — and it isn't usually the model quality.
## Why OpenAI Needed This
OpenAI's revenue has been growing — its [API and enterprise business](https://openai.com) reportedly exceeds $3 billion ARR — but converting large enterprises into durable, high-spend customers has proven harder than converting startups.
The gap isn't model quality. GPT-5.5, Codex, and the Agents SDK are technically excellent. The gap is the last 100 meters: integrating AI into legacy workflows, training employees who didn't choose this, navigating procurement, satisfying legal and security reviews, and maintaining systems that don't break when a new model drops.
That work doesn't happen through an API. It happens through people. The Deployment Company is OpenAI's bet that it can own that layer before the Accentures and Deloittes of the world do.
The 19 investors include Bain & Company, Capgemini, and McKinsey as "consulting and systems-integration partners" — which means they're not just writing checks, they're potential co-delivery partners (or competitors who decided it was better to be inside the tent).
## What This Means for the Market
The IT stock reaction was telling. Shares in traditional systems integrators fell on the news, because the OpenAI Deployment Company is essentially entering their market — except with access to frontier AI capabilities that no legacy SI can match in-house.
For smaller AI tool vendors, this changes the enterprise sales dynamic. If OpenAI's FDEs are already embedded in a customer, that's a natural GPT-5 moat. The Deployment Company's presence at an enterprise account will make it harder for Claude Code, Cursor, or any other tool to get a foothold unless the customer explicitly requests multi-vendor diversity.
That said, the IT services model has a structural weakness: it doesn't scale like a product.
## The Product-Led vs. Services-Led Fork
Here is where the OpenAI Deployment Company diverges sharply from what Anthropic has built.
Claude Code is a product. An individual developer downloads it, authenticates via Claude.ai or an API key, and starts using it in their existing terminal. Adoption spreads laterally through engineering teams through demonstrated productivity gains, not through enterprise procurement cycles and embedded consultants.
The OpenAI Deployment Company is services. You get the FDEs if you can afford the engagement. The technology compounds at the speed of human consulting capacity.
Anthropic's $30B ARR was built largely on this product-led flywheel: Claude Code's $2.5B ARR component grew from $0 to $2.5B in roughly 18 months, driven by developer adoption that preceded formal enterprise procurement. By the time enterprise IT signed the contract, engineering teams had already made the decision for them.
OpenAI's approach can generate large-ticket deals faster — an FDE engagement might be a $2M annual contract from day one. But the ceiling is the headcount of the Deployment Company. A product's ceiling is how good the model is.
## The Deeper Admission
There's something worth sitting with here. OpenAI built the most capable models in the world, captured the popular imagination with ChatGPT, and is still finding that selling AI to enterprises requires a professional services arm backed by Goldman Sachs and McKinsey.
That's not a failure. It's a realistic read of how large organizations adopt new technology. But it's also an admission: models alone are not enough.
This creates a tension at the heart of OpenAI's strategy. If FDEs are what makes the difference, then the value of the AI is partly in the deployment expertise, not just in the model. Which means OpenAI is competing simultaneously in two businesses with very different economics: foundation models (high margin, scales with compute) and professional services (low margin, scales with headcount).
Every IT services company that scaled past $10B learned this tradeoff the hard way. The Deployment Company's investors — who include people from TPG and Brookfield, not just AI enthusiasts — know this too. The $4B bet is that the combination can work; that having the best models *and* the best deployment organization is an unassailable position.
## What to Watch
The Deployment Company's first major test will be whether its FDE engagements produce the kind of measurable, auditable productivity gains that justify the price tag. Tomoro's clients — Mattel, Virgin Atlantic, Tesco — are consumer and retail brands, not deep-tech enterprises. Scaling that playbook to financial services, healthcare, and government is a different challenge.
The Anthropic counter-argument writes itself: Mercado Libre targeted 90% autonomous coding across 23,000 engineers by Q3 2026 using Claude Code. No FDEs required. The engineers themselves became the deployment mechanism, because the tool was good enough to adopt without a consultant explaining how.
Watch whether OpenAI's FDE model generates documented ROI numbers comparable to what Claude Code's analytics API produces — and whether those numbers hold up after the FDEs leave.
---
**Sources:**
- [OpenAI launches the OpenAI Deployment Company](https://openai.com/index/openai-launches-the-deployment-company/) — OpenAI official announcement
- [OpenAI forms $4B PE-backed AI deployment venture](https://www.msn.com/en-us/money/news/openai-forms-4-billion-pe-backed-ai-deployment-venture/ar-AA22VzvM) — Reuters / MSN
- [OpenAI launches enterprise AI service-focused 4Bn 'OpenAI Deployment Company'](https://thetechportal.com/2026/05/11/openai-launches-enterprise-ai-service-focused-4bn-openai-deployment-company/) — The Tech Portal
- [OpenAI launches professional services business with $4B investment](https://siliconangle.com/2026/05/11/openai-launches-professional-services-business-4b-investment/) — SiliconANGLE
- [Capgemini investment announcement](https://www.capgemini.com/news/press-releases/capgemini-strengthens-its-position-in-enterprise-ai-with-investment-in-the-openai-deployment-company/) — Capgemini official
- [OpenAI Deployment Company launches with $4bn and Tomoro buy](https://www.resultsense.com/news/2026-05-12-openai-deployment-company-tomoro/) — ResultSense
---
# Google I/O 2026 Preview: Gemini 4, Firebase Agents, and the Agentic Coding Race
URL: https://sdd.sh/2026/05/google-io-2026-preview-gemini-4-firebase-agents-agentic-coding/
Date: 2026-05-13
Updated: 2026-05-13
Tags: google, gemini, google-io, firebase, agentic-coding, claude-code, gemini-cli
Categories: AI Tools, Industry
Summary: Google I/O 2026 runs May 19–20. Gemini 4 with a 2M+ token context window is the headliner, but the more important story is Firebase Studio becoming an agent-native development platform — Google's direct answer to Claude Code. Here's what to watch and why it matters.
Google I/O 2026 is six days away. The keynote is May 19 at 10am PT at Shoreline Amphitheatre in Mountain View, and for the first time in memory, the most important sessions aren't about Android.
This year the developer keynote carries a specific weight: Google is expected to reveal Gemini 4, announce the agent-native evolution of Firebase Studio, and ship the Gemini CLI upgrade that moves it from a capable free tool to a full multi-agent platform. That's three simultaneous strikes at Claude Code's current position — from the model layer, the platform layer, and the terminal layer.
Here's what to expect, what's confirmed, and what it means if you're already building with Claude Code or evaluating the field.
## Gemini 4: The Context Window Play
The centerpiece announcement will almost certainly be [Gemini 4](https://www.abhs.in/blog/google-io-2026-may-19-gemini-4-android-17-agentic-coding-developer-preview). Sources pointing at internal session copy describe a 2 million token context window as the baseline capability — with some suggesting a 10 million token variant for enterprise workloads.
To calibrate: at 2M tokens, you can fit a moderately large production codebase — say, 200,000 lines of well-commented TypeScript — in a single context window without any retrieval augmentation. At 10M tokens, you're in territory where an entire large monorepo fits as context.
This is the story Google needs to tell. Claude Opus 4.7 has a 1M token context window. Gemini 4 at 2M doubles it. At 10M, the comparison doesn't exist yet.
Whether the extended context translates to better reasoning on long-range code tasks is a different question. Claude Code's 1M context GA in March 2026 produced a 15% reduction in compaction events but didn't make Opus 4.7 dramatically better at tasks that fit in 100K tokens. Context window size is necessary but not sufficient.
Expect Gemini 4 benchmarks focused on multi-file editing, whole-codebase refactors, and tasks where context length is the binding constraint. That's where Google's advantage will be most defensible.
## Firebase Studio: Google's Direct Claude Code Answer
The more strategically significant announcement for developers is Firebase Studio's evolution into what Google's session copy describes as an [agent-native platform](https://www.abhs.in/blog/google-io-2026-may-19-gemini-4-android-17-agentic-coding-developer-preview).
Firebase Studio is the rebranded Project IDX, which Google announced at Cloud Next 2026 as a cloud-hosted development environment. The I/O announcement is expected to add the agentic layer: autonomous multi-file editing, test execution, deployment to Google Cloud, and deep integrations with AI Studio and Antigravity (Google's full-stack AI app builder).
The confirmed path is: prototype in AI Studio → build in Firebase Studio → deploy to Google Cloud. If that pipeline works as described, it's a credible end-to-end story for new application development — especially for developers already in the Google ecosystem.
The comparison to Claude Code is obvious and intended. Claude Code occupies the terminal and spans the entire development lifecycle from spec to deployment. Firebase Studio occupies the browser and targets a web-native workflow. These are not the same tool for the same person — they're different bets about where serious development happens.
Claude Code's thesis is that the terminal is the only environment with the access, flexibility, and tooling depth to support truly autonomous agents. Firebase Studio's thesis is that the browser is where more developers already live, and that moving to the browser unlocks cloud-native benefits (no local setup, easy sharing, instant deployment).
Both can be true. But they can't both be the primary tool for the same developer.
## Gemini CLI: Subagents Shipped, More Coming
While I/O will carry the headline announcements, Google has already shipped a significant Gemini CLI update that deserves attention on its own.
[Subagents are now live in Gemini CLI](https://developers.googleblog.com/subagents-have-arrived-in-gemini-cli/). The model: Gemini CLI acts as an orchestrator, delegating sub-tasks to specialist agents — each with its own context window, custom system instructions, and curated tool set. Subagents are defined as Markdown files with YAML frontmatter, stored in `~/.gemini/agents` for personal workflows or `.gemini/agents` in a repository for team-shared agents.
Parallel execution is supported: you can spin up multiple instances of the same subagent simultaneously, which is how you'd handle tasks like refactoring multiple independent modules or running parallel research on competing approaches.
The architecture is strikingly similar to Claude Code's own skill system. Claude Code skills are also Markdown-defined, live in `.claude/` directories, can be shared via repositories, and are loaded into specific sessions. The convergence isn't coincidental — this is the emerging standard for how terminal AI agents are extended.
The v0.41.0 Gemini CLI release (May 8, 2026) added real-time voice interaction with both cloud and local processing backends — expanding the tool from a coding assistant to something closer to a continuous development collaborator.
Gemini CLI's competitive position: free (1,000 requests/day, 1M context), open-source, Google Search grounding built in, MCP support for tool extensibility. Its weakness relative to Claude Code: Opus 4.7 is a significantly better coding model than Gemini 2.5 Pro on current SWE-bench Pro scores (64.3% vs. ~54%). The Gemini CLI free tier is a compelling entry point, but not a replacement for teams doing serious agentic work.
If Gemini 4 substantially closes that benchmark gap, the free-tier argument gets much stronger.
## Jules and the Async Agent Layer
Jules — Google's async coding agent that runs on Google infrastructure — is expected to receive updates at I/O. Jules GA (April 2026) runs on Gemini 3.1 Pro, handles CI loop closure automatically, and provides audio changelogs of its work.
The strategic parallel to Claude Code Routines is exact: both let you queue a task, walk away, and come back to a completed result. The key difference is integration depth. Routines integrate natively with Claude Code's terminal-native model, CLAUDE.md invariants, git worktrees, and the broader Managed Agents ecosystem. Jules is a standalone async agent without equivalent tooling depth.
A Gemini 4-powered Jules would be a meaningfully different product. Watch for benchmark numbers on multi-step autonomous tasks.
## Android 17: The On-Device Layer
Google is also expected to announce an on-device Gemini Nano API for third-party Android developers in Android 17. This is less relevant to server-side agentic coding but significant for mobile AI applications — it's Google extending the Gemini model ecosystem down to the device layer where Apple Silicon competes.
For developers building mobile applications with Claude Code, Android 17's on-device API means you can target both on-device inference (Gemini Nano) and server-side inference (Claude/Gemini 4) from a unified development workflow.
## The Code with Claude London Coincidence
One scheduling detail worth noting: Code with Claude London is happening the same day as the Google I/O keynote — May 19.
That's not a coincidence. Anthropic scheduled its second developer conference to land on the same day as Google's biggest developer event. The message is deliberate: while Google announces what's coming, Anthropic will be showing what's already running in production.
Code with Claude SF on May 6 announced Managed Agents Dreaming, Outcomes + Multiagent public beta, and Code Review GA — all features that shipped immediately, not as roadmap items. If Code with Claude London follows the same pattern, expect at least one major Claude Code announcement to land simultaneously with Google's I/O keynote.
## How to Think About This
Google I/O 2026 will be impressive. Gemini 4 will post strong benchmarks. Firebase Studio's agent-native pivot will be a credible product announcement. The Gemini CLI subagents update is already showing that Google can move fast on tooling.
What I/O will not change: the architectural difference between a cloud-hosted browser IDE (Firebase Studio) and a terminal-native agent that owns the full development environment (Claude Code). These are different bets about where autonomous coding happens.
If your development workflow is web-first, Google-stack, and you're comfortable in a browser IDE, Firebase Studio's I/O announcement deserves your serious attention. If you're doing multi-agent, spec-driven, CI-integrated autonomous development at team scale, watch the benchmark gap between Gemini 4 and Opus 4.7 on SWE-bench Pro — that number will tell you whether Google has closed the model quality gap that currently makes the architectural comparison moot.
I/O keynote: May 19, 10am PT. Watch that number.
---
**Sources:**
- [Google I/O 2026 Developer Preview: Gemini 4, Android 17, Agentic Coding](https://www.abhs.in/blog/google-io-2026-may-19-gemini-4-android-17-agentic-coding-developer-preview) — Abhishek Gautam
- [What to Expect from Google I/O 2026](https://www.androidauthority.com/what-to-expect-from-google-io-2026-3664979/) — Android Authority
- [Gemini 4, AI Glasses And A New OS — Why Google I/O 2026 Could Be The Most Important Developer Event Of The Year](https://techround.co.uk/tech/gemini-4-ai-glasses-and-a-new-os-why-google-i-o-2026-could-be-the-most-important-developer-event-of-the-year/) — TechRound
- [Subagents have arrived in Gemini CLI](https://developers.googleblog.com/subagents-have-arrived-in-gemini-cli/) — Google Developers Blog
- [Subagents in Gemini CLI Enable Task Delegation and Parallel Agent Workflows](https://www.infoq.com/news/2026/04/subagents-gemini-cli/) — InfoQ
- [From Android 17, Gemini 4 to AI: Everything to expect at Google I/O 2026](https://www.businesstoday.in/technology/story/from-android-17-gemini-4-to-ai-everything-to-expect-at-google-io-2026-530775-2026-05-11) — BusinessToday
- [What to Expect from Google I/O 2026: Dates, Gemini & Android 17](https://android.gadgethacks.com/news/what-to-expect-from-google-io-2026-dates-gemini-android-17/) — Gadget Hacks
---
# Kimi K2.6: The Open-Weight Model That Scales to 300 Sub-Agents
URL: https://sdd.sh/2026/05/kimi-k2-6-open-weight-300-subagents-frontier-level/
Date: 2026-05-12
Updated: 2026-05-12
Tags: open-source, models, benchmarks, agentic-workflows, moonshot-ai, kimi
Categories: AI Tools, Industry
Summary: Moonshot AI's Kimi K2.6 landed on April 20 as the most capable open-weight coding model ever released: 1T-parameter MoE, 58.6% SWE-bench Pro, 66.7% Terminal-Bench 2.0, and an Agent Swarm that scales to 300 sub-agents executing 4,000 coordinated steps — at $0.60 per million input tokens.
On April 20, 2026, Moonshot AI quietly shipped Kimi K2.6 with no press conference and no countdown timer. Eight days earlier, beta testers were running a "Code Preview" build. Then the preview label disappeared, and K2.6 landed across Kimi.com, the Kimi App, the official API, and a dedicated Kimi Code CLI. The model earned very little Western press at launch — the AI news cycle was occupied with Code with Claude SF announcements — but the benchmarks are impossible to ignore.
## What Kimi K2.6 Actually Is
K2.6 is a 1-trillion-parameter Mixture-of-Experts model with 32 billion parameters activated per token. The architecture uses 384 experts, 8 selected plus 1 shared per token, across 61 transformer layers with Multi-head Latent Attention (MLA). The context window is 262,144 tokens. Native INT4 quantization is included, which makes local self-hosting viable on high-end consumer hardware.
The license is Modified MIT — meaningfully open for commercial use, with attribution requirements. You can download the weights from HuggingFace and run them on your own infrastructure. For organizations with data-sovereignty requirements or a need to keep code off third-party cloud APIs, this matters.
## The Benchmark Numbers
On SWE-bench Pro — the harder, less contaminated benchmark that replaced Verified as the credible measure of production-grade code repair — K2.6 scores 58.6%. For context: Claude Opus 4.7 sits at 64.3%, GPT-5.5 at 58.6% (tied), and its predecessor K2.5 at 50.7%. A 7.9-percentage-point jump in a single generation is substantial.
On Terminal-Bench 2.0, K2.6 hits 66.7% — up from 50.8% in K2.5. That is a 15.9-point leap, and it represents the most dramatic single-generation Terminal-Bench improvement any lab has published to date. Terminal-Bench measures real-world shell-level task completion, which is the metric that most closely maps to what autonomous coding agents actually do: navigate filesystems, run tests, parse build output, and iterate.
On SWE-bench Verified (the older, easier benchmark): 80.2%, sitting at the frontier ceiling alongside Opus 4.7 and GPT-5.5.
BrowseComp (Agent Swarm subset) improved from 78.4% to 86.3%. Toolathlon — a new agentic harness that stresses multi-tool chaining — jumped from 27.8% to 50.0%, the latter being a new category high for any open-weight model.
## The Agent Swarm System
The most architecturally significant feature in K2.6 is the Agent Swarm upgrade. K2.5 could coordinate 100 domain-specialized sub-agents executing up to 1,500 steps in a single autonomous run. K2.6 scales both numbers by 3x: 300 sub-agents, 4,000 coordinated steps.
What that means in practice: K2.6 can autonomously decompose a large software task, fan out to 300 specialized workers — each with its own context, tools, and prompt — and coordinate them through a shared execution graph for the duration of a long-horizon job. Sub-agents are domain-specialized (security scanner, test writer, API refactoring agent, etc.) and operate in parallel on a shared filesystem with coordination handled by the lead agent.
No Western frontier model ships this sub-agent scale out-of-the-box. Claude Managed Agents (covered separately) supports up to 20 unique agents per multiagent session. OpenAI Codex's multiagent primitives are still in early beta. Kimi K2.6 is, as of April 2026, the only model with a production-ready 300-agent swarm architecture baked into the base model.
Whether that translates to real-world wins is an honest question. More sub-agents does not automatically mean better outcomes — coordination overhead grows with scale, and 300 agents that half-communicate produce worse results than 20 agents that fully coordinate. Moonshot AI's BrowseComp score of 86.3% is the clearest public evidence that the swarm can actually complete complex tasks, but independent third-party evaluation at this scale is sparse.
## The Cost Picture
Claude Opus 4.7 is priced at $5.00 per million input tokens and $25.00 per million output tokens. GPT-5.5 is $5.00/$30.00. Kimi K2.6 on the official API is $0.60/$2.50 — roughly 8x cheaper on input, 10x cheaper on output.
Self-hosted costs vary by hardware, but Moonshot AI's INT4 quantization means K2.6 can run on H100 clusters at competitive throughput without needing the ultra-high-end infrastructure that 70B+ dense models require.
For teams running high-volume agentic workflows — code generation pipelines that fire hundreds of times per day across a large engineering org — the cost differential is material. An organization spending $50K/month on Claude Opus 4.7 for automated code review and agent tasks could run an equivalent K2.6 workload for roughly $6K/month. That math is not precise (inference overhead, token counts, and task success rates differ), but the order-of-magnitude gap is real.
## The Kimi Code CLI
Alongside K2.6, Moonshot AI shipped the Kimi Code CLI: a terminal-native coding agent in the spirit of Claude Code and OpenCode. It uses K2.6 by default, supports MCP tool extensions, and includes a `/review` command for automated code review. Early benchmarks show it can reduce coding costs by up to 88% compared to equivalent Claude Opus 4.7 runs for the same tasks — a claim that requires the asterisk that task complexity, context length, and quality expectations vary significantly.
The CLI is available through the Kimi API. It does not yet have Claude Code's depth of integrations (Routines, /ultrareview, Agent Teams, the MCP ecosystem of 6,400+ servers, enterprise Cowork features). For solo developers or small teams doing straightforward agentic coding tasks, K2.6 + Kimi Code is a legitimate lower-cost alternative. For engineering organizations that need multi-cloud access, enterprise RBAC, audit trails, OpenTelemetry SIEM integration, and the full Claude ecosystem, it is not a replacement.
## Where Kimi K2.6 Fits in the Landscape
The honest framing: K2.6 is the best evidence yet that the frontier capability ceiling is reachable from outside the Western hyperscaler tier. Moonshot AI is a Chinese lab with fewer resources than Anthropic, Google, or OpenAI. They shipped a model that ties GPT-5.5 on SWE-bench Pro and beats it on Terminal-Bench 2.0, at a fraction of the closed-model cost, as open weights.
That has implications that go beyond which model to use today. It suggests the "frontier as moat" thesis — that capability leadership alone justifies premium closed-model pricing — is under real pressure. If K2.6 can close the gap this fast, the differentiation for Claude Code and the Anthropic stack has to come from the ecosystem, the trust layer, the enterprise integrations, and the agentic infrastructure primitives: Routines, Managed Agents Outcomes, Cowork, the analytics API. Those are not easily replicated by downloading model weights.
For engineers deciding where to point their agentic workflows: K2.6 is worth evaluating for cost-sensitive, high-volume tasks where you can self-host or use the API and don't need the full Anthropic ecosystem. For production engineering workflows where traceability, security, and enterprise integrations matter, the Anthropic stack retains its advantages — but the cost-justification argument just got harder.
## Sources
- [Kimi K2.6 Release Blog — Moonshot AI](https://www.kimi.com/blog/kimi-k2-6)
- [Kimi K2.6 Model Overview — Deep Infra](https://deepinfra.com/blog/kimi-k2-6-model-overview)
- [Kimi K2.6 — MarkTechPost](https://www.marktechpost.com/2026/04/20/moonshot-ai-releases-kimi-k2-6-with-long-horizon-coding-agent-swarm-scaling-to-300-sub-agents-and-4000-coordinated-steps/)
- [Kimi K2.6 HuggingFace model card](https://huggingface.co/moonshotai/Kimi-K2.6)
- [Kimi K2.6 vs Qwen 3.6 vs Opus 4.7 vs GPT-5.5 — BuildFastWithAI](https://www.buildfastwithai.com/blogs/kimi-k2-6-vs-qwen-3-6-vs-claude-opus-4-7-vs-gpt-5-5-2026)
- [Kimi K2.6 API benchmarks — Deep Infra](https://deepinfra.com/blog/kimi-k2-6-api-benchmarks-latency-throughput-cost)
- [Kimi K2.6 Review — Local AI Master](https://localaimaster.com/models/kimi-k2-6)
---
# Claude Managed Agents Outcomes + Multiagent: Moving from Prototype to Production
URL: https://sdd.sh/2026/05/claude-managed-agents-outcomes-multiagent-production/
Date: 2026-05-12
Updated: 2026-05-12
Tags: claude, managed-agents, anthropic, agentic-workflows, multiagent, production
Categories: Agentic Workflows, Guides
Summary: Outcomes and Multiagent orchestration moved to public beta on May 6. This is the practical guide to deploying self-verifying, multi-agent workflows in production — including how to write rubrics that actually work, the 20-agent coordinator limit, and what Netflix built with it.
Anthropic's May 6 Claude Managed Agents update shipped three features: Dreaming (already covered), Outcomes, and Multiagent orchestration. Dreaming is the async self-improvement loop that runs overnight and reshapes memory stores. Outcomes and Multiagent are the synchronous production primitives — the ones you wire into your agent code to make it reliable enough to run without supervision.
This is the hands-on guide. If you want the conceptual overview, the [Anthropic blog post](https://claude.com/blog/new-in-claude-managed-agents) has it. This article covers how to actually use Outcomes and Multiagent in production, what the failure modes look like, and how to compose the two features together.
## What Outcomes Actually Does
The core idea is simple: you write a rubric describing what success looks like, and a separate Claude instance grades the agent's output against that rubric in its own context window. If the output fails, the grader tells the agent exactly what needs to change, and the agent takes another pass. This repeats until the output passes or a retry limit is reached.
"Separate context window" is the key design choice. The grader cannot see the agent's reasoning trace, the tools it called, or the steps it took to produce the output. It evaluates the artifact itself — the document, the diff, the report — against your stated criteria. This eliminates a failure mode common in self-evaluation: models that generated a flawed output tend to rationalize it as correct when asked to review their own work in the same context.
In Anthropic's internal benchmarks, Outcomes improved task success rates by up to 10 percentage points over a standard prompting loop, with the largest gains on the hardest tasks. MindStudio's production data shows a 10.1% improvement in PowerPoint quality when Outcomes was applied to a slide generation agent — a concrete, customer-facing quality improvement with a single rubric change.
## Writing Rubrics That Work
The most common way to waste Outcomes is to write a vague rubric. "Make this excellent, polished, and accurate" is not a rubric. It is a wishful thought. A rubric the grader cannot operationalize produces a false sense of governance: the agent runs through cycles, the grader approves, and you have a worse outcome than a single well-prompted pass would have produced.
A rubric that works has three properties:
**Observable criteria.** Each criterion must be checkable from the document alone, without external knowledge. "The executive summary is three paragraphs or fewer" is observable. "The tone is appropriate" is not.
**Explicit constraints.** State what the output must not do, not just what it should do. Negative criteria are easier for a grader to evaluate. "No passive voice in the executive summary" is a better criterion than "use active voice." "Does not reference internal ticket numbers" is cleaner than "appropriate for external distribution."
**Tiered requirements.** Separate must-haves from nice-to-haves. If the output fails a must-have criterion, it should retry. If it misses a nice-to-have, the grader can flag it but still approve. The Anthropic documentation recommends 5 to 10 criteria per rubric for most tasks — enough to be specific, not so many that every output fails on minor stylistic grounds.
Before running Outcomes in production, test your rubric on known-good and known-bad examples. Generate five outputs manually, grade them yourself, then run the grading agent and compare. If the grader disagrees with your judgment more than twice out of five, rewrite the rubric.
## The Retry Budget
Every Outcomes configuration has a retry limit. Anthropic's documentation recommends three retries as a starting point. If your agent is failing three times before passing, the problem is almost always the prompt or the rubric, not the retry budget. Increasing retries to compensate for an underspecified rubric burns tokens and time without improving outcomes.
Watch for "rubric drift": if you add criteria incrementally as new failures surface, you will eventually have a rubric so strict that no output passes on the first pass even when the output is genuinely good. This inflates costs and reduces throughput. Audit your rubric quarterly. Remove criteria that are never triggered.
## Multiagent Orchestration: The Architecture
Multiagent orchestration addresses a different problem than Outcomes. Outcomes improves quality for a single agent on a well-defined task. Multiagent addresses scope: when the job is too large, too varied, or too parallel for a single agent to do well.
The structure is a coordinator plus specialists. The coordinator receives the top-level task, decomposes it, and delegates to specialized agents — each with its own model, system prompt, tool access, and context window. Specialists work in parallel on a shared filesystem. When they complete, results flow back to the coordinator, which synthesizes the outputs.
The hard limits: a maximum of 20 unique agents can be listed in `multiagent.agents`, but the coordinator can call multiple copies of each agent. So you can have 20 specialist types with multiple parallel instances of each, which in practice means the ceiling on concurrency is higher than it initially appears. All agents share the same filesystem, which is the coordination primitive — agents write intermediate artifacts that other agents read.
All Managed Agents endpoints require the `managed-agents-2026-04-01` beta header. The API is otherwise standard Claude API syntax.
## Netflix's Production Pattern
Netflix's platform team built a log analysis agent using Multiagent orchestration. The problem: their platform generates logs from hundreds of concurrent builds, across multiple infrastructure layers, and the signal-to-noise ratio in raw log output is too low for a single-agent pass to surface the patterns that matter.
Their architecture: a coordinator agent receives a time window and a log scope. It fans out to sub-agents that each process a batch of logs from a specific service or build stage. Sub-agents write structured summaries to the shared filesystem. The coordinator reads those summaries, identifies cross-service patterns, and produces a ranked list of issues worth investigating.
What made Multiagent the right choice here: the work was embarrassingly parallel (each log batch can be processed independently), the total context exceeded a single agent's practical window, and the per-batch tasks were homogeneous enough to use the same specialist agent type in multiple parallel copies.
What would not have worked: a single long-context agent reading all logs sequentially. At the token counts involved, quality degrades in the middle of the context window (the well-documented "lost in the middle" effect), and the sequential processing time would have made the results stale before they could be acted on.
## Composing Outcomes and Multiagent
The two features compose naturally. Apply Outcomes at the coordinator level for end-to-end quality verification. Apply Outcomes at the specialist level for tasks where individual specialists need to meet a quality bar before their output is passed to the coordinator.
For the Netflix pattern: add Outcomes to the coordinator agent with a rubric that checks the final report's coverage, actionability, and format. Do not add Outcomes to each specialist — per-specialist verification at 300+ specialist calls would cost more in retries than the quality gain is worth. Verify at the level where quality actually matters: the final artifact.
A practical composition pattern:
```
coordinator (with Outcomes rubric)
├── specialist_a × N parallel instances
├── specialist_b × M parallel instances
└── specialist_c × P parallel instances
(specialists write to shared filesystem)
coordinator reads results → produces output → Outcomes grader evaluates → retry if needed
```
## When Not to Use These Features
Outcomes adds latency (grader call + potential retry cycles) and cost (grader tokens). For tasks that already produce high-quality output reliably, it adds overhead without benefit. The right candidates are tasks where failure has a real cost: customer-facing documents, security reports, code diffs that will be merged without further human review.
Multiagent adds coordination complexity. For tasks that fit in a single agent's context window and do not have natural decomposition points, a single well-prompted agent is faster, cheaper, and more debuggable. Premature multiagent decomposition is one of the more common mistakes in early Managed Agents deployments.
The decision heuristic: if you are adding Outcomes to a task where outputs are already passing human review 90%+ of the time, stop and tune the prompt instead. If you are decomposing a task into agents that do not produce intermediate artifacts the coordinator actually reads, you do not have a multiagent workflow — you have parallel single-agent calls, which is cheaper and simpler to build.
## Getting Started
The Managed Agents quickstart covers authentication and the beta header. The Define Outcomes documentation has the rubric schema. The Multiagent sessions documentation has the coordinator/specialist agent config.
Start with one document type you generate repeatedly. Write a rubric with five specific, checkable criteria. Run Outcomes for a week. Measure the retry rate and whether the grader's rejections match your own quality judgment. That feedback loop will teach you more about effective rubric design than any documentation can.
## Sources
- [New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration — Anthropic](https://claude.com/blog/new-in-claude-managed-agents)
- [Define outcomes — Claude API Docs](https://platform.claude.com/docs/en/managed-agents/define-outcomes)
- [Multiagent sessions — Claude API Docs](https://platform.claude.com/docs/en/managed-agents/multi-agent)
- [Claude Managed Agents overview — Claude API Docs](https://platform.claude.com/docs/en/managed-agents/overview)
- [Claude Outcomes Feature Improved PowerPoint Quality 10.1% — MindStudio](https://www.mindstudio.ai/blog/claude-outcomes-feature-rubric-grading-agent-powerpoint-quality)
- [Anthropic updates Claude Managed Agents with three new features — 9to5Mac](https://9to5mac.com/2026/05/07/anthropic-updates-claude-managed-agents-with-three-new-features/)
- [Codex /goal and Claude Managed Outcomes: The New Control Loops — Developers Digest](https://www.developersdigest.tech/blog/codex-goal-vs-claude-managed-outcomes-practical-differences)
- [Claude Managed Agents: complete guide to building production AI agents — The AI Corner](https://www.the-ai-corner.com/p/claude-managed-agents-guide-2026)
---
# SpaceX Is Betting $60B on Cursor and $300MW on Anthropic at the Same Time. The AI Coding Market Just Got Weird.
URL: https://sdd.sh/2026/05/spacex-cursor-60b-anthropic-colossus-dual-bet/
Date: 2026-05-11
Updated: 2026-05-11
Tags: cursor, anthropic, spacex, market-analysis, infrastructure, claude-code
Categories: AI Tools, Industry
Summary: On April 21, 2026, SpaceX signed two deals simultaneously: a $60B buyout option on Cursor and a 300MW/220K-GPU compute lease to Anthropic via Colossus 1. The same infrastructure company is now the financial backer of the IDE-first AI coding world and the compute provider for the terminal-native AI coding world. That is not a contradiction — it is a hedge. And it tells you everything about where the AI coding market is headed.
On April 21, 2026, SpaceX made two announcements about AI coding on the same day.
The first: SpaceX struck a deal with Cursor that includes an option to acquire the company for $60 billion later this year. Alternatively, SpaceX can pay a $10 billion fee to maintain the partnership without the acquisition.
The second: Anthropic confirmed it will use all of the compute capacity at xAI's Colossus 1 data center — more than 220,000 NVIDIA GPUs across 300 megawatts in Memphis — via a partnership with SpaceX (which absorbed xAI in February in a deal valued at $1.25 trillion).
Let that sit for a moment.
SpaceX — through its merger with xAI — is simultaneously holding a buyout option on the world's leading IDE-first AI coding tool, and renting its entire GPU cluster to the company behind the world's leading terminal-native AI coding agent. It owns optionality on both sides of the most contested architectural debate in software development.
This is not an accident. It is the most honest map of the AI coding market that anyone has drawn.
## The Two Architectures, Briefly
The debate between IDE-embedded AI and terminal-native AI has been running for two years, but the positions are now clearly defined.
**The IDE-first thesis (Cursor's bet):** Developers spend most of their time in an editor. The best AI assistant is one that integrates deeply with that environment — understanding the full file tree, reading the diff, watching your cursor move. Cursor's parallel agents, Agents Window, Design Mode, and `/best-of-n` run — all build on the premise that the IDE is the right place to coordinate agentic workflows.
**The terminal-native thesis (Claude Code's bet):** The terminal is the real development environment. File systems, CI/CD pipelines, test runners, git, deployment scripts — none of these live in an IDE. An agent that operates at the terminal level can touch everything; an agent that lives inside an editor is structurally constrained to what the editor can see. Claude Code's architecture — subprocess execution, MCP servers, Routines for cloud scheduling, Agent Teams — is built on this premise.
These two architectures produce different tools. They serve overlapping but distinct developer populations. And until April 21, they had clearly distinct financial backers.
## What the Deals Actually Mean
The Cursor deal structure is interesting. SpaceX isn't acquiring Cursor outright — it's buying an option to do so. For the price of a $10 billion "breakup fee," SpaceX gets access to Cursor's technology and team for AI development purposes. The $60 billion buyout option gives SpaceX the right to absorb Cursor entirely if the partnership produces something worth owning.
The strategic logic from SpaceX's perspective: Cursor has the most sophisticated IDE-first agent infrastructure in the market and an estimated 4+ million developers who use it daily. Cursor has been compute-constrained — it could not train frontier models on its own. SpaceX has Colossus 1, a 300MW GPU cluster that can train frontier models. The partnership gives Cursor the compute to train its own model stack; it gives SpaceX a product with developer distribution and the self-reinforcing data flywheel that comes with it.
The Anthropic deal has a different logic. SpaceX absorbed xAI and found itself with more silicon than demand. Renting Colossus 1 capacity to Anthropic generates revenue from idle compute while the xAI/Cursor product stack is being built. SpaceX already said "No one set off my evil detector" when the Anthropic deal was raised internally — Elon Musk is comfortable simultaneously building a competitor and leasing compute to the market leader.
This is not unusual in infrastructure businesses. Cloud providers routinely provide compute to companies whose products compete with their own. What is unusual is having this level of overlap in such a contested strategic domain.
## Why Claude Code Benefits from This Arrangement
Anthropic's access to 220,000+ Nvidia GPUs at Colossus 1 directly affects Claude Code users: Claude's 5-hour rate limits were doubled across Pro, Max, Team, and Enterprise plans immediately after the deal was announced. Peak-hour throttling was removed. Opus 4.7 API rate limits were considerably raised.
That is not a minor quality-of-life improvement. Anyone who has hit a rate limit mid-task knows that rate limit walls are one of the primary friction points in autonomous agentic workflows. An agent that can run for 5 hours without interruption can close a sprint. An agent that runs out of tokens at hour 2 requires a human to restart it — which defeats the point of autonomy.
Colossus 1's scale is why Mercado Libre's 90% autonomous coding target (23,000 engineers by Q3 2026) is plausible rather than aspirational. The compute constraint was the binding constraint. It is now substantially relieved.
## The Asymmetry in the Bet
The two deals have different risk profiles for SpaceX.
The Anthropic deal is infrastructure revenue: SpaceX gets paid for compute it has and would otherwise underutilize. If Anthropic succeeds wildly, SpaceX gets a good customer. If Anthropic struggles, SpaceX has other customers. The downside is limited.
The Cursor deal is equity upside with strategic optionality: SpaceX is betting that IDE-first AI coding remains valuable even as terminal-native workflows mature. If Cursor continues to grow — it had 4M+ developers and a $50B valuation before this deal — the $60B buyout option captures that upside at a fixed strike. If Cursor loses to Claude Code and terminal-native agents, SpaceX walks away for the $10B breakup fee, having gained technology and data in the meantime.
This is a very well-structured hedge. SpaceX is long on the infrastructure that makes terminal-native AI work (via Anthropic compute lease), and long on the product that dominates IDE-first AI (via Cursor option). Whatever architectural model wins — or if both coexist, which is the most likely outcome — SpaceX has a position.
## What This Means for Developers
The short answer: neither architecture is going away.
The longer answer: these two deals probably lock in the bifurcation of the AI coding market for at least the next two years. Cursor will have frontier model compute to train its own specialized models — models that understand code in the context of an IDE, that are trained on the signal of what edits developers accept and reject. That will make Cursor meaningfully better at IDE-embedded workflows.
But Claude Code's compute advantage (via Colossus) translates to better long-horizon autonomy — the 5-hour sessions, the Agent Teams, the Routines that run overnight without a developer present. Those are different product promises aimed at different use cases.
The composable stack is probably the right mental model for most teams: Cursor for the human-in-the-loop, IDE-embedded editing workflow; Claude Code for the autonomous, fire-and-forget execution workflow. The SpaceX deals make both of those experiences better simultaneously.
The irony is that SpaceX — nominally positioned to own the infrastructure layer that either could run on — may end up being the neutral party that allows both architectures to reach their ceiling. In a market with this much capital and this many architectural bets, having a compute landlord who is genuinely indifferent about who wins is not the worst outcome for developers.
It just means the architectural debate continues. For longer, at higher capability levels, with more money at stake.
---
**Sources:**
- [SpaceX strikes deal for right to acquire Cursor for $60B — CNBC](https://www.cnbc.com/2026/04/21/spacex-says-it-can-buy-cursor-later-this-year-for-60-billion-or-pay-10-billion-for-our-work-together.html)
- [SpaceX strikes $60 billion deal for Cursor — Fortune](https://fortune.com/2026/04/22/spacex-strikes-60-billion-deal-cursor/)
- [Anthropic to use all of SpaceX-xAI's Colossus 1 compute — Data Centre Dynamics](https://www.datacenterdynamics.com/en/news/anthropic-to-use-all-of-spacex-xais-colossus-1-data-center-compute/)
- [Musk's SpaceX has rented Colossus to Anthropic — Tom's Hardware](https://www.tomshardware.com/tech-industry/artificial-intelligence/musks-spacex-has-rented-out-access-to-its-supercomputers-220-000-nvidia-gpus-and-300-megawatts-of-ai-compute-power-to-rival-anthropic-musk-says-no-one-set-off-my-evil-detector-antrhropic-also-interested-in-orbital-data-centers)
- [The New Power Triangle Shaping AI Compute — Alphabytes](https://joinalphabytes.substack.com/p/spacex-anthropic-cursor-partnership)
- [How SpaceX preempted Cursor's $2B fundraise — TechCrunch](https://techcrunch.com/2026/04/22/how-spacex-preempted-a-2b-fundraise-with-a-60b-buyout-offer/)
---
# Anthropic Goes to Wall Street: 10 Finance Agents, Microsoft 365, and Claude's Enterprise Vertical Play
URL: https://sdd.sh/2026/05/anthropic-finance-agents-wall-street-enterprise-vertical/
Date: 2026-05-11
Updated: 2026-05-11
Tags: anthropic, enterprise, financial-services, claude-managed-agents, agentic-workflows, claude-opus-4-7
Categories: AI Tools, Agentic Workflows
Summary: Anthropic shipped 10 ready-to-run agent templates for financial services work — pitchbooks, KYC screening, month-end close — plus Microsoft 365 add-ins for Excel, PowerPoint, and Word. Claude Opus 4.7 leads the Vals AI Finance Agent benchmark at 64.37%, and this is the first vertical where Anthropic is shipping domain-packaged agentic workflows out of the box.
On May 5, 2026, Anthropic did something it has never done before: it shipped a vertical.
Not a model. Not a platform primitive. Not an API feature. A vertical — ten ready-to-run AI agent templates built specifically for financial services, packaged with connectors, skills, and subagent architectures for the grunt work that consumes junior analysts and back-office teams. It came bundled with Microsoft 365 add-ins that let Claude operate across Excel, PowerPoint, and Word in a shared context, and a Moody's data partnership that gives agents access to verified financial data without requiring users to feed raw files.
Finance is famously conservative about software. The fact that Anthropic chose it as its first industry vertical says a lot about where enterprise AI adoption is actually happening — and where the next wave of agentic coding revenue is coming from.
## What Shipped
The ten agent templates cover both front office and back office:
**Front office (deal work, client work)**
- Pitch builder — turns earnings filings, comparables, and deal context into draft pitchbooks
- Meeting preparer — generates briefing packages from CRM data and public filings
- Earnings reviewer — flags material changes across quarterly reports
- Model builder — builds financial models from data feeds and audits formula dependencies
- Market researcher — aggregates sector intelligence into structured summaries
- Valuation reviewer — cross-checks model assumptions against live market data
**Back office (operations, compliance)**
- General ledger reconciler — matches entries across linked accounts
- Month-end closer — orchestrates the close sequence, flags anomalies, generates the management pack
- Statement auditor — verifies financial statement consistency and flags discrepancies
- KYC screener — runs name matches against sanctions lists, PEP databases, and adverse media
Each template is a reference architecture that packages three components: *skills* (Claude Code/Cowork plugin instructions plus domain knowledge), *connectors* (governed access to the data sources the task needs — Bloomberg, Moody's, internal data warehouses), and *subagents* (additional Claude instances called for specialist subtasks, such as comparables selection or methodology verification). The templates deploy as plugins in Claude Cowork and Claude Code, or as a cookbook for Claude Managed Agents if you want to wire them into your own infrastructure.
## The Benchmark That Matters Here
Claude Opus 4.7 leads the Vals AI Finance Agent benchmark at 64.37%, ahead of GPT-5.5 at 59.96% and Gemini 3.1 Pro at 59.72%. The Vals benchmark is specifically designed to test finance-domain agentic tasks — not just question answering, but multi-step workflows that involve data retrieval, calculation, and output generation under domain-specific constraints.
That 4.4-point lead over GPT-5.5 on a benchmark this domain-specific is not a number to dismiss. Finance workflows involve high-stakes edge cases where reasoning quality compounds across steps. A model that's broadly competitive on coding benchmarks might still fail the "does this forward schedule reconcile to the balance sheet?" check that a junior analyst catches on first review.
## The Microsoft 365 Integration
The bigger unlock is context continuity across the Office suite. Anthropic's Microsoft 365 add-ins let Claude operate in Excel, PowerPoint, and Word with shared context across applications.
What this means in practice: an analyst builds a DCF model in Excel, flags the bear case, and a deck appears in PowerPoint that reflects the bear case numbers — without copying, re-explaining, or context-switching. Claude carries the financial model's assumptions, data sources, and analytical conclusions across the entire workflow.
In Excel specifically: build models from public filings and data feeds, audit formula linkages across tabs and workbooks, run sensitivity tables. In PowerPoint: draft investor decks that auto-update when the underlying data changes. Outlook integration is coming.
This is the first Microsoft 365 integration that makes Claude a real workflow layer rather than a chat assistant bolted onto a document editor. The difference between those two things is the difference between a tool and a platform.
## Why Finance First
Anthropic did not choose finance by accident.
Finance has three properties that make it a natural first vertical for agentic AI:
**High value per workflow.** A junior analyst spending 40 hours on a pitchbook earns $150,000+ in loaded cost per year. Automating even 30% of that work has a measurable ROI that CFOs can put in a spreadsheet. Unlike developer productivity (where the ROI is real but harder to attribute), finance workflows have line-item costs that map directly to agent runtime.
**Structured outputs.** Financial documents — pitchbooks, credit memos, regulatory filings — have well-defined schemas. The grader in an Outcomes-powered agent can verify that a DCF output reconciles to its inputs. There is a ground truth, and the agent can check its own work against it. This is exactly what Claude Managed Agents' Outcomes feature is built for.
**Regulatory pressure.** KYC, AML, and audit trail requirements in finance are not optional. Anthropic's agent templates ship with documented dismissals, immutable audit logs, and governed data connectors — the compliance infrastructure that financial institutions need before they can put an agent in a production workflow. This is not an afterthought; it is a prerequisite for enterprise adoption in this sector.
The Moody's data partnership matters here too. Agents that need live financial data have historically required custom data pipeline work to provision reliably. Moody's partnership means that connection is pre-built, governed, and available out of the box — removing one of the major friction points for enterprise deployment.
## What This Signals for the Broader Market
Anthropic going vertical is a competitive move, not just a product addition.
Until now, enterprise AI deployments in finance have required significant professional services work — consulting firms, systems integrators, and in-house AI teams assembling custom agent architectures from primitives. Anthropic just commoditized the first layer of that stack. A bank that wants a KYC screener can now deploy a reference architecture in days instead of months.
That accelerates adoption and creates a new competitive pressure for the Cursors and Copilots of the world: if Anthropic is delivering packaged, domain-specific agentic workflows, an IDE-embedded assistant that helps developers write code is competing at a different level of abstraction. Cursor helps engineers build software. Anthropic's finance agents help analysts close books and screen clients. The latter has a clearer ROI conversation at the C-suite level.
Claude Code is the tool that developers use to build these agentic workflows. The finance agents are the workflows that non-developers now deploy on top of Claude. These are two different wedges into the same enterprise budget, and Anthropic is now pushing on both simultaneously.
## The Practical Question
For financial services teams evaluating this: the templates are starting points, not finished products. Each ships with a documented architecture, but production deployment requires mapping connectors to your actual data sources (Bloomberg, FactSet, internal data warehouses) and customizing skills to your firm's terminology and processes.
The Outcomes beta makes this significantly more robust than first-generation enterprise AI deployments. Rather than tuning prompts and hoping outputs are correct, you write a rubric: "The pitchbook must include a DCF with a sensitivity table, a comps table with at least five comparables, and a recommendation section with explicit supporting rationale." The agent runs, the grader evaluates, and if the rubric isn't satisfied, the agent tries again — in a separate context window, without the bias of its own prior reasoning.
That feedback loop is what turns these templates from impressive demos into production workflows.
Finance just became the first industry where Anthropic is shipping the whole stack: model, agent architecture, data connectors, office suite integration, and a self-verification loop. If this pattern holds, it won't be the last.
---
**Sources:**
- [Agents for financial services — Anthropic](https://www.anthropic.com/news/finance-agents)
- [Anthropic deepens push into Wall Street with new AI agents — Fortune](https://fortune.com/2026/05/05/anthropic-wall-street-financial-services-agents-jamie-dimon/)
- [Anthropic rolls out AI agents to target financial services — TechRadar](https://www.techradar.com/pro/anthropic-rolls-out-a-host-of-new-ai-agents-to-target-the-most-time-consuming-work-in-financial-services)
- [Anthropic Launches 10 Claude Agent Templates for Financial Services — how2shout](https://www.how2shout.com/news/anthropic-claude-agent-templates-financial-services-microsoft-365.html)
- [Claude Managed Agents Outcomes — Claude Docs](https://platform.claude.com/docs/en/managed-agents/define-outcomes)
---
# Code with Claude SF 2026: The Day Anthropic Declared Platform Intent
URL: https://sdd.sh/2026/05/code-with-claude-sf-2026-recap/
Date: 2026-05-10
Updated: 2026-05-10
Tags: claude-code, anthropic, agentic-workflows, managed-agents, code-review, spacex
Categories: AI Tools, Agentic Workflows
Summary: On May 6, Anthropic's first developer conference delivered six interconnected launches: a 300MW SpaceX compute deal, doubled Claude Code rate limits, Code Review GA at $15–25/PR, three Managed Agents upgrades, and an 80x Q1 growth figure that outpaced the company's own forecast by 8×. Taken together, they describe a company that is no longer just building a model — it is building the infrastructure layer for autonomous software development.
The number Dario Amodei opened with was not a product feature. It was a statement of position.
Anthropic had projected 10x revenue growth in Q1 2026. They got 80x. Annualized revenue run rate, as of April, sits at $30 billion — up from $87 million in January 2024. API volume is up 17x year-on-year. Claude Code, launched publicly in mid-2025, hit $1 billion in annualized revenue faster than any developer tool in history and now represents more than half of Anthropic's enterprise revenue.
These are not incremental metrics. They are the backdrop for understanding why the six announcements at Code with Claude SF were not a product roadmap — they were a strategy declaration.
---
## The Infrastructure Bet That Unlocks Everything Else
Every launch at the event was downstream of one deal: Anthropic signed an agreement with SpaceX to use all available capacity at the Colossus 1 data center in Memphis — more than 300 megawatts, over 220,000 NVIDIA GPUs, coming online within the month. The facility was originally built by xAI.
This is the third major compute commitment Anthropic has announced in 2026, joining agreements with Amazon (up to 5 GW, 1 GW online by end of year) and Google/Broadcom (5 GW, coming online in 2027). The SpaceX deal fills the near-term gap — it is capacity that is available now, not in 12 to 18 months.
The practical effect landed immediately. Effective May 6, Claude Code's five-hour rate limits were doubled across all paid plans: Pro, Max, Team, and seat-based Enterprise. Peak-hour throttling for Pro and Max accounts was removed entirely. Opus API rate limits were raised "considerably." For Claude Code's heaviest users — teams running multi-agent workflows against large codebases — these changes are not incremental. The constraints that force you to structure work around rate windows are gone.
The orbital compute angle is not a joke. Anthropic confirmed it is exploring developing multiple gigawatts of compute capacity in space with SpaceX. Whether that happens or not, the directional bet is clear: Anthropic is building infrastructure at a scale that makes near-term capacity constraints structurally impossible, not just unlikely.
---
## Code Review GA: The Product Launch of the Event
If the SpaceX deal was the infrastructure headline, **Code Review GA** was the product headline — and the one with the clearest near-term impact for engineering teams.
Code Review dispatches a team of agents to review a pull request in parallel. The architecture has three phases:
1. **Bug-finding agents** work in parallel, scanning the PR for logic errors, security issues, and implementation mistakes
2. **Verification agents** check each candidate finding — filtering false positives before they reach the developer
3. **Ranking** surfaces issues by severity, so the output is prioritized rather than a flat list
The result lands on the PR as two outputs: a single high-signal overview comment summarizing what was found, and in-line comments on specific lines. The intent is one PR comment worth reading, not 47 marginal notes.
### What it costs
Each review averages $15–25, scaling with PR size and codebase complexity. Large PRs (over 1,000 lines changed) get findings 84% of the time, averaging 7.5 issues. Small PRs (under 50 lines) get findings 31% of the time, averaging 0.5 issues. Reviews take approximately 20 minutes.
This is billed separately via Anthropic's "extra usage" mechanism — it does not count against your plan's included usage, and it does not replace /ultrareview (which remains available to Pro and Max users, with three free reviews included). Code Review is available on Team and Enterprise subscriptions.
### The "every Anthropic team uses it" claim
Anthropic said Code Review is used by every internal engineering team. That claim matters not as a credibility signal but as a calibration: the tool was built under real production conditions before GA, against real codebases at real velocity. The false-positive filter in the verification step reflects what happens when you run a review tool against ten thousand PRs and learn what wastes engineers' time.
---
## Managed Agents: Three Features, One Coherent System
The Managed Agents updates (covered in depth [here](/posts/claude-managed-agents-dreaming-self-improving-agents/)) shipped as three coordinated features:
**Dreaming** — a scheduled background process that reviews past sessions, synthesizes patterns, and curates the memory store so agents improve over time without model retraining. Harvey, the legal AI company, saw task completion rates increase 6× after deploying it.
**Outcomes** — a rubric-based evaluation layer. You define what success looks like before the agent starts; a separate grader (in its own context window) evaluates the output and sends the agent back for another pass if criteria are not met. Webhook notification fires on completion.
**Multiagent Orchestration** — a coordinator/specialist model. A lead agent decomposes a task and delegates to up to 20 specialist agents running in parallel on a shared filesystem. Each specialist can have its own model, prompt, and tool set.
These three features close the three main gaps between "capable prototype" and "production autonomous system": agents that learn from history, agents that can self-verify output quality, and agents that can scale horizontally without architectural changes.
---
## The Mercado Libre Number
The enterprise adoption data point Anthropic chose to lead with was Mercado Libre. The Latin American e-commerce and fintech platform — with 23,000 engineers — is targeting 90% autonomous coding by Q3 2026.
Ninety percent autonomous is not a target about replacing engineers. It is a target about what proportion of code shipped per quarter is primarily AI-generated, with humans in the review and direction role rather than the implementation role. For a 23,000-person engineering organization, the scale of the organizational transformation implied by that number is hard to overstate.
Mercado Libre is not a software company in the traditional sense — they are the infrastructure for digital commerce and payments across Latin America. If they hit 90%, the story becomes: agentic coding has crossed into industries where software is critical infrastructure, not just tech companies eating the world from the outside.
Anthropic also mentioned Shopify and other enterprise customers in the same tier, but did not publish specifics.
---
## What Code with Claude SF Actually Was
The mistake in reading this event as a product launch is that the individual announcements — while significant — are not the point individually. The point is the architecture they describe together.
**Compute** (SpaceX deal) → removes the capacity ceiling
**Rate limits** → translates compute into user capacity
**Code Review GA** → production-grade autonomous code review, billed per-use
**Managed Agents** → production-grade autonomous agents that improve over time
**80x growth** → reveals the demand already exists at this scale
**Mercado Libre** → demonstrates what the demand looks like at the enterprise end
These are not six separate features. They are one system: a vertically integrated platform for autonomous software development, from the GPUs at the bottom to the PR review comment at the top.
Cursor's response to this event was essentially silence. OpenAI's Code Review equivalent (Codex Code Review) is still in limited beta with manual trigger only. GitHub Copilot's equivalent is a rules-based heuristic layer, not a multi-agent verification loop.
The window where IDE-first AI tools could argue parity with terminal-native agentic platforms is closing. The Code with Claude SF announcements did not close it — they made it visible.
The next event is Code with Claude London on May 19–20, the same day as Google I/O 2026. The scheduling was not accidental. Anthropic will be in direct air competition with whatever Gemini 4 announcement Google makes that day. It will be worth watching both.
---
**Sources:**
- [Higher usage limits for Claude and a compute deal with SpaceX — Anthropic](https://www.anthropic.com/news/higher-limits-spacex)
- [Code Review for Claude Code — Anthropic Blog](https://claude.com/blog/code-review)
- [What's new in Claude Code — Code with Claude SF session](https://claude.com/code-with-claude/session/sf-whats-new-in-claude-code)
- [Anthropic says it hit a $30 billion revenue run rate after 'crazy' 80x growth — VentureBeat](https://venturebeat.com/technology/anthropic-says-it-hit-a-30-billion-revenue-run-rate-after-crazy-80x-growth)
- [Anthropic CEO says 80-fold growth in first quarter explains 'difficulties with compute' — CNBC](https://www.cnbc.com/2026/05/06/anthropic-ceo-dario-amodei-says-company-crew-80-fold-in-first-quarter.html)
- [Anthropic Doubles Claude Code Limits and Strikes Deal with SpaceX — BigGo Finance](https://finance.biggo.com/news/202605090028_Anthropic_Doubles_Claude_Code_Limits_SpaceX_Compute_Deal)
- [Live blog: Code w/ Claude 2026 — Simon Willison](https://simonwillison.net/2026/May/6/code-w-claude-2026/)
- [Anthropic's Platform Bet: Code with Claude 2026 Was Not a Product Launch — Shashi.co](https://www.shashi.co/2026/05/anthropics-platform-bet-code-with.html)
- [Code Review — Claude Code Docs](https://code.claude.com/docs/en/code-review)
- [New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration — Anthropic](https://claude.com/blog/new-in-claude-managed-agents)
---
# Claude Code Review Goes GA: $15–25 Per PR, Multi-Agent Reviewers, and Who It's Actually For
URL: https://sdd.sh/2026/05/claude-code-review-ga-multi-agent-pr-review/
Date: 2026-05-10
Updated: 2026-05-10
Tags: claude-code, code-review, anthropic, pull-requests, agentic-workflows
Categories: AI Tools, Guides
Summary: Claude Code Review went GA on May 6: a multi-agent system that dispatches parallel bug-finders, runs a verification pass to cut false positives, and lands a single high-signal comment on your pull request. It costs $15–25 per review, is billed separately from your plan, and is available on Team and Enterprise only. Here is how the architecture works, what the performance numbers say, and how it stacks up against /ultrareview, Cursor Bugbot, Greptile, and CodeRabbit.
Most PR review tools have a false-positive problem. They generate long lists of findings, many of which are stylistic, non-actionable, or wrong. Developers learn to skim the output, stop reading it carefully, and eventually stop trusting it. The tool becomes noise.
Anthropic's Claude Code Review — launched at general availability on May 6 at Code with Claude SF — was designed around the premise that the false-positive problem is the only problem worth solving. The multi-agent architecture exists specifically to filter noise before it reaches a developer. Whether it succeeds, and whether the price is justified, depends on your team's size, PR volume, and how much you trust automated review already.
Here is what the tool actually does, what it costs, and how to think about whether it belongs in your workflow.
---
## The Architecture: Three Passes, One Comment
When you open a pull request with Code Review enabled, Anthropic dispatches a team of agents rather than a single model pass. The pipeline has three stages:
**1. Parallel bug-finding.** Multiple agents work through the PR simultaneously, looking for logic errors, security issues, race conditions, missing error handling, and implementation mistakes. The parallelism matters: a single agent working sequentially through a 1,000-line PR has less capacity for deep analysis on any one section than specialized agents focusing on different dimensions.
**2. Verification pass.** Each candidate finding goes through a second agent that verifies whether the issue is real, context-valid, and worth surfacing. This is the false-positive filter. An agent that flags a potential null dereference that is actually guarded by an upstream check gets corrected here, before the finding reaches the PR.
**3. Severity ranking.** Verified findings are ranked. Critical bugs surface at the top; low-severity suggestions are included but positioned below. The output is a prioritized list, not a flat dump.
The result is two outputs on the pull request: a single **overview comment** summarizing findings by severity, and **in-line comments** on specific lines for issues that require per-line context. Anthropic's stated design goal is that the overview comment should be readable in two minutes and actionable without reading the full in-line comments first.
The entire process takes approximately 20 minutes. It is asynchronous — you open the PR, continue working, and the review appears when it is done.
---
## What the Performance Numbers Say
Anthropic published internal benchmarks at launch. The two most relevant figures:
- **Large PRs (1,000+ lines changed):** 84% receive at least one finding, averaging 7.5 issues
- **Small PRs (under 50 lines):** 31% receive findings, averaging 0.5 issues
The large-PR number is the one to focus on. On complex changes — the kind where a human reviewer would spend 45 minutes and still miss things — Code Review finds something meaningful in 5 out of 6 cases and returns an average of 7 to 8 ranked issues. That is a non-trivial signal density for a 20-minute asynchronous process.
The small-PR number is less impressive but expected. Under 50 lines, the verification pass tends to correctly discard most candidates as context-insufficient. You should not use Code Review as a substitute for review on small, targeted changes.
---
## Pricing: What "Billed Separately" Actually Means
Each review costs $15–25, scaling with PR size and codebase complexity. This is billed separately through Anthropic's "extra usage" mechanism and does **not** count against your plan's included usage.
This matters for budgeting. If your team ships 20 PRs per week, you are looking at $300–$500/week, or roughly $1,200–$2,000/month, before any volume negotiation on an Enterprise plan. For a 10-person team where each engineer's time costs the company $50–$100 per hour, that is equivalent to 12–40 engineer-hours of review per month — roughly 1.2 to 4 hours per engineer.
Whether that math works depends on your review patterns. If your team's PRs are large and complex and reviewers frequently miss bugs that reach production, $15–25 per PR is cheap. If your team ships small, well-tested PRs with high review coverage already, the cost-benefit calculus is weaker.
Enterprise teams can negotiate per-seat pricing structures. The $15–25 figure is the published baseline; at scale the effective per-review cost drops.
---
## Availability: Team and Enterprise Only
Code Review is **not** available on Pro or Max plans. It is a Team and Enterprise feature, with a GitHub integration that posts comments directly to PRs via the GitHub App.
This is a deliberate positioning choice. The feature requires codebase context (not just the diff), which means it needs the Claude Code codebase integration already configured — something that is standard in Team and Enterprise deployments but less commonly set up on individual Pro accounts.
Anthropic claimed that every internal engineering team at Anthropic uses Code Review. That claim doubles as a statement about the intended customer profile: teams shipping production code at speed where review bandwidth is a real constraint, not individual developers working on side projects.
---
## How It Compares
| Tool | Architecture | Availability | Price per review | Async? |
|---|---|---|---|---|
| **Claude Code Review** | Multi-agent (find → verify → rank) | Team / Enterprise | $15–25 | Yes (20 min) |
| **/ultrareview** | Single dedicated cloud session | Pro / Max (3 free) / Team / Enterprise | Included (Pro/Max) or extra usage | Yes |
| **Cursor Bugbot** | Single-pass static analysis + LLM | Cursor Pro+ | Included in plan | No (inline) |
| **Greptile** | Codebase-aware semantic search + LLM | All plans | $29–$99/mo flat | Yes |
| **CodeRabbit** | Rules + LLM hybrid, configurable | All plans | $12–$24/user/mo | Yes |
A few things to note in this comparison:
**/ultrareview** is a different product category. It is a dedicated Claude session that produces a long-form review document — closer to a senior engineer's written code review than an automated scan. It is better for catching architectural issues and reasoning about tradeoffs; Code Review is better for catching bugs at speed. They are complementary, not competing.
**Cursor Bugbot** is the closest structural competitor, but it runs as a single pass without a verification step and is embedded in the Cursor IDE context, not a GitHub integration. It is better suited to inline suggestions during active coding than to async PR-level review.
**Greptile and CodeRabbit** are both configurable, rules-aware tools that work well at the process layer — enforcing conventions, catching common mistakes at low cost per PR. They are not multi-agent verification systems. At $12–99/month flat, they cost less at low PR volume and more at high PR volume.
The matrix for choosing:
- **$15–25/PR Code Review** → high-complexity PRs, security-sensitive codebases, teams where production bugs are expensive
- **/ultrareview** → architectural decisions, large refactors, onboarding a new area of the codebase
- **Greptile/CodeRabbit** → lightweight convention enforcement, all PR sizes, cost-sensitive teams
- **Cursor Bugbot** → inline suggestions during development, not async PR review
---
## Is It Worth It?
The honest answer is: it depends on what a production bug costs you.
For a consumer app with a fast rollback cycle and low blast radius, $15–25 per PR is hard to justify against the alternatives. For a fintech platform, a healthcare system, or infrastructure code where a missed race condition or authentication bug creates a serious incident, $25 to catch it before merge is not a discussion worth having — it is obviously worth it.
The more interesting question is what Code Review changes structurally. If developers know that a verification pass is going to run on every PR, the review bandwidth problem changes shape. A team of four engineers cannot review every PR at depth. A multi-agent system running in parallel can. That changes what human reviewers focus on: architectural intent, product logic, and edge cases that require domain knowledge — rather than bug-hunting that an agent does better under time pressure.
Anthropic's stated goal is not to replace human code review but to change what human reviewers spend their time on. Based on the architecture and performance numbers, that is the correct framing.
At $15–25/PR with a 20-minute async cycle, the adoption friction is low enough that the right move for most Team and Enterprise accounts is to run it in parallel with your existing process for a sprint, measure the findings against what your human reviewers caught, and let the data decide.
---
**Sources:**
- [Code Review for Claude Code — Anthropic Blog](https://claude.com/blog/code-review)
- [Code Review — Claude Code Docs](https://code.claude.com/docs/en/code-review)
- [Anthropic Code Review for Claude Code: Multi-Agent PR Reviews, Pricing, Setup, and Limits — DEV Community](https://dev.to/umesh_malik/anthropic-code-review-for-claude-code-multi-agent-pr-reviews-pricing-setup-and-limits-3o35)
- [Claude Code Review vs Bugbot vs Greptile vs CodeRabbit — FindSkill.ai](https://findskill.ai/blog/claude-code-review-vs-cursor-bugbot-greptile-coderabbit/)
- [Anthropic Charges $25 Per PR — Claude Code Review Backlash — Level Up Coding](https://levelup.gitconnected.com/anthropic-wants-25-per-pull-request-devs-are-losing-their-minds-over-claude-code-review-55dbbaca4996)
- [Is Claude Code Review Worth $15–25 Per PR? (2026 Verdict) — BuildFastWithAI](https://www.buildfastwithai.com/blogs/claude-code-review-guide)
- [Live blog: Code w/ Claude 2026 — Simon Willison](https://simonwillison.net/2026/May/6/code-w-claude-2026/)
- [Claude Code PR reviews are here, just $15–25 — LinkedIn / John Crickett](https://www.linkedin.com/posts/johncrickett_claude-code-pr-reviews-are-here-just-15-activity-7437180510180696064-NlR6)
---
# CVE-2026-26268: The Cursor RCE That Proves IDE-Embedded AI Has a Structural Security Problem
URL: https://sdd.sh/2026/05/cve-2026-26268-cursor-rce-ide-security-architecture/
Date: 2026-05-09
Updated: 2026-05-09
Tags: cursor, security, cve, rce, ide, claude-code, agentic-workflows
Categories: AI Tools, Industry
Summary: Novee Security disclosed a CVSS 9.9 remote code execution vulnerability in Cursor on April 28, patched in version 2.5. The attack vector — a malicious git hook triggered automatically by Cursor's own agent — is not a bug that better code can fully solve. It is a consequence of putting an autonomous AI agent inside a process that has broad, native system access.
On April 28, 2026, security researcher firm Novee Security disclosed [CVE-2026-26268](https://novee.security/blog/cursor-ide-cve-2026-26268-git-hook-arbitrary-code-execution/): a remote code execution vulnerability in Cursor, the $50 billion AI-native IDE. The National Vulnerability Database rated it **CVSS 9.9 — critical**. Cursor patched it in version 2.5 and issued its own severity assessment of 8.0, contesting NVD's rating in a move that felt more like reputation management than technical disagreement.
The vulnerability is patched. Update to Cursor 2.5 if you have not. But the technical details of how this attack worked — and why it worked specifically against an IDE-embedded AI agent — deserve more attention than the standard "update your software" coverage has given them.
---
## How the Attack Works
The attack chain has three steps.
**Step 1: Craft the repository.** An attacker creates a legitimate-looking repository that contains an embedded bare repository (a `.git` directory inside the project). That embedded bare repo includes a pre-commit hook containing malicious code.
**Step 2: Exploit Cursor's insufficient sandbox.** Cursor runs its AI agent in a sandbox — but the sandbox failed to protect `.git` directory configurations from write operations. A malicious agent, triggered via prompt injection (a carefully crafted comment, README, or code file the agent reads), could write to `.git/hooks/pre-commit` without triggering a permission prompt or warning.
**Step 3: Wait for git.** The next time Cursor's agent — or the developer — runs a `git commit` or `git checkout` operation inside the embedded repository, the hook fires automatically. The hook runs with the full privileges of the Cursor process. The developer's workstation executes attacker-controlled code.
No confirmation dialog. No permission prompt. No indication anything unusual happened. The hook fires because that is what git hooks do.
Cursor classified the root cause as **CWE-862: Missing Authorization** — the sandbox did not properly restrict write access to `.git` configuration files. The fix in 2.5 implements proper authorization controls to prevent this write path. That is the correct patch for this specific vulnerability.
It does not fix the class of problem.
---
## Why IDE-Embedded AI Is Structurally More Exploitable
Novee's disclosure contains an observation that deserves to be read carefully:
> "Traditional IDEs are passive, doing what developers explicitly tell them to do. Cursor's AI agent interprets intent and autonomously decides which commands to run — which includes git operations."
This is the crux. A traditional IDE could be misconfigured by a malicious repository, but the developer would have to explicitly run a command to trigger a malicious hook. The attack surface requires human action.
An IDE-embedded AI agent changes the threat model in two ways:
**1. The agent decides autonomously which commands to run.** When Cursor's agent is working on a task, it reads files, infers intent, and executes git operations as part of completing that task. A prompt injection payload — embedded in a README, a comment, a variable name, or a docstring — can influence those decisions without any developer interaction. The agent reads the file, the injection fires, the agent executes the operation, the hook runs.
**2. The process inherits native system access.** Cursor runs as a desktop application with your full user-level permissions. It can read your SSH keys, your environment variables, your `.aws/credentials`, your browser cookies, your source code — anything your user account can access. When the sandboxed agent writes to a git hook and that hook fires, the code runs as Cursor, which runs as you.
The sandbox between the AI agent's reasoning layer and the host system is the only barrier. CVE-2026-26268 is a specific instance of a prompt injection attack defeating that barrier via a git hook. The next CVE will find a different path through the same architecture.
---
## The Terminal-Native Contrast
Claude Code's architecture starts from a different premise. There is no IDE wrapper. The agent is a terminal process. It has the permissions you grant it when you run it, and those permissions are explicit in the CLAUDE.md configuration and the per-session approval flow.
When Claude Code needs to run a git operation, it uses the `Bash` tool — which is visible in the session transcript, auditable, and (in default mode) subject to approval before execution. Prompt injection can still occur, but the attack surface for an injected payload is the bash tool approval layer, not a hidden hook that fires silently on the next git operation.
This is not to say Claude Code is immune to prompt injection. No AI agent that reads arbitrary files is immune to prompt injection. But the architecture determines how hard it is to weaponize a successful injection into actual code execution.
| Factor | Cursor (IDE-embedded) | Claude Code (terminal-native) |
|---|---|---|
| Process privileges | Full desktop-app user-level access | Terminal user-level access |
| Tool execution visibility | Determined by sandbox implementation | Tool call visible in session transcript |
| Hook/background trigger surface | Git hooks, IDE extensions, file watchers | Bash tool (auditable) |
| Blast radius if sandbox fails | Full workstation compromise | Same, but the sandbox is shallower |
The sandbox in a desktop AI IDE has to protect against a much wider attack surface than a terminal agent — because the IDE itself runs with broad system integration by design.
---
## What Happened After Disclosure
Novee disclosed the vulnerability privately to Anysphere (Cursor's parent company) before publishing. Cursor patched it in version 2.5. So far, standard responsible disclosure.
The controversy is over the severity rating. NVD assigned CVSS 9.9. Cursor published its own CVSS analysis scoring the vulnerability at 8.0, arguing the attack requires prompt injection as a prerequisite — which it categorizes as a separate vulnerability class — reducing the attack complexity score.
This is technically defensible but practically misleading. Prompt injection is not a separate attack that happens first and then this vulnerability separately requires exploitation. Prompt injection *is* the mechanism by which the malicious agent behavior is triggered. Treating them as independent events understates how easily the full attack chain executes in practice: open a malicious repository in Cursor, let the agent read the README, watch a hook get written.
The 9.9 vs 8.0 debate matters for enterprise security teams tracking CVE severity. For developers, the practical conclusion is the same either way: critical enough to patch immediately.
---
## Cursor Is Not Alone
Fairness requires noting that Cursor is not the only IDE-embedded AI tool with a git hook attack surface. Any AI agent that:
- Runs inside a desktop process with broad system access
- Can write to the filesystem (including `.git` directories)
- Reads arbitrary files for context
...has a version of this risk. Cursor got the CVE because Novee found the specific write path to `.git/hooks`. Other IDE-embedded agents have similar surfaces.
The Hacker News thread on CVE-2026-26268 produced a good question: "Has anyone checked whether Windsurf or Copilot have the same `.git` write path?" No published disclosure as of this writing, but it is the right question to ask.
What makes Cursor specifically notable here is the combination of scale ($50 billion valuation, 1 million users, rapid enterprise adoption) and the aggressive expansion of autonomous agent capabilities — including the [Cursor SDK](/posts/cursor-sdk-programmatic-agents-escape-the-ide/) that allows programmatic agent invocation. More autonomous capability, running in the same process with the same system access, means more surface area for the same class of attack.
---
## What Developers Should Do
**Immediately:**
- Update Cursor to version 2.5 or later. This is the specific patch for CVE-2026-26268.
- Audit which repositories your Cursor agent is allowed to operate in. Treat unknown repositories the same way you would treat unknown npm packages.
**Structurally:**
- Understand that an AI agent that runs inside your desktop application with your user-level permissions is categorically different from a sandboxed cloud agent. The blast radius of a successful prompt injection is your workstation.
- Review your CLAUDE.md (or equivalent agent configuration) to understand what tool access you have explicitly granted. For Claude Code users, the `deny` rules in `settings.json` and the `Bash` tool approval flow are the relevant controls.
- Consider whether the workflows where you need autonomous git operations from your AI agent actually require that agent to run inside a process with full desktop privileges.
---
## The Bigger Question
CVE-2026-26268 will be fixed in most Cursor installations within a few weeks as users update. The next vulnerability will emerge — not because Anysphere is careless, but because the architecture puts an autonomous reasoning system inside a privileged desktop process, and the space of ways that can go wrong is large.
The terminal-native model that Claude Code uses does not eliminate this risk. But it does change the threat model in a meaningful way: a terminal process is expected to execute commands, and those commands are visible and auditable by design. A desktop AI IDE inherits all the attack surface of a desktop application *plus* all the attack surface of an autonomous AI agent — in the same process, with the same permissions.
That is a structural property of the architecture, not a bug any single patch can fix.
---
**Sources:**
- [CVE-2026-26268: How an AI Coding Agent Can Run Exploits in Cursor IDE — Novee Security](https://novee.security/blog/cursor-ide-cve-2026-26268-git-hook-arbitrary-code-execution/)
- [CVE-2026-26268 Detail — NVD](https://nvd.nist.gov/vuln/detail/CVE-2026-26268)
- [CVE-2026-26268: Anysphere Cursor Sandbox Escape RCE Flaw — SentinelOne](https://www.sentinelone.com/vulnerability-database/cve-2026-26268/)
- [Critical Cursor bug could turn routine Git into RCE — CSO Online](https://www.csoonline.com/article/4164250/critical-cursor-bug-could-turn-routine-git-into-rce.html)
- [Cursor AI IDE vulnerability allows code execution via hidden Git hooks — HackRead](https://hackread.com/cursor-ai-ide-vulnerability-code-execution-git-hooks/)
- [Critical Cursor Vulnerability Exposes Developer Workstations To Remote Code Execution — CyberPress](https://cyberpress.org/cursor-rce-threatens-developers/)
---
# Claude Agents Can Now Dream. Harvey Saw 6× More Tasks Completed.
URL: https://sdd.sh/2026/05/claude-managed-agents-dreaming-self-improving-agents/
Date: 2026-05-09
Updated: 2026-05-09
Tags: claude-code, managed-agents, agentic-workflows, anthropic, ai-memory
Categories: AI Tools, Agentic Workflows
Summary: Anthropic's Dreaming feature — launched at Code with Claude SF on May 6 — lets managed agents review their own past sessions overnight, curate what they learned, and arrive at the next run measurably better. Harvey, the legal AI company, saw task completion rates increase 6× after deploying it.
Legal AI company Harvey had a problem that every team running autonomous agents eventually hits: the agent kept making the same mistakes. A workaround that a human had corrected in session 47 was gone by session 48. A tool usage pattern that the team had optimized was invisible to the next run. Institutional knowledge evaporated every time the context window closed.
After deploying Anthropic's new **Dreaming** feature for Claude Managed Agents, Harvey's task completion rate increased roughly 6×. That is not a modest improvement to a success metric — it is a structural shift in what the agent is capable of, compounding across every subsequent run.
Dreaming shipped on May 6 at Anthropic's Code with Claude SF event, alongside two related features: **Outcomes** (rubric-based success evaluation) and **Multiagent Orchestration** (a coordinator/specialist model for decomposing large tasks). Together they move Claude Managed Agents from a capable agent loop to something closer to a continuously improving system.
---
## What Dreaming Actually Does
Memory — [launched earlier this year in public beta](/posts/claude-managed-agents-memory-public-beta/) — stores facts across sessions: preferences, project state, file structures, user-specific context. It is a notepad the agent can write to and read from.
Dreaming operates at a higher level of abstraction. It is a **scheduled background process** that reviews the agent's past sessions and memory stores, extracts patterns across them, and curates memories so the agent improves over time. It surfaces things a single session cannot see on its own: recurring mistakes across dozens of runs, workflows the agent consistently converges on, preferences that show up across a team's sessions.
When a dream runs, it reads the existing memory store alongside past session transcripts, then produces a new, reorganized memory store:
- Duplicate entries merged into single canonical facts
- Stale or contradicted knowledge replaced with the latest validated value
- New patterns surfaced as explicit learnings the next session will act on
The output is written as **plain-text notes and structured playbooks** — not embedded weights, not opaque vectors. Every insight is readable, auditable, and correctable by a human. You can see exactly what the agent learned and why.
For Harvey's legal agents — handling long-form drafting, multi-document review, and complex legal research — the patterns Dreaming surfaced included filetype workarounds, tool-specific usage sequences, and recurring failure modes that no single session had enough context to detect. The agents now arrive at each session pre-loaded with the institutional knowledge the team had built up, rather than starting from scratch.
---
## Control, Not Autopilot
One important design decision: Dreaming does not have to be automatic.
You decide how much control you want. The two modes:
1. **Auto-update**: Dreaming runs on a schedule, updates memory, and the next session benefits immediately. Low friction, suited for established agents with validated memory stores.
2. **Manual review**: Dreaming produces a candidate memory update. A human reviews the proposed changes before they land. Higher friction, appropriate during the early stages when you are still calibrating what the agent should and should not retain.
This is a meaningful distinction. Automatic memory updates in a production agentic system carry real risk — a bad dream could propagate a systematic mistake at scale. The manual review mode gives teams an audit layer, which matters especially in regulated industries (where Harvey operates) where the agent's reasoning chain needs to be defensible.
The schedule configuration is via the Managed Agents API; the [Dreams API docs](https://platform.claude.com/docs/en/managed-agents/dreams) cover the YAML structure and available trigger windows.
---
## Outcomes: Defining Done Before the Agent Starts
The second feature — Outcomes — solves a different problem. Most agentic workflows define a task, run the agent, and then evaluate the result by eye. That works fine for a demo. It does not scale to production agents running thousands of tasks per week.
**Outcomes** lets you write a rubric describing what success looks like before the agent starts. The agent then works toward that rubric. When it finishes, a **separate grader** evaluates the output against your criteria — in its own context window, so it is not influenced by the agent's reasoning or any in-session confirmation bias.
When the grader finds the output does not meet the rubric, it pinpoints what needs to change and sends the agent back for another pass. You define the success criteria once; the agent iterates until it meets them.
You can also add a webhook: `POST /sessions/{id}/outcome` notifies your system when the agent completes (or exhausts its retry budget). No polling. The agent runs, the webhook fires, you process the result.
This is a production pattern, not a convenience feature. Teams that are building SLA-bound agentic workflows — "this task must meet these criteria before it ships" — have had to implement this evaluation loop themselves until now. Outcomes brings it into the managed layer.
---
## Multiagent Orchestration: Coordinator + Specialists
The third feature is the most architecturally interesting. When a task is too large or too heterogeneous for a single agent to handle well, **Multiagent Orchestration** lets a lead agent break the job into pieces and delegate each to a specialist with its own model, prompt, and tools.
The canonical example from Anthropic's documentation is an incident investigation: a lead agent runs the overall investigation while subagents fan out in parallel through deploy history, error logs, metrics dashboards, and support tickets. The specialists work simultaneously on a shared filesystem and contribute findings to the lead agent's overall context.
Key constraints:
- Up to **20 unique agents** per multiagent session (the coordinator plus up to 19 specialists)
- Each specialist can have its own model, system prompt, and tool set — you are not restricted to a homogeneous fleet
- Specialists write to a shared filesystem; the coordinator aggregates
- No separate access request required; available via the Claude Platform API with the `managed-agents-2026-04-01` beta header
The 20-agent limit is not a technical ceiling — it is a guardrail while the feature is in public beta. Anthropic will almost certainly adjust it based on what production deployments actually need.
---
## Why This Matters Beyond the Headline
The Harvey 6× number is striking, but the structural implication is bigger than any single metric.
Until now, Claude Managed Agents was a capable, stateful loop. You got persistence across sessions (memory), sandboxed execution, checkpointing, and solid infrastructure. What you did not get was an agent that could learn from its own history in a systematic, auditable way — or a mechanism to define and enforce success criteria automatically.
Dreaming + Outcomes + Multiagent Orchestration closes three of the main gaps between "capable prototype" and "production-grade autonomous system":
| Gap | Feature that closes it |
|---|---|
| Agent forgets what it learned | Dreaming: scheduled memory curation |
| Success is defined by eyeballing | Outcomes: explicit rubric + grader loop |
| Single agent can't handle complex parallel work | Multiagent: coordinator + specialist fleet |
Together they describe a system that can improve over time (Dreaming), enforce its own quality bar (Outcomes), and scale its capacity horizontally (Multiagent) — without requiring a human in the loop for each iteration.
---
## How to Think About It in Practice
If you are running Claude Managed Agents today, the sequencing that makes sense:
1. **Start with Outcomes** if you have a production workflow with a definable success criterion. This is the lowest risk addition — you are just specifying what "done" means and letting the agent iterate toward it, with webhook notification when it gets there.
2. **Add Dreaming** once you have enough session history for it to be useful — typically after the agent has run 20-50 sessions with meaningful memory writes. Enable manual review mode first. Review a few dream outputs by hand before switching to auto-update.
3. **Add Multiagent Orchestration** when a single agent is hitting token or time limits on a given class of tasks, or when you have genuinely parallel workstreams that benefit from specialization (legal research + document drafting + citation verification can run simultaneously rather than sequentially).
4. **Combine all three** for the full flywheel: multiagent sessions generate more session data for Dreaming to work with; Dreaming curates learnings that improve the specialists; Outcomes catches regressions before they compound.
---
## The Bigger Picture
Dreaming is Anthropic's answer to a question the industry has been asking for a year: how do you make an agentic system that actually gets better over time, without retraining the model?
The answer is not fine-tuning and not RL on production traffic — both are expensive, opaque, and hard to govern. It is a scheduled background process that reads past behavior, synthesizes it into plain-text learnings, and writes those learnings back to the memory store that the next session reads. Observable, auditable, reversible.
That is the design philosophy Claude Code has always taken: terminal-native, filesystem-transparent, inspectable at every step. Dreaming extends the same philosophy up the stack to the agent memory layer.
Harvey's 6× completion rate will not be universal — it reflects a specific agent architecture (long-form legal drafting, complex tool chains, high session volume) where memory curation pays off quickly. Your numbers will depend on your task structure and session volume. But the direction is clear: agents that can learn from their own history are not just more convenient — they are fundamentally more capable in ways that compound.
The question is not whether to use Dreaming. The question is how quickly you can accumulate enough session history to make it valuable.
---
**Sources:**
- [New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration — Anthropic](https://claude.com/blog/new-in-claude-managed-agents)
- [Dreams — Claude API Docs](https://platform.claude.com/docs/en/managed-agents/dreams)
- [Claude Managed Agents overview — Claude API Docs](https://platform.claude.com/docs/en/managed-agents/overview)
- [Anthropic introduces "dreaming," a system that lets AI agents learn from their own mistakes — VentureBeat](https://venturebeat.com/technology/anthropic-introduces-dreaming-a-system-that-lets-ai-agents-learn-from-their-own-mistakes)
- [Anthropic is letting Claude agents 'dream' so they don't sleep on the job — SiliconANGLE](https://siliconangle.com/2026/05/06/anthropic-letting-claude-agents-dream-dont-sleep-job)
- [Anthropic will let its managed agents dream — The New Stack](https://thenewstack.io/anthropic-managed-agents-dreaming-outcomes/)
- [Claude agents can now dream — XDA Developers](https://www.xda-developers.com/claudes-leaked-dreaming-feature-is-now-live-and-it-lets-agents-learn-from-their-own-mistakes/)
---
# The Complete CLAUDE.md Guide
URL: https://sdd.sh/2026/05/the-complete-claude-md-guide/
Date: 2026-05-08
Updated: 2026-05-08
Tags: Claude Code, claude-md, agentic coding, Guides, spec-driven development
Categories: AI Tools, Guides
Summary: CLAUDE.md is the single highest-leverage file in any Claude Code project — and the most-misunderstood. This is the comprehensive guide: what to put in it, what NOT to put in it, the structure that compounds over time, and the security model that prevents CVE-2026-21852-style supply-chain attacks.
If you've used [Claude Code](/claude-code/) for more than a week, you've encountered CLAUDE.md. If you've used it for more than a month, you've probably written one badly. The file looks deceptively simple — a markdown doc the agent reads at session start — but the difference between a CLAUDE.md that compounds value over time and one that pollutes every conversation is large.
This is the comprehensive guide. What to put in. What to leave out. The structure that scales across a team. The security model that prevents you from getting owned.
## What CLAUDE.md actually is
When Claude Code starts a session, it walks the directory tree from your current working directory up to the home directory, reading every `CLAUDE.md` it finds, and concatenates them into the agent's system prompt. Project-level overrides user-level overrides org-level. The agent treats this content as authoritative context — not as soft hints, not as documentation, but as *facts about the project*.
That's the leverage. That's also the trap. Anything in CLAUDE.md is in the prompt on every turn. Two paragraphs cost ~250 tokens; ten paragraphs cost 1,200; a sloppy 800-line CLAUDE.md eats 6,000 tokens before you've typed anything. On large codebases run via `--dangerously-skip-permissions` for hours-long sessions, that adds up to real money — and worse, real *attention drift*: the agent starts treating the preamble as the most important context and the user's actual request as secondary.
## The mental model: "what would I tell a senior engineer joining this project today?"
Don't think of CLAUDE.md as documentation. Think of it as the briefing you'd give a senior engineer who's joining the project, who already knows how to write code, but doesn't know *this codebase*. The questions they would ask in the first thirty minutes are exactly what CLAUDE.md should answer.
What's the build command? Where do tests live? What linter? What conventions? What's our policy on dependencies? What are the load-bearing files? What should they read first? What pitfalls should they avoid?
Notice what they would *not* ask: they wouldn't ask you to explain what TypeScript is. They wouldn't ask for a tutorial on React. They know how to code; they need to know how to code *here*.
## The structure that compounds
A CLAUDE.md that ages well has a shape. Here's the template:
```markdown
#
## Stack
- Language:
- Framework:
- Database:
- Test runner:
- Package manager:
## Commands
- `pnpm dev` — local dev server
- `pnpm test` — full test suite
- `pnpm test ` — single file
- `pnpm build` — production build (must succeed before any commit)
- `pnpm lint` — biome check, run before committing
## Conventions
- File naming: kebab-case for files, PascalCase for React components
- Imports: absolute imports via `@/`, never deep relative paths
- Errors: throw typed errors from `lib/errors`, not generic Error
- Tests: colocated next to source (`foo.ts` → `foo.test.ts`)
- API routes: thin handlers, business logic in `lib/`
## Architecture
## Do not
- Do not run database migrations manually — use the `db migrate` command
- Do not introduce new dependencies without checking existing ones first
- Do not edit `dist/` or `.next/` directly
- Do not commit `.env*` files
## Things that aren't obvious
- The `feature-flags/` directory is consumed by GrowthBook, not internal
- `lib/legacy/` is read-only — being migrated, do not extend
- The `migrations/` folder must be alphabetically ordered for the runner
## Where to look
- New developers: read `docs/onboarding.md` first
- Adding a route: see `docs/routing.md` and existing examples in `app/api/`
- Background jobs: `workers/` and the `Inngest` setup in `lib/jobs.ts`
```
The pattern: factual, not pedagogical. Imperative, not narrative. Short.
## What to leave out
This is the part most people get wrong. CLAUDE.md is *not* the place for:
**Documentation that belongs in the README.** Your project's README is what humans see on GitHub. CLAUDE.md is what the agent sees on every prompt. They overlap, but they shouldn't be identical. README explains the project to potential users; CLAUDE.md briefs an engineer who's already on board.
**Implementation details the agent can read from the code.** Don't paste your `tsconfig.json` into CLAUDE.md. Don't list every file in `src/`. Don't enumerate React components. The agent can read the filesystem; tell it where to look, not what's there.
**Aspirational rules.** "We try to write tests for everything" is noise unless you actually do. The agent will treat aspiration as fact and produce code that assumes the rule is enforced. Rules in CLAUDE.md should be rules you'd fail a PR over.
**Multi-step workflows.** A deploy procedure, a migration playbook, a release process — these are [skills](/2026/05/skills-plugins-mcp-the-three-extension-layers/), not CLAUDE.md content. CLAUDE.md is loaded on every turn; skills load when triggered. Putting a deploy procedure in CLAUDE.md eats context for every conversation, including the ones that have nothing to do with deploys.
**Secrets, paths to secrets, or anything sensitive.** This is non-negotiable. CLAUDE.md is committed to your repo. Anyone running the agent in your project sees it. Anyone who clones your repo sees it. See the security section below.
**Long opinionated essays about software philosophy.** The agent is paid by the token. So is the model behind it. Save the manifesto for the README.
## The CLAUDE.md hierarchy
Three levels, three purposes:
**`~/.claude/CLAUDE.md`** — your personal preferences. Things you want every project to inherit: "I prefer functional patterns over class-based when both work," "always run `gh pr create` with `--draft` first," "never use emoji in code or commit messages." These persist across all your work.
**`/CLAUDE.md`** — the project's source of truth. Everything in the template above. Committed to the repo. Treated as a project artifact: changes go through code review, the file is versioned alongside the code.
**`/CLAUDE.md`** — local overrides. The `frontend/` directory has different conventions than `backend/`? Each gets its own CLAUDE.md with the deltas. The agent reads all of them, with deeper directories overriding shallower ones.
The hierarchy is the lever for big monorepos. A 10,000-line root CLAUDE.md is unmaintainable; ten 200-line subdirectory CLAUDE.mds, each scoped to a service, are.
## The security model: don't get owned
CVE-2026-21852 ([covered in detail](/2026/04/the-claude.md-trap-how-a-new-supply-chain-attack-targets-agentic-developers/)) was a patched vulnerability where a malicious CLAUDE.md could escalate the agent's permissions silently — bypassing user-defined deny rules and exfiltrating credentials. Patched in v2.1.90. The lesson generalizes beyond the specific CVE: *a CLAUDE.md cloned from a stranger's repo is executable trust*.
The rules:
1. **Never put secrets in CLAUDE.md.** Reference environment variable names, never values. `Use $DATABASE_URL` is fine. `Use postgresql://user:hunter2@...` is a credential leak.
2. **Treat CLAUDE.md from untrusted repos like `npm install` from an unverified package.** Read it before running the agent in that repo. Look for instructions that override permissions, exfiltrate data, or invoke unfamiliar tools. The agent will follow the instructions; if you didn't write them, you don't know what you're authorizing.
3. **Use deny rules in `~/.claude/settings.json` for paths the agent should never touch.** A CLAUDE.md cannot grant access to denied paths; deny rules are the floor.
```json
{
"permissions": {
"deny": [
"Read(./.env*)",
"Read(./secrets/**)",
"Read(./credentials.json)",
"Read(~/.aws/**)",
"Read(~/.ssh/**)",
"Read(~/.gnupg/**)"
]
}
}
```
4. **For shared CLAUDE.md across a team or org, treat it like production code.** Pull request review. Git history. Linting (yes, lint your CLAUDE.md — there are early linters that catch ambiguous instructions). Sign-off for changes that grant new tool permissions.
5. **Audit your CLAUDE.md when you add an MCP server.** A CLAUDE.md that says "use the company-db MCP server for all queries" combined with an MCP server that has prod write access is a single line away from "delete the production users table." Cross-check the trust surface every time it grows.
## The check: does this file pay rent?
A working test for whether your CLAUDE.md is healthy:
- Open it.
- For each section, ask: "Would removing this make the agent measurably worse at common tasks here?"
- If the answer is no, delete it.
The discipline matters more than the absolute size. A 100-line CLAUDE.md where every line earns its place outperforms a 600-line one where 80% is filler. The signal-to-noise ratio is what the agent's attention is allocated against; ratio matters more than total length.
## The forward-looking pattern
A few teams are now generating CLAUDE.md programmatically — pulling stack info from package.json, build commands from npm scripts, conventions from a linter config, file structure from `tree`. This is good. It means the file stays in sync with the code, and the parts that humans write (the "do not" list, the architecture summary, the things that aren't obvious) get the entire attention budget instead of being buried in a long auto-stale tutorial.
If you have the cycles, it's worth automating the boilerplate part. The bits that matter — the rules, the architectural map, the gotchas — still need a human author who's been burned by this codebase enough times to know what to write.
## See also
- [What is Spec-Driven Development?](/2026/03/what-is-spec-driven-development/) — the methodology that frames CLAUDE.md
- [Skills, Plugins, and MCP — the three extension layers](/2026/05/skills-plugins-mcp-the-three-extension-layers/)
- [The CLAUDE.md trap (CVE-2026-21852)](/2026/04/the-claude.md-trap-how-a-new-supply-chain-attack-targets-agentic-developers/)
- [Scaling Claude Code Skills across an engineering org](/2026/04/scaling-claude-code-skills-across-an-engineering-org/)
- [Claude Code FAQ](/2026/05/claude-code-faq/) — quick answers including the CLAUDE.md basics
CLAUDE.md isn't documentation. It's the contract between you and the agent. Write it like one.
---
# Skills, Plugins, and MCP: The Three Layers of Claude Code Extensibility
URL: https://sdd.sh/2026/05/skills-plugins-mcp-the-three-extension-layers/
Date: 2026-05-08
Updated: 2026-05-08
Tags: Claude Code, MCP, skills, plugins, extensibility, agentic coding
Categories: AI Tools, Guides
Summary: Skills, plugins, and MCP servers are the three ways to extend Claude Code — and they look similar enough that engineers routinely pick the wrong one. This is the reference: what each one is, when to reach for it, what they cost, and the failure modes nobody warns you about.
Claude Code has three extension layers, and on a casual reading they look interchangeable. Skills are markdown. Plugins are bundles. MCP servers are processes. All three "give the agent more capabilities." The marketing material treats them as a continuum.
They are not a continuum. They live at different abstraction levels, run at different times in the agent loop, and have completely different operational profiles. Picking the wrong one means either ignored capability (the agent never invokes your thing) or a bloated context window (the agent invokes it every turn). This is the reference for picking right.
## The thirty-second model
- **Skills** are *instructions for the agent*. Markdown files that teach Claude Code how to do a specific task — when to do it, what tools to use, what to avoid.
- **Plugins** are *distributable bundles*. A directory containing skills, MCP servers, slash commands, and metadata, installable via a URL.
- **MCP servers** are *external tool providers*. Long-running processes that expose actual functions (read this DB, deploy that service, query Linear) the agent can call.
If you remember nothing else: skills tell the agent *how to think*, MCP servers give the agent *new things to do*, plugins are how you ship either to other people.
## Skills: the instruction layer
A skill is a markdown file. That's it. Place it in `~/.claude/skills//SKILL.md` (user-level) or `.claude/skills//SKILL.md` (project-level), with optional supporting files in the directory, and Claude Code reads it during session bootstrap.
The structure that works:
```markdown
---
name: deploy-staging
description: Deploy the current branch to the staging environment. Triggered by "deploy to staging" or "push to staging".
---
# Deploy to staging
When the user asks to deploy to staging:
1. Verify all tests pass with `npm test`
2. Build the production artifact: `npm run build`
3. Run `bin/deploy --env staging`
4. Tail logs for 60 seconds to confirm healthy startup
If the deploy fails, the rollback is automatic — do not retry without the user's explicit confirmation.
Never deploy `main` directly without a tagged release.
```
What makes a skill effective is what makes any spec effective: it tells the agent *what to do*, *when to do it*, and *what success and failure look like*. The `description` field is the trigger — Claude Code surfaces a skill when the user's request matches its description, so the description has to be specific enough that the agent doesn't activate it on every adjacent topic.
**When to reach for a skill**: any task that needs more than three sentences to explain, any workflow that has a fixed sequence, any operation where "do this exact thing in this exact order" matters more than "figure it out." Deploys, releases, schema migrations, debugging playbooks, code review checklists, "how we structure a new microservice in this org."
**When skills fail silently**: when the description is too generic ("helps with deployments" — gets activated on every deployment-adjacent question, eats context). When the instructions are too prescriptive ("run command X exactly" — but the codebase has moved on). When the skill duplicates information already in CLAUDE.md and the two contradict.
The [scaling skills across an engineering org](/2026/04/scaling-claude-code-skills-across-an-engineering-org/) deep dive covers the team-level pattern: a shared skill marketplace where senior engineers package their playbooks once and the rest of the org consumes them.
## Plugins: the distribution layer
A plugin is a packaged bundle of skills, MCP servers, slash commands, and metadata. Install with `/plugin install `. Plugins solve the "how do I share extensions across a team or the world" problem.
The directory structure is simple:
```
my-plugin/
├── plugin.json # name, version, description, dependencies
├── skills/
│ ├── deploy-staging/SKILL.md
│ └── code-review/SKILL.md
├── mcp/
│ └── linear-server.json # MCP server configuration
└── commands/
└── status.md # custom slash command
```
The plugin.json declares which skills/MCP servers/commands ship together, what version of Claude Code is required, and any external dependencies (npm packages, Python wheels, OS binaries) that need to be present before activation. Anthropic's plugin marketplace publishes a few first-party plugins; private repos work too — `/plugin install https://github.com/your-org/plugin-name`.
**When to reach for a plugin**: when more than one person needs the same extension, when the extension has more than one moving part (a skill *and* an MCP server, for instance), when versioning matters (skill v2 changes the deploy flow, you don't want stragglers stuck on v1).
**When plugins are the wrong tool**: a single skill for one developer's personal workflow. A throwaway experiment. Anything you'd be embarrassed to find still installed in a year.
## MCP servers: the capability layer
This is where the architecture changes shape. A skill is a string; a plugin is a directory; an MCP server is a *process*. It runs alongside Claude Code (locally or remotely), speaks the [Model Context Protocol](/mcp/), and exposes functions the agent can call: `linear.create_issue`, `s3.list_buckets`, `db.run_query`. The MCP server does the actual work; the agent decides when to call it.
The protocol's value proposition has played out: 97 million downloads, [Microsoft Agent Framework](/2026/04/microsoft-agent-framework-1.0-the-enterprise-.net-world-just-adopted-mcp/), Salesforce, Google A2A, OpenAI all building on it. MCP let the AI tool ecosystem stop reinventing per-vendor function-calling for every integration.
Three flavors of MCP server:
1. **Local stdio servers** — a subprocess Claude Code spawns. Fastest, no network, useful for filesystem-bounded tools. Most Anthropic-built servers (filesystem, GitHub-via-token, sqlite) are stdio.
2. **Remote HTTP/SSE servers** — long-running services your team operates. Best for shared resources (databases, deploy systems, internal APIs) where you want central auth, audit, and rate-limiting.
3. **Cloud-hosted servers** — the [Pinterest blueprint](/2026/04/pinterests-mcp-blueprint-66000-invocations-a-month-7000-hours-saved-this-is-what-production-mcp-looks-like/) shows what this looks like at scale: a registry of internal MCP servers, two-layer JWT auth, observability, ROI dashboards.
Configure servers in `~/.claude/settings.json` or per-project `.claude/settings.json`:
```json
{
"mcpServers": {
"linear": {
"command": "npx",
"args": ["-y", "@linear/mcp-server"],
"env": { "LINEAR_API_KEY": "$LINEAR_API_KEY" }
},
"company-db": {
"url": "https://mcp.internal.company.com/db",
"auth": { "type": "bearer", "token_env": "COMPANY_MCP_TOKEN" }
}
}
}
```
**When to reach for an MCP server**: when the agent needs to *do* something that doesn't already exist as a CLI or filesystem operation. Read your Linear issues. Query your prod read-replica. Deploy via your custom CI system. Talk to Slack. Anything that crosses an authentication boundary, anything that requires a non-trivial library, anything that'd be unsafe to let the agent do via shell.
**When MCP is overkill**: anything that already has a CLI. Asking the agent to call `gh`, `kubectl`, `psql`, `aws`, `terraform` directly is faster, simpler, and uses the security model you already trust. Don't build an MCP wrapper around a CLI just to feel professional — Claude Code is good at calling CLIs.
## The decision matrix
| Need | Use |
|---|---|
| Teach the agent a multi-step workflow | Skill |
| Give the agent a new external capability | MCP server |
| Wrap an existing CLI for the agent | Just let it call the CLI |
| Share a workflow with a team | Plugin (containing skills) |
| Share a capability with a team | Plugin (containing MCP server) |
| Quickly trigger a known sequence | Slash command (in plugin or `~/.claude/commands/`) |
| One-off "always remind the agent of X" | CLAUDE.md (not a skill) |
The CLAUDE.md vs skill distinction trips people up. Rule of thumb: CLAUDE.md is for things the agent should *always know* (project conventions, build commands, file layout). Skills are for things the agent should know *when triggered* (a specific workflow, a niche operation). Putting a deploy procedure in CLAUDE.md eats context on every turn; putting "this codebase uses kebab-case" in a skill means the agent forgets it half the time.
## The failure mode nobody warns you about
The seductive failure: stuffing everything into CLAUDE.md and skills until the agent's context window is mostly preamble. CLAUDE.md is loaded on every session. Skills load when their description matches. If you have 40 skills with vague descriptions, half of them activate on any given session, and your effective context shrinks by thousands of tokens before you've said hello.
The check: every skill description should be a sentence the agent could plausibly answer "yes, that matches the user's request" or "no, it doesn't" with confidence. If you read your skill descriptions and they all sound the same, they will all activate the same.
The other failure mode: MCP servers without security boundaries. An MCP server that exposes `db.run_query` with a connection string to your prod read-write database is a foot-gun. Use read-replicas. Use scoped credentials. Use the [Lucidworks pattern](/2026/04/lucidworks-mcp-150k-per-integration-saved-and-what-it-says-about-mcps-real-value/) of one server per data domain with its own auth, not a single mega-server with admin access to everything.
## Going further
- [The MCP topic hub](/mcp/) — protocol roadmap, governance, production case studies
- [The CLAUDE.md guide](/2026/05/the-complete-claude-md-guide/) — what to put in CLAUDE.md, what to leave out, the security pattern
- [Pinterest's production MCP architecture](/2026/04/pinterests-mcp-blueprint-66000-invocations-a-month-7000-hours-saved-this-is-what-production-mcp-looks-like/) — the registry, JWT auth, ROI numbers
- [Scaling Claude Code Skills](/2026/04/scaling-claude-code-skills-across-an-engineering-org/) — the team-level pattern
- [The CLAUDE.md trap](/2026/04/the-claude.md-trap-how-a-new-supply-chain-attack-targets-agentic-developers/) — CVE-2026-21852, why provenance matters
The three layers are a feature, not a confusion. Skills give the agent judgment, MCP gives it capability, plugins give you distribution. Pick the smallest layer that does the job, and stop reaching for a heavier one until the lighter one runs out.
---
# OpenAI Just Bought Python's Toolchain. That's a Problem.
URL: https://sdd.sh/2026/05/openai-just-bought-pythons-toolchain.-thats-a-problem./
Date: 2026-05-07
Updated: 2026-05-07
Tags: OpenAI, Python, Open Source, Codex, Governance, Industry
Categories: Industry, AI Tools
Summary: OpenAI's March 2026 acquisition of Astral — makers of uv, Ruff, and ty — hands one AI lab control over Python's most critical developer infrastructure. The tools stay open source, for now. The governance question is wide open.
On March 19, 2026, OpenAI announced it would acquire Astral, the startup behind uv, Ruff, and ty. The deal is subject to regulatory approval. Founder Charlie Marsh and the entire Astral team will join OpenAI's Codex engineering group.
This is worth sitting with for a moment. Three tools that together handle Python dependency management, linting, formatting, and type checking — all Rust-based, blazingly fast, downloaded hundreds of millions of times per month — now belong to one AI company. Not a foundation. Not a BDFL arrangement. One company with a strong commercial interest in controlling the Python developer experience.
The tools will stay open source. Both sides said so. That's the right thing to say, and it may even be true. But "open source" and "governed by the ecosystem" are not the same thing, and the difference will matter.
## What Astral Built
Understanding why this acquisition is significant requires understanding what Astral actually shipped.
**uv** is a Python package manager and virtual environment tool written in Rust that replaces `pip`, `pip-tools`, and `virtualenv`. It's 10–100x faster than pip for cold installs. Since its February 2024 launch, uv crossed 126 million downloads per month — one of the fastest adoption curves in Python tooling history.
**Ruff** is a Python linter and formatter that replaces flake8, black, and isort in a single binary. Also Rust-based. Also dramatically faster. It's now the default formatter in a significant fraction of serious Python projects.
**ty** is Astral's Python type checker, competing with mypy and pyright. Still maturing, but already gaining traction in codebases that want a faster alternative.
The three tools together cover the full Python developer workflow short of writing the code itself. In an AI-first development context, that last part is no longer a joke: when an AI agent produces Python code, it then runs uv to install dependencies, Ruff to format and lint, and ty to verify types. That pipeline now belongs to OpenAI.
## OpenAI's Strategic Calculus
The acquisition isn't mysterious. OpenAI's Codex platform had reached 2 million weekly active users by March 2026 and was growing 3x year-over-year. Every Codex session that resolves Python dependencies is a session that touches pip — and pip is slow. Each session where uv handles that instead saves roughly 30 seconds on dependency resolution. At 2 million weekly sessions, that's 1 million minutes of compute per week that doesn't have to run on OpenAI's servers. The infrastructure savings alone may justify the acquisition cost within a year.
Beyond compute efficiency, owning the toolchain creates a flywheel. Codex can integrate uv and Ruff natively — automatically formatting agent output, instantly resolving dependencies, running type checks before the user ever sees the code. That's a real product advantage, and it's one no competing agent can replicate without OpenAI's cooperation.
Both OpenAI and Astral's announcement framed this in neutral terms: the tools will remain open source, the community will continue to benefit, and integration with Codex is an additive improvement. That's probably all true in the short term.
## The Pattern Worth Watching
The concern isn't that OpenAI will immediately make uv or Ruff closed-source. That would be commercially stupid — it would shatter developer trust, trigger forks, and destroy the adoption curve that made the acquisition valuable in the first place.
The concern is subtler. JetBrains, whose PyCharm integrates all three tools, named it plainly: if Astral's engineers get reassigned to OpenAI's commercial priorities, the tools could stagnate. The Rust-based internals are maintainable by a motivated community, but the architectural decisions and roadmap have until now lived with a small, focused team. Disperse that team into a 2,000-person company and the independent velocity disappears.
Simon Willison, who has tracked Python tooling governance closely, identified the deeper pattern: AI companies have a strong incentive to own the infrastructure layer their agents run on. Not to close it — closing it is the wrong move — but to optimize it for their own use cases first, and the ecosystem second.
There's also the question of competitive tooling access. Right now, Claude Code and Cursor both shell out to uv and Ruff exactly as Codex does. OpenAI made commitments to continued open development. But what happens in three years when uv ships a "native Codex integration mode" with better performance? What happens when Ruff's default configuration starts to favor code patterns that Codex generates? These aren't paranoid hypotheticals — they're the natural result of commercial incentives over time.
## The Governance Gap
This is the crux. The tools are MIT and Apache 2.0 licensed — permissive licenses that guarantee forkability. But licenses are a floor, not a governance structure.
Compare with how Anthropic handled the Model Context Protocol. When MCP reached production scale, Anthropic donated governance to the Linux Foundation. The protocol's roadmap, security decisions, and compatibility standards are now governed by a multi-stakeholder body. No single AI lab can steer MCP toward its own ecosystem in ways that disadvantage others.
uv, Ruff, and ty have no equivalent structure. They're Apache 2.0 / MIT, which means you can fork them. But the canonical implementation, the authoritative issue tracker, the release pipeline, the ecosystem relationships — all of that now lives inside OpenAI.
The Python Software Foundation has not announced any governance role. The Linux Foundation was not part of the deal. The community retains the right to fork, but not the institutional weight to govern.
## What This Means for Developers
In the short term: nothing changes. uv is still the fastest way to manage Python dependencies. Ruff is still the best linter. You should still use them.
Medium term: watch the roadmap. Specifically, watch for:
- Features that optimize for Codex-generated code patterns over general Python
- Performance improvements that require Codex-specific context or telemetry
- Default configuration changes in Ruff that drift toward OpenAI's preferred code style
- uv integrations that surface Codex capabilities natively
None of these would technically break the open-source commitment. All of them would represent a slow drift from "neutral infrastructure" to "OpenAI infrastructure that happens to work for everyone else."
Long term: the ecosystem needs a governance answer. Either a foundation takes stewardship of these tools, or serious Python infrastructure alternatives emerge that aren't owned by an AI lab with a conflicting commercial interest. The forkable licenses make this possible. The question is whether the community will act before it needs to.
## The Anthropic Comparison
It's worth noting what Anthropic has done differently with its own infrastructure contributions. MCP governance went to the Linux Foundation before it reached the scale of adoption that would have made governance contentious. The A2A protocol (Google's agent communication standard) followed the same path to Linux Foundation governance. Donating governance while tools are still growing is how you build ecosystem trust — it signals that the goal is the standard, not the lock-in.
OpenAI's Astral acquisition runs the opposite direction: acquiring external infrastructure that the community built independently and trusts as neutral. Whether that trust is maintained depends on OpenAI's behavior over years, not on any structural guarantee.
The tools are excellent. Use them. But track who governs them — because the infrastructure your AI agents run on is not a detail.
---
*Sources: [OpenAI acquisition announcement](https://openai.com/index/openai-to-acquire-astral/) · [Astral blog](https://astral.sh/blog/openai) · [Simon Willison analysis](https://simonwillison.net/2026/mar/19/openai-acquiring-astral/) · [JetBrains response](https://blog.jetbrains.com/pycharm/2026/03/openai-acquires-astral-what-it-means-for-pycharm-users/) · [The Register](https://www.theregister.com/2026/03/19/openai_aims_for_the_stars/) · [The New Stack](https://thenewstack.io/openai-astral-acquisition/)*
---
# Claude Code v2.1.126: Gateway Model Discovery, Project Purge, and Smarter Auth
URL: https://sdd.sh/2026/05/claude-code-v2.1.126-gateway-model-discovery-project-purge-and-smarter-auth/
Date: 2026-05-07
Updated: 2026-05-07
Tags: Claude Code, Anthropic, Changelog, MCP, Enterprise, Developer Tools
Categories: AI Tools, Guides
Summary: Version 2.1.126 ships four practical upgrades: a /model picker that reads your gateway's model list, a project purge command for clean state management, OAuth paste mode for headless environments, and an Auto mode that tells you when it's stuck.
Incremental releases don't get headlines. But v2.1.126 — shipping May 1, 2026 — bundles four quality-of-life changes that will matter to anyone running Claude Code in a non-trivial environment: a gateway-aware `/model` picker, a clean-slate `project purge` command, OAuth paste mode for SSH/WSL users, and a more honest Auto mode spinner. None of these are flashy. All of them fix real friction.
## /model Now Reads Your Gateway
The biggest feature is also the most enterprise-relevant. When `ANTHROPIC_BASE_URL` points at an Anthropic-compatible gateway — LiteLLM, a corporate proxy, a private deployment — the `/model` picker now queries `{base_url}/v1/models` at startup and populates the list with whatever models your gateway exposes, labeled **"From gateway"**.
Before this, running Claude Code behind a custom gateway meant manually specifying model names and hoping the string matched what the proxy expected. Now the picker is self-describing: Claude Code asks the gateway what it has, and shows you exactly that.
One important follow-up: in versions v2.1.126 through v2.1.128 this discovery was automatic. In subsequent releases it became opt-in via `CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY=1`. If you upgraded past 2.1.128 and your gateway models disappeared from the picker, that's why — add the env var.
This matters because the gateway model is increasingly how serious enterprise teams run Claude Code. IT controls the proxy, audit logs flow through a single choke point, and engineers get whatever models have been approved. Until now the UX was manual and error-prone; now it's as smooth as the direct API path.
## `claude project purge`: Delete Project State Without Deleting Code
Claude Code accumulates state per project: transcripts, task lists, file edit history, config entries. Useful in aggregate, but occasionally you want a clean slate — after a major refactor, before handing a project to a colleague, or when a stale task list is confusing the agent.
v2.1.126 adds a dedicated command for this:
```bash
claude project purge [path]
```
Options:
- `--dry-run` — shows exactly what would be deleted without touching anything
- `-y` / `--yes` — skips confirmation prompts
- `-i` / `--interactive` — lets you cherry-pick which state to delete
- `--all` — purges state for every project, not just the current one
Before this command existed, cleaning project state meant hunting through `~/.claude/` and deleting the right directories manually — the kind of thing that worked but wasn't documented anywhere. Having a first-class command with a dry-run option is the right answer.
The use case I'd reach for immediately: before onboarding another engineer to a project, run `claude project purge --dry-run` to audit what's accumulated, then purge the transcript history while leaving CLAUDE.md and skill files intact.
## OAuth Paste Mode for Headless Environments
`claude auth login` now has a fallback for when it can't complete the standard OAuth redirect flow: it prints the authorization URL, lets you open it in a browser wherever you have one, and accepts the resulting OAuth code pasted directly into the terminal.
This is a small change with a large surface area. The people it helps:
- **WSL2 users** — the browser lives on Windows, the terminal lives in Linux; the localhost callback doesn't cross the boundary
- **Remote SSH sessions** — port forwarding is a workaround but it's tedious and often blocked
- **Containers** — no browser, full stop
- **CI pipelines doing initial auth setup** — paste mode turns an interactive GUI flow into something scriptable
The previous workaround was typically device flow, but that required a separate code path and wasn't always documented clearly. Paste mode integrates directly into `claude auth login` — same command, graceful fallback.
## Auto Mode: A Spinner That Tells the Truth
A small but meaningful UX fix in Auto mode: when a permission check stalls — the agent wants to run a tool but is waiting for approval — the spinner now turns **red** instead of continuing to look like active work.
This might sound cosmetic. It isn't. In Auto mode, a frozen spinner previously looked identical whether Claude was thinking hard or waiting for a human to respond to a permission prompt. Users would watch it for 30 seconds before realizing the agent was blocked and needed their input. The red color breaks that ambiguity: active work is normal color, waiting for you is red.
It's the kind of detail that takes five minutes to implement and saves minutes per interrupted session, multiplied across every Auto mode user.
## OpenTelemetry: Skill Invocation Tracking
For teams running Claude Code with enterprise observability, v2.1.126 extends the `claude_code.skill_activated` OpenTelemetry event with a new `invocation_trigger` attribute:
| Value | Meaning |
|---|---|
| `"user-slash"` | User typed a `/skill` command |
| `"claude-proactive"` | Claude invoked the skill autonomously |
| `"nested-skill"` | A skill was called from within another skill |
This closes a gap in attribution: you can now distinguish between skills used on explicit instruction versus skills Claude decided to invoke on its own. In an enterprise context where skills are tied to cost centers or approval workflows, that distinction matters.
## The Pattern
v2.1.126 doesn't change what Claude Code can do — it changes how predictable and portable it is. Gateway model discovery makes enterprise deployments first-class. Project purge makes state management explicit. OAuth paste mode removes a real blocker for headless environments. The Auto mode spinner makes the agent's internal state legible.
None of these features exist in IDE-embedded AI tools, because none of those tools need them. Cursor doesn't have a gateway model picker because it doesn't run in containerized CI. Copilot doesn't have OAuth paste mode because it doesn't run on remote servers. The operational surface area of a terminal-native agent is just different — and the Claude Code team keeps shipping to that surface.
---
*Sources: [Claude Code changelog](https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md) · [Release v2.1.126](https://github.com/anthropics/claude-code/releases/tag/v2.1.126) · [Claude Code docs](https://code.claude.com/docs/en/changelog) · [LiteLLM gateway discovery issue](https://github.com/BerriAI/litellm/issues/27180)*
---
# Claude Code FAQ: Everything You'd Actually Ask
URL: https://sdd.sh/2026/05/claude-code-faq/
Date: 2026-05-07
Updated: 2026-05-07
Tags: Claude Code, FAQ, Anthropic, agentic coding, Guides
Categories: AI Tools, Guides
Summary: Direct answers to the questions developers actually ask about Claude Code in 2026 — plans, models, installation, CLAUDE.md, auto mode, MCP, parallel agents, cost, enterprise, security, and how it stacks up against Cursor, Copilot CLI, and Gemini CLI.
This is the question-and-answer compendium for [Claude Code](/claude-code/) — Anthropic's terminal-native agentic coding tool. Written from a year of using it, watching it, and writing about it. Updated for the May 2026 state of the world: Claude Opus 4.7 as the default model, Claude Code on Max plans only (Pro removal April 22), Routines and Code Review GA, Bedrock + Mantle for enterprise.
If a question you have isn't here, [open an issue on GitHub](https://github.com/fclairamb/sddsh/issues) and we'll add it.
## Basics
### What is Claude Code?
Claude Code is an agentic coding tool built by Anthropic that runs in your terminal. You describe what you want; it plans the work, edits files, runs shell commands, executes tests, fixes failures, and iterates until the task is done — with as much or as little human supervision as you choose.
The architectural distinction is "terminal-native, not editor-embedded." Cursor, Copilot, and Windsurf live inside an IDE and ask for approval at every step. Claude Code lives in your shell and runs autonomously when you let it. That difference is the whole product. See the [complete deep dive](/2026/04/claude-code-in-2026-the-complete-deep-dive/) for the long version.
### How is Claude Code different from Cursor, Copilot, or Windsurf?
The other tools are AI-assisted coding inside an editor. Claude Code is AI-autonomous coding from a terminal. The editor-embedded tools optimize for "the human is in every loop, the AI helps." Claude Code optimizes for "the AI does the loop, the human reviews the result."
Concretely: you can hand Claude Code a multi-hour refactor, walk away, and come back to a finished branch with passing tests and a commit log. You cannot do that with Cursor or Copilot — their architecture assumes you stay in the seat. See [Cursor vs Copilot vs Claude Code vs Windsurf](/2026/03/cursor-vs-copilot-vs-claude-code-vs-windsurf-2026/) for the full comparison and [GitHub Copilot CLI Goes GA](/2026/04/github-copilot-cli-goes-ga-microsoft-just-admitted-claude-code-was-right/) for why Microsoft eventually agreed.
### Do I need a Pro or Max subscription?
As of April 22, 2026, Claude Code requires a **Max plan** for new users. The $20 Pro plan was tested in an A/B that's now permanent for new signups; existing Pro subscribers retained access. Max comes in two tiers — **Max 5x** ($100/month) and **Max 20x** ($200/month) — referring to usage allowance multipliers vs the legacy Pro tier.
Why the change: Claude Code's compute costs were unsustainable at $20. The Max tier is the actual home for serious agentic coding. Background and analysis: [Anthropic Tests Pulling Claude Code from Pro](/2026/04/anthropic-tests-pulling-claude-code-from-pro-and-gets-an-instant-lesson-in-developer-trust/).
API users (pay-as-you-go) and enterprise (Teams, Bedrock, Vertex) have separate billing paths and are unaffected.
### What's the difference between Claude Code, the Claude API, and Claude.ai?
Three distinct products:
- **Claude.ai** is the chat interface at claude.ai. Conversational, web-based, no code execution.
- **Claude API** (developer platform) is the raw model API. You build your own application around it.
- **Claude Code** is a CLI agent that wraps the API with a terminal interface, file/shell tools, project context (CLAUDE.md), MCP integration, agent teams, scheduling (Routines), and a managed loop. It runs locally; the model lives in Anthropic's cloud.
You don't pick between them — Claude Code subscribers also get full Claude.ai access on the same plan.
## Getting started
### How do I install Claude Code?
```bash
npm install -g @anthropic-ai/claude-code
```
Then `claude auth login` opens a browser for OAuth. On a headless machine (WSL2, SSH, container), use `claude auth login` with paste mode (added in v2.1.126). Run `claude` in any project directory to start a session.
### What is CLAUDE.md and why does it matter?
`CLAUDE.md` is a project-root markdown file that Claude Code reads at the start of every session. It's where you tell the agent the things that *aren't* in the code: build commands, naming conventions, where tests live, what frameworks you're using, what NOT to do.
A well-written CLAUDE.md cuts a 30-minute onboarding-by-Q&A session down to thirty seconds. For shared codebases, it's the single highest-leverage artifact you can author. The trade-off: if you commit secrets or sensitive paths there, they're exposed to anyone running the agent. See [The CLAUDE.md Trap](/2026/04/the-claude.md-trap-how-a-new-supply-chain-attack-targets-agentic-developers/) (CVE-2026-21852) for what can go wrong.
### Should I use Sonnet or Opus?
Default to **Opus 4.7** (the `opus` API alias since April 23, 2026). It scores 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro — the highest of any model — and ships with one-third the tool-call errors of Opus 4.6. Over a 25-step agentic loop, one-third the per-step error rate compounds to a dramatic difference in completion rate.
Use **Sonnet** when latency matters more than depth (rapid file edits, lots of small back-and-forth, simpler refactors) or when you're rate-limited on Opus. Both are on the same plan; switch with `/model`. Details: [Claude Opus 4.7 release](/2026/04/claude-opus-4.7-87.6-swe-bench-implicit-need-tests-same-price/).
### What's the right starting workflow for a new project?
1. Write a CLAUDE.md with build commands, conventions, and "do not" rules.
2. Write a spec — what you're building, the contract, what success looks like. See [The Spec File as Source of Truth](/2026/05/the-spec-file-as-source-of-truth-how-to-write-specs-that-ai-can-actually-implement/).
3. Run `claude` and hand over the spec.
4. Use `/ultraplan` (or `/plan`) for non-trivial tasks. Review the plan before letting the agent execute.
5. Let it run with `auto mode` for execution if you trust the plan; otherwise use the default approval mode.
6. Review the diff, run the tests, commit.
Skip steps 1-2 only for one-off scripts. Skip them on real projects and you'll spend the saved time correcting drift.
## Core features
### What is "auto mode" and when should I use it?
Auto mode (formerly `--enable-auto-mode`, now the default for Max plans) lets Claude Code execute shell commands, file edits, and tool calls without prompting for approval at every step. The safety layer still screens for prompt injection and dangerous operations.
Use it when: the spec is clear, the work is contained (a worktree, a branch, a sandbox), and the failure mode of a wrong action is "I'll review and revert." Don't use it when: you're operating on production state, your repo lacks branch protection, or you haven't pinned the agent's working directory. See [Claude Code Auto Mode](/2026/03/claude-code-auto-mode-anthropic-hands-ai-more-control-but-keeps-it-on-a-leash/).
### What are skills, plugins, and MCP servers?
Three layers of extension:
- **Skills** are reusable instructions — markdown files in `~/.claude/skills/` or your project — that teach Claude Code how to do specific tasks (deploy a service, run a particular test framework, query your DB). Sharing skills across an org is how teams scale agentic workflows. See [Scaling Claude Code Skills](/2026/04/scaling-claude-code-skills-across-an-engineering-org/).
- **Plugins** package skills, MCP servers, and slash commands as installable units (`/plugin install `).
- **MCP servers** are external processes that expose tools to the agent over the [Model Context Protocol](/mcp/). Anthropic-built servers cover GitHub, filesystem, Slack, Linear; the [Pinterest blueprint](/2026/04/pinterests-mcp-blueprint-66000-invocations-a-month-7000-hours-saved-this-is-what-production-mcp-looks-like/) shows what production MCP looks like at scale.
### What is `/ultrareview` and how is it different from `/review`?
`/ultrareview` runs a fleet of reviewer agents in a remote sandbox to find bugs in your branch or PR. Five parallel agents look at architecture, logic, security, performance, and maintainability separately and merge findings. Pro and Max subscribers get three free runs per billing cycle; additional runs are billed.
`/review` is a local single-agent review — fast, free, less thorough. Use `/review` for routine work, `/ultrareview` before big merges. Background: [Claude Code April 2026 power-user features](/2026/04/claude-code-april-2026-ultrareview-auto-mode-power-user-features/).
A separate "Code Review" feature (GA at Code with Claude SF, May 6) integrates with GitHub PRs and runs automatically on push — billed per PR ($15-25). Different product, same lineage.
### What are Routines? Can I run Claude Code without my computer on?
Yes. **Routines** (research preview, April 14; expanded since) let you schedule Claude Code agents to run on Anthropic's cloud — on a cron schedule, via API trigger, or in response to GitHub events. Your laptop can be off; the agent runs, edits, commits, and pushes from Anthropic's infrastructure.
Use cases: nightly dependency updates, weekly security scans, "fix CI failures" on PR creation, scheduled documentation regeneration. See [Claude Code Routines](/2026/04/claude-code-routines-the-ai-cron-job-that-actually-understands-your-codebase/).
### Can I run multiple Claude Code agents in parallel?
Yes — three different patterns:
1. **Multiple sessions** in the desktop app's sidebar (since the April 14 redesign), each in its own git worktree.
2. **Agent Teams** — a coordinator + specialized sub-agents within a single session, talking via mailbox architecture.
3. **Routines + Cloud agents** — many independent tasks running on Anthropic's infrastructure simultaneously.
For a guided tour: [Parallel AI Agents](/2026/04/parallel-ai-agents-the-tools-that-let-you-run-ten-claudes-at-once/). For the architecture story: [The Orchestrator Seat](/2026/04/the-orchestrator-seat-claude-codes-desktop-redesign-makes-parallel-agents-native/).
## Cost and billing
### How does Claude Code billing work?
Two paths:
1. **Subscription (Max 5x or Max 20x)**: flat monthly fee, capped usage. Hits the cap, you're throttled until reset. Best for individual developers and small teams.
2. **API / pay-as-you-go**: token-priced (currently $5/$25 per million input/output tokens for Opus 4.7). No cap, you pay what you use. Best for variable workloads or teams.
Enterprise (Teams, Bedrock, Vertex) is invoiced separately and includes admin controls, RBAC, and the [Analytics API](/2026/04/claude-code-analytics-api-the-missing-bridge-between-ai-coding-and-enterprise-roi/).
### What's the difference between Max 5x and Max 20x?
Max 5x ($100/month) gives roughly 5× the legacy Pro plan's usage allowance; Max 20x ($200/month) gives 20×. The May 6 SpaceX-Anthropic Colossus deal doubled the five-hour rate limits across all tiers and removed the peak-hours reduction.
Pick based on actual session counts: a developer running a couple of long-form agentic sessions a day is fine on Max 5x; someone running parallel Routines and `/ultrareview` regularly should consider Max 20x. The [Analytics API](/2026/04/claude-code-analytics-api-the-missing-bridge-between-ai-coding-and-enterprise-roi/) gives precise per-developer numbers if you're choosing for a team.
### How do I track usage and cost?
`/usage` in Claude Code shows your current session's token cost, plan consumption, and remaining cap. For organization-level visibility, the **Analytics API** (Admin API key required) returns per-user, per-day metrics on commits, PRs, lines of code, sessions, tool acceptance rates, and token costs. Pipe it into BI / OpenTelemetry / SIEM. Setup details: [Claude Code Analytics API](/2026/04/claude-code-analytics-api-the-missing-bridge-between-ai-coding-and-enterprise-roi/).
## Enterprise
### Can I run Claude Code on AWS Bedrock or Azure / GCP?
Yes on AWS via **Bedrock** (GA, v2.1.94). The Mantle backend gives zero-operator-access — Anthropic engineers cannot reach the inference layer, which is the enterprise air-gap story compliance teams want. Setup is interactive (`claude --setup-bedrock`). See [Claude Code on Bedrock with Mantle](/2026/04/claude-code-on-bedrock-with-mantle-the-enterprise-air-gap-story/).
Vertex AI (GCP) and Azure Foundry support exist via the broader Claude API; native Claude Code interactive setup for those two is in progress as of May 2026.
### Does Anthropic train on my code?
For paid plans (Pro, Max, Teams, Enterprise, API): **no, by default**. Anthropic doesn't train on customer data unless you opt in via feedback flows. This is the explicit policy, contractually backed for Teams/Enterprise.
Compare with GitHub Copilot's [April 24 default opt-in](/2026/04/github-copilots-april-24-data-grab-what-youre-agreeing-to-and-how-to-opt-out/) for Free/Pro/Pro+ users.
### What's the Analytics API?
A REST API on the Admin API key that exposes per-user, per-day rollups of Claude Code activity: commits authored, PRs opened, lines added/removed, session counts, tool-call acceptance rates, token spend. Designed for the "prove the ROI" conversation. Integrates with OpenTelemetry, SIEM, and standard BI tools. [Full coverage here](/2026/04/claude-code-analytics-api-the-missing-bridge-between-ai-coding-and-enterprise-roi/).
## Trust and security
### Is `--dangerously-skip-permissions` safe?
It's safe in the same sense that `rm -rf` is safe: it does what you asked, very fast, with no second look. The flag bypasses the per-tool approval prompts that auto mode usually injects. Use it inside disposable sandboxes (a worktree, a container, a fresh VM, a dedicated repo) where the cost of the worst-case action is "throw away the sandbox."
Don't use it on production checkouts, in directories with secrets, or on shared infrastructure. Anthropic's safety layer still runs — prompt-injection screening, dangerous-command detection — but the human approval gate is gone.
### What was the CLAUDE.md trap (CVE-2026-21852)?
A patched supply-chain vulnerability where a malicious project config could escalate the agent's permissions silently — bypassing user-defined deny rules and exfiltrating credentials. Fixed in v2.1.90. The lesson: a `CLAUDE.md` cloned from a stranger's repo is executable trust. Treat it like you treat `npm install` from an unverified package. Full breakdown: [The CLAUDE.md Trap](/2026/04/the-claude.md-trap-how-a-new-supply-chain-attack-targets-agentic-developers/).
### How do I handle secrets safely with Claude Code?
Three rules:
1. Never put secrets in `CLAUDE.md`. The file lives in the repo and ends up in the agent's prompt context.
2. Use environment variables and reference them by name in CLAUDE.md (`use $DATABASE_URL`), not by value.
3. Add deny rules for paths the agent shouldn't read: `.env`, `secrets/`, `credentials.json`, `~/.aws/`. Configure in `~/.claude/settings.json` under `permissions.deny`.
If you're on enterprise, Mantle's zero-operator-access architecture handles the upstream half — your secrets never leave your AWS account.
## Comparisons
### Claude Code vs Cursor — which is better?
For autonomous, long-horizon agentic work: Claude Code, by a wide margin. Cursor's editor-embedded architecture has a ceiling: it assumes the human is in the loop, which is a feature for some workflows and a fundamental constraint for others. The [autonomy ceiling analysis](/2026/04/cursor-is-worth-50-billion.-its-biggest-problem-is-that-it-still-needs-you./) and [Cursor SDK launch](/2026/04/cursor-sdk-the-ide-escapes-the-ide-but-does-it-break-the-ceiling/) cover why.
For inline, real-time coding inside an IDE — Cursor is genuinely excellent and the better choice if you stay in the seat.
The honest answer for many teams is "use both, in different layers." See [The Three-Layer AI Coding Stack](/2026/04/the-three-layer-ai-coding-stack-that-nobody-planned-but-everyone-is-building/).
### Claude Code vs GitHub Copilot CLI?
GitHub Copilot CLI [reached GA in February 2026](/2026/04/github-copilot-cli-goes-ga-microsoft-just-admitted-claude-code-was-right/) with autopilot mode and multi-model support. It is, materially, GitHub adopting the architectural model Anthropic pioneered.
Where Copilot CLI wins: tight GitHub-platform integration (PRs, Actions, Issues) and the bundled cost story for teams already paying for Copilot. Where Claude Code wins: more mature agent loop, deeper safety/context-management research, the broader MCP and Routines ecosystem, and Opus 4.7's lead on the harder benchmarks.
If you're heavily on GitHub and using Copilot anyway, Copilot CLI is now a credible second tool. For autonomy depth, Claude Code is still the lead.
### Is the free Gemini CLI from Google enough?
For low-volume, latency-tolerant work — yes, surprisingly often. Gemini 3.1 Pro reaches 80.6% on SWE-bench Verified (within 1 point of Opus 4.6) and Google gives you 1,000 free requests per day. See [Gemini CLI honest assessment](/2026/05/gemini-cli-googles-free-terminal-ai-agent-and-what-it-actually-gets-right/).
Where it falls short: 50% slower task completion in head-to-head tests, no equivalent of CLAUDE.md project memory, no Routines, much smaller MCP ecosystem, no Agent Teams. For serious work, the speed and ecosystem gaps add up.
### Can I use Claude Code with Cursor or OpenAI Codex?
Yes — and a growing number of developers do. The pattern is composition: Cursor for orchestration and inline edits, Claude Code for execution of larger tasks, Codex (or `/ultrareview`) as a review layer. The `codex-plugin-cc` plugin makes this concrete — Codex reviews diffs that Claude Code produced. See [The Three-Layer AI Coding Stack](/2026/04/the-three-layer-ai-coding-stack-that-nobody-planned-but-everyone-is-building/).
The world isn't consolidating into one winner. It's stratifying into composable layers, and the best teams are using whichever tool wins each layer.
---
*Have a question that should be here? Open an issue at [github.com/fclairamb/sddsh](https://github.com/fclairamb/sddsh).*
---
# Gemini CLI: Google's Free Terminal AI Agent, and What It Actually Gets Right
URL: https://sdd.sh/2026/05/gemini-cli-googles-free-terminal-ai-agent-and-what-it-actually-gets-right/
Date: 2026-05-06
Updated: 2026-05-06
Tags: gemini, google, terminal-agent, open-source, claude-code, comparison, gemini-cli
Categories: AI Tools, Industry
Summary: Google shipped Gemini CLI in April 2026 — a free, open-source terminal AI agent with 1,000 requests/day on Gemini 2.5 Pro. It's more capable than the price suggests. Here's an honest assessment of what it nails, where it falls short, and what Google's move tells us about the future of AI coding infrastructure.
Roughly two years after Claude Code proved that the terminal, not the IDE, is the right home for a serious AI coding agent, Google agreed.
In April 2026, Google shipped **Gemini CLI**: a free, open-source terminal AI agent powered by Gemini 2.5 Pro, with a 1M-token context window, MCP integration, and 1,000 requests per day at no charge. No credit card. No API key required to start. Just a Google account and a `gemini` command in your shell.
The first question developers asked was whether this would dethrone Claude Code. The honest answer: no. But the more interesting answer is that Gemini CLI is a serious, well-built tool that validates the terminal-native model and raises the floor for the entire market — and understanding what it gets right (and wrong) matters if you're making infrastructure decisions in 2026.
## What Gemini CLI Actually Is
Gemini CLI is fully open-source (Apache 2.0), available on [GitHub](https://github.com/google-gemini/gemini-cli), and ships Gemini 3.1 Pro as the default model as of v0.38.2. It runs in your terminal, reads and modifies files, executes shell commands, and handles multi-step coding tasks with natural language input.
The free tier is genuinely generous: 60 requests per minute and 1,000 requests per day with a personal Google account. That's enough for a full day of active development work without touching a credit card. For developers who want to experiment with agentic coding but can't justify $20/month for Claude Code's Max plan, this is a real option.
Built-in capabilities include:
- **Google Search grounding** — the agent can search the web mid-task, a feature Claude Code doesn't offer natively
- **File operations** — read, write, create, delete across your project
- **Shell command execution** — run tests, build tools, linters
- **MCP support** — extensible via the same MCP servers the rest of the ecosystem uses
- **1M token context window** — the same window Claude uses since its GA in March 2026
The architecture mimics Claude Code's core loop: parse the task, plan a series of steps, execute each one with tool calls, confirm critical changes, iterate. Google's decision to build this as a terminal-native, file-system-based agent rather than another IDE extension is an explicit acknowledgment that the agentic model requires a different paradigm.
## The Benchmarks: Nearly Identical, With a Telling Gap
On SWE-bench Verified — the standard measure for autonomous coding capability — **Gemini 3.1 Pro scores 80.6%**, essentially tied with Claude Code on Opus 4.6 at 80.8%. If you're choosing a model based on that benchmark, the two are interchangeable.
But benchmarks measure completions, not workflows. When you look at real-task timing, the picture shifts. Independent testing shows Claude Code completing a representative to-do application from scratch in an average of **1 minute 44 seconds**, compared to Gemini CLI's **2 minutes 36 seconds** — a 50% gap on the same task. On complex multi-file refactors, reviewers consistently flag that Claude Code produces cleaner, more idiomatic output with fewer manual corrections needed.
SWE-bench parity obscures quality differences that compound over real project work.
## What Gemini CLI Gets Right
**The free tier is a genuine win for accessibility.** Not every developer is at a company paying for Max subscriptions. Students, indie hackers, developers in markets with purchasing power limitations — Gemini CLI gives them access to frontier-tier agentic coding for free. This matters for ecosystem growth, and Google knows it.
**Google Search grounding is a legitimate differentiator.** When a task requires checking current library documentation, finding an obscure API, or verifying whether a dependency has a known CVE, Gemini CLI can search and reason in the same pass. Claude Code requires a separate MCP server for web access. This is a real friction difference on research-heavy tasks.
**Open-source with Apache 2.0 is the right call.** The code is auditable, forkable, and deployable in air-gapped environments without policy review. For security-conscious enterprises that can't route code through an Anthropic endpoint, this matters.
**MCP compatibility means the ecosystem works.** Gemini CLI connects to the same MCP servers as Claude Code, so teams with existing MCP investments don't have to choose.
## Where It Falls Short
**No equivalent to project-level memory.** Claude Code's CLAUDE.md files give the agent stable, persistent context about your project: conventions, constraints, architectural decisions. Gemini CLI starts fresh on every session. For one-off tasks this is fine; for ongoing project work it means re-briefing the agent repeatedly.
Today Anthropic also announced persistent memory for Claude Managed Agents — filesystem-backed, API-controlled, auditable across sessions. Gemini CLI has nothing comparable in the pipeline.
**No scheduling or automation infrastructure.** Claude Code Routines lets you trigger agents on a schedule, via API, or on GitHub events — running on Anthropic's infrastructure without your machine. Gemini CLI is interactive-only. You can't set it up to run a nightly test analysis or respond to a PR webhook.
**The MCP ecosystem gap is substantial.** Both tools support MCP, but Claude Code's ecosystem numbers over 6,400 registered servers. Gemini CLI inherits a subset of that, but purpose-built integrations, enterprise connectors, and community tooling are overwhelmingly built for Claude Code first.
**No Agent Teams equivalent.** Claude Code supports multi-agent orchestration with up to 15 parallel agents working on isolated worktrees. Gemini CLI operates as a single agent. For large-scale autonomous workflows, there's no comparison.
**Pricing is free until it isn't.** The 1,000 req/day free tier is real today. Google's track record on free developer tiers — deprecating APIs, shifting pricing, pivoting products — introduces long-term reliability risk that Anthropic's enterprise commitment, $30B ARR, and Amazon's $25B infrastructure investment don't.
## The Hybrid Stack Worth Considering
The comparison that actually makes sense in practice isn't "Gemini CLI or Claude Code" — it's knowing when each tool fits.
Gemini CLI is fast, free, and excellent for **exploration and boilerplate**. When you want to scaffold a new project, ask a quick question that requires web search, or prototype something before committing to a direction, the free tier is a rational choice.
Claude Code is the better tool for **serious agentic work**: production implementations, multi-file refactors, long-horizon tasks, anything where you want CLAUDE.md invariants enforced, Routines automation, or Managed Agents infrastructure backing the session.
A sensible 2026 stack for an indie developer: Gemini CLI as a daily-use sandbox, Claude Code for the work that ships.
## What Google's Move Confirms
The most significant thing about Gemini CLI isn't the benchmarks or the free tier. It's that **Google built a terminal-native agent** rather than another IDE extension.
In 2025, the dominant narrative was that developers wanted AI embedded in their editor. Copilot, Cursor, Windsurf — they all built to that thesis. Google's decision to build a Claude Code-style terminal agent instead is a direct acknowledgment that the terminal-native, agentic model is the architecture worth competing on.
That's a concession worth noting. The companies shipping IDE wrappers keep telling developers that human-in-the-loop is the right paradigm. Google just shipped infrastructure that bets against that.
Gemini CLI doesn't displace Claude Code for teams doing serious agentic engineering. But it raises the baseline, validates the model, and gives Anthropic a real competitor to push against — which usually makes both products better.
---
**Sources:**
- [Gemini CLI — Google GitHub](https://github.com/google-gemini/gemini-cli)
- [Gemini CLI vs. Claude Code — DataCamp](https://www.datacamp.com/blog/gemini-cli-vs-claude-code)
- [Gemini CLI vs Claude Code — Emergent.sh](https://emergent.sh/learn/gemini-cli-vs-claude-code)
- [Gemini CLI Quotas and Pricing](https://geminicli.com/docs/resources/quota-and-pricing/)
- [Google announces Gemini CLI](https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemini-cli-open-source-ai-agent/)
- [Claude Code vs Codex CLI vs Gemini CLI — CodeAnt](https://www.codeant.ai/blogs/claude-code-cli-vs-codex-cli-vs-gemini-cli-best-ai-cli-tool-for-developers-in-2025)
---
# Claude Managed Agents Just Got Memory: Persistent, Auditable Cross-Session Learning for Enterprise Agents
URL: https://sdd.sh/2026/05/claude-managed-agents-just-got-memory-persistent-auditable-cross-session-learning-for-enterprise-agents/
Date: 2026-05-06
Updated: 2026-05-06
Tags: claude, managed-agents, memory, enterprise, agentic-workflows, anthropic
Categories: Agentic Workflows, AI Tools
Summary: Anthropic shipped persistent memory for Claude Managed Agents at today's Code with Claude SF conference. Here's how the filesystem-based architecture works, why it matters for long-running enterprise agents, and what it means for teams building serious production systems.
Every serious long-running agent hits the same wall: it can do extraordinary things inside a single session, but the moment you close the conversation, it forgets everything. Your preferences, your project context, the decisions it made last week — gone. You brief it from scratch every time.
Today at the **Code with Claude San Francisco developer conference**, Anthropic shipped the answer: **persistent memory for Claude Managed Agents, now in public beta**.
This isn't a memory *wrapper* bolted on top of the API. The architecture is designed from the ground up around how Claude already works — and the design choices reveal something important about where Anthropic thinks serious agentic infrastructure is going.
## How It Works
The core insight is deceptively simple: memory mounts as a **filesystem directory** inside the agent's container.
When you attach a memory store to a session, Claude gets a directory it can read, write, and navigate using the same shell and file tools it uses for everything else. There's no new memory API to learn, no special memory-specific syntax. The agent can `cat` a memory, `grep` across the store, append to a running log, or rewrite a knowledge file the same way it would edit code.
Anthropic describes this as the "intelligence-optimized memory layer." The phrasing matters: instead of forcing developers to choose between vector databases, key-value stores, or prompt stuffing, the filesystem abstraction lets Claude reason about what to remember, when to update it, and how to structure it — naturally, in its own working style.
A note describing each memory mount is automatically injected into the system prompt, so the agent always knows what's available and where to look. No manual orchestration required.
## The Store Architecture
A **memory store** is a workspace-scoped collection of text documents. You can create stores at the organization level or the user level, and attach them to sessions with different permission modes:
- **Organization-wide read-only**: shared knowledge bases, company policies, architecture decisions that no single agent should be able to corrupt
- **Per-user read/write**: individual preferences, personal project context, accumulated learnings from a developer's ongoing work
- **Multi-agent shared stores**: multiple agents working in parallel against the same memory, without clobbering each other
That last point is more significant than it sounds. Today's production agentic architectures — code review pipelines, multi-step research agents, autonomous issue triage systems — routinely involve several agents operating on the same project simultaneously. Until now, giving those agents shared institutional memory without risking corruption required custom synchronization logic that most teams couldn't afford to build. The store model handles that coordination at the platform level.
## API Control and the Audit Trail
Developer-facing control is first-class. Every memory in a store is addressed by a path and is directly readable and writable via the API or the Anthropic Console. You can:
- **Inspect** what an agent learned during any past session
- **Edit** memories directly, the same way you'd edit a file
- **Import and export** stores between environments — useful for copying production learnings to staging, or migrating between workspaces
- **Version** every change: each write creates an immutable memory version, giving you a full audit trail
That last point is non-negotiable for enterprise deployments. When a regulated industry deploys an agent that handles customer interactions or financial decisions, "what did the agent believe, and when did it believe it?" isn't a nice-to-have. Immutable versioning turns memory from a black box into a transparent, auditable record.
## Why Stateless Agents Are Holding Enterprise Back
The memory gap isn't an inconvenience — it's a structural limitation that caps what AI agents can actually do in production.
Consider a code review agent deployed across an engineering org. Without memory, every PR review starts from scratch. The agent can't remember that this team has a preference for a specific error-handling pattern, that a particular contributor tends to skip input validation, or that a recent architectural shift made certain classes of change riskier. It applies generic heuristics instead of org-specific institutional knowledge.
With persistent memory, the same agent builds up context over hundreds of reviews. It learns team conventions from the first week. It accumulates project-specific risk signals. It applies memory of past decisions when it encounters ambiguous tradeoffs. That's not a small incremental improvement — it's the difference between an agent that's useful as a first-pass filter and one that's trusted to make real judgments.
The same dynamic plays out in customer support agents, autonomous QA systems, developer productivity assistants, and any workflow where context compounds across time.
## Compared to CLAUDE.md: Static vs. Dynamic Memory
If you've been using Claude Code with CLAUDE.md files, you already understand project-level memory: a human-written file that tells the agent about conventions, constraints, and context. It works well for stable, project-wide facts.
Managed Agents Memory is the dynamic complement to that static layer:
| | CLAUDE.md | Managed Agents Memory |
|---|---|---|
| **Who writes it** | Human | Agent (and API) |
| **Granularity** | Project-wide | Org, user, or store-level |
| **Updates** | Manual | Per-session, automatically |
| **Audit trail** | Git history | Immutable version chain |
| **Multi-agent** | Single project context | Shared stores, parallel access |
The right production architecture uses both: CLAUDE.md for the invariants a human wants to control, Managed Agents Memory for the context that accumulates organically through the agent's work.
## The Bigger Picture
Today's Code with Claude conference confirmed something that Anthropic has been building toward since the Managed Agents launch in April: **the platform is converging on a complete infrastructure stack for enterprise agentic systems**, not just a model API with some conveniences wrapped around it.
The checklist now includes sandboxed execution, scheduled triggers via Routines, multi-agent orchestration, analytics via the Analytics API, and now persistent cross-session memory with an audit trail. Each piece addresses a specific gap that previously required custom infrastructure.
The memory announcement is also notable for what it reveals about Anthropic's philosophy. Rather than shipping a specialized memory layer that required agents to learn new patterns, they extended the filesystem abstraction the agent already uses. That's a design choice that scales well: as agents become more capable, the primitives get more powerful automatically, rather than requiring a new API version every time an edge case arises.
## Getting Access
Memory on Claude Managed Agents is available today in public beta with no separate access request required — it's included in the standard Managed Agents beta. Full documentation is available at [platform.claude.com/docs/en/managed-agents/memory](https://platform.claude.com/docs/en/managed-agents/memory).
If you're building long-running agents on the Managed Agents platform, this is the upgrade worth integrating first.
---
**Sources:**
- [Built-in memory for Claude Managed Agents — Anthropic](https://claude.com/blog/claude-managed-agents-memory)
- [Using agent memory — Claude API Docs](https://platform.claude.com/docs/en/managed-agents/memory)
- [Claude on X — Memory public beta announcement](https://x.com/claudeai/status/2047421844311949513)
- [Anthropic adds memory to Claude Managed Agents — SD Times](https://sdtimes.com/anthropic/anthropic-adds-memory-to-claude-managed-agents/)
- [Code with Claude SF 2026 — Anthropic](https://claude.com/code-with-claude/san-francisco)
---
# The Spec File as Source of Truth: How to Write Specs That AI Can Actually Implement
URL: https://sdd.sh/2026/05/the-spec-file-as-source-of-truth-how-to-write-specs-that-ai-can-actually-implement/
Date: 2026-05-05
Updated: 2026-05-05
Tags: spec-driven development, SDD, AI coding, Claude Code, agentic workflows, prompt engineering
Categories: Spec-Driven Development, Guides
Summary: Writing specs instead of code is the core premise of SDD — but a bad spec produces bad code just as reliably as a bad prompt does. Here's what separates specs that AI can execute reliably from the ones that waste hours of compute and your afternoon.
You've heard the pitch: in the age of AI coding agents, your job is to write the spec, not the code. Let the agent handle the implementation. Iterate on the output, not the keystrokes.
The pitch is correct. But it skips the hard part: writing a spec that works.
A vague spec produces vague code. An ambiguous spec produces code that's technically correct and functionally wrong. A spec that doesn't define its own boundaries produces an agent that wanders, hallucinates constraints, and confidently ships the wrong thing. Garbage in, garbage out — except now the garbage comes back 400 lines long and passes linting.
This is the skill that actually matters in 2026: writing specs that AI can reliably execute. Here's how to do it.
## What a Spec File Is (and What It Isn't)
A spec file is not a feature request. It's not a Jira ticket dressed up in markdown. It's a complete, self-contained description of a unit of software work — written from the AI's perspective, not the product manager's.
The distinction matters. A feature request tells you *what* the business wants. A spec file tells an agent *exactly* what to build, what constraints to respect, what "done" looks like, and what *not* to touch. An agent that reads a feature request will hallucinate the rest. An agent that reads a tight spec will execute it.
Think of the spec as an interface contract — like a function signature, but for a task. Everything the agent needs to know should be inside it. Nothing it doesn't need should clutter it.
## The Anatomy of a Well-Written Spec
Every spec that reliably produces correct implementations has five sections:
**1. Context** — what already exists that the agent must understand. Not the full history of the project; just the relevant surface. Which files, which modules, which API contracts. What the agent should read before it starts. In Claude Code, this section supplements `CLAUDE.md` with task-specific context.
**2. Goal** — a single, unambiguous statement of what the task produces. One sentence. If you need more than one sentence, you have more than one task.
**3. Constraints** — what must not change. Which files are off-limits. Which interfaces are frozen. Which behavior must be preserved. This is where most spec writers fail: they define what to build and forget to define what to protect.
**4. Acceptance criteria** — a checklist of verifiable conditions. Not "it should feel fast" but "the `/api/search` endpoint must return a response in under 200ms at p95 on a 10,000-record dataset." Each criterion should be testable without human judgment. If a CI pipeline can't check it, it doesn't belong in acceptance criteria — it belongs in a comment.
**5. Implementation notes** (optional) — hints about approach that you know from domain knowledge the agent doesn't have. Algorithm preferences. Third-party library choices. Known edge cases. This section is for *hints*, not instructions. If you're specifying the implementation, you're writing code, not a spec.
## Bad Spec vs. Good Spec: A Concrete Example
**Bad:**
```
Add rate limiting to the API.
```
An agent presented with this will pick an algorithm (token bucket or leaky bucket?), choose where to apply it (per-IP? per-user? per-endpoint?), decide where to store state (in-memory? Redis?), pick a library, and ship something. That something will be coherent and almost certainly wrong, because none of those decisions were yours.
**Good:**
```
## Context
- API is an Express.js app (src/api/)
- Redis instance available via REDIS_URL env var
- Authentication middleware in src/middleware/auth.ts adds req.userId
## Goal
Add per-user rate limiting to all /api/* routes: 100 requests per 15-minute window.
## Constraints
- Do not modify src/middleware/auth.ts
- Do not change existing API response shapes
- Must use the `express-rate-limit` package (already in package.json)
with `rate-limit-redis` store
## Acceptance criteria
- [ ] Requests beyond the limit return HTTP 429 with body: { error: "rate_limit_exceeded" }
- [ ] The X-RateLimit-Remaining header is set on every response
- [ ] Rate limit is per-user (req.userId), not per-IP
- [ ] Existing tests pass without modification
- [ ] New tests cover: under-limit request, limit-hit request, window reset
## Implementation notes
- Set windowMs and max via env vars RATE_LIMIT_WINDOW_MS and RATE_LIMIT_MAX
(defaults: 900000ms, 100 requests) so QA can override in testing
```
The second spec is longer. It takes five extra minutes to write. It eliminates an entire category of rework — the kind where the implementation is technically correct but misses a decision you made in your head and never wrote down.
## Common Spec-Writing Mistakes
**Omitting the constraints section.** This is the most common failure mode. Agents are optimists — if something isn't forbidden, they'll change it. Every file that's off-limits, every interface that's frozen, every behavior that must be preserved: write it down.
**Writing acceptance criteria that require judgment.** "The UI should be responsive" is not a criterion. "All pages must pass Lighthouse mobile score ≥ 90" is. If you can't write a test for it, the agent can't verify it either.
**Mixing multiple tasks in one spec.** Each spec should map to one commit, one PR, one verifiable outcome. If your goal statement has "and" in it, split the spec.
**Specifying implementation in the constraints.** There's a difference between "use the `express-rate-limit` library" (a constraint that protects a decision already made) and "implement a token bucket algorithm that refills at 6.67 requests per minute" (you're writing the code). The former is a spec. The latter is micromanagement that makes the agent's output worse, not better.
**No test requirement.** If the spec doesn't include acceptance criteria that reference tests, the agent will either skip tests or write tests that verify what it built rather than what you specified. Require specific test coverage. Name the test cases.
## How Claude Code Uses Specs
Claude Code operates on two layers of spec. The first is `CLAUDE.md` — the persistent, project-level spec that defines architecture conventions, off-limits files, invariants that must never break, and the style of interaction you expect. This is your standing contract with the agent.
The second layer is the per-task spec — the document you write for a specific piece of work. In a Spec-Driven Development workflow, this is typically a file you commit to the repo (a `specs/` directory works well) before you invoke the agent. The agent reads it, plans against it, and executes it. You review the output against the acceptance criteria. If a criterion fails, you don't re-prompt with "fix the test" — you examine whether the spec was ambiguous and update it first.
The spec becomes version-controlled documentation of your design decisions. Six months from now, it tells you not just what was built but why the constraints existed.
## A Worked Example: From Idea to Executable Spec
Suppose you want to add a search feature to a blog. The naive path: "Hey Claude, add search to my blog." The SDD path:
First, you write the spec. You decide: search scope (posts only, not pages), search fields (title + summary, not body — body is too large and too expensive to index), result count (max 10), UI (a modal triggered by `Cmd+K`, consistent with the existing design system), where the search index lives (build-time JSON, not a server), and what "done" looks like (a Lighthouse performance score that doesn't regress, a keyboard accessibility test that passes).
Then you commit the spec and invoke the agent. The agent has everything it needs. It doesn't guess. It implements exactly what you decided — because you made those decisions in the spec instead of in your head.
The output is reviewable against the criteria. The criteria are testable. The spec is the diff you show your team when someone asks why search works the way it does.
## The Payoff
The value of a tight spec isn't just better AI output — though it is that. It's that writing the spec forces you to make decisions you'd otherwise defer to implementation time, where they're expensive to change.
Agents are fast. Unclear thinking is the bottleneck. The spec file is where you do your thinking. Do it well, and you're not supervising an AI writing code — you're shipping software.
---
*Sources: [Anthropic Spec-Driven Development Guide](https://docs.anthropic.com/en/docs/claude-code/spec-driven-development); [Anthropic 2026 Agentic Coding Trends Report](https://resources.anthropic.com/hubfs/2026%20Agentic%20Coding%20Trends%20Report.pdf); [Claude Code documentation — CLAUDE.md](https://docs.anthropic.com/en/docs/claude-code/memory)*
---
# Mistral Medium 3.5 Just Entered the Agentic Coding Race — Here's Where It Stands
URL: https://sdd.sh/2026/05/mistral-medium-3.5-just-entered-the-agentic-coding-race-heres-where-it-stands/
Date: 2026-05-05
Updated: 2026-05-05
Tags: Mistral, Mistral Medium 3.5, agentic coding, AI tools, SWE-bench, remote agents, cloud agents
Categories: AI Tools, Industry
Summary: Mistral's 128B Medium 3.5 model and its Vibe remote agent platform went live this week. 77.6% SWE-bench Verified, async cloud execution, and a direct shot at the agentic coding market. The benchmarks are strong. The architecture tells a more complicated story.
Mistral AI launched two things this week that matter to anyone building software with AI: **Mistral Medium 3.5**, a 128B unified model scoring 77.6% on SWE-bench Verified, and **Vibe remote agents** — cloud-hosted async coding agents that run in isolated GPU-backed sandboxes without requiring a terminal to be open on your machine.
This is Mistral's first serious entry into the agentic coding market, and it's well-timed. The field is consolidating: Claude Code dominates the terminal-native segment, GitHub Copilot Autopilot covers the IDE-embedded enterprise market, and Jules handles Google's async agent play. Mistral is coming in with a strong model and a credible infrastructure story.
Let's look at what's actually here — and where the gaps are.
## Mistral Medium 3.5: The Model
Medium 3.5 is a 128B parameter dense model with a 256K token context window. Unlike Mistral's previous specialist releases, it's a unified model: the same weights handle code, reasoning, and chat. No separate "Codestral" for code, no separate "Mistral Large" for reasoning — one model, all tasks.
The headline benchmark is **77.6% on SWE-bench Verified**, placing it competitive with GPT-5.4 (which sat around 76-77% on the same benchmark before the transition to SWE-bench Pro). For context: Claude Opus 4.7 scores 87.6% on Verified and 64.3% on the harder SWE-bench Pro, which uses a private, multi-language task set less susceptible to training data contamination.
The 77.6% figure is legitimate. What it doesn't tell you is Pro performance, which Mistral hasn't published. Given the gap between Verified and Pro scores for every frontier model (Claude's 87.6% Verified → 64.3% Pro, GPT-5.5 Spud's ~82% Verified → 58.6% Pro), a Verified score of 77.6% likely maps to a Pro score somewhere in the 50-55% range. Capable — not frontier-tier.
The 256K context window is real and useful for codegen. Fitting an entire codebase's relevant surface in one shot means fewer compaction events, less context management overhead, and better coherence across long agentic tasks. This matches Claude's 1M context GA and is a meaningful differentiator against models still bottlenecked at 128K.
Pricing hasn't been fully disclosed as of this writing, but Mistral's previous pricing has been aggressive. Expect something below the $5/$25 per million tokens that Claude Opus 4.7 charges.
## Vibe Remote Agents: The Platform
The more interesting announcement is the Vibe remote agent platform. Here's how it works:
Agents run in isolated, GPU-backed cloud sandboxes. You invoke them via CLI (`vibe run`) or through the Le Chat interface. The agent executes asynchronously — you don't need to keep a terminal open. Sessions can be "teleported" from a local environment to a cloud sandbox mid-execution, preserving full state: open files, tool invocation history, working directory.
Parallel agents are supported. You can spawn multiple Vibe agents on different branches or different tasks, monitor their status, and merge outputs. The sandboxes are ephemeral but resumable: they checkpoint at task boundaries.
The integrations at launch: Git (GitHub, GitLab), email, calendar, Jira, Slack, and a web browser tool. Essentially, an agent can read an issue in Jira, check out a branch, implement a fix, run tests in the sandbox, open a PR, and notify you in Slack — without any of it touching your local machine.
This is a coherent async agent story. It directly competes with Claude Code Routines (Anthropic's scheduled cloud execution platform) and Jules (Google's async agent on Gemini 3.1 Pro).
## The Honest Comparison
**Model quality**: Claude Opus 4.7 leads on SWE-bench Pro by a significant margin. For a cost-sensitive use case where you're running thousands of agent tasks, Mistral Medium 3.5 at a lower per-token cost may be attractive even if it doesn't match Opus at the frontier. The performance-per-dollar math will matter once pricing is fully disclosed.
**Context window**: Mistral's 256K is solid. Claude's 1M is twice as large — a meaningful gap for full-repo tasks and long-running agentic sessions where you're accumulating context across tool calls.
**Infrastructure maturity**: Claude Code Routines have been in use since April 2026 and are battle-tested against real production codebases. Vibe remote agents are new this week. Early adopter friction is expected.
**Terminal-native vs. platform-native**: Claude Code is built as a terminal agent first. It runs in your shell, it integrates with your existing tooling, and it operates on your local filesystem or a remote machine you control. Vibe agents run on Mistral's infrastructure in Mistral's sandboxes. The latter is powerful for async workflows, but it means your code and your task execution state run on someone else's servers. For enterprises with data residency requirements, that's a ceiling.
**Ecosystem**: The MCP ecosystem now has 6,400+ servers. Claude Code can invoke any of them. Vibe's tool integrations are strong at launch but curated — not a general-purpose protocol. For teams that have already invested in the MCP ecosystem, Vibe's integrations start from scratch.
## Why This Matters Anyway
Mistral matters even if Medium 3.5 isn't the best coding model available.
The arrival of a well-funded, credible European AI lab with a competitive coding benchmark and a real async agent platform is pressure on pricing. Claude Code's commercial dominance — $2.5B ARR, >50% of Anthropic enterprise spend — has been built on genuine technical advantage and a terminal-native architectural bet that turned out to be right. But when capable competition arrives, pricing eventually follows.
It also matters for the open-source ecosystem. Mistral's previous models (Mistral 7B, Mixtral 8x7B, Codestral) have seeded a generation of fine-tunes, local deployments, and derivative tools. If Medium 3.5 gets a commercial or partially open license, the downstream ecosystem benefit is real — even for teams that continue using Claude Code for production agentic work.
And it matters as a signal. The agentic coding market is no longer a two-horse race between Claude Code and GitHub Copilot Autopilot. Google has Jules. OpenAI has the Codex Desktop agent. Mistral now has Vibe. The infrastructure primitives — async execution, sandboxed cloud VMs, parallel agents, durable sessions — have become table stakes fast.
## The Verdict
Mistral Medium 3.5 is a genuinely strong coding model. Vibe remote agents is a credible async agent platform. Neither dislodges Claude Code as the benchmark for terminal-native agentic development, and Medium 3.5's SWE-bench Pro performance will need to be published and independently verified before drawing further conclusions.
For teams running on a tight compute budget, or building in a context where Mistral's European data residency matters, this is a serious option. For teams doing serious agentic engineering — long-horizon tasks, multi-agent orchestration, deep MCP integration — Claude Code and Opus 4.7 remain the reference implementation.
What Mistral has done is raise the floor. That's good for everyone.
---
*Sources: [Mistral AI — Remote agents in Vibe, powered by Mistral Medium 3.5](https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5); [SWE-bench Leaderboard](https://www.swebench.com); [Claude Code Routines documentation](https://docs.anthropic.com/en/docs/claude-code/routines); [Anthropic — Claude Opus 4.7 release](https://www.anthropic.com/news/claude-opus-4-7)*
---
# Meta Avocado Is Closed-Source. The Llama Era Might Be Over.
URL: https://sdd.sh/2026/05/meta-avocado-is-closed-source.-the-llama-era-might-be-over./
Date: 2026-05-04
Updated: 2026-05-04
Tags: Meta, open-source, Llama, AI models, industry, open-weight
Categories: Industry, AI Tools
Summary: Meta's next flagship model has been delayed twice, benchmarks below GPT-5.5 and Claude Opus 4.7, and unlike Llama — it won't be open-sourced. Meta is reportedly considering licensing Google Gemini as a stopgap. The open-source AI story Meta spent two years building is quietly unraveling.
Meta's next flagship model, codenamed Avocado, was supposed to ship in March. Then May. Now it's drifting toward June, internally benchmarking between Gemini 2.5 and Gemini 3.0 — well below Claude Opus 4.7 and GPT-5.5 "Spud" — and carrying a decision that might matter more than any benchmark: **it won't be open-sourced**.
This is a big deal. The Llama ecosystem is one of the most significant things to happen to AI infrastructure in the past three years. Llama 3, 3.1, 3.2, Llama 4 — Meta's open weights powered thousands of fine-tuned models, local deployments, enterprise AI stacks, and the entire open-source supply chain for language models. Avocado's closed-source designation signals that Meta is pulling back from that commitment at exactly the moment when the open-source community has come to depend on it.
## What We Know About Avocado
The details are thin but consistent. Avocado is being developed by **Meta Superintelligence Labs**, the unit under Alexandr Wang — Scale AI co-founder, now Meta's Chief AI Officer. The model was originally targeted for March 2026, has slipped twice, and the current expectation is a May or June release.
Internal benchmark results are landing between Gemini 2.5 and Gemini 3.0. That puts Avocado meaningfully behind Claude Opus 4.7 (87.6% SWE-bench Verified, 64.3% SWE-bench Pro) and GPT-5.5 "Spud" (82.7% Terminal-Bench 2.0, 58.6% SWE-bench Pro). For a model that's months late and positioned as Meta's frontier play, that's a difficult position to launch from.
The closed-source decision appears strategic rather than temporary. Unlike Llama — where open weights were a deliberate policy choice, a way to build ecosystem dominance and put pressure on OpenAI's proprietary model — Avocado is being positioned as a competitive enterprise product. Meta appears to have calculated that releasing frontier weights is now a liability rather than a brand asset.
## The Gemini Option
Perhaps the most striking detail: Meta reportedly discussed temporarily licensing Google Gemini as an interim model while Avocado catches up.
Let that sink in. The company that built Llama — the model that became the backbone of open-source AI — discussed licensing a competitor's proprietary model because its own frontier offering isn't ready. This is not a company that's winning the AI race. This is a company in the middle of a painful recalibration.
No decision has been announced; Meta may have already ruled it out. But the fact that it reached the conversation stage says something about how badly internal momentum has slipped.
## The Pattern: A Second Closed-Source Turn
Avocado isn't the first sign of Meta's retreat from openness in 2026. In April, [Muse Spark — Meta's next-generation creative AI platform](/posts/meta-muse-spark-closed-source-open-source-ai/) launched as a closed-source product with no announced plans for open weights. The stated reason was protecting competitive advantage in a specific product area. Avocado suggests that reasoning has generalized to Meta's model strategy.
The trajectory is clear:
- **Llama 4** (April 2025): Open, under a community license
- **Muse Spark** (April 2026): Closed, no weights, no timeline for release
- **Avocado** (expected May/June 2026): Closed, enterprise-positioned
The progression isn't subtle. Meta is exiting the open-source frontier — at least for its most capable models.
It's worth understanding why. For the first three years of the Llama era, openness was cheap. Llama 2 and 3 were capable but not at the frontier — releasing them cost Meta little competitive ground while earning enormous ecosystem goodwill, research citations, and developer loyalty. As models get better and the gap between "open-source capable" and "frontier capable" narrows, releasing weights becomes a genuine cost. You're handing competitors a foundation they can fine-tune and deploy without paying your API prices. The math changed.
## What This Means for the Ecosystem
The Llama ecosystem is substantial. Hugging Face hosts hundreds of Llama-derived models. AWS Bedrock, Azure AI, and Google Vertex all offer Llama inference. Thousands of enterprises have built fine-tuned applications on Llama weights. The open-source AI coding stack — tools like Aider, Continue.dev, and similar — relies on Llama models as the affordable local option.
Llama 4 and older versions aren't going anywhere. The weights are already out there. But the open-source community has been operating on an implicit assumption: that Meta would continue releasing capable open weights at or near the frontier. If Avocado is closed and whatever follows is also closed, that supply dries up.
The practical consequences:
**Local and fine-tuning pipelines break down.** If Avocado's capabilities exceed Llama 4's — which they should, given the training investment — closed weights mean engineers can't fine-tune on domain-specific data. The model is a rented API, not owned infrastructure.
**Cost changes.** The "free" model you could run on a GPU cluster becomes a metered service. For cost-sensitive applications that currently self-host Llama, Avocado's API is a new line item.
**Lock-in returns.** The entire value proposition of open weights — control, privacy, no vendor lock-in, deploy anywhere — evaporates for Avocado users. They're back in the same position as GPT-4 customers in 2023.
## The Beneficiaries
**Anthropic.** Claude on AWS Bedrock, with Mantle's zero-operator-access architecture, already offers the enterprise air-gap story that Avocado will attempt to compete on — but Anthropic got there first. The [$25B Amazon investment](/posts/amazon-anthropic-25-billion-aws-100-billion-deal/) secured the infrastructure partnerships, compliance certifications, and Claude Code integration that make Bedrock a genuine development platform, not just a model API. If enterprise customers are evaluating closed frontier models, Claude Opus 4.7's benchmark lead (64.3% SWE-bench Pro vs. Avocado's unconfirmed but below-GPT-5.5 numbers) is a hard argument for Meta to counter.
**The open-source labs.** If Meta vacates the open-source frontier, the gap won't stay empty. [DeepSeek V4-Pro](/posts/deepseek-v4-open-weight-frontier-huawei-ascend/) (MIT license, 80.6% SWE-bench Verified, at 1/6th the cost of Opus 4.7) and [GLM-5.1](/posts/glm-5-1-open-source-beats-frontier-models-swe-bench-pro/) (MIT, 58.4% SWE-bench Pro from Z.AI) have already demonstrated that frontier-adjacent capability doesn't require a hyperscaler budget or a proprietary license. Chinese labs in particular have shown willingness to release powerful weights openly — partly for ecosystem reasons, partly because the regulatory environment makes closed models less commercially valuable domestically.
## The Honest Assessment
Meta's AI ambitions are real and its resources are massive. Avocado may ship, may surprise, may even get open-sourced eventually if the competitive situation shifts. Alexandr Wang and the Meta Superintelligence Labs team are serious people building serious systems.
But right now, in May 2026, the signals are discouraging. A model that's months late, benchmarking below its main competitors, under consideration for a third-party licensing bridge, and no longer carrying the open-source commitment that made the Llama brand significant — that's a difficult story to tell as a win.
The Llama era gave developers something genuinely valuable: a credible, capable, free alternative to proprietary models. Whether Avocado marks the end of that era — or just a chapter break before Meta recaptures frontier performance and reverts to openness — is the question Meta hasn't answered yet.
The open-source AI community, watching closely, would like that answer soon.
---
**Sources:**
- [Meta postpones Avocado AI model launch — MLQ.ai](https://mlq.ai/news/meta-postpones-avocado-ai-model-launch-to-may-amid-performance-gaps-with-competitors/)
- [Meta Muse Spark closed-source — sdd.sh](/posts/meta-muse-spark-closed-source-open-source-ai/)
- [Amazon $25B Anthropic investment — sdd.sh](/posts/amazon-anthropic-25-billion-aws-100-billion-deal/)
- [DeepSeek V4 open-weight frontier — sdd.sh](/posts/deepseek-v4-open-weight-frontier-huawei-ascend/)
- [GLM-5.1 open-source beats frontier — sdd.sh](/posts/glm-5-1-open-source-beats-frontier-models-swe-bench-pro/)
- [Meta Superintelligence Labs announcement — Meta](https://about.fb.com/news/2025/09/meta-ai-superintelligence-labs/)
---
# Agentic Coding 101: When Your AI Plans, Builds, Tests, and Ships
URL: https://sdd.sh/2026/05/agentic-coding-101-when-your-ai-plans-builds-tests-and-ships/
Date: 2026-05-04
Updated: 2026-05-04
Tags: agentic-coding, Claude Code, AI Tools, workflow, autonomous-agents, Claude
Categories: Agentic Workflows, Guides
Summary: Most engineers still think of AI coding as an advanced autocomplete. They're missing the paradigm shift. Agentic coding is fundamentally different — the AI plans the work, writes the code, runs the tests, fixes the failures, and iterates until the task is done.
Most engineers still think of AI coding as advanced autocomplete. They're missing the paradigm shift.
"Autocomplete mode" describes roughly 80% of how developers currently use AI coding tools. You're writing a function, Copilot suggests the next line, you tab to accept. You open a chat pane, describe a bug, the model suggests a fix, you apply it. **You stay in the loop at every step.** The AI is a sophisticated suggestion engine — faster and more capable than a code search, but fundamentally reactive. It waits for your next move.
Agentic coding is something else entirely. You give the AI a task and it **runs until the task is done** — or it hits a genuine decision point and asks for guidance. It reads your codebase. It runs your tests. It sees the failures. It makes fixes. It runs your tests again. It may spawn sub-agents to handle parallel workstreams. You're not tabbing to accept suggestions; you're reviewing the completed work.
This isn't a bigger Copilot. It's a different paradigm.
## What Makes Something Actually Agentic
The term gets abused. Cursor adds an "agent mode" and calls itself agentic. GitHub Copilot announces "autopilot" and implies autonomy. But labeling a feature "agentic" doesn't make it so.
True agentic coding requires three things:
**1. Tool use.** The model must be able to take actions beyond generating text. Reading files, writing files, running shell commands, executing tests, making API calls, searching documentation. An AI that can only output text can *describe* what code to write. An AI with tools can *write and run* it.
**2. Long-horizon planning.** A real agentic task spans dozens of steps. The model must maintain a coherent plan across the full task — not just the next token, not just the next line, but the entire arc from current state to goal. This demands genuine working memory (long context) and explicit planning behavior, not just a chain of suggestions.
**3. Autonomous iteration.** When tests fail, the agent doesn't stop and ask "what should I do?" It reads the failure output, identifies the root cause, makes a fix, and runs tests again. The loop continues until the task succeeds or the agent hits a decision it can't resolve without you.
IDE plugins that suggest multi-file edits and call it "agentic" are missing items 2 and 3. They're multi-file suggestion engines. Better than single-file suggestion engines, but not agents.
## The Agentic Loop
The core pattern is straightforward:
```
1. Understand → read relevant files, check tests, understand constraints
2. Plan → decompose the work, identify dependencies, estimate scope
3. Implement → write code following your conventions and patterns
4. Verify → run tests, check types, validate against requirements
5. Fix → address failures and iterate back to step 4
6. Report → summarize what was done and why
```
This loop runs autonomously. You hand off a task, the agent runs the full cycle, and you come back to a summary of completed work. For well-defined tasks with good automated tests, you often don't need to intervene at all.
The quality of each step depends on three things: the capability of the underlying model, the quality of context available to the agent, and the tooling available for execution and verification.
## The Stack You Need
**A capable model.** Not every model can run a reliable agentic loop. The limiting factor is usually instruction-following quality on long, multi-step tasks and tool-call accuracy. A model that hallucinates tool arguments or loses its plan halfway through will fail on anything non-trivial. As of May 2026, **Claude Opus 4.7** is the reference for agentic coding: 87.6% SWE-bench Verified, one-third the tool errors of its predecessor in agentic loops, and native multi-agent coordination for parallel workstreams.
**Context about your codebase.** Generic Claude knows how to write code. It doesn't know *your* conventions, *your* architecture, *your* testing patterns, or *your* service boundaries. This is what `CLAUDE.md` is for. A well-written CLAUDE.md tells the agent what it needs to know to make decisions your team would endorse: which patterns to use, which to avoid, where the key files are, what the testing strategy looks like. An agent without this context will write technically correct code that doesn't fit your codebase.
**Tools for execution.** The agent needs to read and write files, run shell commands, execute tests, and optionally make MCP calls to external systems. The richer the toolset, the more complete the verification loop. An agent that can run your test suite catches its own bugs. An agent that can only edit files cannot.
**[Claude Code](https://claude.ai/code)** is the reference implementation of this stack. Terminal-native (full shell access), built on Opus 4.7, ships with CLAUDE.md support, and has native MCP integration for extending the tool surface. It was designed for the agentic loop from the ground up — not retrofitted with agent features on top of an autocomplete engine.
## When to Use Agentic Mode
Not every coding task benefits from an agentic approach. A useful heuristic:
**Strong candidates for agentic mode:**
- Well-defined tasks with clear acceptance criteria (ideally a passing test suite to aim for)
- Work that spans multiple files or requires understanding existing code structure
- Tasks with mechanical structure that doesn't require creative product judgment
- Anything that benefits from automated verification (tests, type checks, linters)
Real examples: migrating an API endpoint from REST to GraphQL, adding a new data model and wiring up CRUD operations, writing comprehensive tests for a module that has none, refactoring code to match updated conventions, implementing a spec from a requirements document.
**Poor candidates for agentic mode:**
- Ambiguous tasks ("improve the dashboard")
- Tasks requiring significant creative or product judgment ("design the auth flow")
- Work that can't be automatically verified ("write better documentation")
- Very small and specific changes ("fix the typo on line 47")
For poor agentic tasks, regular AI assistance — chat, inline suggestions, a quick prompt — is faster and more appropriate. Your judgment needs to stay in the loop.
## Common Pitfalls
**Too wide a permission scope.** An agent with unconstrained write access will make changes you didn't anticipate. Define what it can and can't touch. Claude Code's permission system — allow/deny lists, cautious mode — exists for this reason. The discipline of scoping permissions is also good practice: it forces you to be explicit about what you're actually asking for.
**No CLAUDE.md.** An agent writing code without codebase context defaults to generic best practices. It will use patterns wrong for your stack, import the wrong libraries, and miss conventions that matter to your team. This is the most common reason agentic coding underdelivers. Investment in a good CLAUDE.md compounds across every task you run.
**Vague task specification.** "Fix the bug" is not a task. "The `UserSync` service fails when the upstream API returns a 429. Fix it to retry with exponential backoff up to 3 times, with unit tests" is a task. Agentic coding amplifies your specification quality — precise spec, good output; vague spec, variable output.
**Skipping verification.** The power of agentic coding comes from automated feedback loops. If you hand the agent a task with no automated tests, it can't self-verify. Either it writes its own tests (good, adds time) or it delivers code that might be subtly wrong. Test coverage pays off most in agentic workflows, because it's the mechanism by which the agent proves its own work.
## Your First Agentic Task
If you haven't run a full agentic task with Claude Code, here's a good first experiment:
1. Write a `CLAUDE.md` for your most active repository — three paragraphs covering the stack, the main conventions, and what to avoid.
2. Find a module with low or no test coverage.
3. Give Claude Code a specific target: *"Add comprehensive unit tests to `src/payments/processor.ts`. Aim for at least 80% line coverage. Run the tests to verify. Don't modify the implementation files."*
Watch it read the module, plan the test cases, write the tests, run them, find the failures, fix them, and iterate. When it finishes, review what it produced.
That's the loop. That's agentic coding.
The engineers who internalize this workflow in 2026 aren't writing less code — they're shipping more of it. They've learned to specify tasks precisely, verify them with automated tooling, and review the output rather than producing it line by line. The bottleneck shifts from implementation speed to specification quality. That's a shift worth making.
---
**Sources:**
- [Claude Code documentation](https://docs.anthropic.com/en/docs/claude-code) — Anthropic
- [SWE-bench Leaderboard](https://www.swebench.com) — Evaluation framework for AI coding agents
- [Anthropic 2026 Agentic Coding Trends Report](https://www.anthropic.com/research/agentic-coding-2026)
- [Terminal-Bench 2.0](https://terminal-bench.com) — Long-horizon agentic task evaluation
- [Claude Opus 4.7 release — sdd.sh](/posts/claude-opus-4-7-agentic-coding-benchmark-release/)
---
# Microsoft Agent 365 Is Live: The Enterprise Control Plane That Governs Agents You're Already Running
URL: https://sdd.sh/2026/05/microsoft-agent-365-is-live-the-enterprise-control-plane-that-governs-agents-youre-already-running/
Date: 2026-05-03
Updated: 2026-05-03
Tags: microsoft, enterprise, AI governance, agents, Agent 365, agentic workflows
Categories: AI Tools, Industry, Agentic Workflows
Summary: Microsoft Agent 365 reached general availability on May 1, 2026, bundled into the new M365 E7 Frontier Suite at $99/user. It is not a coding agent or a development tool. It is governance infrastructure — a control plane for discovering, governing, and securing every AI agent in your organization. Here is what it actually does, what it cannot govern, and why it matters.
On May 1, 2026, Microsoft made [Microsoft Agent 365](https://www.microsoft.com/en-us/microsoft-agent-365) generally available. It shipped alongside the new Microsoft 365 E7 Frontier Suite — a $99/user bundle that consolidates M365 E5, Microsoft 365 Copilot, the Entra Suite, and Agent 365 into a single license tier.
Agent 365 is not a coding agent. It is not a development tool. It will not write code, review PRs, or run tests. What it does is something the enterprise IT and security world has been asking for since AI agents started proliferating: a unified control plane for discovering, governing, and securing every agent in your organization — regardless of where it was built, what model it runs, or which cloud hosts it.
Whether that is the right problem to solve first is a different question.
## What Agent 365 Actually Does
Microsoft describes Agent 365 around five capabilities.
**Registry** is the starting point. Powered by Microsoft Entra, it provides a centralized inventory of every agent deployed and used across your organization — Microsoft-built, third-party, and custom agents. Admins see adoption numbers, activity metrics, and agent health status in the M365 admin center. This is the visibility layer: if you do not know what agents are running, you cannot govern them.
**Access Control** sits on top of the Entra identity infrastructure. It enforces adaptive, risk-based access policies that respond to real-time context — blocking agents that show signs of compromise from accessing organizational resources before human intervention. Think of it as Conditional Access for agents: the same principles that govern how users authenticate, now applied to non-human identities.
**Visualization** provides the dashboard layer. IT can see how agents connect to each other, what data sources they access, how they perform over time, and where issues cluster. The intent is operational awareness at scale — when you have 50 agents running across Sales, Engineering, Finance, and HR, you need a map.
**Interoperability** is where Agent 365 connects to the broader ecosystem. Agents can access organizational data through [Work IQ](https://learn.microsoft.com/en-us/microsoft-agent-365/overview) — Microsoft's enterprise knowledge graph — and integrate with M365 apps: Outlook, Word, Excel. Registry sync with AWS Bedrock and Google Cloud connections means IT teams can discover and perform basic lifecycle governance (start, stop, delete) for agents hosted outside Microsoft's stack.
**Security** brings in Microsoft Defender. Security teams can proactively remediate agent vulnerabilities and misconfigurations, use AI-powered threat intelligence to block attacks, investigate incidents, and prevent data exfiltration through agent channels.
## The Pricing Signal
The E7 Frontier Suite at $99/user lands 73% above the E5 price point of approximately $57/user. That is not a small premium for a governance layer. Microsoft is effectively signaling that enterprise-scale AI agent deployment requires a new tier of infrastructure — and that infrastructure is worth paying for.
The standalone Agent 365 add-on at $15/user/month for existing E5 customers softens that math for organizations not ready to jump to E7. But the bundle pricing is the message: in Microsoft's view, Copilot-as-a-feature is E5 territory, and Copilot-plus-governed-agents is E7 territory.
This is familiar Microsoft playbook. Azure Active Directory became Entra as identity got more complex. Microsoft Endpoint Manager absorbed Intune as device management expanded. Each complexity increase became a new product tier. Agent governance is the next one.
## The Shadow AI Problem Agent 365 Cannot Solve
Here is the honest limitation: Agent 365 primarily governs agents that Microsoft and its partners know about.
Right now, across most enterprises, developers are running Claude Code directly from their terminals. The Claude Code session on a developer's laptop accesses the codebase, calls tools, and commits code — none of that appears in the Agent 365 registry. Cursor agents run inside the IDE. GitHub Copilot Autopilot spins up in a sandboxed cloud environment managed by GitHub, not Microsoft's governance layer. OpenAI Codex agents operate through OpenAI's infrastructure.
The multi-cloud registry sync (AWS Bedrock, Google Cloud) partially addresses this: if your organization's Bedrock-hosted Claude or Vertex-hosted Gemini agents are registered there, Agent 365 can discover and perform lifecycle management. But this requires that those agents are formally registered in the first place. Ad-hoc developer tool usage — the Claude Code sessions, the Cursor activations — is invisible to the registry.
This is the enterprise shadow AI problem in concrete form. IT departments can govern what they can see. What they cannot see is the productivity layer that engineering organizations are already running at scale. Anthropic's own $2.5B ARR trajectory suggests those Claude Code sessions are not going away.
## What This Means for Development Teams
Most developers will never directly interact with Agent 365. It is an IT and security team product. But the existence of Agent 365 — and its pricing structure — tells development organizations something important about where enterprise AI adoption is going.
Enterprises are not going to allow unregistered AI agents to run indefinitely. As CISOs and CIOs get more comfortable with AI in production, governance requirements will follow. Agent 365 is Microsoft building the infrastructure side of that requirement. Development teams who want to use Claude Code, Cursor, or any other AI coding tool at enterprise scale will increasingly need to have an answer to the governance question.
Anthropic has been building toward this. Claude Cowork's RBAC and SCIM provisioning, OpenTelemetry/SIEM integration, the Analytics API, and Mantle zero-operator-access on AWS Bedrock are all governance story components. The difference is Anthropic is selling governance as part of the developer toolchain. Microsoft is selling governance as enterprise IT infrastructure.
Neither approach is wrong. They are solving different parts of the same problem at different layers of the stack.
## The Governance-First Bet
What makes Agent 365 significant is less the specific features than the strategic position it represents. Microsoft is betting that the biggest blocker to enterprise AI agent adoption is not capability — it is governance. CISOs are not comfortable deploying agents they cannot inventory, govern, or secure. Agent 365 is designed to remove that objection.
This is governance-first thinking, and it is the right instinct for the Microsoft customer base. Large enterprises move slowly. Procurement cycles require compliance documentation. Security reviews require auditability. Agent 365 gives IT departments the paperwork layer that makes AI agent deployment approvable in organizations where "we'll figure out governance later" is not an acceptable answer.
The irony is that the agents most worth governing — the ones actually doing substantive work in enterprise codebases — are largely the ones Agent 365 cannot yet see. Closing that gap, through deeper integrations with non-Microsoft AI platforms and enforcement mechanisms that reach developer toolchains, is the work that will determine whether Agent 365 becomes the de facto enterprise control plane or a governance layer for only the Microsoft-native portion of the AI stack.
For now, it is the best-designed enterprise agent governance product available. It just needs the rest of the industry to cooperate.
---
**Sources:**
- [Microsoft Agent 365 GA announcement (Microsoft Security Blog)](https://www.microsoft.com/en-us/security/blog/2026/05/01/microsoft-agent-365-now-generally-available-expands-capabilities-and-integrations/)
- [Microsoft Agent 365 overview (Microsoft Learn)](https://learn.microsoft.com/en-us/microsoft-agent-365/overview)
- [Microsoft 365 E7 Frontier Suite overview (Microsoft Blog)](https://blogs.microsoft.com/blog/2026/03/09/introducing-the-first-frontier-suite-built-on-intelligence-trust/)
- [M365 E7 Frontier Suite launch (AdminDroid)](https://blog.admindroid.com/microsoft365-e7-frontier-suite/)
- [Microsoft Agent 365 Governance: Building Control, Trust and Scale (Charter Global)](https://www.charterglobal.com/microsoft-agent-365-and-enterprise-ai-governance-building-control-trust-and-scale-for-autonomous-systems/)
- [Agent 365 Boosts AI Identity, Yet Governance Gaps Remain (Entro Security)](https://entro.security/blog/microsoft-agent-365-pushes-ai-identity-forward-but-enterprise-agents-still-need-cross-environment-governance/)
- [Microsoft Agent 365 GA: Control Plane for Windows and Multicloud (Windows News)](https://windowsnews.ai/article/microsoft-agent-365-ga-control-plane-for-governing-ai-agents-across-windows-and-multicloud.416288)
---
# Cursor Security Review vs. Claude Security: Two Betas, One Week, Opposite Architectures
URL: https://sdd.sh/2026/05/cursor-security-review-vs.-claude-security-two-betas-one-week-opposite-architectures/
Date: 2026-05-03
Updated: 2026-05-03
Tags: cursor, claude, security, enterprise, code scanning, AI tools
Categories: AI Tools, Industry
Summary: On April 30, 2026, both Cursor and Anthropic shipped AI-powered security products on the same day. The features look similar on paper. The architectures could not be more different — and that difference tells you everything about where each company thinks AI coding is headed.
On April 30, 2026, two AI security products launched on the same day. Cursor shipped [Cursor Security Review](https://cursor.com/changelog/04-30-26), a beta available to Teams and Enterprise customers. Anthropic shipped Claude Security, powered by Opus 4.7, for Claude Enterprise customers. Both promise to find vulnerabilities in your codebase that existing tools miss. Both use AI agents rather than pattern-matching signatures.
The similarity ends there. Under the hood, these are two fundamentally different bets about where AI-assisted development is going — and the architectural choices each company made reveal exactly what that bet is.
## What Cursor Security Review Does
Cursor Security Review ships as two always-on agents.
The first is the **Security Reviewer**. It runs on every pull request and leaves inline comments at the exact diff location — severity rating, affected code, remediation steps. It checks for vulnerabilities, authentication regressions, privacy and data-handling risks, auto-approved tool calls, and prompt injection attacks against your agent workflows. If configured, it can block the CI pipeline on security findings.
The second is the **Vulnerability Scanner**. This one runs on a schedule — daily, weekly, whatever you configure — and scans the full codebase for known vulnerabilities, outdated dependencies, and configuration issues. Findings get posted to Slack with dismiss/snooze actions. It can also open GitHub issues automatically.
Both agents are customizable. You can adjust triggers, add your own security instructions, and — critically — plug in MCP servers for your existing SAST, SCA, and secrets scanners. The design intent is that Cursor's agents act as orchestrators for your existing security toolchain, not replacements. Cursor has also partnered with [Chainguard](https://www.axios.com/2026/04/21/cursor-chainguard-ai-code-security) to steer AI-generated code toward vetted open-source components, reducing the risk of AI pulling in malicious or vulnerable dependencies.
## What Claude Security Does
Claude Security takes a different approach at every layer. Instead of living inside the IDE, it operates as an independent security product. Instead of pattern-matching or signature databases, it uses Opus 4.7's reasoning engine to trace data flows, examine cross-file interactions, and understand business logic.
The reasoning-based distinction matters. Traditional SAST tools look for patterns — `strcpy` with user-controlled input, SQL concatenation, unescaped template literals. They are fast, reliable, and generate a lot of noise. Claude Security reasons about what the code *does*. It can identify a business logic flaw in your authorization model even if no known CVE pattern matches, because it understands the intended behavior and sees the deviation.
Claude Security launched with enterprise security partners already integrated: CrowdStrike (including the Project QuiltWorks AI-native detection program), Palo Alto Networks, SentinelOne, Wiz, and Trend Micro TrendAI. This is not a developer tool trying to add security features — it's a security product built on Claude, distributed through security-market channels.
Claude Security supports scheduled scans with documented dismissals (creating an audit trail), CSV and Markdown export for ticketing system integration, and a public beta API for custom tooling. It does not scan running applications (DAST), container images, or infrastructure-as-code configurations — it is a source code analysis tool.
## The Architectural Divide
Here is the core difference: Cursor Security Review lives inside the development workflow. Claude Security operates outside of it.
That is not a subtle distinction. Cursor's entire product philosophy is IDE-first: the security agent runs where the code is written, reviews PRs where developers are already working, and outputs comments in the same interface where code review happens. This is deeply integrated with the Cursor workflow. If you are already a Cursor shop, Security Review is a natural extension.
But it inherits the same architectural constraint as Cursor itself. The agents live inside the IDE. They are triggered by developer actions — opening a PR, running a scheduled scan. They depend on your development environment being Cursor. And the model powering them is unspecified; Cursor's multi-model architecture means it may not be the most capable reasoning model for any given security analysis.
Claude Security, by contrast, is model-first. It is explicitly built on Opus 4.7 — Anthropic's most capable reasoning model — because security analysis requires the deepest possible reasoning about code behavior. It does not care what IDE your developers use. It does not require them to install anything or change their workflow. It integrates with the tools your security team already uses: SIEM systems via the OpenTelemetry export, your ticketing system via CSV/Markdown, your existing security platform via launch partners.
This is the same architectural difference that separates Claude Code from Cursor as coding tools. One is IDE-centric. The other is model-centric, infrastructure-first, workflow-agnostic.
## Who Wins Where
Cursor Security Review will win in organizations that are already committed to Cursor as their primary development environment. The friction is low — Teams and Enterprise customers get PR review comments automatically, without onboarding a separate tool or convincing the security team to adopt something new. The Chainguard integration and MCP plugin support make it extensible enough to layer onto existing pipelines.
Claude Security will win in organizations that care more about finding vulnerabilities than about where the scan runs. Reasoning-based analysis genuinely catches a different class of bugs than pattern-matching SAST. The security partner ecosystem — CrowdStrike, Wiz, Palo Alto — means it integrates into enterprise security workflows that predate AI coding tools. And because it is tool-agnostic, it works whether your developers use Cursor, Claude Code, Copilot, or vim.
There is also a coverage question. The April 2026 Sherlock Forensics report found that 92% of AI codebases have critical vulnerabilities, with business logic flaws in 72% of codebases. Pattern-matching tools catch the XSS and injection issues. Business logic flaws require reasoning. That is Claude Security's home turf.
## The Bigger Picture
The fact that both of these shipped the same day is significant. Six months ago, neither existed. Now two of the leading AI coding companies have each concluded that AI-powered security scanning is a necessary part of the product.
The immediate implication for engineering teams is practical: you now have options. If your team is already in Cursor, Security Review is worth enabling — it is low-friction and the PR-comment workflow is genuinely useful. If your security team is managing AI code security at the enterprise level, Claude Security's reasoning engine and partner integrations make it the more powerful tool for finding what others miss.
The deeper implication is structural. AI-generated code is creating a security problem that traditional tools were not designed to solve. Business logic flaws, cross-file vulnerabilities, and AI-specific attack surfaces (prompt injection, tool auto-approval) require reasoning-based analysis at scale. Both Cursor and Anthropic are betting on that. They just disagree about where the reasoning should live — inside the IDE, or in a dedicated security infrastructure layer.
Based on Anthropic's track record of building infrastructure that outlasts the current IDE paradigm, that disagreement has a predictable winner.
---
**Sources:**
- [Cursor Security Review launch changelog](https://cursor.com/changelog/04-30-26)
- [Cursor blog: Securing our codebase with autonomous agents](https://cursor.com/blog/security-agents)
- [Cursor + Chainguard partnership (Axios)](https://www.axios.com/2026/04/21/cursor-chainguard-ai-code-security)
- [Cursor's AI Security Agents: What They Get Right (Snyk)](https://snyk.io/blog/cursor-security-agent-prompts/)
- [Claude Security: How It Works vs Snyk (BuildFastWithAI)](https://www.buildfastwithai.com/blogs/claude-security-ai-code-scanner-2026)
- [Best AI Code Security Tools for Enterprise 2026 (TrueFoundry)](https://www.truefoundry.com/blog/best-ai-code-security)
- [Anthropic announcement (referenced in prior Claude Security coverage)](https://www.anthropic.com)
---
# Claude Code v2.1.119: Multi-VCS Support, Settings Persistence, and the Enterprise Push
URL: https://sdd.sh/2026/05/claude-code-v2.1.119-multi-vcs-support-settings-persistence-and-the-enterprise-push/
Date: 2026-05-02
Updated: 2026-05-02
Tags: Claude Code, Anthropic, Release, Enterprise, GitLab, Bitbucket
Categories: AI Tools, Guides
Summary: Claude Code v2.1.119 shipped multi-VCS support for --from-pr (GitLab, Bitbucket, GitHub Enterprise), settings persistence to ~/.claude/settings.json, and proper agent frontmatter handling in --print mode. A release that reads like a feature patch but signals something bigger about where Claude Code is heading.
Not every Claude Code release gets a blog post. Some are genuinely maintenance updates — bug fixes, dependency bumps, minor UX polish. Version 2.1.119, which shipped in late April 2026, looked like one of those at first glance. No new model. No headline agentic feature. Just a changelog of targeted improvements.
Look closer and a pattern emerges. Almost every change in v2.1.119 addresses a friction point that enterprise teams — specifically teams operating in mixed-VCS environments, with strict configuration policies and CI/CD integration requirements — repeatedly reported. This is what enterprise-grade maturation looks like in practice: not a flashy announcement, but the quiet removal of reasons not to deploy at scale.
---
## The Big One: Multi-VCS `--from-pr`
The most significant change in v2.1.119 is also the most practical: `--from-pr` now accepts URLs from GitLab merge requests, Bitbucket pull requests, and GitHub Enterprise Server instances, in addition to the public GitHub URLs it already supported.
Previously, `--from-pr` was effectively a GitHub-only feature. Point it at a PR URL, and Claude Code would pull the diff, context, and history, then start working from there — resuming an in-progress session or bootstrapping a new one. This is genuinely useful for picking up where you left off, handing a PR off to a colleague, or running automated review passes from CI.
The restriction to public GitHub meant it didn't apply to the majority of enterprise deployments. Large organizations tend to run GitHub Enterprise Server on-premises, or choose GitLab for compliance and self-hosting reasons, or use Bitbucket if they're deep in the Atlassian ecosystem. All of those teams were excluded from a workflow that GitHub-native users took for granted.
With v2.1.119, that gap closes. The implementation covers:
- **GitLab merge requests**: `gitlab.com/org/repo/-/merge_requests/123` and self-hosted GitLab instances
- **Bitbucket pull requests**: `bitbucket.org/org/repo/pull-requests/123`
- **GitHub Enterprise Server**: `github.your-company.com/org/repo/pull/123`
The `/resume` search box received the same treatment. You can now paste any VCS provider's PR URL directly into the session picker, and Claude Code will find the originating session regardless of which platform hosted the PR.
For teams running GitLab-based monorepos or Bitbucket-centric workflows, this alone justifies the upgrade.
---
## Settings Persistence and the Config Hierarchy
Before v2.1.119, changing settings via `/config` in a Claude Code session worked fine within that session — but there was no guarantee those changes persisted in a predictable location. Power users had learned to edit `~/.claude/settings.json` directly, but `/config` changes could end up in project-local files or session-only memory depending on how the session was started.
v2.1.119 establishes a clean override hierarchy:
1. **Policy settings** (`~/.claude/settings.policy.json`) — system-level, read-only for users
2. **User settings** (`~/.claude/settings.json`) — where `/config` changes now always land by default
3. **Project settings** (`.claude/settings.json` in repo) — team-shared, checked into version control
4. **Local project settings** (`.claude/settings.local.json`) — developer-specific overrides, gitignored
This matches how configuration typically works in well-designed CLI tools — and more importantly, it matches what enterprise IT and security teams expect when they need to audit or enforce settings. Administrators can now write to the policy layer and know those settings will not be silently overridden by user-level `/config` changes.
For individual developers, the practical effect is simpler: if you set `effort` to `xhigh` or configure a custom `apiKey` via `/config`, it sticks across sessions without any manual JSON editing.
---
## Agent Frontmatter in `--print` Mode
The `--print` flag puts Claude Code into non-interactive batch mode — it reads a prompt, runs it, and outputs the result to stdout. This is the mode used when Claude Code is invoked from scripts, CI pipelines, or orchestration layers.
Before v2.1.119, `--print` ignored agent frontmatter — the `tools:` and `disallowedTools:` declarations at the top of an agent file that constrain which tools the agent can invoke. This created a security and behavior gap: an agent carefully designed to only use specific tools (say, a review-only agent that should never write to the filesystem) would behave differently when invoked via `--print` than when invoked interactively.
The fix in v2.1.119 makes `--print` honor frontmatter consistently. This matters specifically for teams building agentic pipelines where agents are called non-interactively from scripts or CI runners. An agent that declares `disallowedTools: [Write, Bash]` will now have those restrictions enforced in `--print` mode, not just in interactive sessions.
It also makes the `--print` mode more composable with the [Claude Code skills system](/posts/scaling-claude-code-skills-across-an-engineering-org/), where agents are defined as markdown files with frontmatter. Invoking a skill non-interactively now produces behavior that matches the interactive case.
---
## The Small Additions That Add Up
Several other changes in v2.1.119 are individually minor but collectively signal the same enterprise-focus pattern:
**`prUrlTemplate` setting** — allows teams to override the PR badge URL in Claude Code's output footer. Useful for organizations that use custom code review interfaces (like GitHub proxies or internal review portals) rather than linking directly to public GitHub or GitLab.
**`CLAUDE_CODE_HIDE_CWD` environment variable** — suppresses the current working directory in Claude Code's startup banner. Sounds trivial, but organizations that run Claude Code in CI environments with sensitive path structures (directories that reveal internal project names, network mount paths, or container layouts) have asked for this. It's also useful for screenshots and screen recordings where CWD disclosure is undesirable.
**`${CLAUDE_EFFORT}` variable in skill content** — skills and agents can now reference the current effort level via `${CLAUDE_EFFORT}` in their prompt content. This allows skills to adapt their instructions based on whether Claude is running at `medium`, `high`, or `xhigh` effort — a useful hook for skills that should behave differently depending on compute budget.
**Windows PowerShell fallback** — when Git Bash is absent, Claude Code now falls back to PowerShell for shell execution on Windows. Reduces friction for enterprise Windows deployments that haven't set up Git Bash.
**MCP server auto-retry** — MCP servers that fail to start due to transient errors (network timeouts, slow process initialization) now get up to three automatic retry attempts before Claude Code gives up and shows an error. This addresses flakiness in CI environments and slow-starting MCP servers.
---
## `claude ultrareview` as a Subcommand
The `/ultrareview` command — which spins up a fleet of cloud-based multi-agent reviewers for deep code analysis — was [introduced in April 2026](/posts/claude-code-april-2026-ultrareview-auto-mode-power-user-features/). It was interactive-only: you ran it inside a Claude Code session, waited for results, and read them in the terminal.
v2.1.119 adds `claude ultrareview [target]` as a proper non-interactive subcommand, with a `--json` output flag for machine-readable results. This is the integration point CI/CD pipelines were waiting for. Instead of wrapping an interactive session, a build step can now invoke `claude ultrareview --json ./src/auth/` and get structured findings that feed into a PR check, a Slack notification, or a build gate.
The combination of non-interactive `--print` mode with honored agent frontmatter, and `claude ultrareview --json` for batch review, gives teams a meaningful toolkit for agentic code quality enforcement without manual review steps.
---
## 119K GitHub Stars and What It Signals
As a footnote: Claude Code's [GitHub repository](https://github.com/anthropics/claude-code) crossed 119,000 stars alongside the v2.1.119 release. Stars are a vanity metric, but the growth curve is meaningful — the repository added roughly 20,000 stars in the six weeks following the [desktop redesign](/posts/claude-code-desktop-redesign-parallel-sessions/) launch in mid-April.
That kind of sustained star growth, months after initial launch, typically reflects actual adoption rather than launch-day hype. Developers star tools they are actively using or evaluating, not tools they vaguely heard about. Combined with [the revenue figures](/posts/claude-code-2-5b-arr-terminal-beats-ide-market/), the star growth confirms a tool that is growing into its user base rather than exhausting an initial wave of enthusiasm.
v2.1.119 is not a headline release. But if you're running Claude Code in any enterprise context involving GitLab, Bitbucket, GitHub Enterprise, or CI/CD pipelines, it is one of the more practically useful updates the tool has shipped.
---
**Sources:**
- [Claude Code v2.1.115–119 weekly update — Ton Technotes](https://ton-technotes.com/en/blog/2026-04-25-claude-code-weekly-update-v2119/)
- [anthropics/claude-code v2.1.119 — newreleases.io](https://newreleases.io/project/github/anthropics/claude-code/release/v2.1.119)
- [What's new — Claude Code Docs](https://code.claude.com/docs/en/whats-new)
- [Claude Code changelog — GitHub](https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md)
- [Find bugs with ultrareview — Claude Code Docs](https://code.claude.com/docs/en/ultrareview)
- [Claude Code hits 119K GitHub stars — Augment Code](https://www.augmentcode.com/blog/claude-code-119k-stars)
---
# Claude Code at $2.5B ARR: How a Terminal Agent Outpaced Every AI IDE
URL: https://sdd.sh/2026/05/claude-code-at-2.5b-arr-how-a-terminal-agent-outpaced-every-ai-ide/
Date: 2026-05-02
Updated: 2026-05-02
Tags: Claude Code, Anthropic, Revenue, Industry, Agentic Workflows
Categories: AI Tools, Industry
Summary: Claude Code hit $1B ARR in six months after launch — faster than Slack, Zoom, or any AI coding competitor. By February 2026 it had crossed $2.5B, accounting for more than half of all Anthropic enterprise spending. Here's what those numbers actually mean for the AI coding market.
[Anthropic hit $30B ARR in April 2026](/posts/anthropic-30b-arr-overtakes-openai-claude-code-future/), overtaking OpenAI for the first time. That's the headline most people carried. But buried inside that number is a more specific story — one about a single product, a command-line interface with no GUI, no IDE integration by default, and a terminal prompt that looks like it was designed in 1979.
Claude Code reached **$1 billion in annualized run-rate revenue within six months of its general availability launch**. By February 2026 it had crossed **$2.5 billion**. Those are product-level numbers, not company-level, which makes them more striking. Claude Code now accounts for more than half of all Anthropic enterprise spending. Five hundred customers — and growing — spend over $1 million per year on it.
By any measure of enterprise software velocity, that is extraordinary. Slack took four years to reach $1 billion ARR. Zoom took three. Salesforce took seven. GitHub Copilot, the incumbent AI coding tool with the largest distribution advantage in the market, has never disclosed comparable figures but is estimated to be tracking below $500 million ARR after nearly four years of deep GitHub integration.
A terminal agent with no native IDE, no free tier for new signups, and a pricing floor of $20 per month just became the fastest enterprise developer tool to a billion dollars. The question worth asking is: why?
---
## The Architecture Bet That Paid Off
Claude Code launched with a specific thesis: the best way to get AI to write serious code is to give it the same interface a serious developer uses. Not an autocomplete dropdown inside VS Code. Not a chat sidebar you dismiss when the suggestion is wrong. A terminal. Full filesystem access. MCP tool integrations. The ability to run, test, debug, and iterate without asking permission for each step.
That thesis was controversial. Early criticism focused on the lack of visual context — no syntax highlighting, no inline diff preview, no point-and-click for accepting changes. "Too much for junior developers" was a common complaint. "Too slow for iteration loops" was another.
What the criticism missed: Claude Code was never trying to optimize for junior developers or rapid iteration loops. It was optimizing for the tasks that take senior engineers days — architectural refactors, cross-codebase debugging, large-scale migrations, complex feature builds that require holding enormous context simultaneously. Those are the tasks where terminal-native beats IDE-embedded, because they require depth, not speed.
Enterprise engineering teams found this out empirically. The [JetBrains April 2026 developer survey](https://blog.jetbrains.com/research/2026/04/which-ai-coding-tools-do-developers-actually-use-at-work/) shows Claude Code adoption at work jumped from 3% to 18% in eight months — a 6× increase — with the highest satisfaction scores in the market (91% CSAT, NPS of 54). Among US and Canadian developers, adoption is 24%. No tool with those satisfaction numbers grows that fast unless it is genuinely solving problems that competing tools aren't.
---
## What $2.5B ARR Requires
Getting to $2.5 billion in annualized revenue from a developer tool requires enterprises to spend real money. Not "we gave everyone a $10/month Copilot subscription" money. Meaningful budget allocations.
Anthropic's [Claude Code Max pricing](/posts/cursor-vs-copilot-vs-claude-code-vs-windsurf-2026/) runs $100/month (5× usage limits) or $200/month (20× limits). At the enterprise level, with Claude Cowork's RBAC controls, group spend limits, and OpenTelemetry integration, organizations are allocating per-developer, per-team budgets. The 500+ customers spending $1 million or more per year are each running hundreds to thousands of seats.
The [Analytics API](/posts/claude-code-analytics-api-enterprise-roi/) is a meaningful signal here. Anthropic built it because enterprise buyers demanded it — they needed per-user, per-day metrics on commits, PRs, lines of code, sessions, tool acceptance rates, and token costs before they could justify budget. You don't build that API unless the deals are real and the procurement cycles are serious.
Comparison against the competition is instructive:
- **GitHub Copilot**: the most broadly deployed AI coding tool in the market, bundled with GitHub, with Microsoft's entire enterprise sales force behind it. Estimated ARR remains below $500M. The primary product is still inline code completion — a fundamentally bounded category.
- **Cursor**: a $50 billion valuation, 1 million paying users at $20–$40/month. Back-of-envelope: $240M to $480M ARR. Cursor has not disclosed a number. Even at the high end, it is a fraction of Claude Code's figure.
- **Windsurf**: acquired by Cognition for ~$250M, $82M ARR at time of acquisition. Now growing under new ownership, but starting from a much smaller base.
Claude Code is not just ahead. It is in a different revenue category than every other AI coding tool.
---
## The Enterprise Flywheel
The $2.5B figure is not just a milestone — it represents a compounding dynamic that is difficult to reverse. Enterprise contracts are sticky. Once an engineering organization has standardized on Claude Code, built CLAUDE.md files across their repositories, integrated Routines into their CI/CD pipelines, and configured Bedrock Mantle for air-gap compliance, switching costs become significant.
The platform bet extends further. [Claude Managed Agents](/posts/claude-managed-agents-anthropic-agent-loop/) absorbs the production agent loop infrastructure that every team was building themselves — sessions, checkpointing, sandboxing, persistent memory. The [Claude Code Analytics API](/posts/claude-code-analytics-api-enterprise-roi/) feeds into enterprise BI dashboards and executive ROI reports. [Claude Cowork](/posts/claude-cowork-ga-enterprise-features/) handles SSO, RBAC, SCIM provisioning, and group spend limits. Claude Code is no longer a developer tool with enterprise features bolted on. It is enterprise infrastructure with a terminal interface.
The Anthropic IPO, targeting October 2026 at a reported $380B+ valuation, will likely frame Claude Code as the primary growth engine. Given that Claude Code already represents the majority of enterprise revenue, that framing will be accurate.
---
## What This Means for the Market
The $2.5B ARR figure should reframe how the AI coding tool market is analyzed.
The conventional narrative frames Cursor as the growth story (from zero to $50B valuation in four years), GitHub Copilot as the distribution story (embedded in the tool every developer already uses), and Claude Code as the autonomy story (the most capable agent, but perhaps too developer-unfriendly to scale broadly).
The revenue figures suggest the conventional narrative is wrong. Claude Code is simultaneously the autonomy story and the growth story. Developer-unfriendliness — if that was ever true — did not prevent enterprise adoption. If anything, the depth of capability that makes Claude Code harder for casual users is exactly what makes it worth $1M/year to enterprise engineering teams.
The AI coding market is stratifying. Broad distribution tools like Copilot will continue growing on the strength of integration and inertia. Agentic deep-capability tools like Claude Code will capture the high-value segment — the teams working on problems complex enough to justify real investment. That segment, it turns out, is large enough to support $2.5 billion in annual revenue and growing.
Cursor, Copilot, and Windsurf are not going away. But the thesis that "the best IDE integration wins" is looking harder to defend every quarter.
---
**Sources:**
- [Claude Code $1B Run-Rate Revenue Milestone — OrbilonTech](https://orbilontech.com/claude-code-1b-revenue-ai-coding-revolution-2026/)
- [Inside the Claude Code GTM Strategy: How Anthropic Reached $2.5B ARR — Stormy AI](https://stormy.ai/blog/claude-code-gtm-strategy-anthropic-revenue-2026/)
- [Anthropic Revenue Just Passed OpenAI: The Growth Rate Is the Real Story — Remio](https://www.remio.ai/post/anthropic-revenue-just-passed-openai-the-growth-rate-is-the-real-story)
- [Anthropic Just Passed OpenAI in Revenue While Spending 4x Less to Train Their Models — SaaStr](https://www.saastr.com/anthropic-just-passed-openai-in-revenue-while-spending-4x-less-to-train-their-models/)
- [Which AI Coding Tools Do Developers Actually Use at Work? — JetBrains Research Blog](https://blog.jetbrains.com/research/2026/04/which-ai-coding-tools-do-developers-actually-use-at-work/)
- [Anthropic Raises $30B Series G — Anthropic](https://www.anthropic.com/news/anthropic-raises-30-billion-series-g-funding-380-billion-post-money-valuation)
---
# Three Bugs, Six Weeks, One Lesson: Anthropic's Claude Code Postmortem
URL: https://sdd.sh/2026/05/three-bugs-six-weeks-one-lesson-anthropics-claude-code-postmortem/
Date: 2026-05-01
Updated: 2026-05-01
Tags: Anthropic, Claude Code, postmortem, engineering, trust, reliability, reasoning effort
Categories: AI Tools, Industry
Summary: On April 23, Anthropic published an engineering postmortem admitting three overlapping changes caused weeks of Claude Code quality degradation. All three were caught by user complaints, not internal evals. The story matters less for what it says about three bugs than for what it reveals about the risks of depending on black-box AI infrastructure.
On April 23, Anthropic published an [engineering postmortem](https://www.anthropic.com/engineering/april-23-postmortem) that the company probably did not want to write. It traced weeks of widely reported quality decline in Claude Code to three separate engineering changes that overlapped in time, affected different parts of the system, and compounded each other in ways that made the root cause unusually difficult to pin down. The post acknowledged the errors clearly, confirmed that all three had been resolved as of April 20 (v2.1.116), and announced that Anthropic was resetting usage limits for all subscribers.
It was, by software industry standards, a reasonably honest postmortem. It was also the third trust-related incident the company had to address in roughly six weeks — following the [silent reasoning effort downgrade](../anthropic-effort-default-trust-crisis/) in March and the [Pro plan removal test](../anthropic-claude-code-pro-plan-removal-developer-trust/) in April. At some point, the pattern matters as much as the individual incidents.
## What Actually Happened
The postmortem identified three distinct changes, each introduced for legitimate reasons, each creating problems that weren't immediately visible.
**Bug 1: Reasoning effort downgrade (March 4)**
On March 4, Anthropic changed Claude Code's default reasoning effort from `high` to `medium`. The stated goal was latency reduction — models spend less time generating internal reasoning steps before producing output. The change was not announced in the changelog. Users had no way to know it had happened.
[Independent analysis by AMD researchers](../anthropic-effort-default-trust-crisis/) eventually quantified the impact across 6,852 sessions: a 73% reduction in average thinking depth, with no user-visible signal that anything had changed. Anthropic reversed the change on April 7. The post acknowledged it directly: "This was the wrong tradeoff." As of the postmortem, all Opus 4.7 users default to `xhigh` effort; all other models default to `high`.
**Bug 2: Caching regression (March 26)**
On March 26, Anthropic shipped a change intended to reduce latency by clearing cached thinking state from sessions that had been idle for more than an hour. The intent was reasonable — stale context from hours ago is usually not useful, and keeping it cached costs resources.
A bug in the implementation caused the clearing to happen not once after idle, but on *every subsequent turn* for the rest of the session. The practical effect was that Claude appeared forgetful and repetitive within a single working session — re-asking for context it had already been given, losing track of decisions made earlier in the conversation, seemingly unable to maintain a coherent thread over a long task. This was fixed on April 10.
This is the least-covered of the three issues, which is somewhat surprising given that it directly affected the continuous multi-step workflows that Claude Code is specifically designed for. A caching bug that made the model appear to lose memory mid-session is a significant regression for agentic use cases.
**Bug 3: Verbosity reduction (April 16)**
On April 16, Anthropic added a system prompt instruction to reduce response verbosity. The motivation, again, was legitimate: Claude Code's responses had grown long and users were reporting that the model was padding output with unnecessary explanation. The instruction was meant to tighten that up.
In combination with other prompt changes that were already in place, the verbosity reduction degraded coding quality. The Register summarized the community's reaction with characteristic economy: "[Anthropic admits it dumbed down Claude with 'upgrades'](https://www.theregister.com/2026/04/23/anthropic_says_it_has_fixed/)." The instruction was reverted on April 20, the same day as the fix.
## The Part That Isn't in the Postmortem
Anthropic's post is clear about what happened. What's worth examining is how it happened — specifically, what the postmortem reveals about the limits of internal quality assurance for a hosted AI product.
All three issues were caught by user complaints, not internal evaluations. The [VentureBeat coverage](https://venturebeat.com/technology/mystery-solved-anthropic-reveals-changes-to-claudes-harnesses-and-operating-instructions-likely-caused-degradation/) framed this as "mystery solved," but the mystery is only half the story. The more important half is that Anthropic's own evals missed all three regressions before they shipped to production. A postmortem writeup on [Machine Learning at Scale](https://machinelearningatscale.substack.com/p/anthropic-shipped-three-regressions) put it plainly: "Anthropic shipped three regressions in a month and their evals didn't catch one of them."
The engineering post doesn't address this directly. It doesn't describe what the evals looked like, why they failed to detect the regressions, or what changes to the evaluation process are being made. That's a legitimate gap in an otherwise honest postmortem.
The underlying problem is structural. When you depend on a hosted AI product — where the harness, system prompt, caching behavior, and reasoning configuration are all managed by the vendor — you're accepting that engineering decisions made without your knowledge will affect your workflows. The three bugs in this postmortem weren't model failures; they were *harness* failures. The model didn't change. What changed was the infrastructure around it, invisible to the user, with no mechanism for users to detect or opt out of the changes.
## What Power Users Can Do
The postmortem contains practical information that's worth extracting.
**Reasoning effort is now configurable.** As of v2.1.116, the defaults are `xhigh` for Opus 4.7 and `high` for other models. You can set this explicitly via `--config reasoning_effort:xhigh` in your CLI invocation, or in `~/.claude/settings.json`. The [April 2026 power-user features update](../claude-code-april-2026-ultrareview-auto-mode-power-user-features/) documented the `xhigh` effort level, which was added precisely because users needed explicit control over this parameter.
**CLAUDE.md invariants provide a quality floor.** You can't pin to a specific Claude Code version for hosted behavior, but you can use CLAUDE.md to enforce quality expectations at the project level: reasoning depth requirements, output format constraints, verification steps before marking tasks complete. These don't prevent harness regressions, but they give the model explicit quality instructions that survive system prompt changes.
**Watch the changelog.** None of the three changes in this postmortem appeared in the changelog when they shipped. The effort downgrade on March 4 wasn't disclosed until the AMD study surfaced it. The caching bug shipped and was fixed with minimal public documentation. This is a gap Anthropic should address — material changes to reasoning configuration, caching behavior, or system prompt instructions should appear in release notes.
## A Pattern Worth Watching
Anthropic's public record across the first four months of 2026 contains several incidents worth noting in sequence:
- March 4: Silent effort downgrade shipped without changelog disclosure
- March 26: Caching regression shipped, creates forgetfulness in active sessions
- April 7: Effort default reverted after AMD study goes public
- April 10: Caching bug fixed
- April 16: Verbosity reduction prompt ships
- April 20: Verbosity instruction reverted
- April 22: Pro plan A/B test surfaces (Claude Code access removed for ~2% of new users)
- April 23: Postmortem published, usage limits reset
Each incident has an explanation. The effort downgrade was a latency tradeoff. The caching change was a resource optimization. The verbosity instruction addressed genuine user feedback. The Pro plan test was a pricing A/B experiment.
The pattern that emerges is of a company making product decisions without adequate disclosure, catching problems only after significant user backlash, and then addressing them reactively. That's a trust problem that individual postmortems don't fully resolve.
The usage limit reset was a meaningful gesture. Publishing the postmortem was the right call. What would actually change the pattern is a disclosure policy: material changes to reasoning configuration, caching behavior, or system prompt instructions should ship with changelog entries that users can read before the change takes effect.
Claude Code is the best agentic coding tool available. The [SWE-bench Pro numbers](../claude-opus-4-7-agentic-coding-benchmark-release/), the [Agent Teams architecture](../claude-code-agent-teams-multi-agent-orchestration/), the terminal-native model — none of that is undermined by this postmortem. What the postmortem does surface is a gap between the product's technical capabilities and the operational trust that enterprise teams need before treating it as critical infrastructure.
That gap is closeable. The first step is a changelog policy that matches the product's ambitions.
---
**Sources:**
- [Anthropic Engineering: April 23 postmortem](https://www.anthropic.com/engineering/april-23-postmortem)
- [Fortune: Anthropic explains Claude Code performance decline](https://fortune.com/2026/04/24/anthropic-engineering-missteps-claude-code-performance-decline-user-backlash/)
- [The Register: Anthropic admits it dumbed down Claude with upgrades](https://www.theregister.com/2026/04/23/anthropic_says_it_has_fixed/)
- [VentureBeat: Mystery solved — harness and instruction changes caused degradation](https://venturebeat.com/technology/mystery-solved-anthropic-reveals-changes-to-claudes-harnesses-and-operating-instructions-likely-caused-degradation/)
- [Stack Futures: Three overlapping changes, six weeks of quality degradation](https://stackfutures.com/blog/anthropic-claude-code-postmortem-three-bugs-six-weeks-april-2026/)
- [Machine Learning at Scale: Anthropic shipped three regressions their evals didn't catch](https://machinelearningatscale.substack.com/p/anthropic-shipped-three-regressions)
- [OpenTools: Anthropic admits three engineering errors](https://opentools.ai/news/anthropic-admits-three-engineering-errors-behind-claude-code-decline)
---
# Claude Security: Anthropic Enters the Defensive Security Market
URL: https://sdd.sh/2026/05/claude-security-anthropic-enters-the-defensive-security-market/
Date: 2026-05-01
Updated: 2026-05-01
Tags: Claude Security, Anthropic, vulnerability scanning, enterprise security, Claude Opus 4.7, CrowdStrike, Wiz
Categories: AI Tools, Industry
Summary: Anthropic's Claude Security went to public beta on April 30, bringing reasoning-based vulnerability detection to enterprise codebases. With CrowdStrike, Wiz, SentinelOne, and Palo Alto as launch partners, this is Anthropic's first step beyond the developer tools market — and its timing couldn't be better.
Anthropic has been building coding tools since Claude Code launched in public beta in early 2026. On April 30, the company moved into a different market entirely: enterprise security. Claude Security graduated from a closed research preview to a [public beta available to all Claude Enterprise customers](https://claude.com/product/claude-security), powered by Opus 4.7 and backed by a set of launch partners that reads like a who's who of corporate security — CrowdStrike, Palo Alto Networks, SentinelOne, Wiz, and Trend Micro's TrendAI.
This is not a minor feature drop. It is Anthropic's first dedicated product for security teams, and it signals where the company thinks the AI-powered security market is heading.
## The Problem Claude Security Is Solving
To understand why this matters, it helps to look at where enterprise security tooling has historically fallen short.
Traditional static analysis tools — SonarQube, Semgrep, CodeQL — work by matching code patterns against a database of known vulnerability signatures. They're fast, deterministic, and useful for catching the canonical bugs. What they can't do is reason about *intent*, trace multi-file data flows, or detect a business logic vulnerability that doesn't match any known pattern.
That gap has been widening. As our [April 26 analysis of AI-generated code security](../ai-generated-code-security-crisis-92-percent-vulnerabilities/) showed, 92% of codebases now contain critical vulnerabilities, and 62% of security teams say they're overwhelmed. AI tooling is generating code faster than human reviewers can audit it, and traditional scanners are catching the easy bugs while complex, context-dependent vulnerabilities slip through.
Claude Security's pitch is that it approaches vulnerability detection the way a senior security researcher would: not by looking for known patterns, but by reasoning over the codebase as a whole. According to [Anthropic's announcement](https://www.anthropic.com/news/claude-code-security), the system traces data flows, reads source code in context, and examines interactions between components across files — synthesizing what it finds before flagging anything. Everything gets a confidence rating before it reaches an analyst.
The practical effect: Anthropic reports that hundreds of organizations in the research preview found vulnerabilities that had evaded their existing tools for years.
## What's New in the Public Beta
Claude Security first appeared in February 2026 as "Claude Code Security," accessible only to a closed set of enterprise customers. The jump to public beta brings three meaningful additions:
**Scheduled scans.** Rather than running manually on demand, teams can configure Claude Security to scan on a recurring schedule — nightly, weekly, or tied to CI events. For a security team trying to maintain ongoing coverage across a growing codebase, this is the feature that makes it operationally viable rather than just a one-off audit tool.
**Documented dismissals.** Analysts can now dismiss findings with written reasons that persist as notes for future reviewers. This closes a gap that frustrated early users of the research preview: you'd dismiss a false positive, and the next scan would surface it again with no record of why it had been reviewed. The documented dismissal creates a light audit trail without requiring a full JIRA workflow.
**CSV and Markdown export.** Findings can now be exported in formats that import cleanly into existing security management systems and audit documentation. Small feature, large operational significance for enterprise compliance workflows.
## The Partner Ecosystem
The more strategically interesting part of the announcement is the integration roster.
CrowdStrike is integrating Opus 4.7 across the Falcon platform as part of what the company is calling [Project QuiltWorks](https://www.crowdstrike.com/en-us/press-releases/crowdstrike-puts-claude-opus-4-7-to-work-across-falcon-platform-project-quiltworks/), a broader push to bring AI-powered vulnerability discovery and remediation to CrowdStrike's customer base. Palo Alto Networks, SentinelOne, Wiz, and Trend Micro's TrendAI are following the same model: embedding Opus 4.7 as a reasoning engine inside their existing security platforms.
This is a distribution strategy. Anthropic gets access to the security platforms that enterprises already trust and have already deployed. The security vendors get a reasoning-capable AI model that can do things their own models aren't trained for. For the enterprise customer, the model surfaces in a familiar interface rather than requiring a new procurement decision.
The framing is also deliberate. Anthropic has been careful to position this as a *defensive* product. Claude Opus 4.7 reached [64.3% on SWE-bench Pro](../claude-opus-4-7-agentic-coding-benchmark-release/) and demonstrated significant capability in vulnerability research. The flip side of that capability is that Claude-class models can also find exploitable vulnerabilities at scale — a risk the company acknowledged in the context of [Claude Mythos](../claude-mythos-leaked-model-step-change-cybersecurity/) and Project Glasswing. Channeling that capability into a product that patches vulnerabilities rather than exploiting them is both good optics and, presumably, genuinely useful work.
## What This Means for Enterprise Security Teams
The practical question for a security team evaluating Claude Security is where it fits in the existing stack.
It's not a replacement for pattern-matching scanners. Those tools are fast, cheap, and catch the high-volume low-complexity bugs at scale. Claude Security is better understood as a second-pass tool: run your existing scanner first to catch the known patterns, then run Claude Security to look for the context-dependent issues that require reasoning — the business logic flaws (72% prevalence in AI-generated codebases), the multi-file data flows, the insecure defaults that don't look wrong in isolation.
The confidence rating feature is critical here. A reasoning model surfacing findings without quality signals would create more analyst burden, not less. The confidence rating, combined with documented dismissals, is what makes the workflow usable: analysts triage high-confidence findings, review medium-confidence ones selectively, and dismiss false positives with a paper trail.
The scheduling capability changes the deployment model from "audit tool" to "continuous coverage." That's the operational shift that makes this worth taking seriously as enterprise infrastructure rather than a demo.
## The Bigger Picture
Claude Security represents Anthropic's first product that isn't primarily aimed at software developers. Claude Code is a developer tool. Claude Cowork is a collaboration tool for developer organizations. Claude Security is pitched at security teams — a different buyer persona, a different procurement process, a different set of integration requirements.
That's a significant expansion of Anthropic's addressable market, and it comes at a moment when the AI-generated code problem has turned security from a secondary concern into a first-order risk for enterprise engineering organizations. The timing is deliberate. The capability is real.
Whether Claude Security becomes a category leader or a feature that gets absorbed into the major security platforms over the next 18 months remains to be seen. The CrowdStrike and Wiz integrations suggest Anthropic is comfortable with the second outcome — the platform play gets Opus 4.7 in front of more enterprises, regardless of which product it surfaces through.
For teams already on Claude Enterprise and dealing with AI-generated code at scale, the public beta is worth running. If the research preview's track record holds — finding vulnerabilities that evaded existing tools for years — the incremental cost is low and the potential value is significant.
---
**Sources:**
- [Anthropic: Claude Security announcement](https://www.anthropic.com/news/claude-code-security)
- [SiliconAngle: Claude Security public beta](https://siliconangle.com/2026/04/30/anthropic-announces-claude-security-public-beta-find-fix-software-vulnerabilities/)
- [CrowdStrike: Project QuiltWorks announcement](https://www.crowdstrike.com/en-us/press-releases/crowdstrike-puts-claude-opus-4-7-to-work-across-falcon-platform-project-quiltworks/)
- [SecurityWeek: Anthropic Unveils Claude Security](https://www.securityweek.com/anthropic-unveils-claude-security-to-counter-ai-powered-exploit-surge/)
- [The New Stack: Claude Security beta](https://thenewstack.io/anthropics-claude-security-beta/)
- [CRN: Claude Security 5 things to know](https://www.crn.com.au/news-network/security/2026/anthropic-launches-claude-security-5-things-to-know)
- [Business Standard: Claude Security enterprise beta](https://www.business-standard.com/technology/tech-news/anthropic-announces-claude-security-beta-for-enterprise-customers-126050100019_1.html)
---
# OpenAI Lands on Amazon Bedrock — The Cloud That Already Houses Claude
URL: https://sdd.sh/2026/04/openai-lands-on-amazon-bedrock-the-cloud-that-already-houses-claude/
Date: 2026-04-30
Updated: 2026-04-30
Tags: OpenAI, AWS, Amazon Bedrock, Claude, Anthropic, cloud, enterprise, Codex
Categories: AI Tools, Industry
Summary: After Microsoft's exclusivity expired on April 27, OpenAI moved its models, Codex agent, and a new jointly built Bedrock Managed Agents runtime onto AWS. Amazon now hosts both Anthropic and OpenAI. Here's what the infrastructure power shift means for the AI coding landscape.
The tech business story of the week sounds almost absurd on its face: the same cloud provider that just committed $25 billion to Anthropic is now also the home of OpenAI's flagship models and coding agent. Welcome to the post-exclusivity era of AI infrastructure.
On April 28, 2026 — one day after OpenAI's exclusivity agreement with Microsoft expired — Amazon Web Services announced that GPT-5.5, GPT-5.4, Codex, and a new jointly built agent runtime called **Amazon Bedrock Managed Agents powered by OpenAI** were all landing on Bedrock in limited preview.
If you thought the cloud wars were confusing before, you ain't seen nothing yet.
## How We Got Here
For the better part of three years, Microsoft had a stranglehold on OpenAI. Azure was the only hyperscaler allowed to host GPT models commercially — a condition baked into the multi-billion-dollar investment agreement that kept OpenAI solvent through its turbulent 2023-2024 period. Enterprise customers who wanted GPT had to go through Azure. Full stop.
That arrangement expired on April 27, 2026. Microsoft and OpenAI amended the agreement: Azure retains first-mover rights (OpenAI ships there first, unless Microsoft can't support the required capability), but the license is now non-exclusive. The old revenue-share arrangement where Microsoft paid OpenAI has ended, though OpenAI still routes payments to Microsoft through 2030.
The day after exclusivity lapsed, AWS had three OpenAI products live on Bedrock. That turnaround was not an accident — this was planned and ready to ship the moment the legal window opened. AWS CEO Matt Garman confirmed the partnership had been in negotiation for months.
## What's Actually on Bedrock
Three distinct offerings shipped together:
**OpenAI Models on Bedrock** — GPT-5.5 and GPT-5.4 are now available through the standard Bedrock model access interface, alongside Claude Opus 4.7, Gemini, Llama, and the growing zoo of frontier models AWS already hosts. Usage counts toward existing AWS cloud commitments, which matters enormously for enterprises with pre-committed cloud spend.
**Codex on Amazon Bedrock** — OpenAI's coding agent is accessible via the Codex CLI, desktop app, and VS Code extension, all routing through Bedrock infrastructure. This is straightforward: enterprises that standardized on AWS can now run Codex without setting up separate OpenAI credentials or going through Azure.
**Amazon Bedrock Managed Agents powered by OpenAI** — This is the interesting one. Rather than simply making OpenAI's existing agent harness available, AWS and OpenAI jointly built a new managed agent runtime that runs on AWS infrastructure but uses OpenAI's frontier models and agent harness. It's designed to be AWS-native: memory, orchestration, and agent lifecycle are managed through familiar Bedrock controls and IAM policies. Critically, this product is an AWS exclusive by design — you can't get the jointly-engineered runtime on Azure.
## The Amazon Paradox
The strategic picture here is genuinely novel. Amazon has committed more capital to Anthropic than any other single investor — $25 billion with an additional $100 billion AWS infrastructure commitment over ten years, announced just six days ago. Anthropic's models, including Claude Opus 4.7, are deeply integrated into AWS: there's a native Claude console in the AWS dashboard, Claude Code runs on Bedrock with Mantle's zero-operator-access security model, and Amazon is building Trainium3 chips partly to serve Claude inference.
And yet, as of April 28, AWS is also the home of OpenAI's most powerful models and a jointly built OpenAI agent runtime.
This is not a contradiction from Amazon's perspective. AWS has always been a neutral marketplace for compute and services — it hosts competitors' products constantly. The cloud business model rewards volume and commitment, not exclusivity. If enterprises want to run GPT-5.5 workloads on AWS infrastructure rather than Azure, Amazon captures that spend. If those same enterprises also run Claude workloads on Bedrock, Amazon captures that too.
But it does create a peculiar situation: **Amazon Bedrock Managed Agents** now ships in two flavors — one powered by Claude, one powered by OpenAI — with the same underlying AWS infrastructure, IAM policies, and enterprise tooling. The sales pitch becomes "bring your own frontier model, we'll handle the agent runtime."
## What It Means for Claude Code and Bedrock Deployments
For teams running Claude Code on Bedrock (the v2.1.94 GA release covered in April), the immediate answer is: nothing changes. Claude Code's terminal-native architecture doesn't care what other models are available on Bedrock; it routes to Claude Opus 4.7 and inherits all the enterprise controls — Mantle's zero-operator-access, Bedrock's compliance certifications, IAM policies — that were already in place.
The longer-term question is organizational politics. Security teams and procurement departments that previously ran Claude Code and wanted a second coding agent option would have had to go to Azure for GPT or set up a separate OpenAI account. Now everything is under one cloud contract. For mixed-model shops — which represent a growing majority of large enterprises — this is unambiguously convenient.
There's also a benchmark reality check worth keeping in mind. Claude Opus 4.7 scores 64.3% on SWE-bench Pro and leads the field on multi-agent autonomous tasks. GPT-5.5 hits 58.6% on the same benchmark. For organizations that have been running Cursor or Codex workflows and want to evaluate Claude Code alongside them, having both available through a single Bedrock contract makes that evaluation substantially easier.
## The Azure Angle
Microsoft isn't catastrophically damaged here. Azure retains first-mover rights on new OpenAI models, and the GitHub Copilot stack — deeply integrated with Azure DevOps, GitHub Actions, and enterprise SSO — still runs on Azure. The vast majority of Fortune 500 companies that use GitHub Enterprise aren't going to abandon that stack because GPT-5.5 is now also available on Bedrock.
But Azure loses its moat. The single biggest competitive advantage Azure had in the AI era — you want OpenAI, you come to us — is gone. Every hyperscaler can now offer comparable model menus. AWS already hosts Claude (better coding benchmarks), Gemini 3.1 Pro (Google's multimodal flagship), Llama (Meta's open-weight series), and now GPT. Google Cloud has its own native Gemini stack plus Claude on Vertex. Azure still leads on GitHub integration but is no longer the exclusive GPT gateway.
The implication for the broader market: frontier model access is rapidly commoditizing at the infrastructure layer. The differentiation is moving up the stack — toward agent runtimes, orchestration frameworks, developer tooling, and the quality of the harness the agent runs inside.
## The Harness Is the Product
Here's the insight that gets buried in the partnership announcement hoopla: **the model matters less than the harness it runs in.**
Amazon Bedrock Managed Agents powered by OpenAI is notable not because it offers GPT-5.5, but because AWS and OpenAI jointly engineered the agent runtime — the scaffolding that handles context, memory, tool execution, and task orchestration. That's a product bet on agent infrastructure, not on a specific model.
Claude Code made the same bet two years ago, just from the other direction: Anthropic built the harness first (terminal-native, CLAUDE.md invariants, Agent Teams, Routines, MCP integration), and the model powers it. The model and harness are vertically integrated by design, which is why Claude Code's agentic workflows tend to outperform Codex or Copilot setups even when raw benchmark numbers are closer than people expect.
The Bedrock Managed Agents dual-flavor approach — Claude edition and OpenAI edition — is intriguing precisely because it treats the infrastructure layer (AWS) as neutral and the agent runtime as the product. Whether that separation proves durable or whether integrated harnesses (Claude Code, OpenAI Agents SDK) ultimately win is the central question of the next 18 months of agentic AI infrastructure.
## Bottom Line
OpenAI's arrival on Bedrock is a big deal for the business of AI, and a significant setback for Azure's moat. For developers and engineering teams evaluating AI coding tools, it's mostly good news: everything is now available under one cloud contract, procurement gets simpler, and vendor comparisons get easier.
For the deeper question of which coding agent is actually better at autonomous software development, nothing changed on April 28. The benchmarks haven't shifted. Claude Code's 64.3% SWE-bench Pro advantage over Codex's 58.6% is the same today as it was yesterday. The difference is that enterprise buyers can now reach both without a multi-cloud contract negotiation.
That's a win for flexibility. Whether flexibility beats depth of integration is a question every engineering team will have to answer for itself.
---
**Sources:**
- [Amazon Bedrock now offers OpenAI models, Codex, and Managed Agents — AWS](https://aws.amazon.com/about-aws/whats-new/2026/04/bedrock-openai-models-codex-managed-agents/)
- [OpenAI models, Codex, and Managed Agents come to AWS — OpenAI](https://openai.com/index/openai-on-aws/)
- [OpenAI brings models to AWS after ending exclusivity with Microsoft — CNBC](https://www.cnbc.com/2026/04/28/openai-brings-models-to-aws-after-ending-exclusivity-with-microsoft.html)
- [OpenAI's models land on Amazon Bedrock, one day after Microsoft exclusivity ends — GeekWire](https://www.geekwire.com/2026/openais-models-land-on-amazon-bedrock-one-day-after-microsoft-exclusivity-ends/)
- [OpenAI ends Microsoft legal peril over its $50B Amazon deal — TechCrunch](https://techcrunch.com/2026/04/27/openai-ends-microsoft-legal-peril-over-its-50b-amazon-deal/)
- [An Interview with OpenAI CEO Sam Altman and AWS CEO Matt Garman — Stratechery](https://stratechery.com/2026/an-interview-with-openai-ceo-sam-altman-and-aws-ceo-matt-garman-about-bedrock-managed-agents/)
---
# Cursor SDK: The IDE Escapes the IDE — But Does It Break the Ceiling?
URL: https://sdd.sh/2026/04/cursor-sdk-the-ide-escapes-the-ide-but-does-it-break-the-ceiling/
Date: 2026-04-30
Updated: 2026-04-30
Tags: Cursor, SDK, programmatic agents, CI/CD, Claude Code, agentic workflows, TypeScript
Categories: AI Tools, Agentic Workflows
Summary: Cursor launched a TypeScript SDK in public beta on April 29 that lets developers invoke Cursor agents programmatically from CI/CD pipelines, backend services, or other products — with sandboxed cloud VMs, subagents, and durable agent lifecycle. It's Cursor's most significant architectural shift since Composer. The question is whether it actually solves the autonomy problem, or just relocates it.
For three years, Cursor's core identity was the IDE. You opened a window, you typed, an AI model helped you complete it. The agent lived inside the editor, and the editor lived on your machine. Everything — context, execution, approval — flowed through a GUI.
On April 29, 2026, Cursor took a significant step away from that identity.
The company launched the **Cursor SDK** in public beta: a TypeScript library that gives engineers programmatic access to the same runtime, harness, and models that power Cursor's desktop app, CLI, and web interface. One `npm install @cursor/sdk` and you can invoke a Cursor agent from a CI/CD pipeline, a backend service, or another product entirely — no desktop app open, no human watching a chat window.
This is Cursor's most architecturally significant move since Composer. It's worth understanding both what it enables and what it still can't change.
## What the SDK Actually Does
At its core, the Cursor SDK is an API wrapper around Cursor's existing agent runtime. When you call it, you get:
**Sandboxed cloud VMs** — Each agent invocation in Cursor's cloud spins up a dedicated VM with a clone of the target repository and a pre-configured development environment. The agent keeps running even if your laptop goes offline. You can reconnect later and stream the conversation from wherever you left off.
**Subagents** — The main agent can delegate subtasks to named subagents via the `Agent` tool, with independent prompts and model selection per subagent. Multi-agent coordination is available without writing orchestration code. This mirrors a pattern that Claude Code has supported natively since Agent Teams shipped in March — the terminology and implementation differ, but the concept is the same.
**Inherited Cursor infrastructure** — SDK agents get the full Cursor stack: semantic codebase indexing, instant grep, MCP server support, Skills, and Hooks. If your team has invested in building Cursor Skills or MCP integrations, those work inside SDK agents without modification.
**Durable agent lifecycle** — Agents have explicit lifecycle controls: archive, unarchive, permanent delete. Follow-ups, status checks, and event streaming are all scoped to individual runs rather than the agent as a whole. SSE-based streaming with `Last-Event-ID` reconnect support means you can safely interrupt and resume across network disruptions.
**Flexible execution targets** — Local machine for fast iteration, Cursor's cloud for persistent sandboxed VMs, or self-hosted workers for teams with network security requirements. The same SDK code works across all three.
The pricing is token-based consumption, same as Cursor's existing cloud agents. You pay for what the agent uses.
## The Use Cases Cursor Is Chasing
Cursor published a cookbook with four starter projects that reveal who they're building for:
- A minimal quickstart agent for pipeline integration
- A web-based prototyping tool that spins up sandboxed cloud environments
- An agent-powered kanban board that opens PRs based on task cards
- A lightweight coding agent CLI
The throughline: developers who want Cursor's context-intelligence and model quality without being inside the Cursor desktop app. CI/CD integrations top the announced use case list — agents that summarize changes when a PR opens, identify root causes for CI failures, and push fixes back to the branch, all triggered by the pipeline rather than a developer prompt.
This is Cursor responding to a real gap. Its desktop product is genuinely good at helping developers who are sitting at their machines. It's been much weaker at doing anything useful when developers aren't. The SDK is the answer to "what does Cursor do while I sleep?"
## How It Compares to Claude Code's Architecture
This is where editorial honesty requires saying something Cursor probably won't put in their changelog.
Claude Code has supported headless, programmatic invocation since the beginning. The `--print` flag, `--no-interactive` mode, and the full Claude Code Routines system (cloud-scheduled agents with cron, API, and GitHub event triggers, covered in April) were all designed around the assumption that agents run without humans in the loop. The Anthropic Managed Agents API goes further still, offering cloud-hosted agent instances with persistent memory, checkpointing, and session continuity as a first-class API service.
Claude Code was built agent-first. The interactive terminal session is one deployment mode of an agent runtime. The Cursor SDK inverts this: it's an IDE product being extended with programmatic access.
That distinction matters in practice. When you invoke a Claude Code agent headlessly — via Routines, via the API, via `--print` in a CI step — you're using the same execution model that the interactive session uses. There's no translation layer. Routines agents that run on Anthropic's cloud infrastructure don't need your machine at all; they're scheduled, triggered, and executed entirely server-side.
The Cursor SDK achieves something similar with sandboxed cloud VMs, but those VMs are hosting what is fundamentally a Cursor session without a UI. The context management, the Skills system, the MCP integrations — these were all designed to help a human inside an IDE. They're being repurposed for autonomous execution, which mostly works but carries the conceptual weight of their origin.
A concrete example: Cursor's semantic codebase indexing is excellent for helping a developer navigate unfamiliar code. In an autonomous agent context, it may behave differently than purpose-built agentic context management (like Claude Code's CLAUDE.md invariants, which explicitly instruct the agent on project-level constraints before any task begins). Whether that difference matters in practice depends heavily on the task. For "summarize what changed in this PR," probably irrelevant. For "implement this feature across eight files while respecting the project's existing patterns," the difference between IDE-derived and agent-native context management could be significant.
## What This Changes for Cursor's Market Position
The more consequential question isn't how the Cursor SDK compares to Claude Code — it's what the SDK means for Cursor's competitive position against Anthropic's Managed Agents API and OpenAI's Agents SDK.
Both of those products are API-first by design, with enterprise pricing, SLA commitments, and compliance documentation already in place. They're being sold to the same enterprise buyers who are evaluating Cursor Business and Cursor Enterprise. The Cursor SDK puts Cursor into that conversation directly, which is where the real competitive action is.
For Cursor, the strategic calculation is clear: if enterprises are going to build programmatic AI coding agents, they'd rather those be Cursor agents (using Cursor's context intelligence, Cursor's model connections, Cursor's pricing) than Claude Code agents or Codex agents. The SDK is a land-grab for the platform layer.
Whether it lands depends on two things: the quality of the agent output when running unattended, and the enterprise trust story. Cursor's reputation has taken some knocks in 2026 — the CVE disclosures in April, the Kimi K2.5 transparency episode in March, the ongoing questions about what data is used for training. Those issues don't disappear with a TypeScript SDK.
## The Autonomy Ceiling, Relocated
Here's the structural issue that the Cursor SDK doesn't resolve: the autonomy ceiling isn't about where the agent runs. It's about how the agent reasons about what to do next.
Cursor's IDE-first architecture was built to present options to humans and let humans decide. The agent suggests, the developer approves. That's not a bug — it's the entire value proposition for teams that want AI-assisted development rather than AI-autonomous development. But it means the decision-making architecture wasn't designed for situations where there's no human to ask.
The Cursor SDK can run without a human in the loop physically present. But the agent's trained behavior — its priors about when to stop and surface a question, when to make a judgment call autonomously, when to give up — those come from the same training that produced the IDE experience. You can move the ceiling somewhere else. You can't simply remove it by packaging the IDE runtime as an SDK.
Claude Code's autonomy-first design shows up most clearly in edge cases: ambiguous instructions, conflicting constraints, partial context. An agent trained to operate autonomously makes different judgment calls in these situations than one trained to surface them to a human. Neither approach is universally better, but they're genuinely different, and the Cursor SDK doesn't change which one you're getting.
## The Real Achievement
Cursor SDK is still a significant step forward. For teams deeply invested in Cursor's ecosystem — who've built Skills, MCP integrations, and team workflows around the desktop product — programmatic access to that same stack is genuinely valuable. The sandboxed cloud VM persistence, subagent delegation, and durable lifecycle are solid engineering. The cookbook projects show practical CI/CD applications that enterprises can adapt quickly.
The achievement is platform extension. Cursor is no longer just an IDE you use; it's (beginning to be) an agent runtime you can embed. That's a meaningful expansion of the competitive surface.
The more honest framing than "Cursor breaks out of the IDE" is: Cursor is building a platform business on top of an IDE business. Those are compatible, but they require different things. Developers who want autonomous agents that run independently of developer attention should still look first at tools that were designed for that use case. The Cursor SDK is the IDE platform's answer to that demand — a credible answer, but one that carries its IDE origins into everything it does.
---
**Sources:**
- [Build programmatic agents with the Cursor SDK — cursor.com](https://cursor.com/blog/typescript-sdk)
- [Cursor Introduces a TypeScript SDK for Building Programmatic Coding Agents — MarkTechPost](https://www.marktechpost.com/2026/04/29/cursor-introduces-a-typescript-sdk-for-building-programmatic-coding-agents-with-sandboxed-cloud-vms-subagents-hooks-and-token-based-pricing/)
- [Cursor SDK & Cloud Agents API updates — Cursor Community Forum](https://forum.cursor.com/t/cursor-sdk-cloud-agents-api-updates/159284)
- [Cursor SDK in Public Beta — Cursor Community Forum](https://forum.cursor.com/t/cursor-sdk-in-public-beta/159285)
- [Plugins, Sandbox Access Controls, and Async Subagents — cursor.com changelog](https://cursor.com/changelog/2-5)
---
# The Flat-Rate Era Is Over: GitHub Copilot Moves to Token Billing on June 1
URL: https://sdd.sh/2026/04/the-flat-rate-era-is-over-github-copilot-moves-to-token-billing-on-june-1/
Date: 2026-04-28
Updated: 2026-04-28
Tags: GitHub Copilot, billing, AI tools, Cursor, Claude Code, pricing
Categories: AI Tools, Industry
Summary: GitHub Copilot transitions all plans to usage-based billing on June 1, 2026. Code review will double-bill against GitHub Actions minutes. The flat-rate subscription model for AI coding tools is officially dead — and developers are not happy about it.
The flat-rate era for AI coding tools is over. GitHub announced that all Copilot plans — Pro, Pro+, Business, and Enterprise — will transition to usage-based billing on **June 1, 2026**. The new system replaces premium request units (PRUs) with **GitHub AI Credits**, billed based on token consumption at published API rates.
This isn't just a Copilot story. It's the final domino falling: Claude Code dropped tiered plans months ago, Cursor introduced Max Mode credits, and now the last major holdout with a predictable monthly flat rate is joining the consumption economy. If you're building engineering workflows around fixed AI costs, the math just changed.
## What Actually Changes
On June 1, every Copilot interaction that touches an AI model starts consuming credits. Here's the split:
**Free (no credits consumed):**
- Code completions
- Next Edit suggestions
**Credit-consuming:**
- Copilot Chat interactions
- Agentic and multi-step coding sessions
- Copilot code review (which *also* now eats GitHub Actions minutes — more on that below)
- Copilot cloud agent tasks
- Copilot Spaces
The base plan prices are not changing. Copilot Pro stays $10/month, Pro+ stays $39/month, Business stays $19/user/month, and Enterprise stays $39/user/month. But each plan now includes a monthly AI Credit allotment equal to its price — so Pro gets $10 in credits, Enterprise gets $39.
That's a significantly smaller budget than the old request-based system provided for heavy users. Pro previously offered 300 premium requests per month; Pro+ offered 1,500. The shift from "N requests" to "N dollars of tokens" makes your budget visible, variable, and potentially exhaustible mid-month.
GitHub is launching a **preview bill experience in early May** so teams can see projected costs before the switch. Business and Enterprise plans get promotional bonus credits through August ($30 and $70 respectively) to cushion the transition.
## The Double-Billing Problem: Code Review and Actions Minutes
Buried in the changelog: starting June 1, Copilot code review will also consume **GitHub Actions minutes** on private repositories. Previously, code review only drew from Copilot PRU allowances. Now it has two meters running simultaneously — AI Credits for the model calls and Actions minutes for execution.
Public repositories are unaffected; Actions minutes remain free there. But for private repos, every review triggered by Copilot becomes a compound cost. GitHub's recommendation is to audit current Actions usage, verify spending budgets, and brief billing administrators before the transition. That's enterprise-speak for "this is going to surprise your finance team."
## Developer Reaction: "You Will Get Less, but Pay the Same Price"
The community discussion on GitHub has been blunt. The most-shared framing from developers: *you will get less, but pay the same price*. The concern isn't the existence of usage-based billing — most engineers understand why AI compute is expensive. The concern is the asymmetry: base subscription prices stay flat while the included value shrinks.
GitHub's decision also arrives alongside other signals that Microsoft is managing Copilot costs aggressively. New individual and student account signups have been suspended. Anthropic's Opus models were removed from the $10 Pro plan. The direction is clear: premium capabilities are being ring-fenced behind higher tiers or pay-per-use, and the entry-level experience is narrowing.
## Why This Matters Beyond Copilot
GitHub Copilot's billing model shift is significant not because it's surprising — it was inevitable — but because of what it signals about where the industry is heading.
**Agentic sessions are expensive.** Code completions are cheap; token-efficient auto-suggestions can run for fractions of a cent per completion. But multi-step agents — the kind that plan a feature, write the code, run the tests, fix the failures, and open a PR — generate thousands of tokens per session. No flat-rate subscription can absorb that at scale. GitHub is doing what Claude Code did months ago: being honest that agents have a different cost structure than copilots.
**The $10-a-month AI coding tool is a relic.** When Copilot launched at $10/month in 2022, it was primarily an autocomplete product. AI coding tools in 2026 are orchestrators, code reviewers, and autonomous agents. Pricing those workflows like a fixed-rate software subscription was always temporary. Usage-based billing is the correct model for variable, compute-heavy work.
**The competitive calculus is shifting.** For power users who regularly hit Copilot's limits, the real-money cost of agentic sessions may push them toward tools with cleaner per-token economics. Claude Code on the Max plan or Cursor Pro+ with credits may end up cheaper for teams running many agent sessions per day. For light users — developers who mostly want code completions and occasional chat — Copilot's free tier remains compelling.
**Code completions staying free is strategic.** By keeping the core autocomplete experience outside the credit system, GitHub protects its broadest user base. Copilot's 15 million+ developer reach was built on the inline suggestion experience. Making that credit-consuming would risk immediate churn to free-tier competitors. The credit system is aimed squarely at agentic use cases where the cost is real.
## What to Do Before June 1
If your team uses Copilot at Business or Enterprise scale:
1. **Pull your current Actions and Copilot usage data.** GitHub's billing settings show historical consumption. Use it to project what the token equivalent looks like.
2. **Configure spending budgets.** GitHub will allow admins to set caps on AI Credit consumption. Set them before June 1 or you may find engineers hitting walls mid-sprint.
3. **Audit which Copilot features you actually use.** If your team mostly uses code completions and light chat, the transition is painless. If you've been running Copilot Autopilot agents or frequent code reviews on private repos, the cost calculus is different.
4. **Evaluate alternatives with the correct comparison.** Compare what $19/user/month in tokens actually buys across Copilot, Claude Code, and Cursor Pro. Token rates and model quality differ. The best tool for your workflow may not be the same after June 1 as it was before.
GitHub's preview billing tool in early May is exactly the right move to make this transition transparent. Whether the actual numbers turn out to be fair — or confirm the "same price, less value" framing — will become clear very quickly once engineers can see projected spend against real usage.
---
**Sources:**
- [GitHub Copilot is moving to usage-based billing](https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/) — The GitHub Blog
- [GitHub Copilot code review will start consuming GitHub Actions minutes on June 1, 2026](https://github.blog/changelog/2026-04-27-github-copilot-code-review-will-start-consuming-github-actions-minutes-on-june-1-2026/) — GitHub Changelog
- [Exclusive: Microsoft Moving All GitHub Copilot Subscribers To Token-Based Billing In June](https://www.wheresyoured.at/exclusive-microsoft-moving-all-github-copilot-subscribers-to-token-based-billing-in-june/) — Where's Your Ed At
- [Preparing for your move to usage-based billing](https://docs.github.com/en/copilot/how-tos/manage-and-track-spending/prepare-for-your-move-to-usage-based-billing) — GitHub Docs
---
# DeepSeek V4: Near-Frontier Performance, Open Weights, and the First Major Model Built for Huawei Chips
URL: https://sdd.sh/2026/04/deepseek-v4-near-frontier-performance-open-weights-and-the-first-major-model-built-for-huawei-chips/
Date: 2026-04-28
Updated: 2026-04-28
Tags: DeepSeek, open-source AI, model releases, benchmarks, China AI, cost efficiency
Categories: AI Tools, Industry
Summary: DeepSeek V4 arrived April 24 with two variants: a 1.6T-parameter Pro and a 284B-parameter Flash, both MIT-licensed and priced far below Western closed models. The bigger story is what it runs on: Huawei Ascend chips, not Nvidia.
A year after DeepSeek R1 rattled every major AI lab's stock price, the Chinese AI startup is back with V4. Released on April 24, 2026, DeepSeek V4 is the company's new flagship series: a 1.6 trillion-parameter Pro model and a 284 billion-parameter Flash variant, both Mixture-of-Experts (MoE) architectures, both MIT-licensed, and both available via the DeepSeek API on day one.
The headline number is the price. At $1.74 per million input tokens for V4-Pro and $0.14/M for V4-Flash, DeepSeek is undercutting every closed-model competitor at a performance level that makes the comparison credible — not just a race-to-the-bottom cheap model. But the story that will matter more over the next few years isn't the pricing. It's the chips.
## The Models
**DeepSeek-V4-Pro** packs 1.6 trillion total parameters with 49 billion active at inference time — a hallmark of the MoE efficiency architecture DeepSeek has mastered. Context window: 1 million tokens. Available on Hugging Face at 865GB. MIT license means you can deploy and modify it freely.
**DeepSeek-V4-Flash** is 284 billion parameters total with 13 billion active, and 160GB on Hugging Face. Same 1M context, same license, priced at a fraction of Pro.
Performance: V4-Pro beats all rival open-weight models on math and coding benchmarks and trails only Google's Gemini 3.1 Pro — a closed model — on world knowledge tasks. MIT Technology Review puts the gap at roughly "3 to 6 months behind state-of-the-art frontier models." That's not parity with Claude Opus 4.7 or GPT-5.5 Spud, but it's remarkably close for an open-weight model at V4-Pro's price point.
## The Efficiency Architecture
The technical paper is worth reading for the architectural innovations. For 1 million-token contexts, V4-Pro requires only **27% of the single-token FLOPs** and **10% of the KV cache size** relative to DeepSeek-V3.2. V4-Flash is even more efficient at 10% of FLOPs and 7% of KV cache.
That efficiency gain matters beyond cost. Reduced KV cache means longer contexts don't degrade as sharply under memory pressure. The attention mechanism selectively compresses older context while preserving nearby token fidelity — a tradeoff that happens to align well with practical coding workloads, where recent context (the function you're editing, the error you just saw) matters more than distant context (the imports you wrote an hour ago).
## The Real Story: Huawei Ascend
Here's what most coverage is underselling: **V4 is the first major frontier-class model optimized for Huawei Ascend chips rather than Nvidia GPUs.**
During training, DeepSeek still used American hardware — that's not a secret. But inference — the part that actually runs when you call the API or self-host the model — runs on domestic Chinese hardware. MIT Technology Review calls this "China's first model optimized for domestic Chinese chips, such as Huawei's Ascend."
This matters for several reasons:
**Geopolitical.** U.S. export controls on Nvidia A100s and H100s have put significant pressure on Chinese AI labs' ability to scale training runs. DeepSeek's response has been to become brutally efficient: the MoE architecture, the KV cache compression, the per-token FLOP reduction — these aren't just cost optimizations, they're adaptations to a hardware-constrained environment. V4's Ascend inference path is the next step: demonstrating that the full stack, including deployment, can run on non-Nvidia silicon.
**Infrastructure.** If high-quality model inference can run efficiently on Huawei Ascend processors, it decouples Chinese AI deployment from American chip supply chains in a way that training-side restrictions cannot address. The implication for enterprise AI buyers in China (and potentially other markets where Ascend hardware is more accessible than Nvidia) is significant.
**For Western developers.** V4 runs on Nvidia hardware too — the Hugging Face weights are standard. But the existence of a capable Ascend inference path means the model's continued development and availability isn't contingent on sustained access to export-controlled chips. That's a resilience story for a model you might want to rely on.
## The Pricing Comparison That Actually Matters
Let's put the numbers on paper:
| Model | Input | Output | Open Weight? |
|-------|-------|--------|--------------|
| DeepSeek V4-Flash | $0.14/M | $0.28/M | Yes (MIT) |
| DeepSeek V4-Pro | $1.74/M | $3.48/M | Yes (MIT) |
| GPT-5.4 | $2.50/M | $15.00/M | No |
| Claude Opus 4.7 | $5.00/M | $25.00/M | No |
V4-Flash is the cheapest capable small model in this class. V4-Pro is the cheapest frontier-adjacent open-weight model by a significant margin. The output token asymmetry is telling: closed Western models charge 5-6x more for output than input; DeepSeek charges 2x. For agentic workloads that generate substantial output per query, this gap compounds fast.
## Is It Good for Coding?
The benchmark claim is that V4-Pro beats all open-weight rivals on coding. What does that mean in practice?
For self-hosted or API-integrated coding workflows — running code generation in CI pipelines, powering custom coding agents, building developer tooling — V4-Pro's cost-to-capability ratio is compelling. If you're spending $25/M output tokens on Opus 4.7 for automated code tasks and V4-Pro achieves 90%+ of the quality at $3.48/M output, the economics are hard to ignore.
For interactive coding with a product like Claude Code, the comparison is less direct. Claude Code's Opus 4.7 integration is tightly optimized for the agentic loop — CLAUDE.md invariants, tool call efficiency, the multi-agent architecture. Swapping the backbone model requires re-evaluating the whole stack, not just comparing raw benchmark numbers.
The more interesting use case is multi-agent orchestration. If you're running Agent Teams or parallel subagent swarms, the per-token economics of individual agent calls matter enormously. V4-Flash at $0.14/M input becomes genuinely attractive as a workhorse model for high-frequency, lower-stakes subagent tasks.
## What This Means for the AI Coding Landscape
DeepSeek V4 doesn't change the competitive picture at the top of the market today. Claude Opus 4.7's 64.3% SWE-bench Pro score, its tool error rate improvements, and its integration with Claude Code's agentic infrastructure put it in a different category for serious agentic development workflows.
But V4 does three things the closed-model incumbents can't:
1. **Provides a credible open-weight option near the frontier.** You can run V4-Pro yourself, modify it, and build on it without an API key or vendor dependency.
2. **Sets a new price floor for frontier-adjacent performance.** $1.74/M input for a model that competes with 6-month-old closed frontier models is going to pressure API pricing industry-wide.
3. **Demonstrates that non-Nvidia inference infrastructure works at this capability level.** That's a decade-long strategic implication for how AI infrastructure gets built and where.
A year ago, DeepSeek R1 made a point about training efficiency. V4 makes a different point: about deployment independence, open-weight availability, and what it costs to run near-frontier models in 2026.
---
**Sources:**
- [DeepSeek V4 Preview Release](https://api-docs.deepseek.com/news/news260424) — DeepSeek API Docs
- [Three reasons why DeepSeek's new model matters](https://www.technologyreview.com/2026/04/24/1136422/why-deepseeks-v4-matters/) — MIT Technology Review
- [China's DeepSeek releases preview of long-awaited V4 model](https://www.cnbc.com/2026/04/24/deepseek-v4-llm-preview-open-source-ai-competition-china.html) — CNBC
- [DeepSeek V4—almost on the frontier, a fraction of the price](https://simonwillison.net/2026/Apr/24/deepseek-v4/) — Simon Willison
- [DeepSeek-V4-Pro on Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) — Hugging Face
- [China's DeepSeek unveils latest models a year after upending global tech](https://www.aljazeera.com/economy/2026/4/24/chinas-deepseek-unveils-latest-model-a-year-after-upending-global-tech) — Al Jazeera
---
# Google's 75% Threshold: When AI Became the Primary Author of Production Code
URL: https://sdd.sh/2026/04/googles-75-threshold-when-ai-became-the-primary-author-of-production-code/
Date: 2026-04-27
Updated: 2026-04-27
Tags: Google, AI coding, agentic workflows, software engineering, Gemini, Claude Code
Categories: AI Tools, Industry, Agentic Workflows
Summary: Sundar Pichai revealed at Google Cloud Next 2026 that 75% of new code at Google is now AI-generated and reviewed by engineers. That number crossed a threshold most didn't expect this fast — and it reframes every assumption about what software teams look like in 2026.
At Google Cloud Next 2026, Sundar Pichai announced a number that quietly rewrites a decade of assumptions about software engineering: **75% of all new code at Google is now AI-generated and approved by engineers**.
Not assisted. Not suggested. Generated — then reviewed, then shipped.
That figure was 25% in October 2024. It hit 50% by fall 2025. Now, eighteen months after the first credible automation milestone, AI has become the primary *author* of production code at one of the most sophisticated engineering organizations on earth.
Engineers didn't disappear. But their job description just changed.
## The Number in Context
Three data points define the trajectory:
| Period | AI-generated code at Google |
|--------|----------------------------|
| October 2024 | ~25% |
| Fall 2025 | ~50% |
| April 2026 | 75% |
The velocity matters as much as the current figure. That's not incremental adoption — it's geometric. The first 25% took years of careful AI tooling integration. The next 25% took roughly twelve months. The last 25% took six months. If the curve continues, asking "what percentage is human-written?" becomes the more interesting question faster than anyone planned.
Pichai was precise about what the number means: this is code that AI generates and that engineers *approve*. It goes through review. It doesn't bypass human judgment. But the role of that human judgment has inverted: the engineer is now verifying and directing rather than producing and reviewing.
That's not a semantic distinction. It's an architectural one.
## Engineers as Directors
The productivity data Pichai shared is specific enough to be credible. A recent complex code migration — the kind of project that historically consumes entire quarters of senior engineering time — was completed **six times faster** than the equivalent project one year prior, with agents and engineers working in tandem.
The Gemini app on macOS was built using Google's internal agentic development platform, called Antigravity, to go from concept to working native app prototype **in a few days**.
Google has also formalized the transition in its management systems: internal AI adoption goals now factor into engineer performance reviews. That's the clearest possible signal that this isn't a lab experiment or an executive vanity metric — it's operational reality baked into how careers are evaluated.
The shift in role has a clean description: engineers are becoming **reviewers and orchestrators** rather than line-by-line authors. That's not a downgrade. An engineer who can review 10x more code per hour and direct 5 parallel agent workstreams produces more value than one who types faster. But it requires a genuinely different skill set — less syntax, more architecture; less typing, more judgment; less execution, more specification.
This is what Spec-Driven Development looked like in theory two years ago. It's what Google looks like in practice today.
## The Tooling Reality
Google uses its own Gemini models as the primary engine for this automation. The Antigravity platform — their internal agentic development stack — orchestrates agents across tasks that would previously require handoffs between teams.
But there's a detail buried in the announcements: **some Google DeepMind employees have been permitted to use Claude Code**. At a company that builds its own frontier models and operates at Google's scale, that's a notable carve-out. It signals that even inside the world's most capable AI-first engineering organization, there are contexts where Anthropic's tooling is the better choice for autonomous development workflows.
Claude Code's terminal-native architecture — where the agent operates in your environment, uses your tools, and integrates with your existing CLI workflows — is a different class of tool than an IDE-embedded assistant, even one powered by Gemini.
## What This Means for Software Teams That Aren't Google
The temptation is to read Google's numbers as exceptional — the output of a uniquely well-resourced organization with proprietary models and billions of dollars in infrastructure. That's partially true. But the trajectory matters more than the absolute figure.
Google was at 25% eighteen months ago. If your team is at 10% today, you're not behind where Google was in late 2024 — you're behind where Google was in 2023. The curve that took Google from 25% to 75% is the same curve your team is on. It's just shifted.
The practical implications:
**Code review changes first.** When 75% of the diff is AI-generated, the review process can't operate the same way it did for human-authored code. Reviewers need to evaluate correctness, architecture, and intent at a higher level of abstraction. Spotting a bug in a function you wrote yourself is different from evaluating whether an agent's 300-line solution to your spec is the right solution.
**Specification quality becomes the constraint.** If AI writes three-quarters of the code, the quality of what gets written is bounded by the quality of what got specified. Sloppy requirements produce sloppy-but-functional code — and automated tests pass it. The weakest link in the pipeline moves upstream.
**Hiring signals shift.** The engineers who will be most productive at 75% AI-generated codebases are the ones who can write precise technical specifications, evaluate AI output at speed, architect systems that agents can navigate without constant correction, and debug the failure modes of AI-generated code specifically. These are different skills than the ones that made a great engineer in 2022.
## The Productivity Ceiling Question
There's a critique that sits underneath the productivity numbers: if 75% of code is AI-generated, is the resulting software better or just faster? Faster is real value — shipping in days instead of months has compounding effects. But the security data from this same week offers a counterpoint.
Sherlock Forensics found critical vulnerabilities in 92% of AI-generated codebases. ProjectDiscovery's 2026 report shows 62% of security teams say keeping up with AI code velocity is harder than ever. The productivity gains are real; so is the quality debt they introduce if the review layer is too shallow.
Google's answer to this is to keep humans in the review loop with explicit performance accountability. That's the right answer. But it means the review skill is as load-bearing as the generation skill — and most teams are investing more in generation tooling than in review quality.
Agentic review loops, pre-commit security hooks, CLAUDE.md invariants that codify security and architecture invariants — these aren't overhead. At 75% AI-generated code, they're the actual engineering work.
## The Asymmetry That Matters
One thing the 75% number doesn't say: which 75%? Routine, well-specified, well-tested code that follows established patterns is much easier for AI to generate correctly than novel architectural decisions, security-sensitive logic, or code at system boundaries where implicit assumptions live.
If AI is writing the boilerplate and the straightforward implementations while engineers spend their time on the hard, novel, consequential parts — that's a genuine productivity multiplier, and arguably the ideal division of labor. If AI is writing the security-sensitive authentication code and engineers are approving it at speed because 75% of their review queue is AI-generated, that's a different situation.
Google's framing suggests they're aware of this distinction. Building Antigravity with domain-specific guardrails for their internal codebases implies the hard cases are still getting appropriate attention.
The 75% threshold isn't a finish line. It's the point where being thoughtful about which 75% becomes the most important engineering decision your organization makes.
---
**Sources:**
- [Sundar Pichai shares news from Google Cloud Next 2026](https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/cloud-next-2026-sundar-pichai/)
- [Google CEO Sundar Pichai says 75% of the company's code is AI-generated — Fast Company](https://www.fastcompany.com/91531519/google-ceo-says-75-of-the-companys-code-is-ai-generated)
- [75% of Google's New Code Is Now AI-Generated: Engineers Are Becoming Reviewers — AIxploria](https://www.aixploria.com/en/ai-radar/google-75-percent-code-ai-generated-cloud-next-2026/)
- [Google Now Generates 75% of Its Code With AI — Analytics Drift](https://analyticsdrift.com/google-ai-generated-code-75-percent/)
- [Google Says 75 Percent of Its New Code Is Now AI-Generated — AI2Work](https://ai2.work/blog/google-says-75-percent-of-its-new-code-is-now-ai-generated)
---
# Google Cloud Next 2026: A2A Goes Production, Jules Graduates — But the Autonomy Gap Remains
URL: https://sdd.sh/2026/04/google-cloud-next-2026-a2a-goes-production-jules-graduates-but-the-autonomy-gap-remains/
Date: 2026-04-26
Updated: 2026-04-26
Tags: Google, A2A, Jules, Gemini, MCP, agentic, multi-agent, enterprise
Categories: AI Tools, Industry
Summary: Google's Cloud Next 2026 delivered genuine infrastructure progress: A2A protocol in production at 150 organizations, Jules out of beta, Gemini Enterprise Agent Platform replacing Vertex AI. But integration breadth still isn't the same as autonomy depth.
Google Cloud Next 2026 was the company's loudest statement yet that the enterprise AI race is no longer about models — it's about orchestration, interoperability, and scale. The event delivered a stack of meaningful announcements: the Agent2Agent (A2A) protocol is now running in production at 150 organizations, Jules is out of beta and broadly available, and Vertex AI has been renamed and restructured into the Gemini Enterprise Agent Platform. For enterprise buyers, it was impressive. For developers who care about autonomous agents that can actually ship code unattended, it's a more complicated picture.
## A2A Hits Production at Scale
The headline number from Cloud Next is A2A: Google's agent interoperability protocol, announced in early 2026, has moved from experiment to infrastructure. 150 organizations are running it in production — not pilots — routing real tasks between agents built on different platforms. Microsoft, AWS, Salesforce, SAP, and ServiceNow are all live.
Version 1.2 of the spec is now out, with signed agent cards using cryptographic signatures for domain verification. Governance has moved to the Linux Foundation's Agentic AI Foundation, putting it on a trajectory parallel to MCP's own standardization path. Native A2A support is now built into LangGraph, CrewAI, LlamaIndex Agents, Semantic Kernel, and AutoGen.
The framing Google is pushing — and it's actually accurate — is that A2A and MCP are complementary, not competing. MCP handles how an agent connects to tools and data sources. A2A handles how agents communicate with each other across organizational and platform boundaries. A Salesforce agent can hand off a task to a Google agent on Gemini Enterprise, which can query a ServiceNow agent for IT asset data, all through A2A without any of the three systems needing to understand each other's internal architecture.
For teams running multi-vendor agent deployments, this is genuinely useful. The days of bespoke agent-to-agent glue code are numbered.
## Jules: General Availability, Plus a Next-Generation Preview
Jules, Google's async coding agent, exited beta at Cloud Next and is now available to all users, integrated into Google AI Pro and Ultra subscriptions. The model: Gemini 3.1 Pro. The workflow: Jules reads your repository, accepts a task, works asynchronously in an isolated branch, and returns a diff with a full explanation of its plan and reasoning.
That async model has always been Jules' distinguishing feature — and its limitation. Jules is deliberately not interactive. You can't course-correct mid-task. You fire and wait. For well-scoped, well-defined tasks, that's fine. For the kind of iterative, exploratory work that characterizes real software engineering, it's a significant constraint.
The more interesting signal from Cloud Next is what Google is building next. An internal project named Jitro is described as Jules V2, and it's designed around outcome-based goal-setting rather than task-based prompting. The idea: developers define desired outcomes — better test coverage, lower error rates, improved accessibility compliance — and the agent figures out the path. KPI-driven development, in Google's framing. Whether Jitro ships as described is an open question, but the direction is notable: even Google recognizes that task-level prompting is a ceiling.
## Gemini Enterprise Agent Platform
Vertex AI has been rebranded to the Gemini Enterprise Agent Platform — a consolidation move that folds Agentspace, Agent Studio, and the underlying runtime into a unified product. The platform ships with a revamped Agent Runtime delivering sub-second cold starts, support for multi-day autonomous workflows, and a low-code interface (Agent Studio) for building agents via natural language description.
Workspace Studio, Google's no-code agent builder for business users, is now generally available. The numbers cited at the keynote: 3.5 million monthly active users, 170 million tasks automated in a single month. That's a consumer and SMB signal more than an enterprise engineering signal, but it demonstrates that Google's distribution advantages — Gmail, Docs, Sheets, Drive — are a real moat for horizontal agent adoption.
For developers, the platform's most relevant new capability is the combination of A2A routing and the Gemini Enterprise runtime: you can now build agents on Google infrastructure that interoperate with agents running on AWS Bedrock, Azure AI Foundry, or Salesforce Agentforce without custom protocol work. That's table stakes in an increasingly heterogeneous enterprise environment.
## The Editorial Read: Integration Breadth vs. Autonomy Depth
Here's the honest assessment: Google's stack at Cloud Next 2026 is the best demonstration yet of AI agents as enterprise integration infrastructure. A2A in production across five major cloud and CRM platforms, Workspace Studio's 170M monthly automated tasks, Jules available to every Google AI subscriber — these are real adoption numbers.
But there's a distinction worth maintaining: integration breadth is not the same as autonomy depth.
Jules, at GA, is still an async task executor. It doesn't iterate. It doesn't push back on underspecified requirements. It doesn't hold a context window across a multi-session refactoring effort. The Gemini Enterprise Agent Platform is, at its core, a workflow orchestration and connector platform — extraordinarily useful for enterprise automation, but not the model for autonomous software engineering.
Claude Code's architecture is different in kind. It runs in your terminal, manages its own context across sessions, integrates directly with your filesystem and shell, and operates with the kind of tight feedback loop that real software development requires. The upcoming agent memory and multi-session work in Claude Code points in a direction Jules hasn't reached yet.
A2A is actually good news for Claude Code users. As A2A becomes standard infrastructure, Claude Code agents will be able to interoperate with Google-hosted agents, Salesforce agents, and enterprise data systems without bespoke integrations. Anthropic has not announced A2A support explicitly, but the protocol is open standard and built on HTTP/JSON-RPC — integration is a matter of when, not if.
## What Developers Should Take Away
Three things from Cloud Next 2026 matter for working developers:
**A2A is becoming table stakes.** If you're building multi-agent workflows that span organizational systems, start treating A2A the same way you treat MCP — as infrastructure rather than an interesting experiment. 150 production deployments in under a year is a strong signal.
**Jules is worth revisiting.** If you've written off Jules as a beta toy, it's now worth a second look for well-scoped async tasks: fixing specific bugs, adding test coverage, implementing a clearly defined feature against a stable API. It's not Claude Code, but it's also not nothing.
**Google's integration moat is real, but narrow.** Workspace Studio at 3.5M MAU is impressive, but those are business users automating Gmail and Sheets workflows — not engineers shipping production code. The enterprise automation market and the developer tools market are increasingly adjacent but still distinct, and Google's dominance in the former doesn't translate directly to the latter.
Google Cloud Next 2026 was a strong event for enterprise AI infrastructure. The A2A protocol achieving production scale and Jules graduating to GA are genuine milestones. But the companies that will define autonomous software development are still the ones building toward deeper autonomy, not wider integration.
That race is still open.
---
**Sources:**
- [Google Cloud Next 2026 recap — Google Cloud Blog](https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/google-cloud-next-26-recap/)
- [Google Cloud Next 2026: AI agents, A2A protocol, Workspace Studio — The Next Web](https://thenextweb.com/news/google-cloud-next-ai-agents-agentic-era)
- [Announcing the Agent2Agent Protocol — Google Developers Blog](https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/)
- [Introducing Gemini Enterprise Agent Platform — Google Cloud Blog](https://cloud.google.com/blog/products/ai-machine-learning/introducing-gemini-enterprise-agent-platform)
- [Jules — An Autonomous Coding Agent](https://jules.google)
- [Google Cloud Next 2026: Every Major Announcement — Oplexa](https://oplexa.com/google-cloud-next-2026/)
- [10 more Workspace announcements at Cloud Next 2026 — Google Workspace Blog](https://workspace.google.com/blog/product-announcements/10-more-announcements-workspace-at-next-2026)
---
# DeepSeek V4 Ships: Frontier-Class Coding at 1/6th the Cost
URL: https://sdd.sh/2026/04/deepseek-v4-ships-frontier-class-coding-at-1/6th-the-cost/
Date: 2026-04-26
Updated: 2026-04-26
Tags: DeepSeek, open-source, benchmarks, agentic coding, Claude, GPT-5.5, SWE-bench, LiveCodeBench
Categories: AI Tools, Industry
Summary: DeepSeek V4-Pro hits 80.6% on SWE-bench Verified and 93.5% on LiveCodeBench — matching or exceeding most closed models — while costing 1/6th of Claude Opus 4.7 and releasing under the MIT license. Here's what actually matters, and what the benchmarks don't tell you.
DeepSeek dropped V4 on April 24, 2026, and the headline numbers are hard to ignore: 80.6% on SWE-bench Verified, 93.5% on LiveCodeBench, a Codeforces rating of 3206, and pricing that is roughly one-sixth of Claude Opus 4.7 and GPT-5.5. Open-weight, MIT license, 1 million token context.
If you only read the benchmark sheet, this looks like the moment DeepSeek cracked the frontier — at Chinese-lab economics.
The reality is more nuanced, and worth reading carefully before you migrate anything.
## What DeepSeek V4 Actually Is
Two variants shipped in preview:
**V4-Pro** — 1.6 trillion total parameters, approximately 49 billion active per inference pass via a 384-expert Mixture-of-Experts architecture. 1M token context. This is the frontier contender.
**V4-Flash** — 284 billion parameters, 13 billion active. Same context window. Built for throughput where V4-Pro would be overkill.
The architecture headline is a new hybrid attention mechanism: Compressed Sparse Attention (CSA) combined with Heavily Compressed Attention (HCA). In practice, what this means for long-context use cases is significant: at 1M tokens, V4-Pro requires only 27% of the inference FLOPs and 10% of the KV cache compared to V3.2. That is not a marginal improvement — it changes the economics of running long-context agent loops at scale.
It is the largest open-weight model ever released. Under MIT license.
## The Benchmark Picture
| Benchmark | DeepSeek V4-Pro | Claude Opus 4.7 | GPT-5.5 | Gemini 3.1 |
|-----------|----------------|----------------|---------|------------|
| SWE-bench Verified | 80.6% | **80.8%** | — | — |
| LiveCodeBench | **93.5%** | 88.8% | — | 91.7% |
| Codeforces Rating | **3206** | — | 3168 | 3052 |
| GPQA Diamond | 90.1% | — | **93.6%** | 94.3% |
The coding numbers are legitimately impressive. LiveCodeBench is one of the cleanest benchmarks for raw coding ability — it uses live competitive programming problems that post-date training cutoffs, so models can't pattern-match against training data. V4-Pro at 93.5% is the best published score on that benchmark as of this writing.
SWE-bench Verified tells a tighter story: 80.6% for V4-Pro versus 80.8% for Claude Opus 4.7. That is statistical noise. For real GitHub issue resolution on real repositories, they are at parity today.
What V4-Pro does not lead on: SWE-bench Pro, MCP-Atlas, Terminal-Bench 2.0, and the multi-agent coordination benchmarks where Opus 4.7 was specifically tuned. DeepSeek has not published SWE-bench Pro numbers. That absence is notable given how prominently other labs have published on it — it is the benchmark most resistant to data contamination, and the one that best predicts production agentic performance.
## The Cost Argument, Honestly Stated
This is where the case for V4 is strongest, and it is a real case.
**API pricing:**
- V4-Flash: $0.14 / $0.28 per million input/output tokens
- V4-Pro: $0.145 / $3.48 per million input/output tokens
- Claude Opus 4.7: $5 / $25 per million input/output tokens
- GPT-5.5: $5 / $30 per million input/output tokens
For output-heavy workloads — code generation, long agentic loops with extensive tool call responses — V4-Pro is approximately 7× cheaper than Opus 4.7. V4-Flash is cheaper still, by nearly two orders of magnitude.
If you are running agents at scale and your tasks are mostly code synthesis and retrieval-augmented generation rather than complex multi-agent coordination, the economics are significant. A workflow costing $5 on V4-Pro runs $35 on GPT-5.5. At volume, that is the difference between a viable product margin and a cost problem.
The MIT license amplifies this: you can self-host, fine-tune for your proprietary codebase, and run inference on your own infrastructure. No API dependency, no data egress to a third-party provider.
## What the Benchmark Sheet Does Not Tell You
Three things are missing from the V4 launch narrative that matter for production agentic workflows.
**SWE-bench Pro scores.** Every frontier lab published SWE-bench Pro results over the past few months — it became the discriminating benchmark precisely because it is contamination-resistant. DeepSeek did not. Claude Opus 4.7 sits at 64.3% on SWE-bench Pro; MiniMax M2.7 (an open-source competitor) published 56.22%. Without a V4-Pro SWE-bench Pro number, the "matches the frontier" claim is incomplete.
**Agentic harness.** V4-Pro is a model. Claude Code is a model plus a purpose-built agentic scaffold: persistent bash sessions, worktree isolation, multi-agent orchestration, Routines with event triggers, CLAUDE.md project context, and a terminal-native operating model. The benchmark measures the model in isolation; production agents are model + harness. A V4-Pro model in a generic OpenAI-compatible server is a different product than Claude Opus 4.7 inside Claude Code.
**Preview status.** This is a preview release. SWE-bench Verified scores frequently revise between preview and GA. V4-Flash in particular received mixed reactions from developers who found it was not a significant jump over V3.2 for their specific use cases. Wait for independent developer benchmarking on production codebases before treating the launch numbers as settled.
## The Open-Source Dynamic
The strategic picture here is larger than a single model release.
DeepSeek V4-Pro at MIT license, running at frontier-competitive coding performance, is the clearest signal yet that the closed-model tax is becoming optional for coding workloads. GLM-5.1 landed at 58.4% on SWE-bench Pro under MIT in April. MiniMax M2.7 reached 56.22%. DeepSeek V4-Pro matches the top closed models on SWE-bench Verified.
This is not a fluke trajectory. Open-weight models are closing the capability gap with each generation, and they are doing it with substantially better economics. For teams with the infrastructure to self-host, fine-tuned open-weight models at V4-Pro performance levels are increasingly a viable alternative to paying frontier API rates.
The question for engineering organizations is whether the capability you are actually getting from closed-model APIs justifies the cost premium. For long-context code synthesis and standard agentic workflows, it is getting harder to justify.
## Where Claude Still Has an Edge
Honest accounting: Claude Opus 4.7 leads on SWE-bench Pro (64.3%, with no equivalent published from DeepSeek), MCP-Atlas (79.1%), and Terminal-Bench 2.0 (69.4% — though GPT-5.5 has it on that one). More importantly, it comes packaged with Claude Code's agentic infrastructure, which is purpose-built for autonomous terminal-native work in a way that no model-only release from DeepSeek can replicate.
The multi-agent coordination features in Opus 4.7 — one-third the tool errors in agentic loops, 14% improvement on complex multi-step workflows, native Agent Teams support — are architectural bets that Anthropic has been building toward since 2024. A model that scores similarly on SWE-bench Verified is not the same as a model that performs similarly when orchestrating 10 parallel sub-agents across a real deployment pipeline.
If your workload is: "generate code for well-specified tasks with bounded context" — V4-Pro is a serious alternative to evaluate. If your workload is: "run autonomous agents across a complex codebase, coordinate parallel workstreams, and handle the failure modes of long-running multi-step tasks" — Opus 4.7 inside Claude Code is still the stack to beat.
## The Practical Recommendation
V4-Flash at $0.14/$0.28 per million tokens is an obvious candidate for any cost-sensitive, bounded-context coding workload. The price-to-capability ratio is the best in the market for that tier.
V4-Pro is worth evaluating against Opus 4.7 for code synthesis tasks where you have the infrastructure to run comparisons. Wait for SWE-bench Pro numbers and post-preview stability before migrating production agentic pipelines.
The MIT license and self-hosting option are genuinely valuable, particularly for organizations with data-residency requirements or who want to fine-tune on proprietary codebases. That option did not exist at this capability level six months ago.
DeepSeek V4 is the best evidence so far that open-source has reached coding-frontier parity on the benchmarks that are easiest to measure. The benchmarks that are hardest to game — and the agentic scaffolding that turns a model into a production tool — are still a Claude story.
---
*Sources: [DeepSeek V4 Preview Release Notes](https://api-docs.deepseek.com/news/news260424), [TechCrunch: DeepSeek previews new AI model](https://techcrunch.com/2026/04/24/deepseek-previews-new-ai-model-that-closes-the-gap-with-frontier-models/), [VentureBeat: DeepSeek-V4 cost comparison](https://venturebeat.com/technology/deepseek-v4-arrives-with-near-state-of-the-art-intelligence-at-1-6th-the-cost-of-opus-4-7-gpt-5-5/), [CNBC: DeepSeek V4 release](https://www.cnbc.com/2026/04/24/deepseek-v4-llm-preview-open-source-ai-competition-china.html), [MIT Technology Review: Why DeepSeek's V4 matters](https://www.technologyreview.com/2026/04/24/1136422/why-deepseeks-v4-matters/)*
---
# 92% of AI-Generated Codebases Have Critical Vulnerabilities. Here's Why Agentic Review Is the Fix.
URL: https://sdd.sh/2026/04/92-of-ai-generated-codebases-have-critical-vulnerabilities.-heres-why-agentic-review-is-the-fix./
Date: 2026-04-26
Updated: 2026-04-26
Tags: security, AI-generated code, agentic, code review, Claude Code, vulnerabilities, AppSec
Categories: AI Tools, Guides
Summary: The 2026 AI Coding Impact Report reveals that 100% of engineering orgs are shipping more code thanks to AI — and security teams are drowning. 92% of AI-generated codebases contain critical vulnerabilities. The answer isn't less AI. It's better AI review.
The numbers are in, and they're uncomfortable. ProjectDiscovery's 2026 AI Coding Impact Report surveyed 200 cybersecurity practitioners across North America and Western Europe. Every single respondent — 100% — reported increased engineering delivery over the past year. Nearly half attributed most or all of that acceleration to AI-assisted coding tools.
And 62% of those same security teams said keeping up with that volume is getting harder.
A separate analysis from Sherlock Forensics found that 92% of AI-generated codebases contain at least one critical vulnerability. The average AI-coded application has 8.3 exploitable findings. Seven in ten organizations have confirmed or suspected AI-generated security vulnerabilities in production right now.
This is not an argument for slowing down AI adoption. It's an argument for doing it correctly.
## The Specific Failure Modes
AI coding tools don't fail randomly. They fail in patterned, predictable ways that mirror their training data biases and the limitations of autoregressive generation.
The most common vulnerability types in AI-generated code are not exotic. They're classics:
- **XSS (cross-site scripting)**: Present in 86% of AI-generated web-facing code according to 2026 analysis
- **Log injection**: 88% of LLM-generated code includes at least one log injection vector
- **Business logic vulnerabilities**: Cited by 72% of security professionals as a top concern — the kind of flaw where the application does exactly what it was asked to do, but in a way an attacker can abuse
- **Supply-chain risks**: 73% flag unsafe or unreliable dependencies introduced by AI-suggested imports
- **Secret exposure**: 78% cite the exposure of corporate secrets as their primary concern — API keys, credentials, and internal URLs embedded in generated code or configuration files
The pattern here is instructive. XSS and log injection are textbook OWASP vulnerabilities — the kind of thing that experienced developers know to guard against explicitly. AI models know what these vulnerabilities are in the abstract but routinely generate code that contains them because they're optimizing for functional correctness in the context window, not for adversarial safety at the application boundary.
Business logic vulnerabilities are harder. They require understanding the *intent* of a system, not just its syntax. A model that has never been told "this endpoint should only be callable by authenticated admins" will generate a working endpoint that isn't. It will pass a unit test. It won't pass a security review.
## The Trust Gap Is Growing
The April 2026 JetBrains AI Pulse survey found that 84% of developers use AI coding tools — but only 29% trust what those tools produce in production. That gap is not just a perception problem. It's a measurement of the actual risk that organizations are carrying.
Two-thirds of security teams in the ProjectDiscovery report are spending more than half of their time manually validating AI-generated findings rather than fixing them. This is the worst possible allocation of security effort: skilled engineers doing manual triage because the toolchain can't distinguish a real critical vulnerability from a false positive.
The irony is that the same AI acceleration causing the problem could be part of the solution — if deployed correctly.
## The Agentic Review Loop
The response to "AI generates vulnerable code" should not be "generate less code." It should be "build better review into the generation loop."
Agentic workflows with structured review stages address this directly. Here's what that looks like in practice with Claude Code:
**Pre-commit security hooks**: Claude Code supports hooks that run before any code is committed. A security-focused hook can invoke an MCP tool that runs static analysis against the changed files, catches OWASP top-10 patterns, and blocks the commit if critical findings are present. This moves security left — not to "during review," but to "before the change leaves the developer's machine."
```bash
# Example: hook that runs security scan on staged files
{
"hooks": {
"PreToolUse": [{
"matcher": "Bash",
"hooks": [{"type": "command", "command": "run-security-scan.sh"}]
}]
}
}
```
**Subagent review architecture**: In multi-agent workflows, a dedicated review agent — instantiated with explicit security context — can audit the output of a primary coding agent before it's merged. Claude Code's agent teams architecture supports this natively: the orchestrating agent submits code to a review subagent that has been primed with the application's security requirements, threat model, and known sensitive areas.
**Structured CLAUDE.md security rules**: Any project using Claude Code can embed security invariants directly in the `CLAUDE.md` specification. These rules travel with the codebase and apply to every agent session that touches it:
```markdown
## Security invariants
- Never embed credentials, API keys, or internal URLs in code or comments
- All user input must be sanitized before database insertion or log output
- Authentication checks must appear at the start of every admin endpoint handler
- External dependencies must be pinned to specific versions — no floating semver
```
When these rules are in `CLAUDE.md`, they're not just documentation — they're active constraints that Claude Code enforces across every session. The CLAUDE.md supply-chain attack (CVE-2026-21852) demonstrated that this mechanism is powerful enough to be worth attacking, which is itself a signal of how seriously the system enforces these rules.
## What Security Teams Actually Need
The ProjectDiscovery report is explicit about what security practitioners want before integrating AI deeply into their processes: audit trails and access limitations.
Both of these are solved problems in modern agentic tooling. Claude Code's analytics API provides per-user, per-session data on tool acceptance rates, commands executed, and files modified. OpenTelemetry integration via Claude Cowork lets security teams pipe this data directly into their SIEM. A complete audit trail of what the AI agent did, when, and in which files is not a future feature — it's available today.
Access limitations are similarly addressable through the MCP permission model. Tools can be scoped to specific operations and specific file paths. An agent working on the frontend should not have an MCP tool that can execute arbitrary SQL. Namespace your MCP tools. Scope your permissions. Apply the principle of least privilege to your agents the same way you apply it to your service accounts.
## The Bottom Line
The security crisis in AI-generated code is real. 92% of AI codebases with at least one critical vulnerability is not a number to dismiss. But the response can't be to retreat to fully human-written code — that ship has sailed. 51% of all GitHub commits in 2026 are already AI-assisted or AI-generated.
The fix is architectural:
1. **Move security left** — pre-commit hooks, not post-deploy audits
2. **Use agentic review loops** — a separate reviewer agent with explicit security priming, not a human manually reading every diff
3. **Encode invariants in specs** — `CLAUDE.md` security rules that travel with the codebase and apply to every session
4. **Instrument everything** — audit trails, SIEM integration, and per-user analytics so the security team has visibility without bottlenecking delivery
The goal is not to slow AI down. It's to build a review infrastructure that can keep pace with AI's output velocity. That infrastructure is agentic by necessity — because human reviewers cannot scale to match what AI can generate.
The irony is that the most effective defense against vulnerable AI-generated code is more AI, deployed more thoughtfully.
---
**Sources:**
- [ProjectDiscovery's "2026 AI Coding Impact Report" — PR Newswire](https://www.prnewswire.com/news-releases/projectdiscoverys-2026-ai-coding-impact-report-reveals-ai-generated-code-is-outpacing-security-teams-ability-to-keep-up-302749706.html)
- [AI-written software creates hassles for wary security teams — Cybersecurity Dive](https://www.cybersecuritydive.com/news/ai-coding-security-concerns-projectdiscovery/818319/)
- [92% of AI Code Has Critical Vulnerabilities — Sherlock Forensics](https://www.sherlockforensics.com/pages/ai-code-security-report-2026.html)
- [AI coding speeds up, but security teams fall behind — SecurityBrief](https://securitybrief.news/story/ai-coding-speeds-up-but-security-teams-fall-behind)
- [Seven in 10 firms see AI code flaws in production — TechInformed](https://techinformed.com/seven-in-10-firms-see-ai-code-flaws-in-production/)
- [State of AppSec 2026 — ProjectDiscovery](https://projectdiscovery.io/whitepapers/application-security-report-2026)
---
# MiniMax M2.7: The Open-Source Agent That Rewrote Its Own Training Loop
URL: https://sdd.sh/2026/04/minimax-m2.7-the-open-source-agent-that-rewrote-its-own-training-loop/
Date: 2026-04-25
Updated: 2026-04-25
Tags: MiniMax, Open Source, SWE-bench, Benchmarks, AI Models, Agentic Workflows
Categories: AI Tools, Industry
Summary: MiniMax M2.7 is the first open-source model to participate in its own development cycle — 100 autonomous rounds of scaffold optimization, 30% performance gain, 56.22% on SWE-Pro. It's not just a strong model. It's a glimpse of what model self-improvement looks like in practice.
On April 12, 2026, MiniMax quietly open-sourced M2.7 on Hugging Face. The model had been announced internally on March 18. There was no splashy demo, no product keynote, no benchmark war on X. Just weights, a technical report, and some numbers that are genuinely hard to dismiss.
M2.7 scores 56.22% on SWE-bench Pro and 57.0% on Terminal Bench 2 — matching GPT-5.3-Codex on SWE-Pro and landing with an Elo of 1,495 on the coding arena leaderboard. For an open-source model, that's extraordinary. For a model that helped build itself, it's something else entirely.
## What "Self-Evolving" Actually Means
MiniMax's marketing copy says M2.7 is the first model to "actively participate in its own development cycle." That phrase is doing a lot of work, so it's worth being precise about what actually happened.
During the M2.7 training runs, MiniMax gave the model access to its own reinforcement learning harness. The model could update its internal memory, propose new skills for the harness, and adjust its own scaffold — the scaffolding that governs how the model structures its reasoning and tool use during RL experiments. Then it ran experiments. Observed results. Updated the scaffold. Ran more experiments.
This cycle repeated for over 100 autonomous rounds.
The outcome: a 30% performance improvement relative to the M2.5 baseline on the tasks the model was optimizing for. That is not a marketing number — it reflects a measurable delta in SWE-Pro performance between the model trained with human-designed scaffolding and the model that iterated on its own.
What MiniMax is describing is a form of closed-loop self-improvement: the model contributes to the decisions that shape its own training process. It is not science fiction. It is not general self-improvement across arbitrary tasks. It is a tightly controlled RL experiment where the model has limited but real write access to its own training scaffolding.
It is also the most concrete published implementation of this technique at this scale. And the results are measurable.
## The Benchmark Story
Let's put M2.7's numbers in context alongside the current frontier:
| Model | SWE-bench Pro | Terminal Bench 2.0 | Open? |
|---|---|---|---|
| Claude Opus 4.7 | **64.3%** | — | No |
| GPT-5.5 | 58.6% | 82.7% | No |
| GPT-5.3-Codex | ~56% | ~75% | No |
| **MiniMax M2.7** | **56.22%** | **57.0%** | **Yes** |
| MiniMax M2.5 | — | — | Yes (80.2% SWE-Verified) |
M2.7 does not top any single benchmark. Claude Opus 4.7 leads SWE-Pro by an eight-point margin; GPT-5.5 leads Terminal-Bench 2.0 by a significant gap. But M2.7 is competing in that tier — and it does so with publicly available weights.
That framing matters because it determines how you use the model. You cannot self-host Claude Opus 4.7. You cannot audit its behavior under adversarial inputs or run it in an air-gapped environment. You cannot modify its scaffolding, distill it, or fine-tune it on proprietary workflows. M2.7 is all of those things: fully open under a modified MIT license that requires commercial deployers to display the model name in their product UI.
## The Skill Adherence Number Nobody Is Talking About
The benchmark scores get the attention. The number worth sitting with is 97%.
M2.7 maintains 97% skill adherence across 40 complex skills, each exceeding 2,000 tokens. Skill adherence measures whether the model follows a defined skill — a structured procedure involving tool use, reasoning steps, and state management — without deviating from the expected execution path. At 2,000+ tokens per skill, this is not a simple instruction-following task. These are multi-step, multi-tool procedures.
97% adherence across 40 such skills is the kind of reliability number that enterprise deployments require. It means you can define a complex agent workflow, trust that M2.7 will execute it consistently, and build production systems on top of it without constant human oversight.
MiniMax backs this up with a production claim: M2.7 handles 30–50% of MiniMax's internal reinforcement learning team workflows autonomously. That is an extremely specific number for a company to publish. It either reflects real deployment data or is going to be very embarrassing in six months.
## Native Agent Teams
M2.7 ships with native Agent Teams support — a multi-agent architecture with stable role boundaries baked into the model's training, not bolted on as a prompt engineering trick.
In practice, this means you can assign roles (architect, coder, reviewer, test engineer) to separate M2.7 instances and they will maintain their role identity and authority boundaries across a multi-agent session without drifting into each other's domains or recursively second-guessing each other's decisions.
Claude Code has a similar architecture in its multi-agent orchestration model. The notable difference is that M2.7 was explicitly trained for this pattern, while Claude Code's agent teams emerge from Claude's general instruction-following capacity combined with the Claude Code shell's task routing logic.
Whether training for agent teams versus prompting for them produces meaningfully different results in production is an open question. M2.7's 30-50% autonomous workflow handling claim suggests MiniMax believes the training approach matters.
## The Open-Source Implications
M2.7 represents the most capable openly available model for software engineering tasks as of April 2026. The gap to Claude Opus 4.7 on SWE-Pro (56.22% vs. 64.3%) is real but not infinite. The gap on Terminal Bench 2.0 is wider — 57.0% vs. GPT-5.5's 82.7% — but Terminal Bench heavily rewards the kind of scaffolding optimizations that closed commercial systems have applied for months. M2.7 closes that gap through its self-evolving scaffold, and it will iterate.
For teams that need:
- **Air-gapped deployment**: M2.7 is available on Ollama and HuggingFace with full weights
- **Regulatory compliance**: Open weights mean auditable behavior
- **Cost control**: $0.30 per million input tokens for M2.5-Lightning (M2.7 pricing is similar); a fraction of Opus 4.7's $5 input cost
- **Fine-tuning flexibility**: Modify M2.7 on domain-specific codebases in ways you cannot do with closed models
...M2.7 is the most credible option in the market right now. That was not true six months ago, when open-source models were firmly one tier below the frontier.
## Where M2.7 Fits Against Claude Code
Claude Code runs on Claude Opus 4.7, which leads M2.7 on SWE-Pro by eight points. That gap translates to real differences in complex, ambiguous engineering tasks — the kind where judgment calls matter and the model has to reason about architecture, not just write the patch.
What Claude Code offers that no open-source deployment can currently match: the full Claude Code tooling ecosystem (MCP-native architecture, 6,000+ MCP servers, Routines, Ultraplan, computer use), the safety and predictability guarantees that come from Anthropic's constitutional AI training, and the enterprise features (RBAC, OpenTelemetry, Analytics API, Bedrock GA) that large organizations require.
For individual developers or small teams who don't need enterprise features, want full control of their stack, and are cost-sensitive, M2.7 running locally through Ollama or deployed on owned infrastructure is now a serious alternative to subscribing to a frontier commercial model.
The open-source frontier closed faster than most people expected. M2.7 is evidence of that closing.
---
**Sources**
- [MiniMax M2.7 Official Announcement — MiniMax](https://www.minimax.io/news/minimax-m27-en)
- [MiniMax M2.7 Model Page — MiniMax](https://www.minimax.io/models/text/m27)
- [MiniMax Just Open Sourced MiniMax M2.7 — MarkTechPost](https://www.marktechpost.com/2026/04/12/minimax-just-open-sourced-minimax-m2-7-a-self-evolving-agent-model-that-scores-56-22-on-swe-pro-and-57-0-on-terminal-bench-2/)
- [MiniMaxAI/MiniMax-M2.7 — Hugging Face](https://huggingface.co/MiniMaxAI/MiniMax-M2.7)
- [MiniMax M2.7 — Ollama](https://ollama.com/library/minimax-m2.7)
- [MiniMax M2.7: Open Source AI Coding Agent Breaks Records — AI Daily Post](https://aidailypost.com/news/minimax-m27-agent-scores-5622-swepro-57-terminal-bench-2-elo-1495)
---
# Claude Code v2.1.118: Vim Mode, Custom Themes, and Hooks That Talk to MCP
URL: https://sdd.sh/2026/04/claude-code-v2.1.118-vim-mode-custom-themes-and-hooks-that-talk-to-mcp/
Date: 2026-04-25
Updated: 2026-04-25
Tags: Claude Code, Developer Tools, MCP, Vim, Changelog
Categories: AI Tools, Guides
Summary: Claude Code v2.1.118 ships vim visual mode, a full custom theming system, and hooks that can now invoke MCP tools directly. Small-sounding updates that collectively make Claude Code meaningfully more extensible — and more comfortable for developers who live in the terminal.
Three features dropped in Claude Code v2.1.118 that deserve more attention than a changelog entry:
1. Vim visual mode — real selection, operators, and visual feedback
2. Custom themes — a full JSON-based theming system, not just color presets
3. MCP tool hooks — hooks can now call MCP tools directly, not just shell commands
None of these are headline features. All three reflect something important about where Claude Code is going: a terminal tool that increasingly adapts to how you work, not the other way around.
## Vim Visual Mode: Finally Complete
Vim mode in Claude Code has been around for a while. What it was missing was visual mode — the ability to select text with `v` (character-wise) or `V` (line-wise), apply operators (`d`, `y`, `c`, `>`, `<`), and get clear visual feedback on the selection.
That's now in v2.1.118.
For developers who think in vim motions, visual mode is not a nice-to-have. It's the difference between a vim implementation that feels like vim and one that feels like "vim-ish." You cannot effectively use vim without visual mode for selections. `d3j` works. `v3jd` works differently and is sometimes what you need. `V5>` for block-indenting — that's a workflow, not a trick.
The `/vim` command is gone; toggle vim mode through `/config → Editor mode`. The consolidation makes sense — config is the right home for persistent editor preferences.
What this means for Claude Code's positioning: there is a non-trivial segment of senior developers who will not use a terminal tool that cannot replicate their editor keybindings with sufficient fidelity. Vim visual mode removes a real adoption barrier for that segment. It also signals that Anthropic is investing in the editing experience within Claude Code's terminal, not just in agent capabilities — which is the right call for a tool that asks you to spend hours per day in it.
## Custom Themes: The Terminal Is Yours Now
The theming system in v2.1.118 is more substantial than "dark mode vs light mode."
You create a theme via `/theme`, which gives you a named theme that persists across sessions. The theme is stored as a JSON file in `~/.claude/themes/`. You can hand-edit the JSON for fine-grained control of every color in the interface. If you have a Claude Code plugin, you can ship themes in a `themes/` subdirectory alongside the plugin — users install the plugin, get the theme.
This is the kind of extensibility that makes a terminal tool feel native to a developer's environment instead of feeling like a tool that tolerates customization. A developer who has spent years tuning their Alacritty colors and their shell prompt will have opinions about how Claude Code's UI looks. Now those opinions can be expressed.
There's also a practical upside for enterprise deployments: organizations that want a unified look for internal tooling — consistent with their other developer tools, or with accessibility requirements — can ship a shared theme as part of a plugin bundle.
## MCP Tool Hooks: The Integration Bridge
This is the most consequential change in v2.1.118, even though it gets the least attention.
Previously, Claude Code hooks — the automation scripts that fire on specific events (pre-tool call, post-tool call, session start, session stop) — could run shell commands. They could call scripts, write to files, post to webhooks. They could not natively invoke MCP tools.
Now they can. Add `type: "mcp_tool"` to a hook definition, specify the server and tool name, and the hook invokes an MCP tool directly when the event fires.
The implications are significant. MCP tools are the integration layer that connects Claude Code to external systems: databases, APIs, observability platforms, incident management systems, code review tools, CI/CD pipelines. Previously, hooking Claude Code behavior into those systems required a shell script that itself called an MCP server or an external API. That works, but it's fragile — you're running a script that manages another process's lifecycle.
Direct MCP tool invocation in hooks means:
**Pre-tool call hooks can query context from MCP servers.** Before Claude runs a bash command, a hook can query your observability MCP server for current system state. If a metric is in a critical threshold, the hook can abort the tool call and surface that context to Claude — without any manual instrumentation.
**Post-tool call hooks can write to MCP-connected systems.** Every file Claude edits can trigger a hook that writes to your audit log via your logging MCP server. Every commit can trigger a post-hook that creates a linked ticket in your project management system via its MCP connector.
**Session-level hooks can initialize and tear down MCP-connected state.** Session start hooks can establish context in your MCP servers — user identity, project, compliance flags. Session stop hooks can clean up or summarize.
This turns Claude Code's hook system into a proper event-driven integration framework, not just a way to run shell scripts at lifecycle points.
## The MCP Startup Improvement
v2.1.118 also speeds up MCP startup when multiple stdio servers are configured. Previously, Claude Code would initialize all configured MCP servers sequentially at startup, which meant a proportional startup penalty for each server in your config.
The new behavior: `resources/templates/list` is deferred to first use (the `@`-mention that triggers resource loading). Non-essential initialization is pushed to when it's actually needed. This matters in the real world, where a Claude Code config might have 10–20 MCP servers configured — a security scanner, a database connector, a GitHub MCP server, an observability tool, a few domain-specific internal servers.
Faster startup sounds trivial. It's not when it's the difference between opening Claude Code and seeing a ready terminal in one second versus waiting five seconds for server initialization. Agentic workflows often involve opening Claude Code sessions frequently — in different project directories, in CI contexts, in agent orchestration chains. Every second of startup time compounds.
## The OS CA Certificate Change
A quieter but operationally significant change: Claude Code now trusts the OS CA certificate store by default. Enterprise TLS proxies work without extra configuration.
This has been a friction point for teams that route developer tools through a corporate TLS proxy. Claude Code would reject the proxy's certificate because its certificate store didn't include the corporate CA. The fix required manual configuration that many developers didn't know how to do and security teams didn't want to document.
Now it just works. For enterprise deployments, this removes a class of support tickets.
## What These Updates Signal
Taken individually, none of these features change what Claude Code fundamentally is. Together, they reveal a consistent product direction: make the tool maximally comfortable for senior engineers who have strong preferences about their environment, and make the integration surface maximally open for teams building production agentic workflows.
The vim completion, the theming system, and the CA store change are all about developer comfort and adoption. The MCP hook integration and startup improvements are about production deployability at scale.
The underlying architecture thesis remains unchanged: Claude Code is a terminal-native agentic tool, not an IDE extension. Every feature in v2.1.118 deepens both the comfort of living in that terminal and the power of the integrations it can orchestrate from it.
---
**Sources**
- [Claude Code v2.1.118 Release Notes — ClaudeWorld](https://claude-world.com/articles/claude-code-21118-release/)
- [Claude Code Changelog — Claude Code Docs](https://code.claude.com/docs/en/changelog)
- [Claude Code Releases — GitHub](https://github.com/anthropics/claude-code/releases)
- [Shawn Tenam on X: Claude Code v2.1.118 dropped today](https://x.com/shawntenam/status/2047214841157316808)
- [Claude Code by Anthropic — Release Notes April 2026 — Releasebot](https://releasebot.io/updates/anthropic/claude-code)
---
# GPT-5.5 'Spud' Is OpenAI's Strongest Coding Model Yet — With One Important Asterisk
URL: https://sdd.sh/2026/04/gpt-5.5-spud-is-openais-strongest-coding-model-yet-with-one-important-asterisk/
Date: 2026-04-24
Updated: 2026-04-24
Tags: GPT-5.5, OpenAI, benchmarks, agentic coding, Claude Opus 4.7, SWE-bench, Terminal-Bench
Categories: AI Tools, Industry
Summary: OpenAI's first fully retrained base model since GPT-4.5 delivers 82.7% on Terminal-Bench 2.0 and leads on most agentic evals. But on SWE-bench Pro — the benchmark that tests real-world GitHub issue resolution — Claude Opus 4.7 still leads by 5.7 points. Here's what that split actually means.
OpenAI shipped GPT-5.5 on April 23, internally codenamed "Spud" — and for once, the leak-to-launch gap actually built up justified expectations. This is not an incremental patch release. GPT-5.5 is the first fully retrained base model since GPT-4.5, and on several agentic benchmarks it is now the best available model. That matters. So does knowing exactly which benchmarks it wins and which ones it doesn't.
## What Changed in the Architecture
GPT-5.5 is notable not just for the numbers but for what OpenAI did to get them. The 5.1 through 5.4 releases were refinements on a shared base — RLHF tuning, instruction-following tweaks, reasoning mode improvements. GPT-5.5 is a ground-up retraining. OpenAI says the new base model was trained with a stronger emphasis on long-horizon task coherence: the model learns to maintain state across multi-step tool use, not just produce high-quality individual responses.
The practical claim is "faster, sharper thinker for fewer tokens." In agentic loops — where each model call compounds in cost — this matters. If GPT-5.5 routes a 10-step debugging task in 7 calls where Opus 4.7 takes 11, the $30/million output token price (vs Opus 4.7's $25) starts to look comparable in production.
## The Benchmark Split
Here is the actual scorecard, as of launch:
| Benchmark | GPT-5.5 | Claude Opus 4.7 | Winner |
|---|---|---|---|
| Terminal-Bench 2.0 | **82.7%** | 69.4% | GPT-5.5 |
| SWE-bench Pro | 58.6% | **64.3%** | Opus 4.7 |
| Expert-SWE | **73.1%** | — | GPT-5.5 |
| OSWorld-Verified | **78.7%** | 78.0% | GPT-5.5 (marginal) |
| MCP-Atlas | 75.3% | **79.1%** | Opus 4.7 |
| GDPval | **84.9%** | — | GPT-5.5 |
The story the headline writers landed on — "GPT-5.5 masters agentic coding" — is technically accurate for Terminal-Bench 2.0 and Expert-SWE. But SWE-bench Pro is the benchmark that has consistently proven hardest to game because it tests actual GitHub issue resolution on held-out repositories. On that metric, Claude Opus 4.7 leads by 5.7 percentage points — not a rounding error.
The MCP-Atlas split is also worth noting. MCP-Atlas benchmarks model performance on multi-tool coordination across the Model Context Protocol, which is the real-world substrate for production agentic systems. Opus 4.7's 79.1% vs GPT-5.5's 75.3% suggests Anthropic's tighter integration with its own tooling ecosystem still confers an advantage in the workflows that matter most.
## What Terminal-Bench 2.0 Actually Tests
Terminal-Bench 2.0 is worth unpacking because GPT-5.5's 82.7% lead over Opus 4.7 is significant — 13.3 points. The benchmark tests complex command-line workflows: tasks that require planning, iteration, and tool coordination in a terminal environment. Think "set up a CI pipeline with environment-specific config, debug the failing test, and commit a working fix."
This is, notably, Claude Code's home turf. Which makes the Terminal-Bench 2.0 number interesting in two directions: it shows GPT-5.5 is genuinely capable in terminal-native workflows, and it raises the question of what happens when GPT-5.5 gets a proper agentic harness — not just ChatGPT and Codex, but a terminal-native deployment model analogous to Claude Code.
That harness does not exist yet.
## Availability and the Deployment Gap
GPT-5.5 is currently live in ChatGPT and Codex for paid subscribers (Plus, Pro, Business, Enterprise). The API is still in controlled rollout — OpenAI cited additional safety and security work for serving partners at scale.
Claude Opus 4.7, by contrast, is GA on the Claude API, AWS Bedrock, Google Vertex AI, and Microsoft Azure AI Foundry. If you are building production agentic systems today and need the API to be reliably available, that multi-cloud GA status is not a minor footnote.
OpenAI's API rollout will presumably complete within days to weeks. But "the model exists in ChatGPT" and "you can build autonomous agents against it at production scale" are not the same thing.
## The Autonomy Architecture Question
Here is the tension that the benchmark comparison does not capture: GPT-5.5 is an excellent model being delivered through a product architecture that was not designed for terminal-native agentic work. ChatGPT is a conversational interface. Codex is a coding assistant. Both are built around a human in the loop.
Claude Code is built around the terminal as the operating environment. That means persistent bash sessions, multi-agent orchestration, worktree isolation, /ultrareview, /ultraplan, Routines with scheduled and GitHub-event triggers, and a CLAUDE.md-driven project-context model. The model is only one part of the stack. The scaffolding matters enormously.
GPT-5.5 is the best model OpenAI has shipped for agentic tasks. The gap between its benchmark performance and its deployment architecture — still IDE-centric, still conversational-first — is where Claude Code users should be watching, not the SWE-bench Pro delta.
## Pricing in Context
Opus 4.7: $5 input / $25 output per million tokens.
GPT-5.5: $5 input / $30 output per million tokens.
GPT-5.5 Pro (extended reasoning mode): $30 input / $180 output — significant jump for workloads requiring deep multi-step planning.
OpenAI's "fewer tokens per task" claim is unverified by independent benchmarks at launch. If it holds in practice — say, 20-30% fewer output tokens per completed agentic task — the effective cost per task could be roughly comparable to Opus 4.7. Developer benchmarking over the next few weeks will settle this question.
## The Competitive Picture in April 2026
The frontier is genuinely competitive in a way it was not 12 months ago. SWE-bench Verified is now effectively at human baseline across multiple models. SWE-bench Pro is the new discriminating benchmark, and Opus 4.7 leads there. Terminal-Bench 2.0 now has GPT-5.5 in front.
Neither model has runaway dominance. The choice increasingly turns on tooling ecosystem, deployment reliability, and whether you are building around a conversational interface or a terminal-native agent loop.
GPT-5.5 is a real step up. The asterisk is that benchmark leadership is not the same as agentic infrastructure leadership. OpenAI is narrowing the model gap; the architecture gap is a different question entirely.
---
*Sources: [Introducing GPT-5.5 | OpenAI](https://openai.com/index/introducing-gpt-5-5/), [VentureBeat: GPT-5.5 narrows beats Claude Mythos Preview on Terminal-Bench 2.0](https://venturebeat.com/ai/openais-gpt-5-5-is-here-and-its-no-potato-narrowly-beats-anthropics-claude-mythos-preview-on-terminal-bench-2-0), [MarkTechPost: GPT-5.5 scores 82.7% on Terminal-Bench 2.0](https://www.marktechpost.com/2026/04/23/openai-releases-gpt-5-5-a-fully-retrained-agentic-model-that-scores-82-7-on-terminal-bench-2-0-and-84-9-on-gdpval/), [TechCrunch: OpenAI releases GPT-5.5](https://techcrunch.com/2026/04/23/openai-chatgpt-gpt-5-5-ai-model-superapp/), [CNBC: OpenAI announces GPT-5.5](https://www.cnbc.com/2026/04/23/openai-announces-latest-artificial-intelligence-model.html), [The Next Web: GPT-5.5 launch](https://thenextweb.com/news/openai-gpt-5-5-launch-enterprise)*
---
# Amazon Just Bet $25 Billion on Anthropic — and Locked In Its Cloud Destiny for a Decade
URL: https://sdd.sh/2026/04/amazon-just-bet-25-billion-on-anthropic-and-locked-in-its-cloud-destiny-for-a-decade/
Date: 2026-04-24
Updated: 2026-04-24
Tags: Anthropic, Amazon, AWS, investment, Claude Code, Bedrock, infrastructure, cloud
Categories: AI Tools, Industry
Summary: Amazon announced up to $25B in new Anthropic investment tied to a $100B AWS commitment over 10 years. The deal gives Anthropic 5 GW of dedicated compute, native AWS console access for Claude, and a stable infrastructure runway well past any IPO. For developers building with Claude Code, the implications are more concrete than they first appear.
On April 20, Amazon and Anthropic announced what may be the most consequential infrastructure deal in AI history. Amazon will invest up to $25 billion in Anthropic — $5 billion immediately, with up to $20 billion more tied to commercial milestones — in exchange for a commitment from Anthropic to spend over $100 billion on AWS technologies over the next decade. The deal includes access to up to 5 gigawatts of compute capacity, including Trainium3 chip clusters expected online later this year.
The press release framing was predictably about partnership and strategic alignment. The actual story is more interesting: this deal shapes what Claude Code can do at scale, for years.
## The Numbers in Context
Amazon's total Anthropic commitment now approaches $33 billion ($8B prior investment + $25B new). For reference, Amazon announced a similar arrangement with OpenAI in February 2026: $50 billion in investment plus a separate $100 billion AWS commitment. Amazon is now the primary cloud infrastructure partner for both leading AI labs simultaneously.
That positioning is deliberate. Amazon Web Services is playing a different game than Google or Microsoft in the AI infrastructure race. Rather than backing a single model provider and hoping they win, AWS is becoming the substrate on which the winners run — regardless of which lab eventually leads on model quality. The OpenAI and Anthropic deals together guarantee that inference demand from the two most-used frontier labs flows through AWS data centers for at least a decade.
Anthropic's ARR has surpassed $30 billion run-rate, up from approximately $9 billion at end of 2025. The $100 billion AWS commitment over 10 years works out to $10 billion per year — roughly one-third of current revenue committed to a single cloud provider. That is a significant lock-in, but the quid pro quo (guaranteed compute at scale, hardware priority, native integration) is equally significant.
## What 5 Gigawatts Actually Means
The 5 GW compute commitment is the infrastructure detail that deserves more attention than it has received. For context: a large hyperscale data center typically consumes 200-500 megawatts. Five gigawatts is 10 to 25 times that — a dedicated fleet of compute capacity reserved exclusively for training and serving Anthropic models.
Trainium3, Amazon's third-generation AI training chip, is expected to come online later this year. Having priority access to Trainium3 capacity matters in two ways. First, it removes training compute as a bottleneck for the next generation of Claude models — Anthropic will not be competing with other AWS customers for scarce chip capacity at peak demand. Second, it creates an inference advantage: models trained and served on the same hardware stack tend to benefit from tighter optimization.
Anthropic has also been exploring its own chip program — the $25B deal suggests they are keeping that option open while securing AWS as the primary fleet. These strategies are not mutually exclusive.
## The Developer Integration Story
For developers, the most immediately relevant detail is this: AWS customers will be able to access the full Anthropic-native Claude console — including Claude Code — directly from within AWS, using existing AWS contracts, credentials, and billing relationships. No separate Anthropic account, no second billing relationship.
Over 100,000 customers currently run Anthropic Claude models on Amazon Bedrock. That baseline makes Claude one of the most widely deployed model families in enterprise cloud environments. The deal deepens this by giving those customers a native on-ramp to Claude Code's agentic capabilities without leaving their AWS workflow.
For Claude Code specifically, the Bedrock GA story — first covered here in April 2026 — just got a 10-year infrastructure commitment behind it. Claude Code on Bedrock already delivers Mantle-backend zero-operator-access and enterprise air-gap deployment patterns. With dedicated Trainium3 compute coming online and a $100 billion AWS runway, the question of whether Anthropic can maintain uptime and latency at enterprise scale has a different answer today than it did last year.
## The Strategic Dependence Question
It is worth being direct about what this deal trades away. A $100 billion cloud commitment over 10 years is not a partnership of equals. Anthropic is making a bet that AWS infrastructure quality and pricing will remain competitive over a decade — and that its own chip exploration will provide enough credibility to negotiate from a position of strength at renewal time.
The counter-argument: at $30 billion ARR and growing, Anthropic has leverage. And the commitment is structured around milestones, not upfront — if Anthropic's revenue compounds faster than expected, the relative weight of the AWS commitment shrinks.
There is also the competitive neutrality point. Unlike Microsoft, which is deeply integrated with OpenAI and has motivated reason to favor that relationship, AWS serves both Anthropic and OpenAI. That structural neutrality means Anthropic is unlikely to face the kind of platform-level friction that would come from being a minority priority for a cloud provider that is also betting heavily on a direct competitor.
## What Changes for Claude Code Users
The practical implications break down into near-term and long-term.
**Near-term:** The deal does not change Claude Code's pricing or features this week. What it does is remove infrastructure uncertainty as a concern for teams evaluating Claude Code for enterprise deployment. When a procurement team asks "what happens to Anthropic in three years," the answer now includes a committed 10-year AWS infrastructure arrangement and over $33 billion in Amazon backing.
**Medium-term:** Native AWS console access for the full Anthropic product suite (including Claude Code) will simplify enterprise rollouts. Instead of managing separate Anthropic accounts alongside AWS accounts, large organizations with existing AWS contracts will have a unified procurement path.
**Long-term:** Trainium3 compute priority means that Claude model training will not be constrained by chip access. That matters most when you consider what the next generation of models — Claude Opus 5, Claude Mythos full release — will require in training compute. The infrastructure is being positioned to sustain capability advances that Anthropic's current trajectory implies.
## The Broader Infrastructure Race
This deal is the latest move in a pattern that has been clear since early 2025: frontier AI labs are not independent from cloud providers. They are symbiotically locked in. OpenAI is locked to Microsoft Azure and AWS. Anthropic is locked to AWS (with a secondary presence on Google Cloud via Vertex AI). Google owns its own lab (DeepMind/Gemini) and runs it on GCP.
The question for developers is not whether these relationships exist — they do, and they are getting deeper — but whether the lock-in affects their own freedom of choice. For now, the model availability story remains healthy: Opus 4.7 is on AWS, Google Vertex, and Azure. Claude Code's Bedrock deployment is GA. The compute deals happen at the training layer, not the API layer.
As long as inference access remains multi-cloud, the infrastructure race benefits developers more than it constrains them. The moment that changes is when cloud provider exclusivity starts bleeding into API availability. That is the line worth watching.
---
*Sources: [CNBC: Amazon to invest up to $25 billion in Anthropic](https://www.cnbc.com/2026/04/20/amazon-invest-up-to-25-billion-in-anthropic-part-of-ai-infrastructure.html), [About Amazon: Amazon and Anthropic expand strategic collaboration](https://www.aboutamazon.com/news/company-news/amazon-invests-additional-5-billion-anthropic-ai), [GeekWire: Amazon doubles down on Anthropic](https://www.geekwire.com/2026/amazon-doubles-down-on-anthropic-with-25b-investment-mirroring-its-openai-cloud-deal/), [The Tech Portal: Amazon confirms $25Bn investment](https://thetechportal.com/2026/04/21/amazon-confirms-25bn-investment-in-anthropic-ties-deal-to-100bn-aws-commitment), [CIO Dive: Amazon adds $25B to Anthropic AI infrastructure deal](https://www.ciodive.com/news/amazon-25-billion-to-anthropic-ai-infrastructure/818123/), [PYMNTS: Amazon and Anthropic Deepen Ties](https://www.pymnts.com/artificial-intelligence-2/2026/amazon-and-anthropic-deepen-ties-with-investment-and-hardware-pact/)*
---
# OpenCode at 147K Stars: The Open-Source Terminal Agent That Won't Pick a Side
URL: https://sdd.sh/2026/04/opencode-at-147k-stars-the-open-source-terminal-agent-that-wont-pick-a-side/
Date: 2026-04-23
Updated: 2026-04-23
Tags: OpenCode, Aider, open source, terminal, AI coding, Claude Code, provider agnostic
Categories: AI Tools, Guides
Summary: OpenCode has 147K GitHub stars, 6.5M monthly developers, and supports 75+ LLM providers. Here's an honest look at what it gets right, where it falls short, and when it makes more sense than Claude Code.
There's an open-source terminal coding agent with 147,000 GitHub stars and 6.5 million monthly active developers, and most of the Claude Code-focused coverage on this blog has barely mentioned it. That's an omission worth correcting.
OpenCode is real, it's growing fast, and it makes a genuinely compelling case for a specific kind of developer. It also has a fundamental architectural ceiling that the star count can't paper over. Let's be precise about both.
## The Numbers Are Real
OpenCode is maintained by the Anomaly team — the same group behind terminal.shop — and written in Go. As of April 2026: 147,000+ GitHub stars, 850 contributors, 11,000+ commits, and 6.5 million monthly active developers.
For context, Aider — the long-standing open-source benchmark for terminal coding agents — has about 39,000 stars. OpenCode grew from 75,000 to 147,000 stars in roughly six months. That trajectory doesn't happen by accident; it means a large number of developers tried it and told other developers to try it.
The growth is partly structural. Developers who tried Claude Code on the $20 Pro plan, found the limits hit quickly, balked at the $100/month Max requirement, got burned by Cursor's opacity, and found Copilot's agent mode unsatisfying are actively looking for alternatives. OpenCode is the best answer the open-source community currently has.
## What It Actually Is
OpenCode is a terminal UI (TUI) application — think Lazygit or btop, but for AI coding sessions. You run `opencode` in a project directory, get a chat interface in your terminal, and your agent can read files, execute shell commands, write code, and iterate.
The headline differentiator is provider agnosticism: 75+ LLM providers, including Claude, GPT-5.4, Gemini, Mistral, Groq, and any Ollama-compatible local model. You switch providers in a config file. Your sessions, history, and tooling stay identical.
Key technical features worth noting:
**LSP integration.** OpenCode automatically detects and configures Language Server Protocol servers for your project. The LLM sees type information, go-to-definition results, and hover docs — the same context your IDE has, without requiring an IDE.
**Multi-session support.** Run multiple parallel agents on the same project, each in isolated context, with shared filesystem access. Useful for running a design-review agent alongside an implementation agent.
**Session sharing.** Export a session as a shareable link for async review or handoff. Pairs well with remote teams and async engineering cultures.
**MCP extensibility.** Connects to the same MCP server ecosystem Claude Code uses, meaning the tool integrations you've already built don't need to be rebuilt.
## The Three Things It Gets Right
**Provider flexibility without vendor lock.** This is the genuine argument for OpenCode that Claude Code cannot honestly dismiss. If you're evaluating models on your actual codebase — comparing GLM-5.1 against Claude Opus 4.7 for a specific task category, testing whether a cheaper Gemini tier is good enough for routine refactors — OpenCode is the right harness. You don't need to context-switch between tools, billing dashboards, or CLI interfaces.
**Cost structure.** OpenCode's costs are entirely your API costs. There's no subscription floor. A developer burning $25/month across a mix of cheap models and occasional Opus 4.7 calls will spend $25/month. A Claude Code Max subscription is $100/month per seat before API costs. For individual developers in markets where $100/month is a real barrier, or for teams with dozens of occasional users, this matters.
**Clean terminal-native architecture.** OpenCode is genuinely terminal-first, not an IDE plugin with a terminal mode bolted on. It runs cleanly over SSH, in Docker containers, on remote dev machines, and in CI environments. The context model is unambiguous: the filesystem is the interface, shell is the runtime, the TUI is the UI. This is the right architecture for infrastructure, platform, and backend-heavy work.
## Where It Falls Short
The provider-agnostic model is also OpenCode's ceiling.
Claude Code isn't just a chat interface layered over the Claude API. It includes deep integration with Anthropic's infrastructure: prompt caching calibrated for agentic loops, `/ultraplan` cloud planning sessions, `/ultrareview` multi-agent code review with a dedicated Opus 4.7 reviewer, Claude Routines for cloud-native automation, `xhigh` effort mode, and the full Anthropic safety layer on autonomous operations.
None of that is available when you point OpenCode at the Claude API. You get Claude's language capabilities without Anthropic's agentic scaffolding. For most tasks, that difference is invisible. For complex multi-hour agentic tasks — the sessions where an agent is planning, branching on failures, and coordinating subagents — the gap compounds.
Aider has a different strength OpenCode doesn't match: git as a first principle. Every Aider edit is automatically committed with a descriptive message. You can revert any AI change with a single `git revert`. For teams where correctness and auditability matter more than model flexibility, Aider remains the more disciplined tool.
## The Anthropic API Block Episode
In early 2026, Anthropic briefly blocked OpenCode from accessing the Claude API under its OpenClaw policy, which restricts third-party clients that replicate the Claude Code interface commercially. The block was contested: OpenCode argued it was a coding harness, not a Claude Code clone. The policy was clarified and OpenCode resumed access.
The episode matters because it reveals the structural tension in the open-source alternative ecosystem. The more capable Claude becomes, the more value a provider-agnostic harness captures by offering access without the Claude Code subscription. Anthropic has legitimate reasons to prefer developers use Claude Code directly. That tension isn't resolved — it's managed, for now.
If you're building workflows that depend on OpenCode's Claude access, the policy ambiguity is a real dependency risk worth acknowledging.
## Who Should Use OpenCode
**Use OpenCode if:**
- You want to evaluate models on real tasks without juggling multiple billing dashboards
- You're on a tight budget and Claude Opus 4.7 is overkill for most of your daily tasks
- You need a terminal agent that runs cleanly over SSH or in containerized environments
- You're contributing to open-source projects and want to avoid commercial tool dependencies
- You work across multiple cloud providers and need your tooling to stay neutral
**Stick with Claude Code if:**
- Agentic depth is a daily tool — `/ultraplan`, `/ultrareview`, Routines, and multi-agent orchestration are in your regular workflow
- You're on a team that needs shared analytics, RBAC, and enterprise compliance features
- You want Anthropic's safety layer on autonomous operations in sensitive codebases
- You're building the kind of multi-hour autonomous workflows where native infrastructure makes the difference between completion and timeout
## The Bottom Line
OpenCode at 147K stars is not hype. It's the best open-source terminal coding agent currently available, and the provider-agnostic model is a genuine differentiator for developers who value flexibility over depth.
It is not Claude Code. For teams doing serious agentic work at the frontier, that gap is real and growing as Anthropic continues to add infrastructure that only works natively. But for individual developers, budget-conscious teams, and anyone who wants a capable terminal agent without a subscription commitment, OpenCode deserves a serious look.
The open-source AI coding ecosystem being healthy is good for the space as a whole — including Anthropic, which benefits from developers building comfort with terminal-native agentic workflows even when the starting point isn't Claude Code.
---
**Sources:**
- [OpenCode | The open source AI coding agent](https://opencode.ai/) — official site
- [GitHub - anomalyco/opencode](https://github.com/anomalyco/opencode) — GitHub repository
- [Aider vs OpenCode vs Claude Code: 2026 CLI AI Coding Assistants Showdown](https://sanj.dev/post/comparing-ai-cli-coding-assistants) — sanj.dev
- [Best Open Source AI Coding Agents in 2026](https://www.opensourceaireview.com/blog/best-open-source-ai-coding-agents-in-2026-ranked-by-developers) — Open Source AI Review
- [OpenCode: Open Source AI Coding Agent with 146k+ Stars](https://www.decisioncrafters.com/opencode-the-open-source-ai-coding-agent-transforming-terminal-development-with-146k-github-stars/) — Decision Crafters
- [OpenCode vs Claude Code vs Aider: Picking the Right AI Coding Agent](https://dev.to/alanwest/opencode-vs-claude-code-vs-aider-picking-the-right-ai-coding-agent-44i0) — DEV Community
---
# Claude Design Is Not a Figma Clone. It's the Missing First Half of Your Agentic Stack.
URL: https://sdd.sh/2026/04/claude-design-is-not-a-figma-clone.-its-the-missing-first-half-of-your-agentic-stack./
Date: 2026-04-23
Updated: 2026-04-23
Tags: Claude, Anthropic, design, prototyping, Claude Code, agentic workflows, Opus 4.7
Categories: AI Tools, Agentic Workflows
Summary: Anthropic's Claude Design launched April 17 as a research preview. It's not a Figma alternative — it's the upstream half of the Claude Code shipping pipeline, and the handoff mechanism changes the conversation entirely.
Anthropic launched Claude Design on April 17 and the immediate take from most of the tech press was predictable: "Anthropic challenges Figma." Headlines positioned it as yet another AI image and design tool competing for budget against Canva, Adobe Express, and Figma's own AI features.
That framing misses the point entirely.
Claude Design is not trying to be a better design tool. It's closing the last remaining gap between "I have an idea" and "Claude Code is building it." The handoff mechanism is the entire product.
## What Launched
Claude Design ships as an Anthropic Labs research preview, available to Pro ($20/month), Max ($100/month), Team, and Enterprise subscribers at no additional charge. It's powered by Claude Opus 4.7, Anthropic's most capable vision model, and the interface is conversational: you describe what you need, Claude generates a version, and you refine it through chat, inline comments, and direct edits.
The output types cover interactive prototypes, slide decks, one-pagers, and marketing collateral — the visual artifacts that move between designers, PMs, and engineers on a typical feature cycle. Export options include Canva integration, PDF, PPTX, and standalone HTML.
None of that is remarkable on its own. What's remarkable is what happens when the design is done.
## The Handoff: Why This Is Different
When you're satisfied with a prototype in Claude Design, you press one button. Claude packages everything — React component structure, design tokens (colors, typography, spacing), copy, interaction notes, and asset references — into an implementation bundle. You pass one instruction to Claude Code. Your agent reads the bundle natively and starts building.
No Figma-to-Jira ticket. No "here's a screenshot, try to match it." No designer writing a spec document that the developer interprets differently. The design intent and the production codebase are in the same semantic space, because both sides speak Claude.
[VentureBeat's coverage](https://venturebeat.com/technology/anthropic-just-launched-claude-design-an-ai-tool-that-turns-prompts-into-prototypes-and-challenges-figma) called it "a fluent, continuous pipeline from idea to shipped feature." That's accurate, but understates the magnitude. Every other AI design tool — Figma Make, Lovable, Galileo — stops at the prototype. You still have to translate. Claude Design eliminates the translation step because the next tool in the chain is another Claude.
Teams at Brilliant reported compressing 20+ prompts in competing tools down to 2 in Claude Design. Datadog described collapsing a week-long brief-mockup-review cycle into a single conversation. These are early-preview numbers and should be treated as directional, but the direction is unambiguous.
## How the Workflow Actually Works
The workflow has three phases.
**Onboarding:** Claude Design reads your codebase and existing design files to build a design system — brand colors, typography scale, spacing tokens, component patterns. Every subsequent project inherits this automatically. There's no import step, no token spreadsheet to maintain.
**Creation:** Describe what you need. Claude generates a first version. You refine via natural language chat or inline comments on specific elements. The editing model is intentionally collaborative rather than precise; you're art-directing, not tweaking pixels. If you need pixel-level control, you still need Figma.
**Handoff:** When the design is ready, Claude packages it as an implementation bundle and surfaces a one-click action to open a new Claude Code session with the bundle pre-loaded. From there, the agent has everything it needs to implement without asking clarifying questions.
This is where the design system investment compounds. Because Claude read your codebase during onboarding, the generated components match your existing architecture, import your existing tokens, and slot into your component library rather than generating parallel one-offs.
The [Claude Design to Claude Code handoff documentation](https://claudefa.st/blog/guide/mechanics/claude-design-handoff) describes the bundle format in detail for teams that want to automate the pipeline or integrate it into CI workflows.
## Who This Is For
Claude Design is explicitly not for professional designers doing production work. The tool offers no vector editing, no auto-layout in the Figma sense, no multi-page component library management. If your workflow lives in Figma's design mode, you're not the target user.
The target user is the engineer or PM who currently spends two weeks in Figma limbo waiting for design handoffs before Claude Code can start — or who pastes screenshots into the chat and hopes the agent infers the intent correctly. The target team is the startup with no dedicated designer where the founder is copy-pasting from a Framer template and hoping it looks professional enough.
For those users, Claude Design doesn't need to match Figma on features. It needs to be good enough to unblock Claude Code, and from the early reports, it clears that bar in most cases.
## What This Does to the Competitive Landscape
Figma is not threatened by Claude Design's design capabilities. Figma's designers-using-Figma market is safe. But Figma Make — Figma's own AI prototype-to-code feature — is now competing against a system where the design phase and the code phase are natively integrated rather than bridged by an export.
Lovable, Bolt, and Replit all operate in the "prototype to deployed app" space and will feel this more acutely. Their value proposition is speed from idea to running code; Claude Design plus Claude Code competes directly on that axis, with the added advantage that the resulting code is production-grade Claude Code output rather than a vibe-coded scaffold.
The more interesting disruption is what this signals about Anthropic's product ambitions. Anthropic built a $380B valuation on API revenue. Claude Design is the company's first serious step into application-layer territory — and it was designed to feed directly back into the agentic development pipeline, not to stand alone.
## The Bottom Line
Claude Design is worth trying today if you frequently find yourself giving Claude Code a screenshot and asking it to "match this." The handoff mechanism will save you the back-and-forth.
But the bigger story is architectural. Anthropic now has products that cover the complete journey from "I need a feature" to "that feature is in production": Claude Design for the visual definition, Claude Code for the implementation. No other company has that end-to-end loop natively integrated.
That's not a design tool. That's the foundation of an agentic development platform.
---
**Sources:**
- [Introducing Claude Design by Anthropic Labs](https://www.anthropic.com/news/claude-design-anthropic-labs) — Anthropic official announcement
- [Anthropic just launched Claude Design, an AI tool that turns prompts into prototypes and challenges Figma](https://venturebeat.com/technology/anthropic-just-launched-claude-design-an-ai-tool-that-turns-prompts-into-prototypes-and-challenges-figma) — VentureBeat
- [Claude Design to Claude Code: AI Design Handoff](https://claudefa.st/blog/guide/mechanics/claude-design-handoff) — claudefa.st
- [Anthropic launches Claude Design, a Figma and Canva rival built on Claude](https://thenewstack.io/anthropic-claude-design-launch/) — The New Stack
- [What Claude Design actually changes for designers](https://medium.com/design-bootcamp/what-claude-design-actually-changes-for-designers-0c5b04fae343) — Medium / Design Bootcamp
- [Claude Design vs Figma: Can Anthropic's New Tool Replace Your Design Stack?](https://www.mindstudio.ai/blog/claude-design-vs-figma) — MindStudio
---
# Salesforce Headless 360: The World's Largest CRM Just Became an MCP Server
URL: https://sdd.sh/2026/04/salesforce-headless-360-the-worlds-largest-crm-just-became-an-mcp-server/
Date: 2026-04-22
Updated: 2026-04-22
Tags: MCP, Salesforce, enterprise, agentic workflows, Claude Code, AI agents, developer tools
Categories: AI Tools, Agentic Workflows, Industry
Summary: At TDX 2026, Salesforce shipped 60+ MCP tools and 30+ coding skills under the 'Headless 360' banner, making every corner of its platform natively callable from Claude Code, Cursor, Codex, and Windsurf. When the world's largest CRM goes headless for AI, the enterprise software landscape just shifted.
When a company with 150,000+ customers and over $34 billion in annual revenue decides its platform should be operated by AI agents rather than human browsers, it is not a preview. It is a production mandate.
At its annual TDX developer conference in San Francisco on April 16, Salesforce announced Headless 360 — a sweeping initiative that exposes every capability in the Salesforce platform as an API, MCP tool, or CLI command. The premise is direct: your AI coding agent should be able to reach your CRM data, customer workflows, and business logic the same way it reaches your file system.
Over 60 new MCP tools and 30 preconfigured coding skills shipped at launch, with live access to Salesforce data and workflows already compatible with Claude Code, Cursor, Codex, and Windsurf.
## What "Headless" Actually Means Here
The term is borrowed from headless CMS architecture, where the content layer decouples from the presentation layer. Salesforce is applying the same pattern to its entire platform: decouple the business logic, data, and workflow engine from the web browser that traditionally sits in front of it.
The result is that everything in Salesforce — customer records, opportunity pipelines, case management, Apex code execution, workflow automation — becomes callable through standard interfaces an AI agent can consume. MCP tools for the agentic layer. REST/GraphQL APIs for programmatic access. CLI commands for terminal-native workflows.
For developers, this is significant: you can now write a Claude Code prompt that reads a Salesforce account record, checks open opportunities, queries recent case history, and writes a pre-call briefing document — without opening a browser, without copy-pasting from the CRM, without breaking context.
## The MCP Signal at Enterprise Scale
The technology enabling this — the Model Context Protocol — hit 97 million downloads by March 2026. It has been adopted by Anthropic, OpenAI, Google DeepMind, and Microsoft. It shipped as the connective layer in Pinterest's 66,000-invocation production deployment, the Lucidworks enterprise search integration, and dozens of other case studies.
Salesforce's adoption is different in kind, not just degree. Salesforce is not a developer tool company building a niche integration. It is the enterprise software backbone for a significant fraction of global commercial operations. When Salesforce ships MCP tools as a first-class feature at its developer conference, every systems integrator, consulting partner, and enterprise architect in the Salesforce ecosystem is now on notice that MCP is the interface they should be building to.
The standards question that was still being debated in 2025 — is MCP a real protocol or a Anthropic-adjacent experiment? — is now settled. You do not ship 60 MCP tools to 150,000 enterprise customers as a hedge.
## What Shipped on Day One
The Headless 360 launch includes:
- **60+ MCP tools** covering the full Salesforce data model: contacts, accounts, opportunities, cases, workflows, Apex execution, and more
- **30+ preconfigured coding skills** available directly in Claude Code, Cursor, Codex, and Windsurf via the Agentforce developer toolkit
- **Agentforce Vibes 2.0** with support for Claude Sonnet 4.6 and GPT-5.4 as underlying reasoning models
- **DevOps Center MCP** for AI-driven deployment management and change set automation
- **Session Tracing** for observability into what AI agents are doing inside the Salesforce platform
- **Agentforce Experience Layer** for embedding AI-accessible Salesforce capabilities in custom applications
Features scheduled for May–June rollout include the Testing Center (AI-driven test generation for Apex and Flow) and the Salesforce Catalog (unified discovery for all available MCP tools and skills).
## Why This Matters for Agentic Coding Workflows
Most enterprise developers today maintain a split-brain workflow: AI coding tools live in the terminal and IDE, while business context lives in the CRM and ticketing systems. The workflow looks like this: open Salesforce, find the relevant account, copy information into a document or comment, switch to the IDE, start coding. Context leaks at every handoff.
Headless 360 collapses that workflow. A developer building a customer-facing integration can now give Claude Code a single instruction — "build a support ticket escalation handler that pulls the customer's contract tier from Salesforce and routes P1 issues to the enterprise team" — and the agent can read the actual contract tier data, understand the actual workflow logic, and generate code that interfaces with the live system. No mock data. No placeholder API calls. No manually transcribed schema.
This is what the agentic coding stack looks like when it matures: not an AI that writes code about a system, but an AI that writes code against a system it can actually observe.
## The Competitive Pressure Underneath
Salesforce's move also reflects competitive anxiety. ServiceNow, HubSpot, and SAP have all shipped varying degrees of AI-native developer tooling in 2026. The Salesforce developer ecosystem — historically sticky but notoriously friction-heavy — risks losing the next generation of enterprise developers to platforms that feel natively AI-accessible.
Headless 360 is partly an answer to that threat. By making Salesforce as easy for Claude Code to operate as a filesystem, Salesforce is betting that the depth of its data model and the breadth of its workflow engine are durable advantages — if only you can get AI agents to actually use them.
That bet is reasonable. Enterprise AI deployments are not won on benchmark scores; they're won on how much existing business logic the AI can reason over. Salesforce has 20+ years of customer data models, workflow logic, and integration surface area. If an AI agent can traverse all of it natively, that depth becomes a moat.
## What This Changes for Claude Code Users
Practically speaking, if you work in an organization that runs on Salesforce, Headless 360 changes what you can ask Claude Code to do:
- Audit and refactor existing Apex code while checking live schema definitions
- Generate integration tests that run against real Salesforce sandbox data
- Build automations that read Salesforce state as part of their decision logic
- Write deployment scripts using DevOps Center MCP for safe change management
The 30 preconfigured coding skills are especially interesting — these are not raw API wrappers but opinionated, tested patterns for common Salesforce development tasks, designed to be consumed directly by an AI coding agent without requiring the agent to discover the right API surface on its own.
## The Enterprise MCP Flywheel
There is a compounding dynamic worth naming. Every major enterprise platform that ships MCP tools makes the MCP ecosystem more valuable, which accelerates adoption by more platforms, which in turn makes agentic coding tools more capable across more domains. Salesforce at 150,000 customers is a significant flywheel input.
The MCP Dev Summit NYC in April surfaced authentication as the critical unsolved problem — OAuth mix-up attacks, token scoping, enterprise identity management. Salesforce's production deployment will stress-test these issues at scale. How they handle session tracing, access controls, and audit logging for AI agent actions inside Salesforce will likely become the reference architecture for enterprise MCP deployments broadly.
The browser is not going away. But the assumption that enterprise software is fundamentally a human-operated web interface — that assumption is being systematically dismantled, one MCP server at a time.
---
*Sources: [Salesforce Official Announcement](https://www.salesforce.com/news/stories/salesforce-headless-360-announcement/), [VentureBeat](https://venturebeat.com/technology/salesforce-launches-headless-360-to-turn-its-entire-platform-into-infrastructure-for-ai-agents/), [The Register](https://www.theregister.com/2026/04/15/salesforce_headless_360/), [CIO](https://www.cio.com/article/4159536/salesforce-launches-headless-360-to-support-agent-first-enterprise-workflows.html), [VARIndia](https://www.varindia.com/news/Salesforce-opens-ecosystem-to-external-AI-agents-with-%E2%80%98Headless-360%E2%80%99)*
---
# Anthropic Tests Pulling Claude Code From Pro — And Gets an Instant Lesson in Developer Trust
URL: https://sdd.sh/2026/04/anthropic-tests-pulling-claude-code-from-pro-and-gets-an-instant-lesson-in-developer-trust/
Date: 2026-04-22
Updated: 2026-04-22
Tags: Claude Code, Anthropic, pricing, developer tools, Pro plan, Max plan
Categories: AI Tools, Industry
Summary: On April 22, Anthropic quietly removed Claude Code from its $20 Pro plan — then called it an A/B test when developers noticed. The pricing logic is sound; the execution is another episode in a troubling pattern.
On Tuesday, April 22, developers started comparing notes. The Anthropic support page that used to say "Using Claude Code with your Pro or Max plan" now read "Using Claude Code with your Max plan." The Pro plan's feature list had swapped its checkmark for an explicit X next to Claude Code. No announcement. No email. No changelog entry.
Within hours, the backlash was loud enough that Amol Avasare, Anthropic's head of growth, posted a social media clarification: it was "a small test on ~2% of new prosumer signups." Existing Pro and Max subscribers, he emphasized, were unaffected.
The test may have been small. The documentation rewrite was comprehensive.
## What Actually Changed
The short version: as of April 22, new Pro subscribers ($20/month) appear to lose access to Claude Code. Access now starts at Max 5x — $100/month.
The longer version is that this was probably inevitable. Avasare's post acknowledged it directly: "when Max launched a year ago, it didn't include Claude Code, but since then we bundled Claude Code into Max and it took off after Opus 4." The compute math is straightforward. Every Claude Code session chains together dozens of model calls — context reads, edits, file writes, test runs — across a full agent loop. A single focused coding session can burn what a month of casual Claude chat costs. Offering that at $20/month was, in retrospect, a promotional loss leader.
The Register noted that Anthropic has "recently struggled to keep up with demand for its AI models," and that quota exhaustion complaints have been a recurring theme since late 2025. Bundling Claude Code into Pro without adjusting the plan's compute budget was always going to create friction.
## The Real Problem Isn't the Price
The jump from $20 to $100 is real money, and some developers will legitimately feel priced out. But the market for serious agentic coding tools was never going to live at $20/month indefinitely. At $100, you're still paying less than a single hour of a mid-level contractor.
The actual problem is the execution pattern. This is not Anthropic's first quiet change that required community pressure to surface:
- The "effort" default controversy in mid-April documented a silent downgrade to medium thinking depth that affected 73% of sessions before anyone noticed
- The OpenClaw ban was announced without a migration path, then partially walked back
- Quota limits have tightened multiple times without proactive communication to paying users
Developers who depend on Claude Code for production work — people for whom it has genuinely become load-bearing infrastructure — are accumulating a list of instances where Anthropic moved quietly on things that mattered. Trust is not built from benchmark scores. It's built from predictability, transparency, and treating your users as partners rather than test subjects.
## What "A/B Test" Actually Means Here
An A/B test of pricing is a normal, reasonable thing to run. Showing 2% of new sign-ups a different pricing tier to measure conversion and retention is standard product practice.
What is not standard is simultaneously updating every public-facing documentation page to reflect the experimental state as if it were the new default. When the support documentation, the pricing page, and the feature comparison table all change — while the communication strategy is "say nothing and see what happens" — that's not an A/B test. That's a rollout with a plausible deniability clause.
Developers noticed because they pay close attention to tools they depend on. The Wayback Machine is public. Changelog diffs are visible. The community watches.
## The Max Plan Is Actually the Right Home for Claude Code
Setting aside the communication failure, the pricing structure that emerges from this test makes sense. Claude Code is not a chat add-on. It is a software development platform that:
- Runs multi-step agent loops across entire codebases
- Invokes Claude Opus 4.7 (which is itself priced at $5/$25 per million tokens) for reasoning-heavy tasks
- Integrates with MCP servers, local tools, CI/CD pipelines, and git worktrees
- Requires persistent context and multi-session management
The Max plan at $100/month includes higher usage limits, Opus access by default, and the enterprise-adjacent features (auto mode, xhigh effort, Routines) that serious users actually want. For anyone using Claude Code more than a few hours per week, the Max plan is not an upsell — it's the right product.
The Max 20x tier at $200/month exists for teams that need it. Neither of those price points is unreasonable for a tool that demonstrably ships production software.
The Pro plan at $20 was always better understood as Claude-the-assistant: chat, analysis, writing, document work. Bundling a full agentic coding environment into that tier was generous. It was probably also strategic — to grow adoption. That strategy worked. Claude Code now represents a significant fraction of Anthropic's revenue, and the community it built is large and vocal enough to notice when pricing pages change overnight.
## What to Watch
Avasare's framing as a test implies a decision hasn't been finalized. A few scenarios worth tracking:
**If the change sticks**: Anthropic is signaling that Claude Code is a $100+ product. Developers who've been casual users will have to decide whether it earns that spend. For most professionals who are actually shipping with it, it does.
**If it gets rolled back**: Anthropic faces the harder question of how to make the economics work at $20 while demand scales. Tighter per-session limits on Pro are the obvious answer — but require the kind of transparent communication that hasn't been Anthropic's strong suit lately.
**The competitive angle**: Every hour Claude Code is associated with pricing uncertainty is an hour that OpenAI Codex Desktop, GitHub Copilot Autopilot Mode, and Cursor 3 are running their own marketing. The developer audience watching this episode is the same one those products want to convert.
Anthropic has the best agentic coding product on the market. The Stanford AI Index data, the SWE-bench results, the JetBrains survey loyalty metrics — they all point the same direction. The company's biggest recurring risk is not a competitor closing the technical gap. It's an accumulation of quiet-change incidents that erode the trust of the developers who chose Claude Code precisely because they wanted something better.
The pricing test, in isolation, is defensible. The pattern surrounding it is what needs to change.
---
*Sources: [The Register](https://www.theregister.com/2026/04/22/anthropic_removes_claude_code_pro/), [Ed Zitron / Where's Your Ed At](https://www.wheresyoured.at/news-anthropic-removes-pro-cc/), [The New Stack](https://thenewstack.io/anthropic-claude-code-limits/), [AIToolly](https://aitoolly.com/ai-news/article/2026-04-22-anthropic-reportedly-removes-claude-code-from-20-monthly-pro-subscription-tier), [XDA Developers](https://www.xda-developers.com/anthropic-cut-claude-code-new-pro-subscriptions/), [Startup Fortune](https://startupfortune.com/anthropics-decision-to-lock-claude-code-behind-a-100-tier-is-already-pushing-developers-toward-openai/)*
---
# The Stanford AI Index 2026 Is Out. The Skeptics Are Out of Arguments.
URL: https://sdd.sh/2026/04/the-stanford-ai-index-2026-is-out.-the-skeptics-are-out-of-arguments./
Date: 2026-04-21
Updated: 2026-04-21
Tags: AI benchmarks, agentic coding, SWE-bench, AI research, software engineering
Categories: AI Tools, Industry
Summary: Stanford HAI's 423-page 2026 AI Index dropped April 13. The numbers on agentic coding are not subtle: SWE-bench Verified jumped from 60% to near 100% of human baseline in a single year. Here's what the data actually means for working engineers.
Every April, Stanford's Institute for Human-Centered Artificial Intelligence drops its annual AI Index — a dense, footnoted audit of where artificial intelligence actually stands, stripped of vendor PR. The [2026 edition](https://hai.stanford.edu/ai-index/2026-ai-index-report) runs 423 pages, pulls from Epoch AI, McKinsey, LinkedIn, GitHub, and dozens of other sources across 15 chapters and 9 thematic domains. It was released on April 13.
For software engineers, the coding and agentic sections aren't just interesting. They're uncomfortable reading if you've been dismissing AI coding tools as "glorified autocomplete."
The skeptics are out of arguments.
## The SWE-bench Number That Should Scare You
SWE-bench Verified tests AI models on real software engineering tasks drawn from GitHub issues — actual bugs and feature requests from production codebases, not synthetic toy problems. It's the benchmark that's hardest to game because it requires understanding context, navigating unfamiliar code, and producing a working patch.
In 2024, the best scores hovered around 60% of the human baseline. By early 2026, top models — [Claude Opus 4.5 at 80.9%, Claude Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%](https://spectrum.ieee.org/state-of-ai-index-2026) — are near 100%.
That's not incremental progress. That's a cliff. One year. One benchmark category. Near-complete convergence to human-level performance on structured software engineering tasks.
To understand why this matters, consider what SWE-bench Verified actually requires: reading an issue description, locating the relevant code, understanding the intended behavior, writing a fix, and having it pass tests. That's not autocomplete. That's a junior developer's full job description for a typical bug ticket.
## The Broader Agentic Picture
SWE-bench is just one data point. The 2026 Index tracks agentic performance across several domains:
**OSWorld** — tests AI agents on real computer tasks across operating systems (browsing, file management, application use). In 2024, top agents scored around 12%. In early 2026: [66.3%](https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance). That puts AI agents within 6 percentage points of human baseline performance on arbitrary computer use.
**WebArena** — tests autonomous web agents on multi-step tasks requiring navigation and decision-making. In 2023: 15%. In early 2026: [74.3%](https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance). Within 4 percentage points of human performance.
**Cybersecurity benchmarks** — AI agents solving security problems went from 15% in 2024 to [93% in 2026](https://www.unite.ai/stanford-ai-index-2026-reveals-a-field-racing-ahead-of-its-guardrails/). That's a 6x improvement in two years.
The pattern is consistent across every domain where tasks are structured and digital: AI agents went from novelty to near-human performance in 12–24 months. This isn't a trend line you can comfortably extrapolate past. Something changed.
## What Actually Changed
The 2025-to-2026 jump wasn't driven by a single breakthrough. It was accumulated infrastructure: longer context windows, better tool use, improved agentic loop design, and — critically — scaffolding. SWE-bench scores above 80% aren't achieved by dropping a model into a chat interface. They require orchestration: the model gets access to a shell, file system, test runner, and the ability to iterate. That's exactly the architecture Claude Code, Devin, and OpenAI's Codex agents implement.
The lesson: raw model capability matters, but the scaffolding that lets the model act, observe, and correct is what converts capability into results. A Claude Opus 4.6 in a well-designed agentic loop outperforms a marginally more capable model in a chatbot interface.
This is why the terminal-native, tool-rich agentic model — Claude Code's architecture — beats the IDE assistant model on hard tasks. Not because of the model, but because of the feedback loop.
## The Productivity Numbers
The Index also synthesizes productivity research. The headline: [26% gains in software development tasks](https://www.technologyreview.com/2026/04/13/1135675/want-to-understand-the-current-state-of-ai-check-out-these-charts/) on average, though with significant variance.
That 26% is both encouraging and misleading. The gains are most pronounced on structured, well-defined tasks — bug fixes, test generation, boilerplate, code translation. For architectural decisions, ambiguous requirements, and cross-cutting refactors, the productivity signal is weaker. This isn't a knock on AI tools; it's a map of where they're useful now.
The implication for how you use Claude Code: double down on agentic workflows for defined tasks (implement this spec, fix this bug, write tests for this module). Don't expect the same leverage on open-ended design work — yet.
## The "Jagged Frontier"
The report uses a phrase worth stealing: the "jagged frontier." AI capability isn't a uniform surface. It's a jagged terrain where performance on one task type tells you almost nothing about performance on a neighboring task type.
An agent that patches GitHub issues at near-human accuracy can still fumble a loosely specified "make this better" prompt. A model that generates production-ready SQL can hallucinate an API that doesn't exist. The jagged frontier means the 80% SWE-bench number and the "my AI wrote broken code" experience can both be true simultaneously — for different task shapes.
This is the right mental model for working engineers: AI coding tools aren't uniformly good or bad. They're exceptionally good at specific task shapes and unreliable outside them. Your job is to route the right tasks to the agentic layer. That routing judgment — understanding what to delegate — is increasingly the core engineering skill.
## What This Means for Working Engineers
The Stanford numbers confirm what early adopters already know and skeptics keep dismissing: the transition isn't coming. It's here, it's measurable, and it's accelerating.
But the Index's data also reframes the anxiety. Engineers who worry "AI will replace me" are asking the wrong question. The Index documents productivity *amplification*, not replacement. 26% faster output on structured tasks means more shipping, not fewer engineers — at least for now, at current capability levels.
The real risk isn't replacement. It's the gap between engineers who've built the routing judgment — who know which tasks to hand the agent, how to spec them, how to verify the output — and those who haven't. That gap is widening. The 2026 Index is a timestamp on when it became unambiguous.
The benchmark that says AI agents are within 4-6 percentage points of human performance on web tasks and near 100% on structured software engineering is not a prediction. It's a measurement of what happened between April 2025 and April 2026.
If you're still treating AI coding tools as optional productivity experiments, the Stanford AI Index 2026 is a useful document for recalibrating that position.
---
**Sources:**
- [The 2026 AI Index Report — Stanford HAI](https://hai.stanford.edu/ai-index/2026-ai-index-report)
- [Stanford's AI Index for 2026 Shows the State of AI — IEEE Spectrum](https://spectrum.ieee.org/state-of-ai-index-2026)
- [Want to understand the current state of AI? — MIT Technology Review](https://www.technologyreview.com/2026/04/13/1135675/want-to-understand-the-current-state-of-ai-check-out-these-charts/)
- [Stanford AI Index 2026 Reveals a Field Racing Ahead of Its Guardrails — Unite.AI](https://www.unite.ai/stanford-ai-index-2026-reveals-a-field-racing-ahead-of-its-guardrails/)
- [Technical Performance Chapter — Stanford HAI 2026 AI Index](https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance)
---
# Scaling Claude Code Skills Across an Engineering Org
URL: https://sdd.sh/2026/04/scaling-claude-code-skills-across-an-engineering-org/
Date: 2026-04-21
Updated: 2026-04-21
Tags: Claude Code, agentic-coding, workflow, skills, developer-experience
Categories: Guides, Case Studies
Summary: You gave Claude Code to 40 engineers. Now everyone's writing their own prompts, their own workflows, their own shortcuts. Here's how one team turned that chaos into a shared skill marketplace — and what they learned building it.
You gave Claude Code to your engineering team. Within a week, someone wrote a brilliant prompt for reviewing PRs. Someone else figured out a great workflow for debugging Kubernetes pods. A third person built a scaffolding command that saves twenty minutes per new component.
None of them know the others exist.
This is the **skill fragmentation problem**, and it hits every organization that adopts AI coding tools beyond the "individual contributor experimenting" phase. The tools are powerful. The knowledge of *how to use them well* stays trapped in individual heads, scattered across personal dotfiles and Slack threads that no one will ever search.
A company I've been working with — ~40 engineers, 300+ repositories, a mix of React, Angular, Go, Kotlin, and Python — ran into this wall about six months ago. Their solution: a **shared skill marketplace** for Claude Code. This article is a deep dive into what they built, why they made the choices they did, and what they'd do differently.
## The Core Insight: Commands vs. Skills
The first architectural decision — and the one that shaped everything else — was separating **commands** from **skills**.
- **Commands** are actions. They do things: review a PR, scaffold a component, fix CI, create a JIRA issue. They're procedural, step-by-step, and they call tools.
- **Skills** are knowledge. They describe things: how the REST APIs are designed, what the database conventions are, how the frontend architecture works. They're reference material that Claude loads when it needs context.
This distinction matters because they have completely different lifecycles. Commands change when workflows change. Skills change when the platform changes. Mixing them means you're constantly updating action-oriented prompts because someone renamed a database column, or vice versa.
In practice, a command for reviewing a PR *references* skills about frontend conventions, architecture patterns, and security standards — but doesn't duplicate that knowledge. When the frontend team migrates from one pattern to another, they update the skill once and every command that references it gets the new context automatically.
```
plugins/
tech/
commands/
review.md # Action: review a PR
fix-ci.md # Action: fix CI failures
new-component.md # Action: scaffold a React component
skills/
react/
SKILL.md # Knowledge: React conventions
components.md # Knowledge: component patterns
state.md # Knowledge: state management
architecture/
SKILL.md # Knowledge: system architecture
services.md # Knowledge: service dependencies
```
## The Plugin Topology: Organize by Audience, Not Technology
The team tried organizing by technology first. It was a disaster. A "kubernetes" plugin and a "golang" plugin and a "react" plugin meant that when someone needed to deploy a Go service to Kubernetes, they needed three plugins loaded. Context windows aren't free.
The topology that worked: **organize by audience**.
- **`tech`** — Everything a developer needs day-to-day. Code review, CI, scaffolding, language conventions, architecture, infrastructure, observability. This is the big one: 26 commands, 19 skill categories.
- **`product`** — Domain knowledge. Data models, business modules, legacy system schemas, analytics platforms. Product managers and domain-adjacent engineers use these.
- **`setup`** — Onboarding and environment configuration. New engineer joins? Four commands get them from zero to productive.
- **`process`** — Cross-functional workflows. User offboarding, bulk operations, things that touch multiple systems.
The key metric: **how many plugins does a person need for their typical workday?** For most developers, the answer is one (`tech`) plus maybe one more. That's a manageable cognitive load and a reasonable context window footprint.
## Naming: The `scope:verb-target` Convention
With 30+ commands across four plugins, discoverability becomes a real problem. The team settled on a strict naming convention:
```
/:-
```
Examples:
- `/tech:review` — review code
- `/tech:fix-ci` — fix CI failures
- `/tech:new-component` — scaffold a new component
- `/tech:create-pr` — create a pull request
- `/tech:diagnose-pod` — diagnose a crashed pod
- `/setup:global` — set up the global dev environment
- `/process:reassign-user` — reassign a departing user's work
The scope prefix is the plugin name. The verb is always imperative. The target is what you're acting on. This means that even if you've never seen the command before, `/tech:fix-ci` is self-explanatory. You can guess that `/tech:new-hook` exists if `/tech:new-component` does.
Bad naming they rejected:
- `review-frontend-pr` (too specific — what about backend?)
- `kubernetes-pod-crash-debug` (too long, noun-oriented)
- `fix` (too vague — fix what?)
## The JIRA Workflow: Sequential Gates That Prevent Garbage
The most opinionated part of the marketplace is a four-stage workflow for going from JIRA issue to merged code:
```
/tech:validate → /tech:decompose → /tech:spec → /tech:implement
```
Each stage is a gate. You can't skip ahead.
**1. Validate** checks that the issue is complete, unambiguous, and in English. It looks for acceptance criteria, edge cases, and clear scope. If the issue says "improve the dashboard" with no further detail, it gets rejected with specific feedback on what's missing.
**2. Decompose** takes a validated Story and breaks it into implementable Tasks — backend, frontend, infrastructure. Each task gets a structured requirements template (FR-001 for functional requirements, SC-001 for scenarios). This is where a vague "add export feature" becomes three concrete tasks with specific acceptance criteria.
**3. Spec** creates a local specification file in the target repository (`specs/{ISSUE-KEY}.md`). This is the spec-driven development part: before writing any code, there's a committed, reviewable document that describes exactly what will be built, how it fits into the existing architecture, and what the verification criteria are.
**4. Implement** executes the spec with mandatory verification gates. It doesn't just write code — it validates that the spec exists, checks that tests pass, verifies that the implementation matches the requirements, and produces a structured checklist.
The key insight: **each gate catches a different class of error**. Validate catches ambiguity. Decompose catches scope creep. Spec catches architectural mismatches. Implement catches bugs. By the time code is written, three layers of AI-assisted review have already happened.
Does this slow things down? For trivial changes, yes — and that's fine. Not everything needs the full pipeline. But for any feature that touches multiple systems or takes more than a day, the upfront investment in validation and decomposition pays for itself ten times over in avoided rework.
## Skills as a Knowledge Base: Versioning What Claude Knows
Here's something that isn't obvious until you try to scale AI tools across an org: **Claude's effectiveness is directly proportional to the quality of the context it has about your specific platform**.
Generic Claude knows React. It doesn't know *your* React — your component patterns, your state management conventions, your design system API, your import ordering rules. Every engineer was re-explaining these things in every conversation.
Skills solve this by codifying platform knowledge into versioned, structured documents that Claude loads automatically when relevant. The tech plugin alone has 19 skill categories:
- **Language conventions**: React, Angular, Go, Kotlin, Python — not generic language guides, but conventions specific to the platform. Component structure. Testing patterns. Error handling approach.
- **Architecture**: Service dependency map, communication patterns, data sync via message queues, BFF removal strategy.
- **Infrastructure**: Helm chart conventions, Kubernetes cluster topology, CI/CD pipeline structure, observability setup.
- **Process**: Git branching conventions, PR title format, release workflow, environment management.
Each skill is a `SKILL.md` file with YAML frontmatter (for description/metadata) and `@file.md` references to supporting documents:
```yaml
---
description: React/TypeScript development guidelines for client-* repositories.
---
@components.md
@state-management.md
@testing.md
@styling.md
@imports.md
```
When Claude encounters a React file in one of the company's repositories, it loads the React skill and immediately knows that the team uses SCSS modules (not Tailwind), React Query for server state (not Redux), React Aria for accessibility primitives, and a specific 10-group import ordering. No engineer has to explain this. No prompt has to include it.
**The compounding effect**: once you have good skills, every command that references them gets better automatically. The code review command doesn't need its own copy of "how do we do React" — it references the React skill. When the frontend team updates the skill (say, adopting a new pattern), every downstream command benefits immediately.
## Code Review: Five Specialized Lenses
Instead of one monolithic "review this PR" command, the team built five specialized review commands that each look at the code through a different lens:
1. **`/tech:review`** — Full-spectrum review: quality, architecture, security, performance, conventions
2. **`/tech:check-arch`** — Architectural debt detection: provider sprawl, raw HTTP calls vs typed clients, state management anti-patterns
3. **`/tech:check-security`** — Security audit: XSS vectors, secrets in code, unsafe third-party scripts
4. **`/tech:check-a11y`** — Accessibility: semantic HTML, ARIA usage, keyboard navigation, visual compliance
5. **`/tech:check-perf`** — Performance: bundle size impact, rendering efficiency, data fetching patterns
Why five commands instead of one? Because **context window budget matters**. A full review loads all the skills. An accessibility check only needs the a11y standards and component conventions. Splitting them means faster, more focused reviews with less noise.
The architecture check (`/tech:check-arch`) is particularly interesting because it encodes *ongoing migration knowledge*. It knows that the team is migrating away from certain patterns, so it flags code that uses the old approach even if it's technically correct. This is institutional knowledge that would otherwise live only in senior engineers' heads.
## Scaffolding: Codegen That Knows Your Codebase
Five scaffolding commands generate new code that matches existing conventions:
- `/tech:new-component` — React component + SCSS module + Storybook story + test file
- `/tech:new-hook` — Custom React hook with tests
- `/tech:new-page` — Page component with route registration
- `/tech:new-slice` — Redux Toolkit slice with typed actions and tests
- `/tech:new-ds-component` — Design system component (for the shared library)
The crucial detail: these aren't templates. They're Claude Code commands that **read the existing codebase** to match the current style. If the last five components use a certain pattern, the scaffolded component follows that pattern. If the project recently migrated to a new approach, the scaffold picks it up because it reads real files, not a frozen template.
This is where the skill system shines. The `/tech:new-component` command references the React skill, which describes the *intended* architecture. So even if the codebase has a mix of old and new patterns (every real codebase does), the scaffold generates the *correct* new pattern, not a copy of the nearest existing file.
## Onboarding: Zero to Productive in Four Commands
The setup plugin is deceptively simple — four commands — but it encodes months of "hey, how do I set up my environment?" Slack conversations:
```bash
/setup:global # Git, Python, Go, workspace, credentials, env vars
/setup:backend # Java/Kotlin SDK, AWS, build tools, database access
/setup:claude # Register the marketplace, install plugins
/setup:check # Verify everything works
```
Each command is idempotent: run it again and it skips what's already done. Each command checks prerequisites before proceeding. Each command explains what it's doing and why.
The `/setup:check` command is the most valuable — it's a diagnostic that verifies every tool, credential, and configuration is correct. New engineer's build fails? `/setup:check` tells them exactly what's wrong and how to fix it.
Before these commands existed, onboarding took 1-2 days with significant hand-holding. Now it takes about an hour, mostly waiting for downloads.
## Distribution: One Source, Two Channels
The team wanted the same knowledge base available in two contexts:
1. **Claude Code** (in the terminal, with tool access)
2. **Claude.ai** (in the browser, for conversations and planning)
Claude Code uses the plugin system natively — the marketplace registers via `settings.json` and plugins are loaded on demand. But Claude.ai uses a different skill format (zip files uploaded to the admin console).
A build script (`sync-skills.sh`) bridges the gap: it reads the plugin skills, inlines the `@file.md` references into a single document, zips each skill, and outputs them to a `dist/` folder for upload. Same knowledge, different packaging.
```bash
# Convert Claude Code skills → Claude API skill zips
./scripts/sync-skills.sh
# Output: dist/skills/tech-react.zip, dist/skills/tech-architecture.zip, ...
```
This means a product manager can ask Claude.ai about the data model and get the same quality answer that a developer gets in Claude Code, because they're drawing from the same skill.
## Quality Enforcement: Hooks and Gates
Two mechanisms prevent the marketplace from rotting:
**Pre-commit hook**: If someone changes any plugin file without running `/refresh` (which regenerates the README, updates marketplace.json versions, and validates the structure), the commit is rejected. This ensures the catalog is always in sync with reality.
**Plugin validation**: Before merging a new command or skill, a validation step checks:
- File structure matches conventions
- YAML frontmatter is valid
- Referenced files exist
- Naming follows `scope:verb-target`
- Description is present and meaningful
The `/refresh` command is itself a Claude Code command — it reads all plugins, regenerates the README catalog, updates version numbers, and validates the marketplace manifest. Dog-fooding at its finest.
## What They Got Wrong
**Over-engineering the JIRA workflow initially.** The first version had six stages instead of four, including a separate "estimation" step and a "design review" step. Engineers bypassed them constantly. The team cut it to four stages that each provide clear, immediate value.
**Not investing in skills early enough.** They built commands first and skills second. This meant early commands had platform knowledge baked directly into their prompts — duplicated, inconsistent, and hard to update. The refactor to extract skills was painful but transformative.
**Underestimating the maintenance burden.** 26 commands and 19 skill categories is a lot of surface area. When the platform changes (new service, new convention, deprecated pattern), multiple skills may need updates. The team is considering automated staleness detection — diffing skills against actual codebase patterns — but hasn't built it yet.
**Making the tech plugin too big.** 26 commands in one plugin means a fat manifest. They should have split it into `tech-review`, `tech-scaffold`, `tech-workflow`, and `tech-ops` from the start. Restructuring now would break everyone's muscle memory, so they live with it.
## The Numbers
Six months in:
- **4 plugins**, **31 commands**, **25 skill categories**
- **~40 engineers** using the marketplace daily
- **300+ repositories** with marketplace integration via shared settings
- Onboarding time: **~2 days → ~1 hour**
- Code review coverage: from "when seniors have time" to "every PR, five dimensions"
- Scaffolding consistency: no more "which component do I copy from?"
The hardest metric to quantify but the most impactful: **reduction in repeated questions**. When Claude knows the platform conventions, engineers stop asking each other "how do we do X here?" They ask Claude, and Claude gives a correct, consistent answer because it's drawing from the same curated knowledge base.
## How to Start
You don't need 31 commands. Start with three:
1. **One review command** that encodes your team's code review standards. This gives you immediate, daily value and forces you to articulate what "good code" means in your context.
2. **One skill document** that describes your most common technology conventions (React patterns, API design, database conventions — whatever your team argues about most). This forces you to separate knowledge from actions.
3. **One setup command** that automates whatever new engineers always get wrong. This gives you a forcing function to document tribal knowledge.
Build these three, use them for a month, and you'll see exactly where to expand next. The architecture — plugins, scoped naming, commands vs. skills — can come later. The important thing is to stop letting AI knowledge fragment across individual engineers and start treating it as shared infrastructure.
Because that's ultimately what this is: **infrastructure for AI-assisted development**. Not the code. Not the prompts. The *knowledge layer* that makes both useful at organizational scale.
---
# Five Claude Code Features That Don't Make Headlines But Change Everything
URL: https://sdd.sh/2026/04/five-claude-code-features-that-dont-make-headlines-but-change-everything/
Date: 2026-04-21
Updated: 2026-04-21
Tags: Claude Code, code review, agentic workflows, developer productivity, Anthropic
Categories: AI Tools, Guides
Summary: The benchmark releases get the press. The unglamorous power-user features don't. Here's what /ultrareview, auto mode for Max, xhigh effort, /recap, and the new prompt caching TTL controls actually change about your daily Claude Code workflow.
Every time Anthropic drops a new benchmark score, the AI press covers it. Every time Anthropic quietly ships five features that change how you actually work in Claude Code, crickets.
April 2026 was one of those quiet months. While the coverage focused on [Opus 4.7's SWE-bench numbers](https://www.anthropic.com/news/claude-opus-4-7), Anthropic shipped a cluster of power-user features that matter more to your daily workflow than any leaderboard position. Here's what dropped and what it actually means.
## /ultrareview: A Skeptical Senior Engineer in One Command
The headliner of the batch is [`/ultrareview`](https://x.com/claudeai/status/2044785266590622185). The description from Anthropic is accurate: it "runs a dedicated review session that reads through your changes and flags what a careful reviewer would catch."
What that means in practice: rather than asking Claude to review your changes in the middle of an active session (where it's sharing context budget with the implementation work it just did), `/ultrareview` spins up a separate, focused review session. It examines your current branch or a specific GitHub PR with a structured protocol: architecture, logic correctness, security, performance, and maintainability in a single pass, run at maximum effort.
The key distinction from `claude "review my code"` is the isolation. A model reviewing its own recent work in the same session has a well-documented blind spot — it tends to validate what it just built. `/ultrareview` sidesteps this by starting fresh, with no implementation context to defend.
Pro and Max users get 3 free `/ultrareview` passes per billing cycle. Beyond that, it runs at standard Opus 4.7 rates. Given that a thorough manual review of a non-trivial changeset takes 30–60 minutes, the math on `/ultrareview` is favorable even at standard token prices.
**When to use it:** Before merging a PR you're not fully confident in. After a large refactor. Any time you'd normally ask a colleague "can you take a quick look at this?" — but you want more than a quick look.
## Auto Mode for Max Subscribers
[Auto mode](https://www.threads.com/@claudeai/post/DXMil0xjnSh/) was previously available only on Teams, Enterprise, and API plans. As of this April release, it's extended to Max subscribers — and it no longer requires the `--enable-auto-mode` flag.
Auto mode lets Claude proceed through longer tasks without stopping to request confirmation at every decision point. The model still respects explicit safety gates, but routine judgment calls — "should I rename this variable?" "should I split this function?" — don't generate interruption prompts.
For developers who were already on Teams or Enterprise, nothing changes. For Max subscribers who've been running Claude Code in the default interactive mode and found themselves constantly babysitting it through multi-step tasks, this is a material change. You can hand Claude a well-scoped implementation task, go do something else, and return to finished work.
The tradeoff is the same one that's always existed: auto mode requires you to be precise about what you want upfront. A loose prompt in auto mode produces a confident answer to the wrong question. The spec-first discipline that makes agentic workflows effective — write what you want, not how to do it — is more important, not less, when the agent is running without interruption.
## xhigh Effort: Finer-Grained Control on Hard Problems
Opus 4.7 introduced a new effort level: `xhigh`. It sits between `high` and `max`, giving you more thinking budget per request than high without the full latency and cost ceiling of max.
The practical use case: problems that are too complex for `high` effort to handle reliably, but where you don't need the full extended reasoning pass that `max` triggers. Architectural questions, algorithmic problems, complex debugging — situations where you want the model thinking harder, but you don't need to wait for the maximum-length chain of thought.
Combined with the [`/effort` command](https://releasebot.io/updates/anthropic/claude-code), which lets you set effort level per-session or per-task, you now have four rungs on the effort ladder: `low`, `high`, `xhigh`, `max`. The default for Opus 4.7 sessions is `xhigh`.
If you've been defaulting to max effort on everything and wondering why costs are higher than expected, switching to `xhigh` for most tasks and reserving `max` for genuinely hard problems is an easy optimization.
## /recap: Context Recovery Without Starting Over
Anyone who's left a long Claude Code session and come back to it knows the problem: the model has lost the thread of what was happening. Recapping manually is tedious. Starting over wastes the prior session's context.
The new [`/recap` command](https://claudefa.st/blog/guide/changelog) generates a summary of what happened in the session — what was built, what decisions were made, what's in progress — and injects it as context. You can invoke it manually with `/recap`, set it to run automatically when returning to a session via `/config`, or both.
This is more useful than it sounds. Multi-hour implementation sessions accumulate context that's genuinely hard to reconstruct: why a particular approach was chosen, what was tried and abandoned, which files were touched. `/recap` surfaces that context on demand rather than forcing you to scroll through the session history.
The feature integrates with the [desktop app's multi-session sidebar](https://www.macrumors.com/2026/04/15/anthropic-rebuilds-claude-code-desktop-app/): you can see a session summary in the sidebar before you resume it, which helps with the "which session was I working on this in?" problem.
## Prompt Caching TTL Controls
This one is for API and infrastructure users, but it's worth knowing.
Claude Code now supports two new environment variables for prompt cache TTL:
- `ENABLE_PROMPT_CACHING_1H` — opt into a 1-hour cache TTL for API key, Bedrock, Vertex, and Foundry
- `FORCE_PROMPT_CACHING_5M` — force a 5-minute TTL, overriding the default
The default prompt cache TTL has been 5 minutes. For most interactive sessions, that's fine — you're actively working and the cache stays warm. But for scheduled automations, long agentic runs, or workflows where Claude Code pauses for extended periods, the 5-minute TTL means you're paying full price to re-establish context on every restart.
The 1-hour TTL trades a slightly higher cache cost for much better hit rates on longer workflows — particularly relevant for [Claude Code Routines](https://9to5mac.com/2026/04/14/anthropic-adds-repeatable-routines-feature-to-claude-code-heres-how-it-works/) that fire on schedules and need to work with large, stable system prompts.
For teams running Claude Code at scale via the API, this is a meaningful cost lever. The 1-hour TTL can dramatically reduce token costs on workflows with large, stable context — CLAUDE.md files, large codebases with slow-changing structure, per-org system prompts.
## The Pattern
These five features share a common theme: they're not about making Claude smarter. They're about removing friction from the workflow layer that sits between the developer and the model.
`/ultrareview` removes the bias of self-review. Auto mode on Max removes the confirmation tax on longer tasks. `xhigh` effort removes the false binary between speed and quality. `/recap` removes the context cliff when sessions pause. Prompt caching TTL removes the cost penalty on intermittent long-horizon workflows.
None of these will appear in a press release. None of them produce a benchmark score. All of them change whether Claude Code is something you use occasionally or something you run your engineering workflow through.
The benchmark releases tell you what the model is capable of. The unglamorous power-user features tell you what that capability is worth in practice.
---
**Sources:**
- [Claude on X — /ultrareview and auto mode announcement](https://x.com/claudeai/status/2044785266590622185)
- [Claude Code changelog — Releasebot](https://releasebot.io/updates/anthropic/claude-code)
- [Claude Code changelog — claudefa.st](https://claudefa.st/blog/guide/changelog)
- [Anthropic adds Routines to redesigned Claude Code — 9to5Mac](https://9to5mac.com/2026/04/14/anthropic-adds-repeatable-routines-feature-to-claude-code-heres-how-it-works/)
- [Anthropic rebuilds Claude Code Desktop App — MacRumors](https://www.macrumors.com/2026/04/15/anthropic-rebuilds-claude-code-desktop-app/)
- [Introducing Claude Opus 4.7 — Anthropic](https://www.anthropic.com/news/claude-opus-4-7)
---
# OpenAI's Agents SDK Gets Sandboxed Execution and a Model-Native Harness: The Agent Infrastructure Layer Is Now Table Stakes
URL: https://sdd.sh/2026/04/openais-agents-sdk-gets-sandboxed-execution-and-a-model-native-harness-the-agent-infrastructure-layer-is-now-table-stakes/
Date: 2026-04-20
Updated: 2026-04-20
Tags: OpenAI, Agents SDK, agentic workflows, sandboxing, Claude Code, developer tools
Categories: AI Tools, Agentic Workflows
Summary: OpenAI's April 15 Agents SDK update ships sandboxed execution, a model-native harness with configurable memory, provider-agnostic model support, and durable state via snapshotting. The primitives Claude Code has offered since day one are becoming the standard SDK layer. Here's what that means.
On April 15, OpenAI shipped what might be the most consequential developer tooling update it has released in 2026: a major overhaul of its Agents SDK that adds sandboxed execution environments, a model-native harness, and durable state management via snapshotting. It also works with any model, not just OpenAI's.
Let's unpack what shipped, why it matters, and what it reveals about where the agentic infrastructure market is heading.
## What Actually Shipped
### A Model-Native Harness
The new harness is the architectural centerpiece of the update. It gives agents:
- **Configurable memory** — agents can persist state across tool calls and sessions in a structured way
- **Sandbox-aware orchestration** — the harness knows what execution environment the agent is running in and coordinates accordingly
- **Codex-like filesystem tools** — agents can inspect, read, write, and manipulate files as a first-class operation
- **Standardized integrations** — a common interface for the primitives that are converging across frontier agent systems
The language in OpenAI's announcement is precise: this is "alignment with how frontier models perform best." That's a concession that generic orchestration frameworks misfire on complex, long-running tasks. The harness is opinionated in the right direction — toward how large reasoning models actually behave in practice.
### Native Sandbox Execution
The second major addition is native sandbox support. Agents can now run in controlled compute environments with the files, tools, and dependencies they need for a task. Developers have two paths:
- **Bring your own sandbox** — if you already run E2B, Modal, Runloop, or another compute platform
- **Use a built-in integration** — support for Blaxel, Cloudflare, Daytona, E2B, Modal, Runloop, and Vercel is wired in from day one
A new **Manifest abstraction** describes the agent's workspace in a portable, provider-agnostic format. Swap sandbox providers without rewriting orchestration logic.
### Security by Separation
This is the detail that enterprise security teams will care about. OpenAI explicitly designed the new architecture assuming prompt-injection and data exfiltration attempts *will happen*. Their solution: separate the harness from the compute environment.
Credentials never enter the environment where model-generated code executes. The harness side holds auth context; the sandbox side is treated as potentially hostile territory. This is a genuinely sound security model for production agent deployments — particularly for agents that touch databases, secrets managers, or production APIs.
The separation also enables **durable execution**: if a sandbox container fails or times out, the SDK snapshotshots agent state and rehydrates it in a fresh container. Long-running agentic tasks can survive infrastructure blips.
### Provider-Agnostic by Default
Here's the move most people buried: the updated Agents SDK works with any model that exposes a Chat Completions-compatible API endpoint. That's over 100 models from third-party and open-source providers — including Anthropic's Claude.
OpenAI built an agent orchestration layer that will run Claude. Let that land for a moment.
This is either an extremely confident play ("our models win on merit, so let developers compare") or an ecosystem land-grab ("we become the standard SDK regardless of which model wins"). Probably both.
Coming soon: **subagents** (hierarchical multi-agent patterns, Python and TypeScript) and **code mode** (agents that write and execute code as a native workflow step). Python ships first; TypeScript follows.
## The Primitives Are Now Table Stakes
The most interesting thing about this release is what it confirms about the direction of the entire market.
Everything OpenAI shipped here — sandboxed execution, configurable memory, filesystem tooling, durable state, harness/compute separation — has been part of Claude Code's architecture since its initial design. Not as optional add-ons, but as core premises of how the tool was built.
This is what happens when a competitor moves from "AI assistant in your editor" to "agent that runs things autonomously." The infrastructure requirements are identical regardless of which tool you pick. You need isolation (so runaway agents don't destroy your repo). You need memory (so agents don't repeat work). You need durable state (so long tasks survive restarts). You need security separation (so model-generated code can't exfiltrate credentials).
OpenAI shipping this as an explicit SDK layer is validation that these requirements are non-negotiable for serious agentic workflows. It's also a signal that the tooling layer — not just the model layer — is now a primary competitive surface.
## Modular vs. Integrated: The Real Architectural Tradeoff
Claude Code and the new OpenAI Agents SDK embody different philosophies about how agent infrastructure should be delivered.
**OpenAI's approach**: a modular SDK that you wire together. You choose your sandbox provider. You configure your harness. You bring your model. You assemble the pieces. This gives you flexibility — you can use the OpenAI SDK with Claude, or with a local Llama model, or with GPT-5. You own the architecture.
**Claude Code's approach**: an opinionated, integrated system where the terminal, the agent runtime, the worktree isolation, the memory, and the cloud execution layer (Routines, Ultraplan) are designed to work together. You don't configure; you use.
Which is better depends on your context. If you're a platform team building a proprietary agent infrastructure at scale, the SDK model gives you the control you need. You can adapt it to your security posture, your cloud provider, your model strategy.
If you're a developer or a small team trying to ship software faster, the integrated model wins every time. The operational overhead of selecting sandbox providers, wiring the harness, managing manifests, and testing security separation across your configuration is not "overhead you do once and forget." It's ongoing maintenance. Claude Code's premise — that you should spend zero time thinking about agent infrastructure — is still the right one for the majority of use cases.
The SDK approach has a deeper problem: it converts agent infrastructure into an engineering project. Claude Code converts it into a tool you install and use.
## What's Missing
The new Agents SDK still lacks what Claude Code Routines provides natively: agents that run on the vendor's infrastructure, on a schedule, without your machine being online. OpenAI's sandboxed agents run in *your* chosen compute environment. You manage the compute; you pay for the compute; you maintain the sandbox integrations.
That's fine for teams with dedicated infrastructure. It's friction for individual developers or small teams who want agentic automation that just runs — the way a CI pipeline runs, not the way a self-hosted server runs.
The subagents feature (coming soon) is the other gap to watch. Multi-agent orchestration is where most of the interesting long-horizon coding work happens. Until subagents ship and prove out, the Agents SDK orchestration story is still single-threaded relative to what Claude Code's team mode delivers today.
## The Broader Picture
OpenAI's Agents SDK update is good for developers and good for the ecosystem. More mature tooling, real security models, and provider-agnostic architecture lift all boats. If the SDK becomes a de facto standard for building agents, it reduces the proprietary lock-in risk for teams concerned about betting entirely on one provider's tooling.
But the update also clarifies the competitive landscape rather than disrupting it. The primitives Claude Code pioneered are now being standardized. That's validation. The question going forward isn't whether agentic infrastructure needs sandboxing, memory, and durable state — everyone now agrees it does. The question is whether you want those primitives assembled by you (SDK) or delivered as an integrated system (Claude Code).
For most teams, the integrated answer is still the right one. The best infrastructure is the infrastructure you never have to think about.
---
**Sources:**
- [The next evolution of the Agents SDK — OpenAI](https://openai.com/index/the-next-evolution-of-the-agents-sdk/)
- [OpenAI updates its Agents SDK to help enterprises build safer, more capable agents — TechCrunch](https://techcrunch.com/2026/04/15/openai-updates-its-agents-sdk-to-help-enterprises-build-safer-more-capable-agents/)
- [OpenAI Agents SDK Update Explained: Sandboxes, Memory, and the New Harness — Junia AI](https://www.junia.ai/blog/openai-agents-sdk-update)
- [OpenAI updates Agents SDK, adds sandbox for safer code execution — Help Net Security](https://www.helpnetsecurity.com/2026/04/16/openai-agents-sdk-harness-and-sandbox-update/)
- [OpenAI Updates Agents SDK With Sandboxed Execution Tools — Dataconomy](https://dataconomy.com/2026/04/17/openai-updates-agents-sdk-with-sandboxed-execution-tools/)
---
# Apple Sends 200 Siri Engineers to AI Coding Bootcamp — The Rest of Apple Already Got There
URL: https://sdd.sh/2026/04/apple-sends-200-siri-engineers-to-ai-coding-bootcamp-the-rest-of-apple-already-got-there/
Date: 2026-04-20
Updated: 2026-04-20
Tags: Apple, Claude Code, AI adoption, Siri, WWDC, enterprise AI
Categories: AI Tools, Industry
Summary: Apple is sending nearly 200 Siri engineers to a multi-week AI coding bootcamp before WWDC 2026. The subtext: other Apple teams already run on Claude Code. When the world's most elite engineering org mandates the transition, the shift is real — but the story is messier than the headline.
Apple is sending nearly 200 Siri engineers to a multi-week AI coding bootcamp before WWDC 2026. The bootcamp runs in the weeks leading up to June 8, when Apple is expected to unveil a long-delayed, revamped Siri at its annual developer conference.
The headline is notable. The subtext is more revealing.
## The Laggard Inside the House
According to reporting by The Information, the Siri team has built a "reputation as a laggard inside Apple" when it comes to AI-assisted development. That's a pointed phrase. It implies the rest of Apple has moved on — and it turns out that's exactly right.
Other parts of Apple, particularly its software engineering organization, have "allocated large budgets for Claude Code." Teams that build the OS layers, the frameworks, the developer tools — they're already running on Anthropic's terminal-native agent. The Siri team, paradoxically, was not.
This is one of the most instructive data points in recent AI adoption history. The bottleneck to AI adoption inside organizations isn't access to tools. It isn't budget, in most cases. It's team culture, management incentive structures, and the psychological friction of changing how you work when the existing way has been "good enough." The Siri team built one of Apple's most scrutinized products, developed under intense secrecy, with long-standing processes. AI-assisted development means changing how code gets written, reviewed, and shipped. Inertia compounds.
The bootcamp is Apple's blunt solution: mandate the transition.
## What the Bootcamp Actually Signals
When Apple sends 200 engineers to an AI coding bootcamp, a few things are true:
**The transition is no longer optional.** Apple is not a company that moves quickly or follows trends. When Cupertino mandates organizational retraining for nearly 200 engineers in an org of a few hundred, it has concluded that AI-assisted development is not a productivity perk — it is a baseline competency. Full stop.
**The productivity gap is measurable and embarrassing.** Companies don't restructure engineering bootcamps without data. The implicit story here is that teams using Claude Code and similar tools are shipping faster, with fewer defects, than teams that aren't. The Siri team's output against its competitors — Google Assistant, Amazon Alexa, OpenAI's assistant layer — may have made the gap hard to ignore.
**Leadership is finally aligned.** Craig Federighi, Apple's software engineering chief, has taken direct oversight of AI development. Mike Rockwell, who shipped the Vision Pro, is now the Siri team lead. These are not placeholder appointments. When you put your two most credible engineering executives on a product and simultaneously mandate AI tooling across the team, you're signaling that the old approach is being replaced wholesale.
## The Gemini Wrinkle
There's a detail here that deserves its own paragraph. Apple is widely expected to announce a Gemini-powered Siri at WWDC 2026 — outsourcing the heavy AI inference to Google's models while Apple handles the on-device integration layer.
So Apple's engineers are learning to code with AI tools (including Anthropic's Claude Code), building a product that will be powered by Google's AI (Gemini), while competitors like OpenAI push into Apple's Xcode ecosystem with Codex. Apple is simultaneously a customer of Anthropic, a customer of Google, and a competitive barrier to both.
This is what the AI stack looks like at the world's most valuable company: fragmented, pragmatic, and messy. There is no "one vendor wins Apple" story here. Claude Code is the tool Apple's engineers use to build. Gemini is the model Apple's users will talk to. Codex is a third option sitting in Xcode's sidebar. These coexist.
What's clear is that Anthropic has won the tool layer at Apple — at least for the teams that got there first. Whether the Siri bootcamp standardizes on Claude Code or something else, the fact that Claude Code already has deep penetration inside Apple's software org is a significant distribution win.
## What This Means for Software Engineers Everywhere
Apple running a mandatory AI coding bootcamp is a Rorschach test for the industry.
For skeptics, it's evidence that AI coding is still something that needs to be explicitly taught — that it isn't happening organically, that there's still friction. Fair point. The Siri team had access to the same tools other Apple teams used; they just didn't adopt them.
For proponents, it's the signal they've been waiting for: the world's most conservative, secretive, process-driven engineering organization has concluded that this is no longer optional. If Apple is running bootcamps, every large enterprise will follow. The question is when, not whether.
There's a third reading that's more uncomfortable for everyone: if even Apple's internal AI laggards are being trained up on Claude Code-style workflows before a major product launch, then the competitive pressure to operate this way is now existential. Companies that don't make this shift aren't just slower — they're building with a smaller team than they think they have.
## The Timing Is Not Subtle
WWDC 2026 is June 8. Apple needs a Siri that doesn't embarrass it in front of developers and the press. The bootcamp is happening now because there isn't time to wait for organic adoption. The revamped Siri needs to ship, and it needs to ship with quality.
Whether the bootcamp delivers on that promise in six weeks is genuinely uncertain. Multi-week training programs don't rewrite team culture overnight. The engineers coming out of this will know how to use AI tools; whether the processes, code review workflows, and product culture around them will change in time for June is another question.
But the fact that Apple has decided the answer to "how do we ship faster and better?" is "we train everyone on AI-assisted development" is a definitive statement about where software engineering is going.
That statement was already obvious to anyone paying attention. Now it's coming from Cupertino.
---
**Sources:**
- [Siri Engineers Sent to AI Coding Bootcamp as Apple Prepares to Deliver Siri Overhaul — MacRumors](https://www.macrumors.com/2026/04/15/siri-engineers-ai-coding-bootcamp/)
- [Report: Apple to send Siri engineers to multi-week AI coding bootcamp — 9to5Mac](https://9to5mac.com/2026/04/15/report-apple-to-send-siri-engineers-to-multi-week-ai-coding-bootcamp/)
- [Apple pushes Siri engineers into AI coding bootcamp as delays stretch toward WWDC 2026 — iGeeksBlog](https://www.igeeksblog.com/apple-siri-ai-coding-bootcamp-wwdc-delay/)
- [Apple sets June date for WWDC 2026, teasing 'AI advancements' — TechCrunch](https://techcrunch.com/2026/03/23/apple-wwdc-june-8-12-ai-advancements-siri-developers-conference/)
- [Apple Sends Siri Team to AI Bootcamp Ahead of Major Gemini-Powered Upgrade — The Hans India](https://www.thehansindia.com/technology/tech-news/apple-sends-siri-team-to-ai-bootcamp-ahead-of-major-gemini-powered-upgrade-1066253)
---
# OpenAI Codex Goes Desktop Agent. It's Still Not Claude Code.
URL: https://sdd.sh/2026/04/openai-codex-goes-desktop-agent.-its-still-not-claude-code./
Date: 2026-04-19
Updated: 2026-04-19
Tags: OpenAI, Codex, Claude Code, Agentic Workflows, Comparison, MCP
Categories: AI Tools, Industry
Summary: OpenAI's April 17 Codex update ships multi-agent desktop control, 90+ MCP plugins, and persistent memory. It's a real step forward in autonomy — built on exactly the wrong architecture.
April 17, 2026: OpenAI shipped its biggest Codex update in months. The headline feature is multi-agent desktop control — Codex can now see, click, and type across every macOS application in parallel background agents, running entirely outside your active session. The update also adds 90+ new MCP-compatible plugin integrations (Atlassian, CircleCI, GitLab, Microsoft 365), persistent memory across sessions, image generation via GPT-Image-1.5, and an in-app browser with annotation capabilities.
On paper, this is OpenAI claiming the agentic coding territory Claude Code has occupied for the past year. In practice, it illustrates exactly why GUI-first autonomy and terminal-native autonomy are different things.
## What Codex Actually Shipped
The centerpiece is parallel desktop agent control. Multiple Codex background agents can simultaneously open Finder, navigate Xcode, file a Jira ticket, open a PR on GitHub's web UI, and update a Confluence doc — all without keyboard input, running in a separate macOS session. For developers who want an AI assistant handling ticket grooming or documentation while they focus on code, this is genuinely useful.
The plugin expansion deserves credit. 90+ new integrations ship as MCP servers, covering tools most enterprise engineering teams already use: Atlassian's full suite, GitLab, CircleCI, Linear, and the Microsoft 365 ecosystem. OpenAI's embrace of MCP here is meaningful — the company spent years pushing its own proprietary function-calling format, and the pivot to MCP signals the protocol has won as the default integration standard for AI agents. Any developer already running MCP servers for Claude Code can connect the same tools without reconfiguration.
Persistent memory rounds out the update. Codex now accumulates context across sessions — your project conventions, preferred libraries, team patterns. Previously, every new Codex session started cold.
## Why This Is Still the Wrong Architecture
Desktop GUI control as a primary autonomy mechanism has a structural problem: the GUI layer is not the integration layer.
When a Codex agent navigates Jira's web UI to file a ticket, it's parsing pixels and clicking buttons. That operation is inherently fragile — it breaks when Jira's UI changes, when a modal appears unexpectedly, when the network is slow enough to miss a transition state. Compare that to calling the Jira REST API directly from a Claude Code routine. The API integration doesn't break because the design team shipped a new modal.
More importantly, desktop GUI control anchors the agent to your machine. An agent that sees your screen and controls your apps cannot easily be:
- Run in parallel at scale — you have one screen, one active session
- Isolated to a clean git worktree — apps share filesystem state
- Triggered by a GitHub webhook while your laptop is closed
- Packaged into a team's CI pipeline without distributing desktop access
Claude Code's routines, shipped April 14, run on Anthropic's Cloud Container Runtime. They trigger on schedules, API calls, or GitHub events. They use git worktree isolation to run parallel agents across clean repository states. They don't need a macOS session open anywhere.
That is the architecture difference. Codex extended what an AI can do *on your desktop*. Claude Code extended what an AI can do *without your desktop*.
## The MCP Win Belongs to the Ecosystem
One clear positive in this Codex update: 90+ MCP server integrations. This validates what the protocol's backers have been arguing for two years — that MCP, not any provider's proprietary API, is the integration standard AI agents will converge on.
The practical effect is tool portability. A development team that has deployed MCP servers for their internal ticketing system, code review workflow, and deployment pipeline doesn't reconfigure anything when switching between Claude Code and Codex. The servers work across both. The lock-in play through proprietary tooling failed; the ecosystem won.
For Anthropic, this is free amplification. Every MCP server the ecosystem builds works in Claude Code. OpenAI shipping 90 additional servers — even as part of a competitive product launch — expands the shared tool surface area that Claude Code benefits from.
## The Memory Question
Persistent memory across sessions is the one Codex feature that addresses a genuine Claude Code limitation. Claude Code has per-project memory via CLAUDE.md files and session state within a conversation, but true cross-session persistent memory — the kind that remembers you always validate with `zod` and prefer descriptive variable names over comments — is not natively built in.
Codex is betting this matters enough to move the needle. For developers who want a consistent AI collaborator that accumulates context over months of use, it's a real differentiator. Whether a GUI agent that controls your macOS desktop is the right host for that memory is a separate design question. The memory feature makes sense. The delivery mechanism raises all the same autonomy ceiling questions.
## The Ceiling Doesn't Move
OpenAI is catching up on agentic features, and competition is good for the industry. The April 17 update is real progress. Persistent memory, MCP ecosystem parity, and parallel background agents represent meaningful engineering.
But the structural gap hasn't closed. The Agents Window in Cursor 3 drew community pushback for the same reason: bolting agent capabilities onto an existing GUI tool produces a cognitive dissonance that neither a great IDE nor a great agent platform has. You end up with a smarter GUI — not composable infrastructure.
The developers feeling that dissonance with Cursor 3 are experiencing the same ceiling Codex just hit. When you add autonomy features to a GUI, the GUI is still the unit of composition. When you build from the terminal up, your unit of composition is the process, the pipeline, the routine.
For most developers evaluating AI coding tools today: the April 17 Codex update is worth exploring, particularly if you work heavily in the Atlassian ecosystem or want cross-session memory. For developers building multi-agent pipelines, CI-integrated automation, or enterprise-scale agentic workflows, the architecture still points toward the terminal.
---
*Sources: [OpenAI Codex multi-agent update – The Tech Portal](https://thetechportal.com/2026/04/17/openai-upgrades-codex-with-multi-agent-workflows-and-desktop-app-control-to-challenge-anthropics-claude-code) · [VentureBeat Codex desktop coverage](https://venturebeat.com/technology/openai-drastically-updates-codex-desktop-app-to-use-all-other-apps-on-your-computer-generate-images-preview-webpages) · [MacRumors Codex update](https://www.macrumors.com/2026/04/16/openai-codex-mac-update/) · [Claude Code Routines – Anthropic](https://www.anthropic.com/news/claude-code-routines)*
---
# Claude Opus 4.7 Is Your New API Default on April 23. Here's What Changes.
URL: https://sdd.sh/2026/04/claude-opus-4.7-is-your-new-api-default-on-april-23.-heres-what-changes./
Date: 2026-04-19
Updated: 2026-04-19
Tags: Claude, Claude Code, Opus 4.7, Anthropic, API, Agentic Workflows
Categories: AI Tools
Summary: On April 23, the 'opus' API alias switches to Opus 4.7. Same price, one-third the tool errors, best SWE-bench Pro score on the market. If your pipeline uses the bare alias, you're upgrading automatically. Here's what that actually means.
On April 23, Anthropic will flip a switch that affects every API-based Claude integration currently running in production: the model resolved by the `opus` alias will switch from Opus 4.6 to Opus 4.7. Enterprise pay-as-you-go customers and direct API users get the upgrade automatically, without any code changes.
If your application is already pinning a full version string — `claude-opus-4-6-20261015` — nothing changes. If you're using the bare `opus` alias, you're upgrading in four days. Here's what that means and why you probably shouldn't fight it.
## What Opus 4.7 Brings
Opus 4.7 shipped April 16, 2026. The benchmark numbers: **87.6% on SWE-bench Verified**, **64.3% on SWE-bench Pro** — the highest SWE-bench Pro score of any generally available model at the time of writing, clearing GPT-5.4 (57.7%) and Gemini 3.1 Pro (54.2%) by a margin wide enough to be operationally significant.
The number that matters more for production workflows is tool error rate. Opus 4.7 makes roughly **one-third the tool-call errors** of Opus 4.6 in long-horizon agentic tasks. For pipelines that chain 20–30 tool calls — code generation, test execution, repository search, file writes, PR creation — this is a qualitative shift, not just an incremental one.
Here is why the math compounds: if each tool call in a 25-step agentic task has a 5% failure rate, the full task completes without error roughly 28% of the time. Reduce the per-call failure rate to 1.7% (one-third of 5%) and that same task completes roughly 65% of the time. The per-call improvement translates to a dramatic task-completion-rate difference at production scale. For teams running Claude Code routines against real repositories, fewer errors means fewer tasks that stall and require a human to unstick them.
The other differentiator is implicit-need reasoning. Anthropic benchmarks models on tasks where the correct action is unstated — the model must infer from context what the user actually wants, not just what they literally said. Opus 4.7 is the first model to pass this benchmark at a statistically meaningful rate. In practice, this shows up as agents that handle edge cases gracefully without requiring exhaustive prompt specifications. An agent that understands "clean up this controller" means "remove dead code, update naming conventions, add missing type annotations" — rather than producing a token-minimal literal response — changes the overhead required to write effective prompts.
The price is unchanged. $5/MTok input, $25/MTok output. Opus 4.7 is not a premium tier. It is Opus 4.6, substantially improved, at the same price.
## How the Alias Mechanics Work
Anthropic maintains model aliases that always resolve to the current stable version of each tier. The `opus` alias is one of them. On April 23, the alias resolves to `claude-opus-4-7-20260416` instead of the previous Opus 4.6 version string.
In Claude Code specifically, the default model for API key, Bedrock, Vertex AI, and Azure Foundry users is already set to high-effort mode following the v2.1.94 rollback of the silent March 3 effort-level change. That means Claude Code users on these backends get full-effort Opus 4.7 on April 23 without any configuration change.
To stay on Opus 4.6 explicitly:
- API: pin `model: "claude-opus-4-6-20261015"` in your request
- Claude Code: set `ANTHROPIC_MODEL=claude-opus-4-6-20261015` in your environment
- SDK: pass the full model ID string rather than using the alias
There is no cost or capability reason to pin 4.6. The reason to pin any specific version is reproducibility — if you need consistent behavior across a compliance audit period or A/B test window, pinning makes sense regardless of which model is better. Otherwise, the upgrade is correct.
## The Production Readiness Checklist
For teams with critical API integrations:
**Audit alias usage.** Find every place in your codebase where the `opus` string appears without a full version ID. These are all upgrading automatically on April 23. For most use cases, this is fine. For pipelines with hand-tuned prompts that depend on specific Opus 4.6 behavior, validate before the switch.
**Test against `claude-opus-4-7-20260416` now.** The full model ID is available today. If you want to validate behavior before the alias flips, point your staging environment at the explicit 4.7 model ID this week.
**Expect fewer tool errors, not different outputs.** Opus 4.7's core instruction-following and reasoning are consistent with 4.6. The behavioral improvements are in error rates and edge-case inference, not in fundamental output style. Prompts that worked well on 4.6 will continue to work on 4.7.
## It's Already in Copilot, Too
One notable wrinkle in the Opus 4.7 rollout: it is simultaneously appearing as an option in GitHub Copilot Pro+, where it replaces Opus 4.5 and 4.6 in the model picker on the same timeline.
This is Anthropic's multi-surface distribution strategy working as designed. The model ships everywhere at once: Claude Code native, direct API, Bedrock, Vertex AI, and Copilot. Developers no longer have a model quality reason to switch tools — the frontier Claude model is accessible regardless of where they work.
The competitive pressure on OpenAI is real. If a Copilot user can access Opus 4.7 through their existing GitHub subscription, and Opus 4.7 leads SWE-bench Pro by 6–10 points, the model gap erodes as a reason to choose OpenAI tooling. Anthropic is not charging a premium for the better model. They are moving the floor up.
## What This Means for Claude Code Users
For developers using Claude Code day to day: the April 23 switch requires no action and delivers a better agent. The reduced tool error rate has the most direct practical impact — long-running agentic tasks complete more reliably, and the cost-per-successful-task drops even though per-token pricing is unchanged.
For teams running Claude Code routines or automated pipelines: validate once against the 4.7 model ID before April 23, then let the alias do its job. The one-third error reduction in tool calls is the number worth watching in your production telemetry. If you have Claude Code Analytics API integration set up, you can track tool-call success rates and task-completion rates directly through the dashboard.
For everyone else: same interface, same pricing, better model. The April 23 default change is Anthropic moving the baseline rather than selling an upgrade. That's the right move.
---
*Sources: [Introducing Claude Opus 4.7 – Anthropic](https://www.anthropic.com/news/claude-opus-4-7) · [VentureBeat Opus 4.7 coverage](https://venturebeat.com/technology/anthropic-releases-claude-opus-4-7-narrowly-retaking-lead-for-most-powerful-generally-available-llm) · [Claude Opus 4.7 on Amazon Bedrock – AWS Blog](https://aws.amazon.com/blogs/aws/introducing-anthropics-claude-opus-4-7-model-in-amazon-bedrock/) · [Claude models documentation – Anthropic](https://docs.anthropic.com/en/docs/about-claude/models)*
---
# Lucidworks MCP: $150K Per Integration Saved, and What It Says About MCP's Real Value
URL: https://sdd.sh/2026/04/lucidworks-mcp-150k-per-integration-saved-and-what-it-says-about-mcps-real-value/
Date: 2026-04-18
Updated: 2026-04-18
Tags: MCP, Enterprise, Search, AI Integration, Lucidworks, Agentic Workflows
Categories: AI Tools, Industry
Summary: Lucidworks launched an MCP server that connects AI assistants to enterprise search with claimed $150K savings per integration and 10x faster rollout. The numbers are impressive. The bigger story is what it reveals about MCP's role in enterprise AI architecture.
The Model Context Protocol hit 97 million downloads in March. OpenAI adopted it in April. The Linux Foundation is governing it. The question stopped being "will MCP win?" weeks ago. The question now is: **what exactly does winning look like at enterprise scale?**
Lucidworks — the enterprise search company behind the Fusion platform — answered that question on April 8, 2026. Their MCP server launch doesn't just add another entry to the MCP registry. It puts concrete dollar numbers on the value proposition, and those numbers are worth examining.
## The Claim: $150K Per Integration
The headline figure from the Lucidworks announcement is blunt: enterprises using the MCP server can save more than $150,000 per integration and reduce AI agent integration timelines by up to 10x.
That's not a vague "efficiency gain." It's a specific cost claim, which means someone did the math on what the alternative costs. The alternative is a custom integration: engineers designing a bespoke API layer between the AI assistant and the enterprise search system, handling authentication, query translation, relevance model compatibility, permission propagation, and incremental data sync. In a mid-size engineering org, that's several senior engineers for several months. The $150K figure is not hard to believe.
What the MCP server offers instead is a standardized endpoint. Connect once; the Lucidworks Platform handles the rest. The AI assistant — Claude, ChatGPT, or whatever ships next — calls the MCP tool. The tool routes through the existing Fusion query pipeline, applying the same relevance models, the same permission checks, and the same security controls that already govern the search system. The integration timeline drops from months to, according to Lucidworks, minutes.
The 6,400+ servers now in the MCP registry means there's ecosystem pressure to make this work. Enterprises that standardize on MCP now aren't betting on an emerging protocol — they're betting on what is rapidly becoming the default integration layer for AI agents.
## Why Enterprise Search Is the Right Problem to Solve
Enterprise search has always been the unsexy cousin of consumer search, but it's sitting on top of the most valuable data an organization has: internal documentation, product catalogs, customer records, support histories, compliance materials. The problem was never that the data didn't exist. The problem was getting AI assistants to it in a way that respected the security model.
Before MCP, there were two paths. The first was RAG: ingest everything into a vector store, run semantic search, hope the retrieval quality was good enough. The second was custom tool integration: build and maintain an API wrapper for every data source the AI might need. Both paths were expensive, brittle, and required ongoing engineering work.
Lucidworks' MCP server is neither. It connects the AI assistant to the existing search index — with all the relevance tuning, synonym expansion, and ranking models already baked in — through a single protocol endpoint. The search expertise already invested in Fusion isn't discarded; it's exposed to the AI layer.
That's a fundamentally different architecture than building AI on top of raw data. An enterprise that has spent years tuning their Fusion deployment for precision and recall doesn't lose that investment when they add AI agents. The agents get the benefit of the tuned index. That's worth more than the $150K savings number suggests.
## The Security Architecture
Enterprise search isn't just retrieval. It's retrieval with access control. An employee in sales shouldn't see engineering's unreleased roadmap when they ask the AI assistant a product question. A contractor shouldn't see executive compensation data when they query the HR knowledge base.
The Lucidworks MCP server propagates the existing Fusion security model end-to-end:
- **Document-level permissions**: The AI assistant only retrieves documents the authenticated user is authorized to see. The permission check happens in Fusion, not in a middleware layer that could be misconfigured.
- **Role-based access control**: Existing Fusion RBAC groups apply automatically. No re-implementation required.
- **Field-level security**: Sensitive fields within documents can be masked or excluded based on user role. The AI doesn't see them; it can't leak them.
- **Self-hosted deployment**: For organizations with data sovereignty requirements, the MCP server runs in your own infrastructure. The AI assistant calls your endpoint; data never leaves your boundary.
This is the part that matters most for enterprise procurement. The security review isn't about the AI model. It's about whether the data pipeline respects the organization's access control model. If the MCP server is just another application layer, every existing Fusion permission policy still applies.
## Claude Code, Meet Enterprise Knowledge
The practical workflow for a developer using Claude Code with a Lucidworks MCP server changes the research phase of coding. Instead of switching contexts to search internal documentation, asking teammates for tribal knowledge, or digging through Confluence manually, the developer stays in the terminal and queries through Claude Code's MCP tool integration.
Ask Claude Code a question about the internal payment processing API. Claude calls the Lucidworks MCP tool. The tool queries the Fusion index — which already knows about the API, has the latest spec, and respects the developer's access level. The response comes back with grounded, relevant results from the actual internal knowledge base, not from model weights trained on public internet data.
This is not a small workflow improvement. The context-switching cost of leaving a coding session to do research is underestimated. Every minute the developer spends manually searching is a minute the agentic loop is paused waiting for human input. An MCP server that reduces that to a single tool call is compressing the loop in a meaningful way.
## The Bigger Story: MCP as Integration Standard
The Lucidworks announcement is notable not just for the numbers but for what it confirms about where enterprise AI architecture is heading.
Six months ago, MCP was a protocol for connecting AI assistants to development tools. Today it's being used to connect AI assistants to enterprise search infrastructure deployed at Fortune 500 companies. The protocol that started as a way to give Claude Code access to a file system is becoming the standard interface between AI agents and enterprise data systems.
The 6,400-server MCP registry reflects this trajectory. The servers aren't all developer tooling. They're CRMs, ERPs, data warehouses, search platforms, and now Lucidworks' enterprise search. Each new server increases the value of standardizing on MCP as the integration approach, because every agent that supports MCP gets access to the full registry.
For developers and architects thinking about how to connect AI agents to enterprise data, the Lucidworks story is a proof point: MCP is now the right abstraction for this problem. Custom integrations are still possible. They're just harder to justify when a standardized approach saves $150K and three months of engineering time.
The protocol that won the developer tools market is winning the enterprise data market too.
---
*Sources: [Lucidworks MCP launch — GlobeNewswire](https://www.globenewswire.com/news-release/2026/04/08/3269912/0/en/Lucidworks-Launches-Model-Context-Protocol-to-Reduce-AI-Agent-Integration-Timelines-by-Up-to-10x.html) · [Lucidworks MCP server overview](https://lucidworks.com/mcp) · [How to integrate MCP into enterprise systems](https://lucidworks.com/blog/how-to-integrate-mcp-into-existing-enterprise-systems) · [MCP and AI search](https://lucidworks.com/blog/how-mcp-can-improve-ai-powered-search-and-discovery) · [MCP gateways and AI agent security tools](https://www.integrate.io/blog/best-mcp-gateways-and-ai-agent-security-tools/) · [Martechcube coverage](https://www.martechcube.com/lucidworks-launches-mcp-to-reduce-ai-agent-integration-timelines-by-up-to-10x/)*
---
# Claude Code on Bedrock with Mantle: The Enterprise Air-Gap Story
URL: https://sdd.sh/2026/04/claude-code-on-bedrock-with-mantle-the-enterprise-air-gap-story/
Date: 2026-04-18
Updated: 2026-04-18
Tags: Claude Code, Amazon Bedrock, Enterprise, Security, AWS, Mantle
Categories: AI Tools, Industry
Summary: Claude Code v2.1.94 shipped Mantle backend support, enabling zero operator access on AWS-managed infrastructure. No SSH. No Session Manager. No Anthropic personnel in the inference path. Here's what that actually means for enterprise buyers.
Enterprise AI adoption has always had a chicken-and-egg problem. Security and compliance teams demand that no vendor personnel can access their data. Vendors respond with policy promises, audit reports, and contractual commitments. Then someone in IT asks the obvious question: how do we *know*?
Amazon's Mantle inference engine — now supported by Claude Code v2.1.94 — replaces that conversation with an architecture. And architecture is harder to lie about than policy.
## What Mantle Actually Is
Mantle is AWS's next-generation inference engine for Amazon Bedrock, designed from the ground up around a single constraint: **zero operator access (ZOA)**. Not "minimal access." Not "access by exception." Zero.
The practical implementation is blunt: Secure Shell (SSH), AWS Systems Manager Session Manager, and serial consoles are not installed anywhere in Mantle. There is no interactive mechanism that would allow an AWS operator — or anyone else — to access a customer's prompts or model completions. Not Anthropic. Not AWS. Nobody.
The security architecture has three layers that matter:
**Cryptographic software verification.** Every inference software update must be signed and verified before it deploys into Mantle. Only approved code runs. The supply chain for the inference engine itself is attested before it touches production.
**NitroTPM-backed hardware attestation.** The services handling model weights and running inference on customer prompts are backed by cryptographically signed attestation measurements from AWS Nitro Trusted Platform Modules. Auditors don't have to trust a promise — they can verify the attestation chain.
**Provider isolation.** Anthropic has no access to the AWS-owned account where inference happens. Model providers supply the model; they don't touch the runtime. The separation is structural, not contractual.
For standards like SOC 2, HIPAA, or ISO 27001, this changes the compliance conversation. Instead of "we promise our operators don't look at your data," auditors can confirm the architecture has no backdoors. That's a different kind of assurance.
## What Changed in v2.1.94
Claude Code v2.1.94, released April 8, 2026, introduced native Mantle support with a single environment variable:
```bash
export CLAUDE_CODE_USE_MANTLE=1
```
That flag routes Claude Code's inference through the Mantle backend instead of the standard Bedrock path. Combined with the interactive setup wizard, the configuration path for enterprise deployments went from "manually set five environment variables and pray" to a guided flow that handles AWS credential selection, region configuration, and model availability verification in a single session.
The wizard lets you choose how you authenticate to AWS: a detected profile from `~/.aws`, a Bedrock API key, explicit access key and secret, or ambient credentials already in your environment. It picks up your region, verifies which Claude models your account can invoke, and optionally pins the 1M context window during initial setup.
A regression in v2.1.94 caused Bedrock requests to fail with a 403 "Authorization header is missing" error when using `AWS_BEARER_TOKEN_BEDROCK` or `CLAUDE_CODE_SKIP_BEDROCK_AUTH`. That was fixed in v2.1.96. If you're deploying from these changelogs, skip directly to .96 or later.
Also in v2.1.94: the default effort level reverted to **high** for API-key, Bedrock, Vertex AI, and Azure Foundry users — walking back the silent "medium effort" change from March 3 that generated the trust crisis covered here last week. Enterprise users running Claude Code through Bedrock now get high-effort reasoning by default without having to override it.
## Why the Air-Gap Story Matters Now
The Bedrock + Mantle combination unlocks Claude Code for a class of enterprise buyer that was previously unreachable: regulated industries and organizations with strict data residency requirements.
Think financial services firms running inside an AWS GovCloud boundary. Healthcare systems with PHI that cannot leave their AWS account. Defense contractors with audit requirements that preclude vendor access to inference infrastructure. For these buyers, the previous Claude Code story was "trust our policies." The Mantle story is "verify our architecture."
It also changes the enterprise procurement conversation. When a CTO asks "can your AI coding tool be deployed such that neither you nor any third party can see our code?", the answer used to be a legal document. Now it's a reference architecture with cryptographic attestation.
Claude Code's deployment options now span three enterprise tiers:
| Tier | Infrastructure | Operator Access |
|---|---|---|
| Standard | Anthropic-managed | Anthropic zero-access policy |
| Bedrock | AWS-managed | AWS policy + Anthropic model isolation |
| Bedrock + Mantle | AWS-managed, ZOA | Hardware-attested zero operator access |
The Mantle tier is the one that passes an enterprise security review without a thirty-page risk exception.
## The Competitive Angle
Cursor runs entirely through its own cloud infrastructure, with no equivalent to running inside your AWS account boundary. GitHub Copilot Enterprise offers data residency options, but the inference infrastructure remains Microsoft-operated. Windsurf has no equivalent enterprise air-gap story.
Claude Code on Bedrock with Mantle is, as of this writing, the only major AI coding agent that can be deployed such that no vendor — not even the AI provider — has a technical path to access customer code during inference. That's not marketing positioning. That's an architectural fact that matters to the CISO who signs the vendor risk assessment.
For most developers, the distinction is academic. For the enterprise buyer trying to deploy AI coding tooling to 2,000 engineers at a regulated institution, it's the difference between a signed contract and a blocked procurement.
## Getting Started
The AWS blog post on [Claude Code deployment patterns with Amazon Bedrock](https://aws.amazon.com/blogs/machine-learning/claude-code-deployment-patterns-and-best-practices-with-amazon-bedrock/) covers the setup in detail. For teams that want the Mantle path specifically, the Claude Code docs at [code.claude.com/docs/en/amazon-bedrock](https://code.claude.com/docs/en/amazon-bedrock) now include the Mantle setup flow.
The prerequisites are standard AWS fare: a Bedrock account with Claude model access enabled in your target region, IAM credentials with the right Bedrock permissions, and v2.1.96 or later of Claude Code. The wizard handles the rest.
For organizations where the compliance question has been the blocker, v2.1.94 is the release that removes it.
---
*Sources: [AWS Mantle ZOA deep dive](https://aws.amazon.com/blogs/machine-learning/exploring-the-zero-operator-access-design-of-mantle/) · [Claude Code v2.1.94 release notes](https://github.com/anthropics/claude-code/releases/tag/v2.1.94) · [Claude Code Bedrock docs](https://code.claude.com/docs/en/amazon-bedrock) · [AWS Claude Code deployment patterns](https://aws.amazon.com/blogs/machine-learning/claude-code-deployment-patterns-and-best-practices-with-amazon-bedrock/) · [i10x Mantle ZOA overview](https://i10x.ai/news/aws-mantle-zero-operator-access-zoa-amazon-bedrock) · [Claude Opus 4.7 on Bedrock](https://aws.amazon.com/blogs/aws/introducing-anthropics-claude-opus-4-7-model-in-amazon-bedrock/)*
---
# The Orchestrator Seat: Claude Code's Desktop Redesign Makes Parallel Agents Native
URL: https://sdd.sh/2026/04/the-orchestrator-seat-claude-codes-desktop-redesign-makes-parallel-agents-native/
Date: 2026-04-17
Updated: 2026-04-17
Tags: Claude Code, Anthropic, desktop app, parallel agents, developer tools, workflow
Categories: AI Tools, Agentic Workflows
Summary: Anthropic's April 14 Claude Code desktop redesign isn't a UI polish — it's a rethinking of how developers manage multiple AI agents simultaneously. Multi-session sidebar, git worktree isolation, side chats, and an integrated toolkit mean you can orchestrate five agents without leaving the app.
On April 14, 2026, Anthropic shipped the redesigned Claude Code desktop app alongside the Routines cloud automation feature. The Routines story got the headlines — but the desktop redesign is the more consequential change for developers who run Claude Code all day.
The previous desktop app was designed around single-session, human-in-the-loop coding. You prompt, Claude works, you review. That model made sense in 2024. It doesn't map to how serious Claude Code users work in 2026: multiple agents running simultaneously across different repos, some in the background, some requiring mid-task decisions, with you moving between them as they surface results.
Anthropic's framing for the redesign: **"The new app is built for how agentic coding actually feels now: many things in flight, and you in the orchestrator seat."**
## The Sidebar: Session Management That Actually Scales
The most visible change is the session sidebar, which replaces the previous list of recent conversations with a full session management interface. Every active and recent session is visible, filterable, and groupable.
Filtering options: status (running, waiting for input, complete, failed), project (repository), and environment. If you're working across five repos simultaneously, you can narrow the sidebar to a single project and see only its sessions.
Grouping by project is the detail that makes this usable at scale. When you're running three sessions per repo across four repos, a flat list becomes cognitive overhead. Grouping means you're seeing your sessions organized the way you think about your work — by project, not by creation time.
Status filtering is underrated. In a multi-session setup, the sessions that need your attention right now are the ones waiting for input or that have hit an error. The rest can run unattended. Being able to filter to just "waiting" means you're not scanning past six in-progress sessions to find the one that needs a decision.
## Git Worktrees: The Safety Layer That Makes Parallel Work Real
The most important technical detail in the redesign is buried in the documentation: **each session in a git repository gets its own isolated copy of your project using git worktrees**.
Changes in one session don't affect other sessions until you commit. This is the implementation detail that makes parallel agentic work actually safe, not just technically possible.
Without worktree isolation, running two Claude Code sessions on the same repo means they're both reading and writing to the same working directory. Session A refactors a module while Session B writes tests for that module's current interface. The result is either a race condition or a confusing mess of half-finished changes. With worktrees, Session A and Session B operate on isolated copies of the codebase. You see both sets of changes, review them, and decide what to merge.
This is the same principle behind feature branches, applied at the session level. It's not a new concept — git worktrees have been around for years. But wiring worktree isolation directly into the session model means you get the isolation automatically, without having to manually create branches and switch contexts before starting each agent.
## Side Chats: The Sleeper Feature
Side chats (⌘+; on Mac, Ctrl+; on Windows) let you open a branching conversation off the main thread. The side chat pulls context from the main session but doesn't add anything back to it.
The use case: you're running a complex agent task and you want to ask a question — "what's the current state of X?" or "is this approach correct?" — without injecting that exchange into the main thread's context. Every message in the main thread influences how the agent continues. A clarifying question, a half-formed thought, a quick sanity check — these don't belong in the main session context. They create noise that can misdirect subsequent agent behavior.
Side chats let you think out loud without polluting the agent's working context. That's a subtle but significant workflow improvement for developers who use Claude Code for complex, multi-step tasks where context quality matters.
## The Integrated Toolkit
The redesign moves four tools into the app that previously required switching to a separate terminal or editor:
**Integrated terminal.** Run tests, builds, or CLI commands alongside your session. The agent proposes a fix; you run the test suite in the same window to verify. This is the workflow that felt clunky before — tab to terminal, run tests, tab back to Claude Code, paste results — now collapsed into a single view.
**In-app file editor.** Open files and make spot edits without leaving the session. For the inevitable "Claude got 95% right but I want to tweak this one line myself" moment, you don't have to switch to VS Code or another editor.
**Faster diff viewer.** Rebuilt for performance on large changesets. If you've ever watched Claude Code scroll through a large diff in the old viewer, this is the fix. The rebuild prioritizes rendering speed, which matters when an agent has touched 40 files and you need to review the whole changeset quickly.
**HTML and PDF preview.** Open in-app. For agents generating reports, documentation, or frontend HTML, you can preview the output without exporting to a browser.
All four panes are drag-and-drop. Arrange terminal, preview, diff viewer, and chat in whatever grid matches how you work. The app doesn't enforce a layout.
## Three View Modes
Verbose, Normal, and Summary — three ways to look at what Claude is doing.
**Verbose** shows every tool call, every intermediate step, every reasoning token the model emits. Useful for debugging why an agent went in a direction you didn't expect, or for learning how Claude Code handles a class of task.
**Normal** shows the default level of detail — tool calls and key decisions, but not every intermediate step.
**Summary** collapses the session to high-level progress updates. If you have five sessions running and you want a quick status scan without reading detailed tool call logs, Summary mode lets you do that without switching to a different view.
The mode switch is per-session. A session where you're actively debugging gets Verbose; a background session refactoring a module gets Summary.
## The Bigger Picture
The redesigned desktop app is making an argument: the Claude Code desktop is the environment for agentic development, not a side panel bolted onto your existing editor.
This positions Anthropic differently from Cursor and Copilot, which are IDE-embedded and therefore architecturally tied to the editor-centric workflow. Claude Code's terminal-native and now desktop-native approach means the environment scales to multi-agent orchestration in a way that an IDE plugin cannot — you can't run five parallel Cursor Composers across five repos in a single organized interface the way you can now with Claude Code's session sidebar.
The worktree isolation, the side chats, the integrated toolkit, the view modes — none of these are features you'd design for a tool that assumes one developer, one task, one file at a time. They're features you design for a tool that assumes the developer is an orchestrator managing multiple simultaneous workstreams.
That assumption is increasingly accurate.
---
**Sources:**
- [Redesigning Claude Code on desktop for parallel agents — Anthropic](https://claude.com/blog/claude-code-desktop-redesign)
- [Claude Code desktop docs — Anthropic](https://code.claude.com/docs/en/desktop)
- [Anthropic rebuilds Claude Code desktop app around parallel sessions — MacRumors](https://www.macrumors.com/2026/04/15/anthropic-rebuilds-claude-code-desktop-app/)
- [Claude Code gets automated routines and a desktop makeover — SiliconANGLE](https://siliconangle.com/2026/04/14/anthropics-claude-code-gets-automated-routines-desktop-makeover/)
- [We tested the redesigned Claude Code desktop app — VentureBeat](https://venturebeat.com/orchestration/we-tested-anthropics-redesigned-claude-code-desktop-app-and-routines-heres-what-enterprises-should-know)
- [Claude Code desktop redesign — The New Stack](https://thenewstack.io/claude-code-desktop-redesign/)
---
# Claude Opus 4.7: 87.6% SWE-bench, Implicit-Need Tests, Same Price
URL: https://sdd.sh/2026/04/claude-opus-4.7-87.6-swe-bench-implicit-need-tests-same-price/
Date: 2026-04-17
Updated: 2026-04-17
Tags: Claude, Anthropic, model release, SWE-bench, agentic coding, benchmarks
Categories: AI Tools, Industry
Summary: Anthropic shipped Claude Opus 4.7 on April 16, 2026. SWE-bench Verified jumps nearly 7 points to 87.6%, SWE-bench Pro leaps from 53.4% to 64.3%, and the model is the first Claude to pass implicit-need tests. Pricing stays flat at $5/$25 per million tokens.
Claude Opus 4.7 landed yesterday, April 16, 2026 — and for once, the headline isn't "new model costs more." Pricing holds at $5/$25 per million tokens (input/output), the same as Opus 4.6. What changed is the capability ceiling, and on the benchmarks that matter for agentic coding, the jump is real.
## The Numbers
**SWE-bench Verified** climbs from 80.8% to **87.6%** — a nearly 7-point improvement that puts Opus 4.7 ahead of both GPT-5.4 and Gemini 3.1 Pro (80.6%). More meaningfully, **SWE-bench Pro** — the harder, multi-language variant Anthropic itself helped design to be contamination-resistant — jumps from 53.4% to **64.3%**, leapfrogging GPT-5.4 (57.7%) and Gemini (54.2%).
To put the Pro number in perspective: every frontier model clustered around 53–58% on SWE-bench Pro as recently as two months ago. A 64.3% score is a genuine step change, not rounding-error progress.
**MCP-Atlas**, the agentic tool-use benchmark tracking multi-step agent behavior through Model Context Protocol workflows, hits **77.3%** for Opus 4.7, compared to 75.8% for Opus 4.6, 73.9% for Gemini 3.1 Pro, and 68.1% for GPT-5.4.
**OSWorld-Verified**, which tests computer-use tasks against real desktop interfaces, climbs from 72.7% to **78.0%** — within 1.6 points of the Mythos Preview at 79.6%, and ahead of GPT-5.4's 75.0%.
## What Actually Improved
Anthropic is specific about the improvements, which is more useful than the usual "stronger, better, smarter" release language.
**Fewer token errors in agentic loops.** The model produces a third of the tool errors of Opus 4.6 on complex multi-step tasks. In agentic workflows where a model is running dozens of tool calls in sequence, error accumulation is the enemy — errors compound, context gets polluted, and the agent recovers badly or not at all. Cutting tool errors by 67% has a multiplicative effect on task completion.
**14% improvement on complex multi-step workflows, using fewer tokens.** Anthropic isn't trading efficiency for accuracy here. Opus 4.7 completes more complex tasks while consuming fewer tokens — a combination that almost never shows up in model releases because it's easier to throw more compute at hard problems than to solve them more efficiently.
**Cursor's internal benchmark (CursorBench) jumped from 58% to 70%.** Cursor's team has one of the most honest third-party evaluations because it's run against their actual user workflows, not academic datasets. A 12-point jump on a real-world coding benchmark is significant.
**One production partner saw 13% higher resolution rate** on a 93-task coding benchmark, including four tasks that neither Opus 4.6 nor Sonnet 4.6 could solve at all. Tasks that were simply out of reach for the previous generation are now solvable.
## Implicit-Need Tests: The Understated Breakthrough
The benchmark most likely to be glossed over in coverage is the one that might matter most: Opus 4.7 is the first Claude model to pass what Anthropic calls **implicit-need tests**.
These are tasks where the model must figure out what tools or actions are needed, rather than being told explicitly. The distinction matters enormously in production agentic workflows. When you're running Claude Code autonomously — working through a multi-step spec, debugging a failing CI run, or refactoring a module it's never seen before — you cannot enumerate every tool call in advance. The model has to infer what's required from context.
Previous models handled explicit instruction well. "Use the filesystem tool to read X, then the search tool to find Y" — fine. But "figure out what's causing the test failure and fix it" requires inferring which tools to reach for, in what order, with what parameters. That's the difference between a capable model and a capable agent. Opus 4.7 passing implicit-need tests is Anthropic flagging that the gap is narrowing.
## Vision: 3x the Resolution
Opus 4.7 accepts images up to **3.75 megapixels** — about three times the limit of prior Claude models (approximately 1.25MP). In practical terms: full-resolution screenshots of multi-monitor setups, high-density UI designs, detailed diagrams, and dense code in screenshots are now readable without pre-scaling.
This matters specifically for computer-use workflows where the agent is reading the actual screen, not a compressed thumbnail. If you've seen Claude miss fine-grained UI details on OSWorld-style tasks, the resolution increase is directly relevant.
## Multi-Agent Coordination
Opus 4.7 introduces native multi-agent coordination — the ability to orchestrate parallel AI workstreams rather than processing tasks sequentially. This is the model-level capability that Anthropic's Claude Cowork and Managed Agents infrastructure have been building toward. The orchestrator model can now natively spawn, monitor, and synthesize work from parallel subagents rather than treating every task as a single-threaded problem.
Combined with the Claude Code desktop redesign that shipped three days ago (parallel sessions via git worktrees), Opus 4.7's multi-agent coordination is the model layer catching up with the tooling layer.
## Pricing: Still $5/$25
The pricing line is worth repeating because it's unusual: **$5 per million input tokens, $25 per million output tokens**, identical to Opus 4.6.
In a market where model releases routinely come with 20–40% price increases, Anthropic keeping the price flat while delivering genuine capability improvements is either a deliberate competitive move or a sign that their inference infrastructure improvements are outpacing raw capability gains. Possibly both.
For teams that have been holding off on Opus for cost reasons — running Sonnet 4.6 for most tasks, reserving Opus for the hard stuff — the economics of running Opus more aggressively just got better.
## Availability
Opus 4.7 is live across:
- **Claude API** (claude.com)
- **Amazon Bedrock** — with Bedrock's zero-operator-access guarantee, meaning neither Anthropic nor AWS operators see your prompts or responses
- **Google Cloud Vertex AI**
- **Microsoft Azure AI Foundry**
The v2.1.94 Claude Code update (released alongside the model) automatically defaults Bedrock, Vertex, and Foundry users to **high effort** instead of the previously controversial medium default — an implicit acknowledgment that the effort controversy from last week was a legitimate concern for enterprise users.
## What This Means for Claude Code Workflows
If you're running Claude Code in autonomous or semi-autonomous modes, Opus 4.7 changes the calculus in three ways:
**1. Fewer restarts.** The tool error reduction means complex multi-step agents are less likely to go off the rails mid-task. Every tool error is a branch point where the agent either recovers correctly or drifts. Fewer errors means fewer unrecoverable states.
**2. Harder tasks become viable.** The 64.3% SWE-bench Pro score and the implicit-need test results suggest that tasks requiring genuine inference about what to do next — not just instruction-following — are now in reach. The boundary of "what can I delegate to Claude Code" just moved outward.
**3. The orchestrator role gets stronger.** Multi-agent coordination as a native model capability, combined with lower error rates, makes Opus 4.7 a better orchestrator for multi-agent Claude Code setups. If you're running Claude Code Agent Teams, the lead agent just got meaningfully better at managing its subordinates.
Anthropic is building toward a model that doesn't just do what it's told — it figures out what needs to be done and does it. With Opus 4.7, that trajectory is visible in the benchmark numbers.
---
**Sources:**
- [Introducing Claude Opus 4.7 — Anthropic](https://www.anthropic.com/news/claude-opus-4-7)
- [Claude Opus 4.7 available in Amazon Bedrock — AWS](https://aws.amazon.com/blogs/aws/introducing-anthropics-claude-opus-4-7-model-in-amazon-bedrock/)
- [Claude Opus 4.7 leads on SWE-bench and agentic reasoning — The Next Web](https://thenextweb.com/news/anthropic-claude-opus-4-7-coding-agentic-benchmarks-release)
- [Claude Opus 4.7 Benchmarks Explained — Vellum AI](https://www.vellum.ai/blog/claude-opus-4-7-benchmarks-explained)
- [Claude Opus 4.7 with automated cybersecurity safeguards — Help Net Security](https://www.helpnetsecurity.com/2026/04/16/claude-opus-4-7-released/)
- [Claude Code Changelog — Claudefast](https://claudefa.st/blog/guide/changelog)
---
# Claude Cowork Goes GA: Six Enterprise Features That Turn AI Into Workplace Infrastructure
URL: https://sdd.sh/2026/04/claude-cowork-goes-ga-six-enterprise-features-that-turn-ai-into-workplace-infrastructure/
Date: 2026-04-16
Updated: 2026-04-16
Tags: Claude Code, Anthropic, Enterprise, RBAC, OpenTelemetry, MCP, Claude Cowork
Categories: AI Tools, Industry
Summary: Anthropic moved Claude Cowork from research preview to general availability on April 9, 2026, and shipped six enterprise management features alongside it. RBAC, group spend limits, OpenTelemetry, per-tool connector controls, a Zoom MCP connector, and expanded analytics. Here is what each feature does and why the bundle matters more than any individual item.
There is a difference between a productivity tool and workplace infrastructure. Productivity tools individuals adopt on their own. Infrastructure IT deploys, controls, audits, and bills for. Claude Cowork — Anthropic's shared AI workspace inside Claude Desktop — crossed that line on April 9, 2026, when it reached general availability with six enterprise management features that are clearly aimed at the second category.
The GA launch was part of a triple announcement that also included Claude Managed Agents entering public beta. But the Cowork features deserve their own analysis, because the bundle they form is not accidental.
## What Claude Cowork Is
Claude Cowork is the shared workspace layer inside Claude Desktop — available on macOS and Windows — that lets teams collaborate with Claude across projects. It is distinct from Claude Code (the agentic coding tool) and from standard Claude chat. Think of it as a shared project environment where teams can run Claude agents, connect tools via MCP, and build persistent workflows that multiple people access.
It has been in research preview since earlier this year. As of April 9, it is available to all paid plans at no extra cost. The six new features are what make it deployable at scale rather than just usable by early adopters.
## The Six Features
### 1. Role-Based Access Control with SCIM
Admins on Enterprise plans can now organize users into groups — manually or via SCIM from an identity provider — and assign each group a custom role that defines which Claude capabilities its members can use. A read-only group can query but not execute. A developer group can run agents but not modify org-wide connector settings. A security review group gets access to audit logs without touching live workflows.
This is table-stakes for any serious enterprise deployment. Without it, every user has the same permissions, which is a non-starter for regulated industries. The SCIM integration means group membership stays in sync with existing HR systems automatically — no manual provisioning when someone changes teams.
### 2. Group Spend Limits
Spend limits can now be set per group, enabling per-team AI budgeting. The finance team gets a cap appropriate to their use case. The engineering team gets a larger cap appropriate to theirs. When a group hits its limit, usage stops — not the whole organization.
This sounds administrative, but it is actually what makes AI tooling compatible with how companies manage software spend. CFOs do not approve "unlimited token consumption." They approve line items with predictable ceilings. Group spend limits turn Claude Cowork from a potentially unconstrained cost into a managed budget item.
### 3. Usage Analytics (Dashboard and API)
Claude Cowork activity now appears in the admin dashboard and in the Analytics API introduced last week. Dashboard-level views show Cowork sessions and active users across custom date ranges. The API goes deeper: per-user Cowork activity, skill and connector invocations, and DAU/WAU/MAU figures alongside existing Chat and Claude Code metrics.
For anyone building a business case for AI tooling — or reporting on AI ROI to a board — having normalized, exportable usage data is essential. This connects directly to the Analytics API story from April 14: the data is consistent across Chat, Claude Code, and Cowork, so you can build a single BI dashboard that shows AI adoption and impact across the full Anthropic stack.
### 4. OpenTelemetry Support
Cowork now emits operational events in OTel-format telemetry: tool and connector calls, files read or modified, skills used, and whether each AI-initiated action was approved manually or automatically. Those events are compatible with standard SIEM pipelines — Splunk, Datadog, Cribl, Elastic.
This is the feature that lets security teams say yes. If an AI agent is taking actions in your environment, your security operations team needs to see those actions in the same observability stack where they see everything else. Asking them to log into a separate AI dashboard to audit agent behavior is not going to work. OTel compatibility means Claude Cowork activity shows up in the same Splunk dashboards that already alert on anomalous API calls or unusual file access patterns.
The event schema includes whether actions were manually approved, which matters for compliance auditing. You can answer "did a human approve this AI action?" from your SIEM without touching the Anthropic console.
### 5. Per-Tool Connector Controls
Admins can now configure which actions are available within each MCP connector at the organizational level. You can enable a GitHub connector in read-only mode — Claude can see the repository but cannot push commits. You can allow a Linear connector to query and comment but not close issues. You can give one group full access to a Slack connector and give another group read-only access.
This is fine-grained permission management for the MCP layer, and it matters because MCP connectors are where things can go wrong. An AI agent with unrestricted write access to your production GitHub repo is a different risk profile than one that can only read and comment. Per-tool connector controls let you tighten the blast radius of any individual workflow without disabling it entirely.
### 6. Zoom MCP Connector
Zoom launched a native MCP connector alongside the Cowork GA, bringing meeting intelligence directly into Cowork. The connector delivers AI Companion meeting summaries, action items, and transcripts into Cowork projects, which can then feed into agent workflows.
The practical use case: a Cowork routine fires after a standup, pulls the Zoom transcript, extracts action items, and creates Linear tickets with the appropriate owners and due dates. The meeting does not produce a document that gets filed and forgotten — it produces tracked work items automatically. This is the kind of workflow that previously required a custom integration; now it is a connector configuration.
## Why the Bundle Matters
Each of these features is useful in isolation. Together, they answer the four questions that enterprise IT and security teams ask before deploying any new tool:
1. **Can we control who accesses what?** — Yes: RBAC + SCIM + per-tool connector controls.
2. **Can we predict and cap spend?** — Yes: group spend limits.
3. **Can we audit what the AI is doing?** — Yes: OTel export to your existing SIEM.
4. **Can we see adoption and usage data?** — Yes: analytics dashboard and API.
These are not features that individual developers care about. They are features that make a CISO and a CFO comfortable enough to greenlight an org-wide deployment. Anthropic shipped all four in one announcement. That is not a coincidence — it is a deliberate enterprise readiness package.
## What Is Still Missing
Claude Cowork GA is not a complete enterprise solution. A few gaps are worth noting.
There is no on-premises or VPC deployment option. Everything runs on Anthropic's infrastructure. For organizations with strict data residency requirements — financial services, healthcare, defense contractors — that is a blocker. The AWS Bedrock path exists for those organizations, but Cowork itself does not yet support private cloud deployment.
The RBAC model is also relatively coarse. You can define roles at the group level, but there is no workflow-level permissions system yet — you cannot say "only Jane can approve this specific agent's write actions." That level of granularity would matter for high-stakes automated workflows.
And the Zoom connector, while useful, is a single vendor. The MCP ecosystem is rich enough that teams expecting Salesforce, ServiceNow, or Workday connectors at the same level of integration will need to build or wait.
## The Bigger Picture
Claude Cowork's GA is Anthropic making a clear bet that AI tooling will be purchased and managed by enterprises, not just adopted bottom-up by individual developers. The feature set maps directly to enterprise procurement requirements, not developer preferences.
This is the right bet. Individual developer adoption got AI coding tools to where they are today. The next growth phase — the one that justifies $30B ARR projections — requires enterprise IT departments to say yes. That requires RBAC, audit logs, spend controls, and SIEM integration. Claude Cowork GA delivers exactly that.
Whether enterprises actually deploy it at scale is a different question, and the answer depends on how well the workflows hold up against real organizational complexity. But the prerequisites for enterprise consideration are now in place.
---
**Sources**
- [Making Claude Cowork ready for enterprise — Anthropic Blog](https://claude.com/blog/cowork-for-enterprise)
- [Anthropic scales up with enterprise features for Claude Cowork and Managed Agents — 9to5Mac](https://9to5mac.com/2026/04/09/anthropic-scales-up-with-enterprise-features-for-claude-cowork-and-managed-agents/)
- [Claude Cowork Reaches GA with 6 Enterprise Management Features — Lilting Channel](https://lilting.ch/en/articles/claude-cowork-ga-enterprise-rbac-opentelemetry)
- [Zoom Launches MCP Connector for Claude, Giving Meeting Data a Life Beyond Zoom — UC Today](https://www.uctoday.com/productivity-automation/zoom-launches-mcp-connector-for-claude-giving-meeting-data-a-life-beyond-zoom/)
- [Anthropic Launches Managed Agents and Claude Cowork GA: The Triple Announcement of April 9, 2026 — Pasquale Pillitteri](https://pasqualepillitteri.it/en/news/755/anthropic-managed-agents-cowork-ga-april-9-2026)
- [Anthropic and OpenAI target big businesses with enterprise-grade controls and lower pricing — SiliconAngle](https://siliconangle.com/2026/04/09/anthropic-openai-target-big-businesses-enterprise-grade-controls-lower-pricing/)
---
# Anthropic's Silent 'Effort' Default: A Reasonable Decision, a Transparency Failure
URL: https://sdd.sh/2026/04/anthropics-silent-effort-default-a-reasonable-decision-a-transparency-failure/
Date: 2026-04-16
Updated: 2026-04-16
Tags: Claude Code, Anthropic, Opus 4.6, Performance, Trust, Adaptive Thinking
Categories: AI Tools, Industry
Summary: On March 3, Anthropic quietly changed Claude Opus 4.6's default effort level to 'medium' without telling users. An AMD executive's analysis of 6,852 sessions showed a 73% drop in visible thinking depth. Fortune, VentureBeat, and The Register covered the fallout. Here is what actually changed, why Anthropic did it, and what it means for developers who depend on Claude Code for serious work.
On April 13, 2026, a GitHub issue went viral in the AI tools community. Stella Laurenzo — a machine learning engineer at AMD — posted an analysis of 6,852 Claude Code session files, 17,871 thinking blocks, and 234,760 tool calls, and the picture she painted was damning. Median visible thinking length had dropped from 2,200 characters in January to 600 characters in March. API calls per task showed up to 80 times more retries. Claude's behavior had shifted from "research first, then edit" to "edit first, ask questions later."
The thread blew up. Fortune covered it. VentureBeat ran "Is Anthropic nerfing Claude?" The Register reported that Claude was getting worse — according to Claude itself, apparently.
The core complaint was not just that performance had degraded. It was that nobody had been told.
## What Actually Changed
In early February 2026, Anthropic introduced "adaptive thinking" to Claude Opus 4.6. Instead of applying a fixed reasoning budget to every query, the model would decide how much thinking to apply based on the task. A simple request gets a light touch. A complex refactor gets extended reasoning. The idea is resource efficiency — don't spend 2,000 tokens thinking through "what is 2+2."
That part was announced and documented.
On March 3, 2026, Anthropic did something else: they changed the default effort level for Opus 4.6 to "medium" (internally, effort level 85 out of 100). This was not in a changelog. It was not in a release note. Boris Cherny — Anthropic's executive lead on Claude Code — eventually explained the decision publicly after the April 13 backlash: Claude had been consuming too many tokens per task, users had been complaining about that, and medium effort was the best balance across intelligence, latency, and cost for most users.
That explanation is probably correct. It is also completely beside the point.
## The Transparency Problem
The complaint that matters is not "Anthropic made Claude worse." The complaint is "Anthropic changed the behavior of a tool that developers have built workflows around, and didn't say so."
This distinction is important because it determines what kind of problem this is. If the issue is pure quality degradation, the fix is improving the model. If the issue is communication, the fix is better process. The evidence suggests it is primarily the second.
Here is why the communication failure hit so hard. Developers building on Claude Code do not experience it as a product with version numbers and changelogs — they experience it as a partner with a certain personality and work style. When that work style changes, the mental model breaks. Is Claude worse? Did I change something in my prompts? Is there a problem with my CLAUDE.md file? Am I hitting rate limits? The invisible nature of the change made debugging impossible.
A developer who sees a new model version knows to re-benchmark. A developer who sees no version change but degraded output has no signal to act on. That is a trust problem, not a performance problem.
## The Compounding Factor
The March effort change did not happen in isolation. It came on top of adaptive thinking (February), which itself had already shifted Claude's visible reasoning in ways that users noticed. Then Claude experienced service disruptions in April — including a significant outage on April 15 that generated thousands of user complaints about login failures and degraded output.
When multiple things change or break in close succession, users cannot attribute degradation to any single cause. The cumulative effect is a generalized distrust: Claude is less reliable than it used to be, for reasons that are not fully explained. The Laurenzo analysis was compelling precisely because it provided data that confirmed what developers had been sensing for months.
Anthropic also compounded the problem by being slow to respond. By the time Cherny's explanation appeared, the narrative had already calcified. Users who had spent weeks debugging their Claude Code setups — adjusting prompts, restructuring CLAUDE.md files, switching models, blaming themselves — were not receptive to "we made a reasonable product decision."
## What Anthropic Should Have Done
The decision itself was defensible. An effort level of 85 probably is the right default for most tasks. Token consumption had been a genuine problem — Anthropic publicly acknowledged in early April that users were hitting usage limits "way faster than expected." Managing that is a legitimate product concern.
The correct approach was a changelog entry and a user-facing setting. Something like: "We've changed the default effort level to medium to optimize for token efficiency. Power users running complex autonomous tasks can set effort to high in their settings, or use `/think` for per-query extended reasoning."
One sentence. One documented setting. The trust gap does not open.
Instead, users discovered the change through statistical analysis of their session data. That is not how you treat developers who are depending on your tool to run production workflows.
## What Power Users Can Do Now
The good news is that the effort level is not locked. There are several ways to get more reasoning depth when you need it:
**Per-query extended thinking**: Use `/think` in Claude Code to trigger extended reasoning on a specific task. This works well for complex, high-stakes tasks where you want Claude to work through the problem carefully before acting.
**Effort setting**: In Claude Code settings, you can set `effort: high` to override the medium default globally. This increases token consumption, but if you are on Max or Enterprise and need the depth, it is worth it. Be aware that this affects your usage limits.
**Ultraplan**: For complex planning tasks, `/ultraplan` spins up a dedicated Opus 4.6 session with extended compute specifically allocated for planning. If you are architecting a major refactor or designing a system, this is the right tool rather than fighting the default effort level in a standard session.
**Well-scoped prompts**: Adaptive thinking uses the complexity of the request as a signal. Vague prompts get lighter thinking. Specific, complex prompts with clear constraints get more. This was always true, but it matters more now that the default effort ceiling is lower.
## The Bigger Concern
Beyond this specific incident, there is a structural issue worth naming. As AI tools become more central to how developers work, the standards for how vendors communicate changes need to increase, not decrease. A model behavior change that affects thousands of production workflows is functionally a breaking change. It deserves the same communication standards as a breaking API change.
This is not a niche concern. The JetBrains survey from April 13 showed Claude Code growing 6x in adoption over eight months. The Pragmatic Engineer survey found it the most-loved tool among professional developers. As the user base grows, so does the number of people who have built genuine workflow dependencies on Claude's behavior.
Anthropic knows this. Their response to the backlash has been measured — acknowledging the communication failure, explaining the reasoning, and committing to more transparency about effort defaults going forward. That is the right posture.
But the episode is a useful reminder that even the best AI tools are only as trustworthy as the vendor's operational practices. Capability without transparency is a fragile foundation for infrastructure.
---
**Sources**
- [Is Anthropic 'nerfing' Claude? Users increasingly report performance degradation as leaders push back — VentureBeat](https://venturebeat.com/technology/is-anthropic-nerfing-claude-users-increasingly-report-performance)
- [Anthropic is facing a wave of user backlash over reports of performance issues with its Claude AI chatbot — Fortune](https://fortune.com/2026/04/14/anthropic-claude-performance-decline-user-complaints-backlash-lack-of-transparency-accusations-compute-crunch/)
- [Claude is getting worse, according to Claude — The Register](https://www.theregister.com/2026/04/13/claude_outage_quality_complaints/)
- [Claude code performance under scrutiny after viral 67% drop claim — Cryptonomist](https://en.cryptonomist.ch/2026/04/13/claude-code-performance/)
- [Anthropic admits Claude Code users hitting usage limits 'way faster than expected' — DevClass](https://www.devclass.com/ai-ml/2026/04/01/anthropic-admits-claude-code-users-hitting-usage-limits-way-faster-than-expected/5213575)
- [Claude Code Drama: 6,852 Sessions Prove Performance Collapse — Scortier Substack](https://scortier.substack.com/p/claude-code-drama-6852-sessions-prove)
---
# The Three-Layer AI Coding Stack That Nobody Planned (But Everyone Is Building)
URL: https://sdd.sh/2026/04/the-three-layer-ai-coding-stack-that-nobody-planned-but-everyone-is-building/
Date: 2026-04-15
Updated: 2026-04-15
Tags: Claude Code, Cursor, Codex, AI Tools, Agentic Workflows, Architecture
Categories: AI Tools, Industry
Summary: Cursor, Claude Code, and OpenAI Codex are not converging into a single winner-take-all tool. They are stratifying into three distinct layers — orchestration, execution, and review — and the most sophisticated developers are building workflows that use all three. Here is what each layer does, why Claude Code wins at the execution layer, and what the emergence of OpenAI's Codex plugin for Claude Code signals about where this is heading.
The AI coding tools narrative has spent two years waiting for a winner. Cursor versus Copilot versus Claude Code versus Codex — pick one, evangelize it, wait for the others to die. It is a satisfying story, and it is not what is actually happening.
What is happening is stratification. Cursor, Claude Code, and OpenAI Codex are each staking out distinct roles in a layered architecture, and the most productive developers are treating them the way they treat Terraform, Docker, and Kubernetes: not as competitors you choose between, but as components you compose.
The pattern has a name now — composition over consolidation — and there is hard evidence it is the direction the tooling ecosystem is moving.
## The Three Layers
### Layer 1: Orchestration (Cursor)
Cursor has spent 2026 repositioning itself away from "AI-enhanced IDE" and toward "agent control plane." Cursor 3, released in April, is the clearest expression of that ambition. The centerpiece is the Agents Window: a management interface for running multiple coding agents simultaneously across local machines, cloud sandboxes, SSH connections, and Git worktrees — all from one view.
The `/best-of-n` command is the most telling feature. It takes a single task description and runs it simultaneously across multiple models in isolated worktrees, then surfaces the results for comparison. You pick the winner, or you combine the best parts. The logic is the same as `terraform plan` for infrastructure: generate options before committing. Model selection becomes an infrastructure decision driven by task characteristics, not brand loyalty.
This is an orchestration model. Cursor's value at this layer is not in writing the code — it is in managing the fleet of agents that write the code, controlling which models get which tasks, and providing a unified interface over a heterogeneous set of execution environments.
### Layer 2: Execution (Claude Code and Codex)
This is where code actually gets written. Claude Code and OpenAI Codex occupy the same layer and compete on it — but the competitive dynamics are more interesting than simple head-to-head.
Claude Code's position at the execution layer is strong. A February 2026 survey of 906 software engineers put Claude Code at a 46% "most loved" rating — the highest in the field. SemiAnalysis estimates it accounts for approximately 4% of all public GitHub commits as of March 2026, with projections toward 20% by year-end. Claude Code's terminal-native architecture, deep codebase understanding, and Anthropic's commitment to agentic workflows give it structural advantages at complex, long-context execution tasks.
Codex has reached 3 million weekly active users on OpenAI's side, running autonomous coding tasks in sandboxed cloud environments. It is fast, capable, and increasingly embedded in enterprise pipelines that were already built around the OpenAI API surface.
The interesting development is not who is ahead. It is that OpenAI published `codex-plugin-cc` — a plugin that allows Codex to run inside Claude Code as a review agent. Let that sink in: OpenAI released a plugin for a competitor's terminal to extend that competitor's tool. Rather than waiting for Claude Code users to switch to Codex, OpenAI embedded Codex where they already work.
This is infrastructure distribution thinking, not zero-sum competition thinking. And it creates a review layer that is genuinely valuable.
### Layer 3: Review (Cross-Provider Verification)
The codex-plugin-cc capability exposes a structural problem with any single-model development workflow: the model that writes the code is poorly positioned to independently catch its own errors. Its training biases, its failure modes, and its blindspots are consistent. Self-review by the same model is not adversarial.
Running Codex as a review agent inside Claude Code introduces a different model's perspective at verification time. The plugin supports standard code review, adversarial pressure-testing around authentication and race conditions, and automatic review gates that block completion if issues appear. The two models were trained on different data, by different teams, with different objectives. Their failure modes are not correlated. That is the point.
This is a pattern borrowed from security engineering — the difference between a developer testing their own code and a dedicated security team with adversarial intent. The emerging cross-provider review layer formalizes that logic at the AI level.
## The Stack Developers Are Actually Building
The practical upshot of this stratification is visible in what developers are deploying at scale.
A survey by The Pragmatic Engineer in February 2026 found that many senior engineers have converged on a two-tool baseline: Cursor for daily IDE work, and Claude Code for complex tasks requiring deep codebase context. This combination runs approximately $40 per month and covers the full range of development scenarios these developers encounter.
For teams with more sophisticated requirements, the pattern extends: Cursor orchestrates a fleet of agents, Claude Code handles execution of complex multi-file tasks, Codex runs as a review agent via the plugin, and GitHub Actions handles CI integration. Each component does what it is good at. None of them is trying to replace the others.
This is compositional reasoning applied to tooling. It is how mature developers already think about their infrastructure stack — you do not choose between a load balancer and a database — and it is how AI coding tools are beginning to be understood.
## Why Claude Code Wins at the Execution Layer (For Now)
The terminal-native architecture is not an aesthetic choice. It is an architectural one that determines what Claude Code can and cannot do.
Running in the terminal means Claude Code has native access to your full development environment: the file system, the shell, running processes, git history, environment variables, the actual test runner output. It does not mediate through an IDE's extension API. It does not inherit an editor's mental model of what a "project" is.
For tasks that require genuine codebase understanding — large refactors, multi-file changes that need to reason about the whole system, debugging failures that require understanding what changed and why — this full-context access matters. IDE-based tools are architecturally constrained to the view the editor exposes. Claude Code is not.
The autonomy claim is real but requires context. Claude Code's Auto Mode, the `/ultraplan` cloud planning sessions, and now Routines are all expressions of the same thesis: the agent should be capable of doing more work without interrupting the developer. Not every task needs human approval at each step. The tools that accept this and build for it are structurally different from tools that treat every action as requiring human confirmation.
Cursor's self-hosted cloud agents and Codex's sandboxed execution model are attempts to reach similar autonomy. They are credible. But they started from the IDE-centric model and are working toward autonomy, while Claude Code started from the autonomous agent model and is working toward better UI. The starting positions matter.
## What the Convergence Signals
The emergence of a three-layer stack does not mean the competition is over. Cursor is competing at the orchestration layer by building a fleet management interface. Claude Code is extending at the execution layer through Routines, Ultraplan, and cloud-native features. Codex is distributing itself as infrastructure through the plugin model.
Each company has a defensible position in the layer where they are strongest, and each is attempting to expand into adjacent layers. Anthropic's routines are a move from pure execution into scheduled orchestration. Cursor's `/best-of-n` is a move from orchestration into multi-model execution management.
The interoperability is real — OpenAI publishing a plugin for Anthropic's product is not a press release, it is a distribution decision — and it reflects a market that is large enough that the dominant strategy for any individual player is to make their layer indispensable, not to win at every layer.
For developers, the implication is practical: the teams treating AI coding tools as a pick-one decision are leaving capability on the table. The teams that have figured out what each layer is good for and composed them accordingly are operating at a different productivity level.
The three-layer stack was not designed by any committee. It emerged from developers solving real problems with the tools available. That is usually how the important infrastructure patterns happen.
---
**Sources**
- [Cursor, Claude Code, and Codex are merging into one AI coding stack nobody planned — The New Stack](https://thenewstack.io/ai-coding-tool-stack/)
- [AI Coding Stack Emerges Across Cursor, Claude, and Codex — AI Bucket](https://www.aibucket.io/post/ai-coding-stack-emerges-cursor-claude-codex)
- [Which AI Coding Tools Do Developers Actually Use at Work? — JetBrains Research Blog](https://blog.jetbrains.com/research/2026/04/which-ai-coding-tools-do-developers-actually-use-at-work/)
- [OpenAI Codex vs Cursor vs Claude Code: Which AI Coding Tool Should You Use in 2026? — NxCode](https://www.nxcode.io/resources/news/openai-codex-vs-cursor-vs-claude-code-ai-coding-tools-2026)
- [Claude Code overview — Claude Code Docs](https://code.claude.com/docs/en/overview)
---
# Claude Code Routines: The AI Cron Job That Actually Understands Your Codebase
URL: https://sdd.sh/2026/04/claude-code-routines-the-ai-cron-job-that-actually-understands-your-codebase/
Date: 2026-04-15
Updated: 2026-04-15
Tags: Claude Code, Automation, Agentic Workflows, GitHub, API, Productivity
Categories: AI Tools, Agentic Workflows
Summary: Claude Code's new Routines feature — launched April 14 as a research preview — turns your AI agent into a cloud-native automation engine. Schedule it, trigger it via API, or fire it on GitHub events. Here is what routines are, how each trigger type works, and why this is a bigger architectural shift than it looks.
Most people use Claude Code reactively. You open a terminal, describe a task, and watch it go. The session ends when the work is done, and Claude Code goes silent until you invoke it again.
Routines change the orientation entirely. Launched April 14, 2026 as a research preview, routines let you configure Claude Code once — a prompt, a repository, a set of connectors — and then have it run automatically on a schedule, in response to an HTTP call, or whenever something happens in your GitHub repo. It runs on Anthropic's cloud infrastructure. Your laptop doesn't need to be on.
This is not a convenience feature. It is a different model of how an AI coding agent fits into a software team's workflow.
## What a Routine Is
A routine is a saved Claude Code configuration: a prompt that defines what to do, one or more GitHub repositories to work in, environment variables and setup scripts, and MCP connectors (Slack, Linear, Google Drive, or anything else you have connected). You configure it once. The triggers determine when it runs.
Each run creates a new Claude Code cloud session. Claude starts from the repository's default branch, works in a `claude/`-prefixed branch by default, and can push changes, open pull requests, or call external services through connectors. Everything it does appears in your session list — reviewable, continuable, and auditable.
Crucially, routines run autonomously. There is no permission-mode picker and no approval prompt mid-run. What Claude can reach is bounded at setup time by which repositories you select, which connectors you include, and what your environment allows. Tighten those constraints before you deploy anything that touches production.
## Three Ways to Trigger a Routine
### Scheduled Triggers
Pick a cadence — hourly, daily, weekdays, or weekly — and Claude Code will start a session at that interval. You enter the time in your local timezone and Anthropic handles the conversion. Runs may start a few minutes late due to stagger, and that offset is consistent per routine.
For custom intervals (every two hours, first of the month), configure the closest preset in the UI and then run `/schedule update` from the CLI to set a specific cron expression. The minimum interval is one hour.
The canonical use case: a nightly backlog triage routine that pulls unprocessed issues from Linear, applies labels and owner assignments based on the area of code referenced, and posts a summary to Slack at 7am so the team starts the day with a groomed queue rather than an inbox. The engineer who used to do this on Monday mornings no longer does.
### API Triggers
Every routine can be given a dedicated HTTP endpoint and a bearer token. POST to it with the token in the `Authorization` header and Claude Code starts a new session. An optional `text` field in the request body passes run-specific context — an alert body, a log snippet, a failing test output — alongside the routine's saved prompt.
The call looks like this:
```bash
curl -X POST https://api.anthropic.com/v1/claude_code/routines/trig_01ABCDEFGHJKLMNOPQRSTUVW/fire \
-H "Authorization: Bearer sk-ant-oat01-xxxxx" \
-H "anthropic-beta: experimental-cc-routine-2026-04-01" \
-H "anthropic-version: 2023-06-01" \
-H "Content-Type: application/json" \
-d '{"text": "Sentry alert SEN-4521 fired in prod. Stack trace attached."}'
```
The response returns a session ID and a URL you can open to watch the run in real time:
```json
{
"type": "routine_fire",
"claude_code_session_id": "session_01HJKLMNOPQRSTUVWXYZ",
"claude_code_session_url": "https://claude.ai/code/session_01HJKLMNOPQRSTUVWXYZ"
}
```
The `/fire` endpoint ships under the `experimental-cc-routine-2026-04-01` beta header. It is explicitly experimental: shapes, rate limits, and token semantics may change. Anthropic has committed to keeping the two most recent previous header versions active so callers have time to migrate when it evolves.
The use case here is alert-to-PR pipelines. Your monitoring system fires an alert. The routine receives the stack trace, correlates it with recent commits in the repo, and opens a draft pull request with a proposed fix and a link back to the alert. Your on-call engineer reviews a PR rather than starting from a blank terminal at 2am. That is a meaningful difference.
### GitHub Event Triggers
This is where routines get genuinely powerful. A GitHub trigger fires a new Claude Code session whenever a matching event occurs in a connected repository. The list of supported events is comprehensive: pull requests (opened, closed, assigned, labeled, synchronized), PR reviews, push events, releases, issues, issue comments, check runs, check suites, workflow runs, workflow jobs, and more.
You can filter pull request triggers by author, title, body, base branch, head branch, labels, draft status, and whether the PR comes from a fork. Practical examples:
- **Auth module review**: triggers on PRs targeting `main` from branches containing `auth-provider`, running a security-focused review checklist.
- **External contributor triage**: triggers on all fork-based PRs, routing them through an extra review before a maintainer looks.
- **Automatic SDK porting**: triggers on merged PRs in one SDK repository and opens a parallel PR in a second SDK with the equivalent change — keeping two language implementations in sync without a human re-implementing each change.
- **CI failure triage**: triggers on `check_suite.completed` with a failure status, pulling the failing test output and proposing a fix before the developer even opens the notification.
Each matching event creates its own independent session. There is no cross-event session reuse, so two pushes produce two separate Claude Code runs — each with full context from the new event.
## The CLI Interface
From inside any Claude Code session, `/schedule` creates a scheduled routine conversationally. You can also pass a description directly: `/schedule daily PR review at 9am`. The CLI walks through prompt, repository, environment, and saves the routine to your account.
- `/schedule list` — see all routines
- `/schedule update` — modify an existing routine, including setting custom cron expressions
- `/schedule run` — trigger a run immediately
API and GitHub triggers can only be configured from the web UI at `claude.ai/code/routines`. The `/schedule` CLI currently handles scheduled routines only.
## Plan Limits and Usage
Routines draw down subscription usage the same way interactive sessions do. There is also a separate daily cap on routine runs per account:
| Plan | Daily routine runs |
|---|---|
| Pro | 5 |
| Max | 15 |
| Team / Enterprise | 25 |
Organizations with extra usage enabled can continue running routines on metered overage when the daily cap is hit. Without it, additional runs are rejected until the window resets.
Five daily routines on Pro sounds limiting, but consider what those five can cover: a nightly backlog triage, a deploy verification hook, a PR review gate, a documentation drift check, and one alert-response pipeline. That is a meaningful set of async coverage for an individual developer. Max's fifteen is the tier where routines start to become a team-level infrastructure decision.
## Why This Is More Than Convenient Cron
The standard objection to describing this as significant is that developers already have cron, GitHub Actions, and a dozen CI/CD integrations. What does adding Claude Code to the scheduler actually change?
The answer is the difference between executing a script and completing a task. GitHub Actions runs your defined workflow. A Claude Code routine reads the state of your codebase, applies judgment about what needs to happen given that state, and produces an output — a PR, a comment, a Slack message, a proposed fix — tailored to the specific context of this run.
A GitHub Action that runs on PR open applies a static checklist. A routine that runs on PR open reads the changed files, understands what the change is trying to accomplish, applies your team's documented review standards, and leaves inline comments that are specific to this PR. These are different things.
This is also distinct from Claude Code's existing GitHub Actions integration (which runs Claude in your CI pipeline) and from desktop scheduled tasks (which run on your machine). Routines are cloud-native, event-responsive, and persistent — the session URL is always accessible, even after the run finishes.
## The Catch
Routines are in research preview. Anthropic is explicit: behavior, limits, and the API surface may change. The `/fire` endpoint's beta header signals this is experimental infrastructure, not a stable API commitment.
The more operational concern is autonomy without oversight. A routine runs without approval prompts. If your prompt is underspecified, if the connectors it has access to are too broad, or if the Claude GitHub App has write access to branches you care about, a routine can do more than you intended. The default branch protection (Claude only pushes to `claude/`-prefixed branches) is there for a reason. Trust it until you have a specific reason to widen the permission.
The correct approach is to start narrow: one trigger, one well-scoped prompt, one repository, minimum connector access. Watch the first ten runs. Then expand.
## What It Signals
Routines are the first Claude Code feature that does not require a human to initiate anything. You set it up once and it runs. The agent operates on its own cadence, responding to the state of your codebase and the events your infrastructure generates.
That is a different mental model than "AI assistant." It is closer to "AI colleague who works the overnight shift, triages what they can, and leaves a queue of reviewed drafts for the morning." Whether routines fulfill that promise will depend on how well they handle the inevitable edge cases — but the architecture is right, and the direction is clear.
---
**Sources**
- [Introducing routines in Claude Code — Anthropic Blog](https://claude.com/blog/introducing-routines-in-claude-code)
- [Automate work with routines — Claude Code Docs](https://code.claude.com/docs/en/routines)
- [Anthropic adds routines to redesigned Claude Code, here's how it works — 9to5Mac](https://9to5mac.com/2026/04/14/anthropic-adds-repeatable-routines-feature-to-claude-code-heres-how-it-works/)
- [Claude Code routines promise mildly clever cron jobs — The Register](https://www.theregister.com/2026/04/14/claude_code_routines)
- [Anthropic's Claude Code gets automated 'routines' and a desktop makeover — SiliconAngle](https://siliconangle.com/2026/04/14/anthropics-claude-code-gets-automated-routines-desktop-makeover/)
---
# Claude Code Analytics API: The Missing Bridge Between AI Coding and Enterprise ROI
URL: https://sdd.sh/2026/04/claude-code-analytics-api-the-missing-bridge-between-ai-coding-and-enterprise-roi/
Date: 2026-04-14
Updated: 2026-04-14
Tags: Claude Code, Enterprise, Analytics, API, Productivity, ROI
Categories: AI Tools, Guides
Summary: Anthropic's Claude Code Analytics API gives enterprise organizations programmatic access to daily aggregated usage metrics — commits, PRs, lines of code, session counts, token costs, and more — per developer, per day. Here is what it tracks, how to set it up, and why it matters for every team that needs to justify its AI coding investment to leadership.
There is a moment every engineering leader eventually faces after deploying Claude Code at scale: someone in finance asks what the productivity gain actually is, in numbers they can put in a spreadsheet.
Until recently, the honest answer was "hard to measure." Developers loved the tool, velocity seemed up, but connecting AI coding activity to concrete output metrics required custom instrumentation, opinion surveys, or rough estimation.
Anthropic's Claude Code Analytics API closes that gap. Released in early 2026 and extended significantly with the Cowork GA launch in April, it gives enterprise organizations programmatic access to daily aggregated usage metrics per developer — commits created through Claude Code, pull requests opened, lines of code added and removed, session counts, tool acceptance rates, token usage by model, and estimated cost. No manual reporting. No developer self-assessment. Direct API access to what actually happened.
## What the API Tracks
The Claude Code Analytics API returns records aggregated at the per-user, per-day level. Each record contains:
**Productivity signals**
- Lines of code added via Claude Code
- Lines of code removed via Claude Code
- Commits created through Claude Code's commit functionality
- Pull requests created through Claude Code's PR functionality
- Number of distinct Claude Code sessions
**Tool usage signals**
- Tool call acceptance rates (how often developers approve vs. reject Claude's suggested actions)
- Tool call rejection rates (leading indicator of prompt quality or task mismatch)
- Breakdown by tool type (file edits, bash commands, web fetches, etc.)
**Cost and model signals**
- Token usage broken down by Claude model
- Estimated cost per user per day
- Customer type and terminal type metadata
The Enterprise Analytics API — a related but broader endpoint — also captures per-user engagement: conversation counts, messages sent, projects created, files uploaded, artifacts created, skills invoked, connectors used, and the Claude Code-specific metrics above rolled up for org-level reporting.
Data is available for up to 90 days of history (with records beginning January 1, 2026). Activity appears in the API within approximately one hour of completion, though the API excludes data newer than one hour to ensure pagination consistency.
## Getting Started
Access requires an Admin API key — a distinct credential from standard API keys, provisioned through the Claude Console. Only organization members with the Primary Owner role can mint Admin API keys.
Once you have one, the Claude Code Analytics API is accessed via `GET /v1/organizations/{org_id}/usage/claude_code` with standard date range parameters:
```bash
curl https://api.anthropic.com/v1/organizations/{org_id}/usage/claude_code \
-H "x-api-key: $ADMIN_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-G \
-d "start_date=2026-04-01" \
-d "end_date=2026-04-13"
```
The response is paginated JSON with one record per (user, date) pair. Each record contains the full metric set described above. Standard REST pagination applies — iterate through pages until you have the full dataset for your date range.
For the Claude Enterprise Analytics API (conversation-level metrics), the endpoint is separate: `GET /v1/organizations/{org_id}/usage/users`.
## Why Tool Acceptance Rates Matter More Than You Think
Most teams focus on the headline metrics — commits, PRs, lines of code. But tool acceptance and rejection rates are among the most valuable signals in the dataset, and they're easy to overlook.
When Claude Code proposes a file edit, a bash command, or a web search, the developer either approves or denies it. In Claude Code's Auto Mode, many actions are approved automatically. In standard mode, the developer reviews each one.
High rejection rates on specific tool types indicate something important:
- Claude is proposing actions the developer doesn't trust (prompting or spec quality issue)
- The task type is poorly suited to autonomous execution
- The developer's CLAUDE.md configuration is too permissive for their comfort level
High acceptance rates indicate the inverse: Claude is operating in a zone of trusted, predictable behavior. This is the configuration state you want for autonomous workflows.
Teams that instrument their rejection rates typically discover that new developers reject significantly more tool calls than experienced Claude Code users — useful data for onboarding and training programs. They also discover that certain task types have structurally low acceptance, which surfaces where autonomous workflows need better spec design before they're handed off.
## Building the ROI Case
Let's be specific about what this data enables.
**Commits per developer per week**: The single most comparable metric across teams. If your developers averaged 8 commits per week before Claude Code and are now averaging 14, that is a 75% increase in observable output. Finance can work with that.
**Cost per commit**: Total Claude Code spend (available from the cost metrics) divided by total commits. This gives you a cost-per-unit-of-output number that can be compared against the alternative: developer time cost per commit, including review cycles.
**High-value developer leverage**: Sort users by Claude Code session count and compare to commit output. Developers with high session counts and high commit rates are getting full value from autonomous workflows. Developers with high session counts but average commit rates may be using Claude Code conversationally rather than agentically — an onboarding opportunity.
**Model cost optimization**: The token breakdown by model shows whether your team is over-indexed on Opus when Sonnet would suffice for specific task types. This is typically a 2-5x cost difference per token. At scale, model routing based on task complexity pays for itself.
## OpenTelemetry and SIEM Integration
For security-conscious enterprise deployments, the analytics story goes beyond the API. Claude Code v2.1.94 and Claude Cowork GA introduced expanded OpenTelemetry support on Team and Enterprise plans.
OpenTelemetry events are emitted for:
- Tool calls (what was invoked, by which user)
- File modifications (what was changed)
- Whether AI-initiated actions received manual or automatic approval
These events plug directly into standard SIEM pipelines — Splunk, Cribl, Datadog, and equivalents. The result is that Claude Code activity becomes auditable in the same systems where you audit SSH access, code deployments, and database queries.
For organizations with compliance requirements, this is significant. You are not trusting Claude Code activity on faith — you are treating it as a first-class operational event, logged and auditable like any other privileged action.
## The Adoption Curve Problem (and How Analytics Solves It)
Grassroots adoption of Claude Code creates a specific enterprise problem: some developers are running full autonomous workflows while others are treating it as a better autocomplete. Both appear as "active users" in a license count.
The Analytics API exposes the actual distribution. A team of 50 developers with 50 active licenses might show:
- 12 developers in autonomous mode (high sessions, high commit output, high acceptance rates)
- 23 developers in conversational mode (high sessions, moderate output)
- 15 developers in minimal use (low sessions, low everything)
The 12 autonomous users are generating the ROI. The 23 conversational users have potential. The 15 minimal users are an onboarding problem.
You cannot build that picture from a license dashboard. You can build it from the Analytics API.
This is what separates "we use Claude Code" from "we have deployed Claude Code effectively." The former is an adoption metric. The latter is an outcome metric. Analytics converts one into the other.
## Connecting to GitHub and BI Tools
The Claude Code Analytics API outputs standard JSON over REST, which means it pipes into any BI tool without custom connectors. Typical integrations:
- **Looker / Tableau / Power BI**: Pull daily via scheduled API calls, load into a warehouse, build dashboards against it.
- **GitHub Actions**: Compare Claude Code commit metrics against total repository commit volume to calculate attribution percentage.
- **Notion / Confluence**: Automated weekly reports generated from the API and posted to engineering wikis.
- **PagerDuty / OpsGenie**: Alert on anomalous rejection rate spikes (often indicates a bad CLAUDE.md push or a prompt regression in a shared skill).
The 90-day history limit means you have approximately three months of runway before data starts rolling off. If you need longer retention, build the export pipeline early and store your own copy.
## The Bottom Line
The Claude Code Analytics API is not a feature for power users. It is a feature for organizations that have moved past "should we deploy Claude Code?" into "how do we optimize the deployment we already have?"
At $30 billion ARR and 1,000+ enterprise customers spending over $1M per year, Anthropic is clearly talking to engineering organizations that need this kind of accountability layer. The API reflects that understanding: it gives you the data to answer the productivity question, the cost question, and the adoption quality question in the same place.
If you are managing a Claude Code deployment of more than 10 developers and you are not pulling from this API, you are operating on intuition. That works until someone with a spreadsheet asks you to defend the budget. Build the pipeline first.
---
**Sources**
- [Claude Code Analytics API — Claude API Docs](https://platform.claude.com/docs/en/build-with-claude/claude-code-analytics-api)
- [Track team usage with analytics — Claude Code Docs](https://code.claude.com/docs/en/analytics)
- [Claude Enterprise Analytics API Reference Guide — Claude Help Center](https://support.claude.com/en/articles/13703965-claude-enterprise-analytics-api-reference-guide)
- [Access engagement and adoption data with the Analytics API — Claude Help Center](https://support.claude.com/en/articles/13694757-access-engagement-and-adoption-data-with-the-analytics-api)
- [How to Use Claude Code Analytics via API — Apidog](https://apidog.com/blog/claude-code-analytics-api/)
- [Claude Cowork Reaches GA with 6 Enterprise Management Features — Lilting Channel](https://lilting.ch/en/articles/claude-cowork-ga-enterprise-rbac-opentelemetry)
---
# Anthropic Hits $30B ARR and Overtakes OpenAI: What the Revenue Rocket Means for Claude Code
URL: https://sdd.sh/2026/04/anthropic-hits-30b-arr-and-overtakes-openai-what-the-revenue-rocket-means-for-claude-code/
Date: 2026-04-14
Updated: 2026-04-14
Tags: Anthropic, Claude Code, Industry, Enterprise, Revenue, Infrastructure
Categories: AI Tools, Industry
Summary: Anthropic just reported a $30 billion annual run rate — up 3x from $9B just four months ago — and overtook OpenAI in revenue. With a CoreWeave infrastructure deal, a Broadcom/Google TPU compute agreement, and 1,000+ enterprise customers spending over $1M per year, the company building Claude Code is now the fastest-growing software company in history. Here is what that means for the tools you use.
The number sounds made up: $30 billion annual run rate, reported in April 2026. Four months ago, Anthropic's ARR was $9 billion. Four months before that, $4.5 billion. Four months before that, roughly $1 billion. This is not a hockey stick. This is a vertical line.
Anthropic confirmed this week that its revenue run rate has surpassed $30 billion — overtaking OpenAI's reported $25 billion ARR for the first time. For context on the speed: it took Anthropic from its founding in 2021 to early 2025 to reach $1 billion ARR. It took the next fifteen months to add $29 billion more.
This matters for Claude Code users. When the company building your primary development tool is the fastest-growing software business in recorded history, it is not an abstract financial story. It is a product roadmap signal.
## The Numbers in Full
Anthropic's ARR trajectory since January 2025:
| Date | ARR |
|------|-----|
| January 2025 | ~$1B |
| Mid-2025 | ~$4.5B |
| End of 2025 | ~$9B |
| February 2026 (Series G) | ~$14B |
| April 2026 | $30B+ |
This is not a rounding-error jump. The company more than tripled revenue in four months. More significantly: it overtook OpenAI.
OpenAI built the GPT brand, shipped ChatGPT — the fastest consumer product to 100 million users in history — and raised over $40 billion in capital. Anthropic, working from a safety-first research agenda and a single-model family, has now surpassed it in revenue. That outcome would have been dismissed as fantasy speculation two years ago.
## Who Is Actually Paying
The composition of Anthropic's revenue is the more instructive number: roughly 80% enterprise. This is not a consumer subscription story. This is large organizations paying real money to put Claude into production systems.
More than 1,000 business customers are now spending over $1 million per year on Claude services — a figure that has more than doubled since February 2026. Enterprise buyers at that spend level do not experiment. They commit budgets, sign multi-year contracts, and integrate deeply into internal workflows.
For Claude Code specifically, enterprise adoption is driven by something concrete: measurable productivity outcomes. Teams that deploy Claude Code in autonomous, agentic modes — with specs, CLAUDE.md configs, and skill libraries — are reporting hours-per-week reclaimed per developer. At $1M+ annual spend, those ROI conversations close quickly.
## The Infrastructure Bets Signal Long-Term Conviction
Revenue is one signal. Capital allocation is another, and Anthropic is placing enormous infrastructure bets.
**CoreWeave deal (April 10, 2026)**: Anthropic signed a multi-year agreement with CoreWeave to provide GPU capacity for Claude workloads at production scale. CoreWeave's stock jumped 11% on the announcement. The deal brings online compute across multiple NVIDIA chip architectures in US data centers, with a phased rollout starting later this year. CoreWeave now has nine of the ten largest AI model providers on its platform — and Anthropic is the marquee name.
**Broadcom / Google TPU agreement**: Separately, Anthropic has signed a long-term deal with Google and Broadcom for 3.5 gigawatts of TPU compute starting in 2027. For reference, the entire global data center industry consumed roughly 500 gigawatts annually in 2024. Anthropic is contracting for 3.5 GW of dedicated AI compute. This is a bet that demand for Claude inference will be orders of magnitude larger than it is today.
**Own chip exploration**: Reports this week indicate Anthropic is actively exploring building its own AI chips — a move that would mirror Google's TPU program and Apple's silicon strategy. If confirmed, this would represent the most ambitious infrastructure play the company has made.
These are not the moves of a company that expects revenue to plateau.
## What This Means for Claude Code
There is a direct line between Anthropic's revenue and what gets shipped in Claude Code. More revenue means more research engineers, more infrastructure, faster iteration cycles, and the ability to run larger models at lower per-token costs.
Several implications are already visible:
**Faster model improvements**: The gap between Claude Opus 4.6 and Claude Mythos — the leaked model that reportedly scores 93.9% on SWE-bench Verified vs. 80.8% for Opus — demonstrates the pace of capability improvement. With 3.5 GW of future compute locked in, model training runs at a scale that makes step-change improvements sustainable.
**Enterprise feature investment**: The April 9 launch of Claude Cowork GA, the Claude Managed Agents API, the Claude Code Analytics API, and Bedrock Mantle support (v2.1.94) all shipped within a two-week window. This is what enterprise revenue buys: a product organization with enough resources to ship across multiple product lines simultaneously.
**Price compression**: When a model provider's revenue triples in four months, the economics of serving inference improve. Expect per-token costs to continue falling. The $0.08/agent runtime hour price for Claude Managed Agents is already lower than most developers expected.
**Competitive moat**: Anthropic now has the financial position to out-invest rivals on safety research, model capability, and enterprise tooling simultaneously. A year ago, the question was whether Anthropic could compete with OpenAI's capital advantage. That question has been answered.
## The Competitive Realignment
Overtaking OpenAI in revenue represents something beyond a market share data point. It reflects a fundamental difference in strategy that has played out in the enterprise's favor.
OpenAI's revenue mix is heavily consumer — ChatGPT Plus subscriptions, consumer API usage, Microsoft licensing revenue. These are large numbers, but they come with high churn risk and commoditization pressure as smaller models improve. Enterprise revenue at $1M+ annual contracts is stickier, compounds over time, and builds switching costs through deep integration.
Anthropic bet on enterprise and on developers. That bet is paying off at a rate that surpassed almost everyone's model.
For the record: Google's AI revenue (Gemini enterprise, Vertex AI) has also accelerated sharply this year. This is not a two-horse race, and it would be wrong to declare the outcome settled. But Anthropic is clearly in the lead position entering Q2 2026, with infrastructure locked in to sustain it.
## The IPO Signal
Anthropic has been public about a target IPO in late 2026. At $30B ARR and 80%+ enterprise composition, the fundamental case is strong. The more interesting question is how the IPO story gets framed: is this an AI lab going public, or a software company with an AI product surface?
The distinction matters for valuation multiples. At $380 billion valuation (the figure from February's Series G terms), Anthropic is already priced at roughly 13x current ARR. By year-end, if growth continues at even half the current rate, the revenue base will look very different — and the IPO price correspondingly so.
For Claude Code users, the IPO changes nothing in the short term and potentially improves everything in the medium term. Public companies under investor scrutiny ship product. Anthropic has every incentive to demonstrate continued growth through product iteration — and Claude Code is the primary developer-facing surface through which that iteration is visible.
## What to Watch
The $30B ARR headline is today's news. The signal to track for the next six months is whether the enterprise customer count continues doubling. More than 1,000 $1M+ customers has doubled since February. If that number is 2,500 by the Series H or IPO filing, the revenue story is not yet at its inflection point — it is still in acceleration.
Claude Code's roadmap, the pace of Anthropic's model releases, and the infrastructure commitments all point in the same direction. The company that builds the tools you use for your most important work is now the fastest-growing software company on the planet.
That is worth understanding. Not for the financial narrative — for the product bet.
---
**Sources**
- [Anthropic Tops $30 Billion Run Rate, Seals Broadcom Deal — Bloomberg](https://www.bloomberg.com/news/articles/2026-04-06/broadcom-confirms-deal-to-ship-google-tpu-chips-to-anthropic)
- [Anthropic Hits $30 Billion Run Rate as Enterprise Demand Accelerates — PYMNTS](https://www.pymnts.com/artificial-intelligence-2/2026/anthropic-hits-30-billion-run-rate-as-enterprise-demand-accelerates/)
- [CoreWeave Announces Multi-Year Agreement With Anthropic — CoreWeave](https://www.coreweave.com/news/coreweave-announces-multi-year-agreement-with-anthropic)
- [CoreWeave stock pops 11% on deal to power Anthropic's Claude — CNBC](https://www.cnbc.com/2026/04/10/coreweave-anthropic-claude-ai-deal.html)
- [Anthropic is exploring building its own AI chips as Claude revenues surge past $30 billion run rate — OneNewsPage](https://www.onenewspage.com/n/Internet/1ztf28lqe2/Anthropic-is-exploring-building-its-own-AI-chips.htm)
- [At the HumanX Conference, Everyone Was Talking About Claude — TechCrunch](https://techcrunch.com/2026/04/12/at-the-humanx-conference-everyone-was-talking-about-claude/)
---
# Claude Code Is Now the #2 AI Coding Tool at Work — and Has the Best NPS in the Industry
URL: https://sdd.sh/2026/04/claude-code-is-now-the-%232-ai-coding-tool-at-work-and-has-the-best-nps-in-the-industry/
Date: 2026-04-13
Updated: 2026-04-13
Tags: Claude Code, AI Tools, Developer Survey, JetBrains, Adoption
Categories: AI Tools, Industry
Summary: JetBrains surveyed 10,000+ developers in January 2026. Claude Code has grown 6x in eight months and now ties Cursor for second place — while GitHub Copilot still leads by adoption, Claude Code leads by every satisfaction metric.
Eight months. That is how long it took Claude Code to go from a new terminal experiment to the second-most-used AI coding tool among professional developers worldwide.
JetBrains published the results of its second AI Pulse survey this month — 10,000+ professional developers across eight languages, surveyed in January 2026. The numbers are striking, and not just for Anthropic. They signal where the professional coding tool market is heading.
## The Headline Numbers
In January 2026, 90% of developers regularly used at least one AI tool at work for coding tasks. That number was unthinkable two years ago. Today it is a baseline.
Within that majority, here is how the specialized AI coding tools rank by work adoption:
- **GitHub Copilot**: 29% (most widely used, 76% brand awareness)
- **Claude Code**: 18% (second place, tied with Cursor)
- **Cursor**: 18% (second place, tied with Claude Code)
Claude Code launched in May 2025. In April–June 2025, it had roughly 3% work adoption. By September 2025, that was up to around 12%. By January 2026: 18%. That is a 6x increase in eight months.
In the US and Canada specifically, Claude Code's work adoption reaches 24% — suggesting the professional developer market in North America is already well past early-adopter territory.
## The Satisfaction Story Is Even Better
Market share is one thing. What developers think of a tool once they use it is another.
Claude Code has the highest product loyalty metrics in the industry:
- **CSAT (satisfaction)**: 91%
- **NPS (likelihood to recommend)**: 54 (scale: -100 to +100)
For context: an NPS above 50 is considered excellent in any software category. GitHub Copilot, which has been shipping since 2021 and has spent years iterating with the world's largest developer community, cannot match these numbers. Neither can Cursor, despite its significant engineering investment and $50 billion valuation.
The Pragmatic Engineer ran a separate survey of 906 software engineers in February 2026. Claude Code earned a 46% "most loved" rating — the highest of any tool in the survey.
These satisfaction scores matter because they predict trajectory. High NPS tools grow through word-of-mouth in engineering teams. When one developer on a team adopts Claude Code and their productivity visibly changes, others follow. That dynamic is already playing out in the adoption data.
## The Awareness Gap Is a Growth Pipeline
Here is the number most people are overlooking: 57% of developers worldwide have heard of Claude Code. But only 18% use it at work.
That gap — 39 percentage points — represents the next wave of adoption. These are developers who are aware of Claude Code but haven't yet switched. Some are waiting for enterprise procurement. Some are comfortable with their current tool. Some are skeptical. But the awareness is there.
Compare this to GitHub Copilot's 76% awareness and 29% adoption — a 47-point gap. Even the incumbent has a large pool of aware-but-not-using developers.
Claude Code's awareness number has itself grown fast: 31% in April–June 2025, 49% in September 2025, 57% in January 2026. The product is still gaining cultural traction.
## Why the Growth Is Different
GitHub Copilot and Cursor both grew through distribution — Copilot via GitHub's integration across millions of repositories, Cursor via replacing VS Code for a developer audience already living in editors. Those are powerful vectors.
Claude Code grew through word-of-mouth from developers who experienced what an agentic, terminal-native workflow actually does to their output. There is no bundling play here. Developers are choosing it because it changes the job, not because it was included in a tool they already use.
That distinction matters for understanding the satisfaction scores. Copilot users include millions of developers who installed it because it was convenient or free through a GitHub subscription. Claude Code users overwhelmingly chose it deliberately, which inflates satisfaction metrics somewhat — but the 91% CSAT suggests even deliberate early adopters are finding it delivers on its promise.
SemiAnalysis estimates that Claude Code now accounts for roughly 4% of all public GitHub commits as of March 2026, with projections suggesting 20% by year-end. If those projections are directionally correct, the adoption chart does not flatten — it accelerates.
## What GitHub Copilot's Lead Actually Means
GitHub Copilot's 29% adoption and 76% awareness reflect five years of distribution dominance. It is baked into GitHub. It ships with Visual Studio. It has enterprise procurement relationships with thousands of companies. Those are real moats.
But moats built on distribution erode when the productivity gap between tools becomes obvious to decision-makers. The JetBrains survey doesn't publish Copilot's NPS directly, but the qualitative data is clear: experienced developers cite Copilot as convenient and adequate. Claude Code users describe it as transformative.
"Adequate" loses to "transformative" when the market matures and developers have budget and choice.
## The Cursor Comparison
Tying Cursor at 18% is noteworthy. Cursor spent 2025 building an extraordinary user base through a VS Code replacement strategy — same interface, better AI. It executed well. Its $50 billion valuation reflects real traction.
But Claude Code achieved the same work adoption number with a fundamentally different paradigm. No IDE to learn. No extension to install. Just a terminal agent that operates autonomously.
The implication: the segment of developers who want autonomous, agentic workflows is at least as large as the segment that wants a smarter IDE. Both are real markets. But only one of those markets is growing into a paradigm that scales beyond human-in-the-loop code suggestions.
## The Path from Here
The JetBrains data suggests we are in the middle innings of a market reshuffling, not the end. Claude Code's trajectory — 3% to 18% in eight months, with 57% awareness and a 91% CSAT — points toward continued share gains.
The question for the rest of 2026 is whether enterprise procurement cycles can keep up with grassroots developer adoption. Developers want it. Their employers are still writing purchase orders for Copilot enterprise licenses bought 18 months ago.
When those contracts come up for renewal, they will be renewing against a tool that their developers are already using at home, already love, and will advocate loudly for.
That is what a 54 NPS looks like in action.
---
**Sources**
- [Which AI Coding Tools Do Developers Actually Use at Work? — JetBrains Research Blog](https://blog.jetbrains.com/research/2026/04/which-ai-coding-tools-do-developers-actually-use-at-work/)
- [AI Tooling for Software Engineers in 2026 — The Pragmatic Engineer](https://newsletter.pragmaticengineer.com/p/ai-tooling-2026)
- [AI Tools Race Heats Up: Week of April 3–9, 2026 — DEV Community](https://dev.to/alexmercedcoder/ai-tools-race-heats-up-week-of-april-3-9-2026-37fl)
---
# 84% of Developers Use AI Code Tools. Only 29% Trust What They Ship.
URL: https://sdd.sh/2026/04/84-of-developers-use-ai-code-tools.-only-29-trust-what-they-ship./
Date: 2026-04-13
Updated: 2026-04-13
Tags: AI Tools, Developer Survey, Trust, Security, Production, Agentic Workflows
Categories: AI Tools, Industry
Summary: Stack Overflow's developer survey exposed a paradox: AI coding tool adoption is at an all-time high, but trust in AI-generated code just hit an all-time low. The gap isn't irrational — it's diagnostic. And it points directly to what's broken about the autocomplete paradigm.
The numbers from Stack Overflow's 2025 developer survey don't add up — and that's the point.
84% of developers use or plan to use AI coding tools. Only 29% trust the output to be accurate. That is not a small gap. That is a fundamental indictment of how most AI coding tools work today.
The adoption curve went up. Trust went down. And the divergence is accelerating.
## The Trust Collapse in Detail
Trust in AI code accuracy dropped 11 percentage points from 2024 to 2025 — from 40% to 29%. At the same time, adoption increased. More developers using AI tools, trusting them less. How does that happen?
The answer is simple: developers kept using the tools because the productivity upside is real, even when they don't trust the output. They accepted the overhead of constant verification as the cost of the speed boost.
That tradeoff worked when AI tools handled small, isolated tasks. It breaks down at scale.
The distribution of trust is telling:
- 46% actively distrust AI accuracy
- 29% trust it
- Only 3% report highly trusting AI output
- Experienced developers have the lowest "highly trust" rate (2.6%) and the highest "highly distrust" rate (20%)
The most skeptical developers in the survey are the most experienced ones. This is not coincidence. Senior engineers have seen enough AI-generated code fail in subtle ways — off-by-one errors that pass tests, security oversights that look fine in review, edge cases the model never considered — to have calibrated their distrust precisely.
## The "Almost Right" Trap
The most common frustration cited by developers — 45% of respondents — is AI solutions that are "almost right, but not quite."
This is the central failure mode of the autocomplete paradigm. An AI that autocompletes code is optimizing for plausible next tokens, not for correct program behavior. The output looks right. It compiles. It might even pass your existing tests. But it's subtly wrong in a way that only surfaces in production, under load, with real data, three weeks later.
Debugging AI-generated code takes disproportionate time precisely because the error is hidden inside code that looks reasonable. You can't just read it and spot the problem — you have to understand it deeply enough to find where the model's assumption diverged from reality.
Developers now spend up to 24% of their work week verifying, fixing, and validating AI output. That is a full day out of every five. The productivity gain from AI generation is being partially consumed by the overhead of AI validation.
## The Verification Debt Crisis
Here is the most alarming data point in the survey: 96% of developers don't fully trust AI-generated code, but 48% commit it without verification.
Nearly half of developers are doing something they themselves don't trust.
Time pressure is the driver. Thorough verification of AI-generated code takes time — often more time than writing the code manually. When you're on a deadline and the AI has produced something that looks correct, the temptation to ship it is enormous. Especially when your team already has technical debt from previous sprints.
This behavior creates what the research calls "verification debt": unverified AI outputs get merged, become depended upon downstream, and grow harder to audit over time. The codebase accumulates AI-generated logic that no human fully understands or has validated. Eventually something breaks, and the root cause is traced back to a commit that skipped review because the developer trusted the AI enough to ship but not enough to own.
38% of developers report that reviewing AI-generated code requires more effort than reviewing human-written code. This is counterintuitive. AI is supposed to reduce review burden. Instead, for many teams it's increasing it — because the AI generates at a pace humans can't keep up with, and the output has a particular failure pattern (confident-sounding errors) that makes it harder to catch than the kinds of mistakes humans typically make.
## Where This Leads
Veracode data puts a number on the downstream consequence: 45% of AI-generated code contains security vulnerabilities. With AI code projected to reach 65% of all commits by 2027, and verification practices already under pressure, the industry is trending toward a significant production security problem.
Stack Overflow's April 2026 follow-up analysis of enterprise SaaS teams found that teams relying heavily on AI-generated code with weak verification processes were experiencing a 2.3x increase in security-related incidents compared to 2024. The correlation is direct.
This isn't an argument against AI tools. It's an argument about which AI tools and which workflows.
## The Paradigm Problem
The trust gap is a symptom of a specific failure: the autocomplete and suggestion model was never designed for the task developers now expect it to handle.
Copilot, Cursor, and most AI coding assistants are fundamentally suggestion engines. They produce code. You review it. At low volumes, that works. At the velocity and scale that modern development teams are running — where AI might generate thousands of lines per developer per week — the human review bottleneck becomes the single point of failure.
The review load doesn't scale. Human attention doesn't scale. Trust erodes as output volume rises and review quality degrades.
The alternative isn't to slow down AI generation. It's to shift where verification happens.
## Agents That Verify Their Own Output
The fundamental insight behind agentic coding workflows is that the most valuable thing an AI can do is not just generate code — it's generate code, run tests, observe failures, fix them, and iterate until it has something it can demonstrate works.
Claude Code's terminal-native agentic model is built on this principle. The agent runs your test suite. It checks compilation. It observes runtime behavior. It iterates on its own output before presenting it to you. By the time you see the result, it has been through multiple verification cycles that never touched your attention.
This doesn't eliminate the need for human review. It changes what human review looks like. Instead of reading AI-generated code line by line and trusting your own ability to spot subtle errors, you're reviewing the output of an agent that has already demonstrated its output works — at least against the test suite.
That's a fundamentally different review task. It's reviewing a result, not a suggestion.
## The Trust Gap Is Fixable
The 29% trust number isn't a ceiling. It reflects the current state of a market dominated by suggestion-based tools that optimize for generation speed rather than verification quality.
The data suggests what happens when tools shift that calculus. Claude Code's 91% CSAT and NPS of 54 — the highest in the industry — aren't coincidental. Developers who use agentic tools with built-in verification loops don't experience the same "almost right" problem at the same rate, because the tool itself is doing part of the verification work.
The trust gap will close as the market shifts from autocomplete to agentic. That shift is already underway. The question is how much verification debt the industry accumulates before the new paradigm becomes standard practice.
The 48% who are committing unverified AI code today are not irresponsible developers. They're developers whose tools have put them in an impossible position: generate fast or verify thoroughly, pick one. Agentic workflows eliminate that tradeoff. That's the actual unlock.
---
**Sources**
- [Mind the Gap: Closing the AI Trust Gap for Developers — Stack Overflow Blog](https://stackoverflow.blog/2026/02/18/closing-the-developer-ai-trust-gap/)
- [What the AI Trust Gap Means for Enterprise SaaS — Stack Overflow Blog](https://stackoverflow.blog/2026/04/02/what-the-ai-trust-gap-means-for-enterprise-saas/)
- [Developer AI Trust Crisis: 84% Use, 29% Trust in 2026 — byteiota](https://byteiota.com/developer-ai-trust-crisis-84-use-29-trust-in-2026/)
- [2025 Stack Overflow Developer Survey](https://survey.stackoverflow.co/2025/)
- [84% of Developers Now Use AI Tools, But Trust Is at an All-Time Low — CoderCops](https://www.codercops.com/blog/developer-ai-adoption-84-percent-2026)
---
# Microsoft Agent Framework 1.0: The Enterprise .NET World Just Adopted MCP
URL: https://sdd.sh/2026/04/microsoft-agent-framework-1.0-the-enterprise-.net-world-just-adopted-mcp/
Date: 2026-04-12
Updated: 2026-04-12
Tags: mcp, microsoft, agent-framework, enterprise, agentic-workflows, a2a
Categories: AI Tools, Agentic Workflows, Industry
Summary: Microsoft shipped Agent Framework 1.0 on April 3 with full MCP and A2A protocol support for .NET and Python. This isn't just another framework — it's Microsoft committing the entire enterprise .NET developer ecosystem to MCP as the standard tool integration layer.
The question of whether MCP would become the standard protocol for AI agent tool integration had a few possible outcomes. It could remain an Anthropic-adjacent specification used primarily by Claude Code power users. It could fragment into competing protocols — Google's version, OpenAI's version, Microsoft's version. Or one player could make a definitive commitment that closes the debate.
Microsoft closed the debate on April 3, 2026. Agent Framework 1.0, their production-ready multi-agent orchestration framework for .NET and Python, ships with MCP as the standard tool integration layer and A2A as the agent networking layer. When the company that runs Azure, owns GitHub, and serves the majority of enterprise software developers commits to a protocol at 1.0, it's not a bet — it's an infrastructure decision.
## What Microsoft Agent Framework 1.0 Is
Agent Framework is Microsoft's open-source framework for building, orchestrating, and deploying AI agents. It's been in development since late 2025; the 1.0 release on April 3 marks the stable API commitment with a long-term support pledge.
The framework is built around a few core abstractions:
**Agent types**: Conversational agents (back-and-forth dialogue), task agents (goal-directed, tool-calling, single-task execution), orchestrators (agents that coordinate other agents). Each has its own state model and execution contract.
**Middleware pipeline**: Every agent turn passes through a configurable middleware chain — content safety filters, audit logging, compliance policies, custom business logic. This is where enterprise requirements live without polluting agent prompts.
**Graph-based workflow engine**: Compose agents and functions into deterministic workflows. Useful for processes where you need predictable execution order and auditability, not just "let the agent figure it out."
**Multi-model support**: First-party connectors for Microsoft Foundry, Azure OpenAI, OpenAI, Anthropic Claude, Amazon Bedrock, Google Gemini, and Ollama. Swap models without rewriting agent logic.
Deployment options include hosted managed services on Microsoft Foundry and Azure Durable Functions, with OpenTelemetry-powered observability baked in.
## MCP as the Resource Layer
The architectural decision that matters most: Agent Framework treats MCP as the standard mechanism for tool discovery and invocation. Agents resolve tools at runtime from any MCP-compliant server — they don't need to know in advance what tools exist, just how to speak MCP.
For an enterprise developer building with Agent Framework, this means the entire 6,400+ server MCP registry is immediately accessible. Database connectors, Jira, GitHub, Slack, internal APIs wrapped in MCP — any compliant server works without custom integration code. The framework handles the protocol; you declare which servers to connect.
This is materially different from the "bring your own tool integration" approach that most pre-MCP agent frameworks required. Building tools was undifferentiated infrastructure work that every team duplicated. MCP converts that into a solved problem — and Agent Framework 1.0 adopts that solution wholesale.
## A2A: The Cross-Framework Networking Layer
Alongside MCP, Agent Framework 1.0 ships with A2A (Agent-to-Agent) protocol support, enabling cross-runtime agent collaboration. An Agent Framework orchestrator can delegate work to agents running in other frameworks using structured, protocol-driven messaging — and receive results back without either side knowing about the other's internal implementation.
The architecture is cleanly layered: MCP handles the resource layer (what tools agents can invoke), A2A handles the networking layer (how agents communicate with each other across frameworks). A workflow can involve an Agent Framework planning agent, a LangGraph execution agent, and a Claude Code sub-agent — coordinated through A2A without any of them needing direct coupling.
This matters for large organizations that aren't betting on a single agent framework. Most enterprise engineering orgs will end up with multiple frameworks in production — by design, by team preference, or by acquisition. A2A is the specification that makes them interoperable. Agent Framework 1.0 shipping with A2A built in means Microsoft's developer ecosystem starts with interoperability as a default, not an afterthought.
## Why This Is a Protocol Inflection Point
The MCP story up to now has been impressive on numbers — 97 million installs, adoption by OpenAI, Replit, Block, Apollo, and the Linux Foundation taking over governance — but the ecosystem remained weighted toward Python developers, Claude Code users, and developer tooling contexts. A meaningful share of enterprise software is built on .NET. Banks, insurance companies, logistics platforms, government systems — the .NET ecosystem represents billions of lines of production code and millions of professional developers.
Agent Framework 1.0 brings that entire constituency into the MCP ecosystem in a single release. A .NET developer who hadn't thought about MCP before April 3 can now use it through the framework they were going to use anyway. The adoption curve for MCP in enterprise .NET applications just changed shape.
Compare this to what the alternative would have looked like: Microsoft ships their own protocol, fragmentation ensues, teams have to choose, interoperability erodes. Instead, Microsoft evaluated the landscape — the Linux Foundation governance, OpenAI's adoption, Anthropic's stewardship, 97M downloads — and concluded that building on MCP is the right call. That judgment from Microsoft carries significant weight with enterprise buyers.
## The Competitive Landscape Implication
For the frameworks that didn't make this bet early — or made it quietly without the enterprise distribution — the window is narrowing. Agent Framework 1.0 will be the path of least resistance for the vast majority of .NET enterprise agent deployments. Competing frameworks in the .NET space will have to answer "does this work with MCP and A2A" as a baseline question, and the answer is now table stakes, not differentiation.
For the Python AI development ecosystem, the pressure is different. Frameworks like LangGraph, CrewAI, and AutoGen need A2A support to remain interoperable with the significant volume of Agent Framework deployments that will exist in enterprise environments. A2A was published as an open spec; now it has the weight of Microsoft's production ecosystem behind it.
## What It Means for Teams Building Today
If you're building agent workflows for an enterprise .NET context, Agent Framework 1.0 is now the obvious starting point. The MCP tool layer is production-ready, the multi-model connectors cover every major provider, and Azure Foundry hosting means you don't have to build deployment infrastructure from scratch.
If you're already building with Claude Code or Anthropic's Managed Agents API, the A2A support means your agents can interoperate with Agent Framework agents in the same organizations. This is the scenario enterprise platform teams have been waiting for: AI agents from different vendors and frameworks that can actually work together on complex organizational workflows.
For teams evaluating which AI agent infrastructure to bet on, the most important signal from Agent Framework 1.0 isn't the feature set — it's the protocol decisions. MCP and A2A are now jointly endorsed by Anthropic, OpenAI, Google, and Microsoft. The protocol question is resolved. The work ahead is in the applications.
---
*Sources: [Microsoft Agent Framework 1.0 release post](https://devblogs.microsoft.com/agent-framework/microsoft-agent-framework-version-1-0/) · [Visual Studio Magazine coverage](https://visualstudiomagazine.com/articles/2026/04/06/microsoft-ships-production-ready-agent-framework-1-0-for-net-and-python.aspx) · [Azure blog introduction](https://azure.microsoft.com/en-us/blog/introducing-microsoft-agent-framework/) · [Microsoft Learn overview](https://learn.microsoft.com/en-us/agent-framework/overview/) · [GitHub repository](https://github.com/microsoft/agent-framework)*
---
# Claude Code /powerup and /insights: Fixing the 80% Problem
URL: https://sdd.sh/2026/04/claude-code-/powerup-and-/insights-fixing-the-80-problem/
Date: 2026-04-12
Updated: 2026-04-12
Tags: claude-code, anthropic, developer-experience, onboarding, productivity
Categories: AI Tools, Guides
Summary: Most developers use a fraction of what Claude Code can do. Two new commands shipped in v2.1.90 — /powerup and /insights — attack this problem from opposite ends: one teaches you what's possible, the other shows you where your actual workflow breaks down.
Claude Code has a discovery problem. The tool is remarkably deep — hooks, sub-agents, custom slash commands, CLAUDE.md memory, /ultraplan, skills, the full MCP ecosystem — but nothing in the default experience tells you any of it exists. You open the terminal, ask Claude to write some code, and it obliges. The other 80% of what the tool can do sits undiscovered.
Anthropic shipped two commands in v2.1.90 (April 1, 2026) that attack this problem from opposite ends. `/powerup` teaches you what's possible; `/insights` shows you where your existing workflow is leaking efficiency. Together they form something that looks a lot like Anthropic's acknowledgment that "just ship powerful features" isn't a complete product strategy.
## /powerup: Duolingo for Your Terminal
The premise is simple: 18 interactive, animated lessons built directly into the CLI, no browser required. You run `/powerup`, select a lesson with arrow keys, and Claude Code walks you through the feature hands-on.
The lessons follow a consistent four-part structure:
1. **Concept introduction** — what this feature is, in plain language
2. **Guided exercise** — you try it, in your terminal, in your actual project
3. **Feedback** — Claude Code responds to what you did
4. **Progression hook** — context for why the next lesson builds on this one
The full curriculum runs from fundamentals to intermediate automation:
**Beginner tier**: basic context management, `/clear` vs. `/compact`, the CLAUDE.md memory system, plan mode (Shift+Tab), model selection. These are the features most new users discover by accident after weeks of daily use — if at all.
**Intermediate tier**: the skills and custom commands system, hooks (`PreToolUse`, `PostToolUse`), sub-agent orchestration, MCP server configuration, `/rewind` and checkpointing.
The time investment per lesson is roughly 10 minutes. Working through the full curriculum in focused sessions would take under three hours — less time than most teams spend misconfiguring MCP servers by trial and error.
## Context-Aware Instruction
What elevates `/powerup` past a static tutorial is that it reads your project state before presenting each lesson. The CLAUDE.md module checks whether you already have one and adjusts accordingly — if you do, it analyzes your existing configuration and suggests improvements rather than walking you through creation from scratch. The context management lesson behaves differently in a 200-file monorepo than in a three-file script.
This matters because the gap between "I know this feature exists" and "I know how to apply it to my actual codebase" is where documentation typically fails. Generic examples don't transfer. Context-aware instruction does.
## /insights: Your Workflow Under a Microscope
While `/powerup` is forward-looking, `/insights` is retrospective. It analyzes your local Claude Code session history from the past 30 days and generates a full HTML report covering:
- **Recurring patterns** — what kinds of tasks you actually give Claude Code
- **Friction points** — where you repeatedly retry, correct, or restart
- **Tool usage distribution** — which Claude Code capabilities you use and which you never touch
- **Workflow inefficiencies** — sequences of actions that suggest a better-configured approach
The analysis runs entirely locally. Session logs live in `~/.claude/`; the HTML report is generated and saved there too. Nothing leaves your machine. For teams on enterprise plans with strict data governance requirements, this is the right answer to "can we analyze how developers use AI tools without shipping private data to a third party."
The feature that will get the most immediate use is **automatic CLAUDE.md rule generation**. `/insights` identifies instructions you've repeated multiple times across sessions — things you've typed to Claude Code again and again — and converts them into ready-to-paste CLAUDE.md configuration rules. If you've told Claude "don't add console.log statements" four times this month, `/insights` turns that into a permanent instruction.
## The 80% Problem, Stated Plainly
JetBrains' January 2026 developer survey found Claude Code at 18% work adoption among professional developers — up 6× year-over-year — with an industry-leading 91% customer satisfaction score. The tool is being adopted. The question is depth of adoption.
There's a common pattern in powerful developer tools: new users find the one workflow that solves their immediate problem and stop there. Git users who never learn `rebase -i`. vim users who stay in insert mode. Claude Code users who write code with it but never configure hooks, never build custom skills, never use multi-agent workflows for the tasks that genuinely benefit from them.
The cost of shallow adoption isn't zero. A developer who uses `/compact` correctly on long sessions gets substantially better results than one who doesn't. A team that codifies standards in CLAUDE.md gets consistent behavior across agents. An engineer who knows how to hand off to sub-agents can parallelize work that would otherwise be serial. The efficiency delta between a casual user and a configured power user of Claude Code is not marginal.
`/powerup` and `/insights` are Anthropic's admission that shipping features isn't enough — you have to guide people to them. That's a product decision, not just an engineering one.
## How to Use Both Effectively
The practical approach is to run `/powerup` first, working through the lessons in order over a few sessions, then run `/insights` after another week or two of use to see which features you've actually incorporated into your workflow. The second `/insights` run becomes a feedback loop: did the things you learned from `/powerup` actually change your behavior?
For teams, the auto-generated CLAUDE.md rules from `/insights` are worth extracting and codifying into a shared team configuration. Instructions that appear repeatedly across individual developers' session histories are candidates for the team-level `CLAUDE.md` — those are the implicit norms that should be made explicit.
```bash
# Learn what you're missing
/powerup
# After a week of use, analyze your patterns
/insights
```
Both commands are available in Claude Code v2.1.90 and later. If you're running an older version, `npm update -g @anthropic-ai/claude-code` will get you there.
The discovery problem isn't fully solved by two commands. But these are a real step toward the kind of in-product education that separates tools people use from tools people use well.
---
*Sources: [Claude Code v2.1.90 release notes](https://code.claude.com/docs/en/changelog) · [claudefa.st /powerup guide](https://claudefa.st/blog/guide/mechanics/claude-powerup) · [Claude Lab /powerup command guide](https://claudelab.net/en/articles/claude-code/claude-code-powerup-command-guide) · [Claude Lab /insights command guide](https://claudelab.net/en/articles/claude-code/claude-code-insights-command-guide) · [JetBrains developer survey Jan 2026](https://blog.jetbrains.com/research/2026/04/which-ai-coding-tools-do-developers-actually-use-at-work/)*
---
# Claude Code Ultraplan: When 30 Minutes of Cloud Thinking Beats 5 Seconds of Local Guessing
URL: https://sdd.sh/2026/04/claude-code-ultraplan-when-30-minutes-of-cloud-thinking-beats-5-seconds-of-local-guessing/
Date: 2026-04-11
Updated: 2026-04-11
Tags: Claude Code, Ultraplan, Agentic Workflows, Planning, Anthropic
Categories: AI Tools, Agentic Workflows
Summary: Ultraplan hands your planning task to a dedicated cloud session running Opus 4.6 for up to 30 minutes — while your terminal stays free. Here's what it actually is, how the three modes differ, and when to reach for it.
The hardest part of any autonomous coding task isn't execution — it's the plan. A bad plan executed flawlessly still ships the wrong thing. And plans created under terminal pressure, in the same context window that's about to run the code, are almost always worse than plans built with time, depth, and distance.
Claude Code's new `/ultraplan` command is a direct attack on that problem. Instead of asking Claude to plan locally and then immediately execute, Ultraplan offloads the entire planning phase to Anthropic's cloud infrastructure — a dedicated Claude Code web session running Opus 4.6, isolated from your terminal, with up to 30 minutes of compute and a browser interface for review. Your terminal is free the whole time.
It's a simple idea with surprisingly deep implications for how you structure agentic workflows.
---
## What Ultraplan Actually Does
When you run `/ultraplan` (or include the word "ultraplan" anywhere in a prompt), Claude Code hands the planning task off to a remote session in Anthropic's **Cloud Container Runtime (CCR)**. That session:
- Runs **Opus 4.6** — the most capable model in the Claude family
- Has up to **30 minutes** of dedicated compute for deep analysis
- Operates independently of your local terminal, which polls for status every 3 seconds
- Produces a structured plan you review in your **browser**, not in the terminal
When the plan is ready, you open it in Claude Code on the web. From there you can comment on specific sections, request revisions to individual parts, and — when you're satisfied — choose where to execute it: send it back to your local terminal, to a different Claude Code session, or to a cloud environment entirely.
That browser review surface is underrated. In terminal plan mode, your feedback options are essentially "accept," "reject," or "describe what's wrong in prose." With Ultraplan's browser interface, you can annotate exact sections, ask targeted questions about specific steps, and iterate on individual parts of the plan without starting over.
There's also a third trigger path worth knowing: when Claude finishes a local plan and shows the approval dialog, you can choose *"No, refine with Ultraplan on Claude Code on the web"* to send the draft to the cloud for deeper elaboration. Local planning with cloud polish.
---
## The Three Modes
Ultraplan isn't one system. Based on documented behavior (and analysis of the system prompts Claude operates under), there are at least three variants — and which one activates appears to depend on how you phrase the request:
**Simple Plan**
The lightweight option. Essentially regular plan mode running on cloud hardware — faster than local planning, but not fundamentally different in depth. Good for tasks where you want the browser review surface without committing to 30 minutes of compute. The cloud execution environment also gives you a cleaner context window, isolated from whatever conversation history your local terminal is carrying.
**Visual Plan**
Same as Simple, plus explicit instructions for Claude to generate **Mermaid diagrams or ASCII visualizations** for structural changes. If your task involves data flow changes, service dependencies, or architectural restructuring, Visual mode produces a plan you can actually walk stakeholders through — dependency order shown graphically, not just described in paragraphs. Useful for anything where "before and after" matters.
**Deep Plan**
The full 30-minute commitment. Claude is instructed to explore the problem space exhaustively — multiple approaches considered, trade-offs surfaced, edge cases enumerated, implementation sequenced in detail. The output is substantially longer and more thorough than what local plan mode produces under typical conditions. This is the right mode for large-scale refactors, greenfield architecture decisions, or anything where getting the plan wrong is expensive.
Anthropic has confirmed the three modes exist; assignment between them currently appears to involve some A/B testing infrastructure, meaning you may not always get the same mode for the same phrasing. The official documentation describes the system as a "research preview" and the modal behavior is expected to stabilize as it matures.
---
## Requirements and Access
Ultraplan requires:
- **Claude Code v2.1.91 or later** (check with `claude --version`)
- A **Pro, Max, Team, or Enterprise** account — the feature is not available on the free tier
- A **connected GitHub repository** — Ultraplan's cloud session needs repository context to plan meaningfully
The feature is in research preview. Anthropic's framing suggests it will remain gated on higher-tier plans; the compute costs for 30 minutes of Opus 4.6 per planning session are non-trivial.
---
## When to Use It — and When Not To
Ultraplan is not a replacement for local planning. It's a tool for a specific category of tasks where the cost of a bad plan is high and the cost of 30 minutes of compute is low by comparison.
**Reach for Ultraplan when:**
- The task involves significant architectural change across multiple files or services
- You're making a decision that's hard to reverse (database schema changes, API breaking changes, major dependency upgrades)
- You want a structured plan you can share with a team or use in a PR description
- You need visual representations of dependencies or data flow
- Local planning has already produced a plan that feels incomplete or misses edge cases
**Skip Ultraplan when:**
- The task is bounded and well-understood — local plan mode is faster
- You're iterating rapidly and 30 minutes is a blocking delay
- You don't need the browser review surface (pure terminal workflows)
- The task doesn't require a GitHub repository (scripts, local utilities, exploration)
The most important judgment call: Ultraplan is best when you're front-loading the cost of thinking. If you've ever spent three hours debugging an implementation that solved the wrong problem because the initial plan was too shallow, Ultraplan is paying back that tax in advance.
---
## What This Tells You About Anthropic's Direction
Ultraplan is the third leg of Anthropic's "planning infrastructure" story, alongside the leaked ULTRAPLAN and KAIROS features described in earlier coverage and Claude Managed Agents' checkpointing and multi-agent coordination primitives.
The through-line: Anthropic is disaggregating the agent loop. Planning happens somewhere. Execution happens somewhere. Review happens somewhere. These don't have to be the same place, the same model, or the same time.
Ultraplan specifically makes planning a **cloud-native, asynchronous operation** — you don't have to babysit it, you don't have to keep your terminal open, and you don't have to review it under time pressure. The plan arrives when it's ready, you review it at your pace, and execution happens wherever makes sense.
That's a fundamentally different model from "AI completes your thought as you type." It's AI doing extended, deliberate work while you do something else — and then collaborating on the output before any code runs.
The JetBrains April 2026 developer survey puts Claude Code at 18% adoption at work (up from 3% a year ago, 6× growth), with a 91% CSAT score — the highest in the market. Ultraplan is the kind of feature that explains those satisfaction numbers: it's not adding UI chrome, it's removing a genuine limitation in how well autonomous agents plan before they act.
---
## The Bottom Line
Local plan mode is still the right default for most tasks. But for anything complex enough that the plan matters as much as the execution, Ultraplan gives you something that wasn't previously available: Opus 4.6, 30 minutes, no interruptions, and a browser interface for collaborative review.
Run `/ultraplan` before your next major refactor. Give it the full depth it needs. Review it properly in the browser. Then execute with confidence.
That's the point — not speed, but quality of judgment before the work starts.
---
**Sources:**
- [Plan in the cloud with ultraplan — Claude Code Docs](https://code.claude.com/docs/en/ultraplan)
- [Claude Code Ultraplan: Cloud Planning to Free Your Terminal — ClaudeFast](https://claudefa.st/blog/guide/mechanics/ultraplan)
- [Claude Code's Ultraplan Bridges the Gap Between Planning and Execution — DevOps.com](https://devops.com/claude-codes-ultraplan-bridges-the-gap-between-planning-and-execution/)
- [Inside Claude Code's New Ultra Plan: The Deep Plan Mode Explained — Geeky Gadgets](https://www.geeky-gadgets.com/ultra-plan-cloud-interface/)
- [Which AI Coding Tools Do Developers Actually Use at Work? — JetBrains Research Blog](https://blog.jetbrains.com/research/2026/04/which-ai-coding-tools-do-developers-actually-use-at-work/)
- [Claude Code Ultraplan Launched: I Just Tested It — Medium](https://medium.com/@joe.njenga/claude-code-ultraplan-launched-i-just-tested-it-and-its-better-than-it-looks-21a628332e97)
---
# 81% vs. 46%: The AI Coding Benchmark That's Been Lying to You
URL: https://sdd.sh/2026/04/81-vs.-46-the-ai-coding-benchmark-thats-been-lying-to-you/
Date: 2026-04-11
Updated: 2026-04-11
Tags: SWE-bench, Benchmarks, AI Coding, Evaluation, Claude, GPT
Categories: AI Tools, Industry
Summary: SWE-bench Verified — the benchmark that put every frontier model above 80% — is contaminated. OpenAI stopped reporting it in February. Here's what actually happened, what SWE-bench Pro replaces it with, and why 46% is a more honest number than 81%.
For the past twelve months, AI coding leaderboards have told a confident story: frontier models now solve more than 80% of real-world software engineering tasks. Claude Opus 4.5 at 80.9%. GPT-5.3 somewhere close behind. Gemini in the same vicinity. The implication was clear — we are rapidly approaching the point where AI can handle essentially any coding task you throw at it.
That story is wrong.
On February 23, 2026, OpenAI announced it would stop reporting SWE-bench Verified scores. Not because OpenAI's models stopped performing well — they still show 81% on Verified. They stopped reporting it because the benchmark has been compromised. The real number, on a contamination-resistant benchmark, is closer to 46%.
The 35-point gap is not a rounding error. It's a warning about what happens when the AI industry measures itself with its own ruler.
---
## What SWE-bench Verified Is
SWE-bench was introduced by researchers at Princeton and Chicago in 2023 as a novel approach to evaluating AI coding ability: instead of toy problems or contrived exercises, it uses **real GitHub issues from real open-source repositories**. Each task is a concrete bug report or feature request from a production codebase. The AI agent must produce a patch that passes the repository's existing test suite.
The "Verified" variant — SWE-bench Verified — was introduced by OpenAI in mid-2024. It curated a subset of 500 Python-only tasks that had been manually validated by human contractors to ensure the problems were solvable and the test suites were meaningful. It became the standard leaderboard that everyone cited.
It was also, it turns out, increasingly unreliable.
---
## What Went Wrong
OpenAI's internal audit identified three distinct failure modes in SWE-bench Verified:
**Contamination.** Every frontier model tested — GPT-5.2, Claude Opus 4.5, Gemini 3 Flash — showed evidence of training data contamination. For certain tasks, models could reproduce the **verbatim gold patch** from the benchmark's answer key, or recall problem-specific details that should only be accessible from the repository's commit history. The benchmark's tasks came from public GitHub repositories, which means they were present in training corpora for models trained after 2023.
**Broken tests.** OpenAI found that nearly 60% of the problems that their models failed contained fundamentally flawed tests. Some tests were too narrow — they checked for a specific implementation detail not mentioned in the problem description, meaning a correct solution would fail. Others were too wide — they required extra features that were never specified. The benchmark was penalizing models for being wrong when, in many cases, the tests themselves were wrong.
**Saturation.** Even setting aside contamination and test quality, Verified's 500 Python-only tasks weren't representative of real production codebases — which are polyglot, messy, and poorly documented. The benchmark had become easier than the work it was supposed to measure.
---
## SWE-bench Pro: The Replacement
The alternative that OpenAI now recommends, and that the broader research community is migrating to, is **SWE-bench Pro** — a benchmark built from scratch to address every failure mode of Verified.
Key differences:
**Scale and diversity.** Pro has **1,865 tasks** across multiple programming languages — Python, JavaScript, TypeScript, Go, Rust, Java. No more Python monoculture. Real engineering organizations don't work in one language; the benchmark now reflects that.
**Contamination resistance by design.** SWE-bench Pro draws exclusively from repositories under **strong copyleft licenses** (GPL and similar). The legal framework around these licenses creates a practical barrier to their inclusion in proprietary training corpora. OpenAI's audit found contamination cases in Pro to be "significantly rarer and less egregious" than in Verified, and found no model capable of producing a verbatim gold patch.
**Private tasks.** Beyond the public set, SWE-bench Pro includes tasks sourced from **private proprietary codebases** — code that definitively was not in any training set. These tasks function as a gold standard contamination check.
**Standardized scaffolding.** Scale AI maintains a SEAL leaderboard with standardized evaluation scaffolding, so that score differences reflect model capability, not differences in how each lab sets up its evaluation pipeline.
---
## What the Real Numbers Look Like
The performance gap tells the story more clearly than any analysis:
| Model | SWE-bench Verified | SWE-bench Pro (SEAL) |
|---|---|---|
| Claude Opus 4.5 | 80.9% | 45.9% |
| GPT-5.3-Codex | ~81% | 56.8% |
| Best with search subagent (Opus 4.6 + WarpGrep v2) | — | 57.5% |
The same Claude Opus 4.5 that scores 80.9% on Verified scores 45.9% on Pro. The model didn't change. The benchmark did. That 35-point gap is the contamination premium — the extra credit AI systems have been giving themselves for tasks they'd already seen.
Even the Pro leaders aren't close to 80%. The best result on the public leaderboard, GPT-5.3-Codex at 56.8%, represents genuinely impressive performance on uncontaminated, multi-language, real-world engineering tasks. But 56% is a very different claim than 81%.
---
## What This Means for How You Evaluate AI Coding Tools
If you've been using SWE-bench Verified scores to make purchasing or adoption decisions about AI coding tools, you've been looking at an inflated metric. That's not a knock on any particular vendor — everyone was playing the same game with the same benchmark — but it does mean the capability picture was rosier than reality.
A few practical implications:
**Don't cite Verified scores going forward.** OpenAI has retired them. The research community is following. Any vendor still prominently featuring SWE-bench Verified scores on their homepage is either unaware of the contamination issue or hoping you are.
**Pro scores are the new floor, not the ceiling.** A model scoring 45-57% on SWE-bench Pro is genuinely capable — these are hard, real tasks on uncontaminated code. But the industry needs to recalibrate its language around what "capable" means. "Solves 80% of real engineering tasks" was always too strong a claim.
**Benchmark diversity matters.** SWE-bench is not the only coding eval that matters. Terminal-Bench 2.0, LiveCodeBench v6, and proprietary evals from labs like Scale AI each measure different dimensions. No single number tells you everything. Any vendor claiming supremacy based on one leaderboard deserves skepticism.
**The gap between benchmark performance and production utility remains large.** Stackademic's April 2026 survey found that 84% of developers now use AI coding tools daily — but only 29% trust the output in production. That trust gap isn't primarily about benchmark scores. It's about reliability on *your* codebase, *your* test suite, *your* edge cases. Benchmarks can't substitute for that.
---
## Why It Matters That OpenAI Flagged This
It's worth pausing on the institutional dimension here. OpenAI published a detailed post-mortem on *its own benchmark*, acknowledged contamination in *its own models*, and stopped reporting the metric that made those models look best. That's epistemically unusual in an industry not known for rigorous self-criticism.
The OpenAI research team's framing: *"The standard for frontier coding evals is changing with model maturity."* That's a diplomatic way of saying: the benchmark succeeded at what it was designed to do, then the models ate the benchmark, and now we need harder tests.
This is a healthy cycle — benchmarks get saturated, better benchmarks replace them — but it requires the industry to resist the temptation to camp on flattering numbers longer than is honest. The fact that OpenAI made the Pro migration publicly and transparently is the right move, and it sets a reasonable standard for others.
---
## The Practical Upshot
The best AI coding models in 2026 can autonomously resolve roughly **50-57% of hard, uncontaminated, multi-language real-world software engineering tasks** — without human guidance, using only the repository and the issue description.
That's genuinely extraordinary. Three years ago it was 0%. The progress is real.
But it's not 80%. It never was. The gap between 46% and 81% is the space where contaminated training data was doing work that the models couldn't do themselves. Now that we've subtracted that credit, we know where the models actually stand.
The right response isn't cynicism — it's calibration. Use SWE-bench Pro scores to compare models. Use your own production tasks to verify the comparison. And stop letting any vendor tell you the problem of AI coding capability is nearly solved.
It's not. It's just that progress has been real and the benchmarks have been generous.
---
**Sources:**
- [Why SWE-bench Verified No Longer Measures Frontier Coding Capabilities — OpenAI](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/)
- [Is SWE-bench Verified Contaminated? OpenAI Shifts to SWE-bench Pro — CodeSOTA](https://www.codesota.com/news/swe-bench-contamination-debate)
- [SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81% — MorphLLM](https://www.morphllm.com/swe-bench-pro)
- [SWE-Bench Pro Leaderboard — Scale AI SEAL](https://labs.scale.com/leaderboard/swe_bench_pro_public)
- [⚡️ The End of SWE-Bench Verified — Latent Space](https://www.latent.space/p/swe-bench-dead)
- [OpenAI Abandons SWE-bench Verified After Finding 59% of Failed Tests Were Flawed — Blockchain.news](https://blockchain.news/news/openai-abandons-swe-bench-verified-contamination-flawed-tests)
- [84% of Developers Use AI Coding Tools in April 2026 — Only 29% Trust What They Ship — Stackademic](https://blog.stackademic.com/84-of-developers-use-ai-coding-tools-in-april-2026-only-29-trust-what-they-ship-d0cb7ec9320a)
- [OpenAI Developers on X — announcement thread](https://x.com/OpenAIDevs/status/2026002219909427270)
---
# Cursor 3: Agent-First Branding, IDE-Last Architecture
URL: https://sdd.sh/2026/04/cursor-3-agent-first-branding-ide-last-architecture/
Date: 2026-04-10
Updated: 2026-04-10
Tags: cursor, agents, ide, agentic-workflows, ai-tools, claude-code
Categories: AI Tools, Agentic Workflows
Summary: Cursor 3 shipped a genuinely redesigned interface built around parallel agents. The Agents Window, Design Mode, /worktree, and /best-of-n are real features with real uses. But 'agent-first' describes the UI layer, not the architecture — and the distinction matters more than Cursor's marketing suggests.
Cursor 3 launched on April 2 as a full UI rebuild framed around a single premise: AI agents are now the primary unit of work, not chat completions. The old Composer pane is gone. In its place is an Agents Window designed for running fleets of parallel agents across environments. Design Mode lets you annotate browser UIs and send elements directly to agents. Two new commands — `/worktree` and `/best-of-n` — bring isolated execution and model comparison into the editor workflow.
This is a meaningful release. The features are real, the architecture has genuinely shifted, and Cursor at 3.0 is a different product than Cursor 2.x.
It's also still an IDE. That distinction is doing more work than Cursor's framing acknowledges.
## What Actually Shipped
### The Agents Window
The Agents Window replaces the Composer as Cursor's primary interface. It's a full-screen workspace for running and monitoring multiple agent tasks simultaneously — locally, in git worktrees, in Cursor's cloud, or on remote hosts via SSH. You can run agents in parallel, view them side-by-side or in a grid, and switch between the Agents Window and the traditional editor view as needed.
This is a real improvement over Composer. Composer ran one agent session at a time and was firmly embedded in the editor sidebar. The Agents Window treats multiple concurrent agents as the default, which is the right mental model for serious agentic work.
### Design Mode
Within the Agents Window, Design Mode connects the agent to a browser. You toggle it with `Cmd+Shift+D`, select areas with `Shift+drag`, and add elements to the agent's context with `Cmd+L`. The agent can then make changes and you iterate against the actual rendered UI rather than describing interface elements in prose.
For frontend work this is genuinely useful. The gap between "the button should be higher and more prominent" and pointing at the button is significant in practice.
### /worktree and /best-of-n
`/worktree` creates an isolated git worktree for a task, keeping experimental changes out of your main branch automatically. `/best-of-n` runs the same prompt against multiple models in parallel worktrees and shows results side-by-side for comparison.
`/best-of-n` is interesting because it externalizes a decision developers typically make implicitly: which model handles this kind of task better? Running the task against three models in parallel and comparing outputs is a more honest answer than intuition. It also compounds with Cursor's multi-model support — you can compare Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro on the same task and pick the best result.
### The Await Tool and Reliability Improvements
A quieter addition: an `Await` tool that lets agents wait on shell commands or specific output patterns like "Ready" or "Error." This matters for long-running tasks that involve starting servers, waiting for builds, or polling external systems. Agents that can block on real signals rather than sleeping for arbitrary durations are meaningfully more reliable.
## Where the Architecture Ceiling Shows
Cursor 3's framing is "agent-first." That's accurate for the interface layer. The Agents Window makes agents the primary surface. But agent-first interface and agent-first architecture are different things, and Cursor has the former.
### The IDE as Constraint
Every agent in Cursor 3 is orchestrated through a running Cursor application. The cloud agents handoff works, but it's a handoff *from Cursor* — you initiate from the IDE, monitor through the IDE, and the agent's output returns to the IDE. Remove Cursor from the picture and the workflow doesn't exist.
Claude Code doesn't have this constraint because it doesn't have an IDE. It runs in a terminal, attaches to whatever environment you're in, and operates autonomously on whatever infrastructure you point it at. You don't keep a Claude Code window open while an agent runs a four-hour task; you start the task and come back. There's no host application whose availability determines whether the agent continues.
This isn't a subtle difference. Cursor's architecture means:
- Agents are bounded by Cursor's session model
- Cloud offload is a feature of the Cursor product, not a property of the agent itself
- Long-running autonomous tasks have a dependency on Cursor's process lifecycle
- CI/CD and server-side automation require Cursor to be present or replaced with a headless mode that doesn't have the same capabilities
### Human-in-the-Loop as Default
Cursor's agent model treats human approval as a normal part of the execution path. The Agents Window makes it easier to oversee multiple agents, which implicitly assumes you'll be overseeing them. Claude Code's default is autonomy — you set guardrails and the agent runs until it's done or it needs to ask you something genuine.
Neither model is universally correct. For interactive design work with Design Mode, being in the loop is the point. For a long-running background task like "refactor this module, write tests, and open a PR," being in the loop is friction.
Cursor's architecture optimizes for the interactive case. That's a defensible product choice, but it's a choice — not a neutral implementation of "agents."
### /best-of-n Versus Actually Deciding
`/best-of-n` is a clever feature that exposes a real problem: if you don't know which model to use, you run them all and compare. This works. It also means you've turned model selection into a manual review task.
An architecture where the agent selects the appropriate model for each subtask — and routes accordingly — would be preferable. `/best-of-n` is a useful workaround while model routing is still an unsolved problem, but it's worth being clear that it's a workaround.
## What Cursor 3 Is Good For
None of this means Cursor 3 is a bad release. For developers who want to:
- Stay in an IDE workflow with AI deeply integrated
- Work on frontend tasks where Design Mode's visual targeting adds real value
- Run multiple exploratory agent tasks in parallel and compare outcomes with `/best-of-n`
- Operate in environments where human oversight of agent work is required or preferred
Cursor 3 is a genuine step forward. The Agents Window is better than Composer. Design Mode solves a real problem for UI-focused work. The worktree integration is cleaner than managing worktrees manually.
The $50B valuation Cursor commanded earlier this year reflects a real market: developers who want a great AI-enhanced IDE. That market exists. Cursor serves it well.
## The Paradigm Gap
The harder question is whether an AI-enhanced IDE is the destination or a waypoint.
Cursor 3's "agent-first" framing borrows the vocabulary of autonomous agentic development while delivering a product that still centers the developer as the primary actor. That's not a criticism of what Cursor 3 does — it's a description of what it is.
The shift that's actually happening in production AI development isn't about better IDE features. It's about agents that plan, implement, test, and iterate with decreasing amounts of human involvement. That shift is architectural: it requires agents that operate independently of a GUI, integrate natively with CI/CD, persist state across long time horizons, and recover from failures without human intervention.
Cursor is building toward that future through the lens of an IDE. Claude Code is building from the terminal up. The two approaches produce products that look similar at the feature-list level — parallel agents, tool execution, model selection — but differ fundamentally in where they put the human.
Cursor 3's best work is making the human-in-the-loop experience excellent. The question is how long that's the right optimization target.
---
*Sources: [Cursor 3 changelog](https://cursor.com/changelog/3-0) · [Cursor blog](https://cursor.com/blog/cursor-3) · [The Decoder](https://the-decoder.com/new-cursor-3-ditches-the-classic-ide-layout-for-an-agent-first-interface-built-around-parallel-ai-fleets/) · [SiliconANGLE](https://siliconangle.com/2026/04/02/cursor-refreshes-vibe-coding-platform-focus-ai-agents/) · [DEV Community](https://dev.to/liran_baba/cursor-3-shipped-parallel-agents-but-is-any-of-it-new-2dd1)*
---
# Claude Managed Agents: Anthropic Just Built the Agent Loop You Were Going to Write Anyway
URL: https://sdd.sh/2026/04/claude-managed-agents-anthropic-just-built-the-agent-loop-you-were-going-to-write-anyway/
Date: 2026-04-10
Updated: 2026-04-10
Tags: claude, managed-agents, agentic-workflows, anthropic, api
Categories: AI Tools, Agentic Workflows
Summary: Anthropic launched Claude Managed Agents on April 8 — a managed API that handles the agent loop, sandboxing, checkpointing, and tool orchestration you'd otherwise build yourself. Here's what it actually offers, how the pricing model works, and why it matters for teams shipping production agents.
Every team building a serious agentic workflow with Claude has written roughly the same thing: an agent loop. You manage the session state, wire up tool execution, handle errors and retries, implement checkpointing so a two-hour task doesn't restart from zero on a network blip, add tracing so you can figure out what went wrong in production, and bolt on sandboxing so Claude can't accidentally delete something important. It's a few thousand lines of plumbing that has nothing to do with your actual product.
Anthropic shipped [Claude Managed Agents](https://claude.com/blog/claude-managed-agents) on April 8 in public beta. The pitch is simple: they built that plumbing for you.
## What It Is
Claude Managed Agents is a suite of composable APIs for building and deploying cloud-hosted AI agents at scale. Instead of maintaining your own agent loop, you define what you want — the model, system prompt, tools, MCP servers, and guardrails — and Anthropic runs it on their infrastructure.
The architecture is built around four core concepts:
- **Agent** — the definition: model, system prompt, tools, MCP servers, skills
- **Environment** — a configured cloud container with pre-installed packages (Python, Node, Go, etc.) and network access rules
- **Session** — a running agent instance within an environment, executing a specific task
- **Events** — the messages exchanged between your app and the agent: user turns, tool results, status updates
You create an agent once and reference it by ID across as many sessions as you need. Sessions run asynchronously and stream responses back via server-sent events. You can send additional events mid-execution to steer the agent, or interrupt it entirely.
## Built-In Infrastructure You'd Otherwise Wire Yourself
What makes this more than a thin wrapper around the Messages API is what Anthropic handles for you in every session:
**Checkpointing.** Long-running sessions persist their state. A four-hour research task doesn't restart from scratch because a container hiccupped. The infrastructure handles recovery transparently.
**Prompt caching and compaction.** The harness applies Claude's prompt caching automatically and manages context compaction for long sessions — two optimizations that significantly reduce cost and latency on extended tasks but require careful engineering when you implement them yourself.
**Tool execution sandbox.** Claude can run bash commands, read and write files, search the web, and call MCP servers inside a secure container. You don't have to build the execution environment or worry about escape vectors.
**Credential and permission management.** Scoped permissions and identity management are built in. You define what the agent can access; the runtime enforces it.
**End-to-end tracing.** Every tool call, model turn, and event is logged and viewable in Claude Console. When something goes wrong in production, you have the full trace to debug against.
**Multi-agent coordination** (research preview). Agents can delegate work to other agents. This is early — request access required — but it's the architecture needed for complex orchestrated workflows where a planning agent fans out to specialist subagents.
## The Pricing Model
The cost structure is straightforward: standard Claude API token pricing plus **$0.08 per session-hour** of active runtime, measured in milliseconds. Idle time doesn't count. Web search, if your agent uses it, costs an extra $10 per 1,000 searches.
For context: a complex agent task that runs for 20 minutes of active computation costs $0.027 in session runtime before token costs. For most production workloads, token usage will dominate the bill by a large margin — the session fee is noise unless you're running thousands of long sessions daily.
All Managed Agents endpoints require the `managed-agents-2026-04-01` beta header. The SDK sets it automatically.
## Who's Already Shipping With It
Anthropic announced several early adopters at launch:
- **Notion** — custom agents embedded in their product
- **Rakuten** — enterprise agents at scale
- **Asana** — the AI Teammates feature that offloads project work to Claude agents
- **Vibecode** — AI-native app development workflows
- **Sentry** — automated debugging and patch generation
The Sentry use case is worth noting specifically. Debugging production errors, generating reproduction cases, writing and testing patches, opening PRs — that's a multi-hour, multi-tool workflow that requires exactly the kind of stateful, long-running infrastructure Managed Agents provides. Building that on the raw Messages API is non-trivial. With Managed Agents it's a few hundred lines of configuration.
Anthropic's internal testing showed up to 10 percentage points improvement in task success compared to standard prompting, with the largest gains on complex, multi-step problems. The gain isn't from the model — it's from the harness: better tool execution, automatic retries, and optimized context management that you'd have to implement yourself otherwise.
## What's Still in Research Preview
Three significant features require separate access:
**Outcomes / self-evaluation.** Claude iterates until it reaches defined success criteria rather than stopping after a single pass. This is the difference between "run until done" and "run until correct" — a meaningful distinction for quality-sensitive workflows.
**Multi-agent coordination.** Agents delegating to other agents. The obvious use case is a planning agent that fans out work to specialist subagents and aggregates results — exactly the architecture Anthropic's own research on [agent teams](/posts/claude-code-agent-teams-multi-agent-orchestration/) pointed toward.
**Memory.** Persistent memory across sessions. Currently, each session starts fresh; memory would allow agents to build up context across interactions over time.
These are the features that push Managed Agents from "hosted agent loop" to "autonomous agent platform." The fact that they're research preview means they're real and being used, not vaporware — but they're not ready for general production use yet.
## Messages API vs. Managed Agents
Anthropic's own comparison is honest about the tradeoff:
| | Messages API | Managed Agents |
|---|---|---|
| **Control** | Full | Configuration-level |
| **Infrastructure** | You build it | Anthropic provides it |
| **Best for** | Custom loops, fine-grained control | Long-running tasks, async work |
If you need to implement unusual agent architectures, deeply customize the loop, or have compliance requirements that preclude running agent state on Anthropic's infrastructure, the Messages API is still the right choice. For the majority of production agent use cases — tasks that run for minutes to hours, need tool access, and need to survive failures gracefully — Managed Agents removes the infrastructure work without constraining what you can build.
## Why This Matters
Every serious team building production agents with Claude was going to reinvent this wheel. Some already have. Anthropic's Managed Agents offering isn't just a convenience layer — it's a statement about where agent infrastructure is heading.
The model of "you write the loop, we provide the model" worked fine for inference. It starts to break down when the loop itself needs to be reliable at scale: stateful, recoverable, observable, and secure. That's real engineering, and it compounds with every new agent use case.
By absorbing that complexity into the platform, Anthropic shortens the path from "Claude can do this" to "we ship this to customers" from months to days. For the ecosystem, that means faster iteration and more production deployments. For Anthropic, it means deeper integration into the workflows that matter.
The research preview features — outcomes, multi-agent, memory — point at what comes next: agents that aren't just long-running but genuinely persistent, collaborative, and self-correcting. The plumbing is in place. The interesting question is what gets built on top of it.
**Access:** Claude Managed Agents is available in public beta to all Claude API accounts. Request access to research preview features (outcomes, multi-agent, memory) via the [Claude platform](https://claude.com/form/claude-managed-agents).
---
*Sources: [Claude Managed Agents launch post](https://claude.com/blog/claude-managed-agents) · [Official documentation](https://platform.claude.com/docs/en/managed-agents/overview) · [SiliconANGLE coverage](https://siliconangle.com/2026/04/08/anthropic-launches-claude-managed-agents-speed-ai-agent-development/) · [The New Stack](https://thenewstack.io/with-claude-managed-agents-anthropic-wants-to-run-your-ai-agents-for-you/) · [The Register](https://www.theregister.com/2026/04/09/anthropic_offers_to_host_ai/)*
---
# Meta's Muse Spark Is Closed Source. Open-Source AI Just Lost Its Last Major Patron.
URL: https://sdd.sh/2026/04/metas-muse-spark-is-closed-source.-open-source-ai-just-lost-its-last-major-patron./
Date: 2026-04-09
Updated: 2026-04-09
Tags: Meta, open-source AI, Muse Spark, Llama, AI models, closed-source
Categories: AI Tools, Industry
Summary: Meta Superintelligence Labs shipped Muse Spark — and made it closed-source. The company that framed open AI as a moral imperative just locked the door. Here's what that means for developers who built their stack on Llama.
For three years, Meta was the loudest voice for open-source AI. Every Llama release came with a manifesto: open models are safer, open models accelerate progress, open models democratize access. Mark Zuckerberg called closed-source AI a "mistake" in interviews and positioned Meta's open approach as a competitive moat and a moral stance simultaneously.
On April 8, 2026, Meta Superintelligence Labs shipped **Muse Spark** — its first major model under Alexandr Wang's leadership — and made it completely closed-source.
The company that evangelized openness just locked the door. And the developer community needs to update its assumptions accordingly.
## What Muse Spark Is
Muse Spark is the debut model from Meta Superintelligence Labs (MSL), the renamed AI division formed after Meta's $14 billion deal to bring in Alexandr Wang from Scale AI. According to Meta's announcement and reporting from CNBC and TechCrunch, the model is:
- **Closed-source** — no weights, no license, API-only access
- **Competitive but not leading** — benchmarks place it roughly at parity with Llama 4's best midsize models but below frontier performers like Claude Mythos Preview or GPT-5.4
- **More compute-efficient** — Meta claims it achieves comparable quality to previous models using "an order of magnitude less compute" due to a rebuilt training infrastructure
- **A platform shift** — MSL is positioning Muse Spark as the foundation for Meta's AI products (Ray-Ban glasses, Meta AI, Workplace), not primarily as a developer tool
Meta's stock rose approximately 9% on the announcement day, suggesting investors view the closed-source pivot positively — which is itself a signal about where the money thinks AI commercialization is heading.
A separate **Llama 5** is reportedly still in development, which may preserve some open-weight continuity. But Llama 5 is not Muse Spark, and the distinction matters.
## Why This Is a Bigger Deal Than It Looks
On the surface, one closed model from one lab isn't catastrophic. The open-source ecosystem doesn't collapse because Meta shipped a proprietary product.
But Meta wasn't just a participant in the open-weights space — it was the load-bearing wall. The entire argument for using open-source frontier models rested on Llama's existence. When someone said "we don't need to send code to OpenAI or Anthropic," they usually meant "we'll run Llama." When a startup said "we can self-host for compliance reasons," Llama was the answer. When academics built research infrastructure on open models, they built it on Llama's license terms.
Meta knew this. The open-source positioning was strategic: flood the market with free weights, build developer mindshare, undermine OpenAI's pricing power. It mostly worked. Llama 3 and Llama 4 became default answers to "which open model should I use?" across enterprise AI projects.
Now the strategic calculus has shifted. Wang's team at MSL appears to be optimizing for commercial AI revenue — products, APIs, enterprise deals — rather than ecosystem goodwill. A closed model can be monetized directly. Open weights cannot.
The message to developers is plain: the free ride had terms. Meta was never a charity. When the economics changed, so did the license.
## The Open-Weight Landscape After Muse Spark
So what's left for developers who need self-hostable, deployable-without-API-fees models?
**GLM-5** (Zhipu AI, 744B MoE, MIT license) is currently the strongest open-weight coding model available. Its [SWE-bench Pro performance](/posts/glm-5-1-open-source-beats-frontier-models-swe-bench-pro/) is the best among open-weight models, and the MIT license is genuinely permissive. But GLM-5 is a Chinese research lab model, which creates procurement complications for US government and regulated-industry deployments.
**Mistral** continues shipping open models but has never been a Llama-scale ecosystem player for coding tasks. Its Codestral variants are strong on narrow code completion but trail on multi-step agentic workflows.
**Google's Gemma 4** ([covered here](/posts/gemma-4-local-coding-agent-open-weight/)) is Apache 2.0, runs on consumer hardware, and performs well on code tasks — but at 31B active parameters, it's not a frontier competitor for serious agentic pipelines.
The honest assessment: if you needed an open-weight model with Llama-tier ecosystem support, Llama was irreplaceable. Nothing else has the same combination of size, license terms, tooling ecosystem, and community momentum. Meta's pivot to closed-source leaves a gap that no current model fully fills.
## What This Means for the Claude Code Stack
For engineers building agentic workflows with Claude Code, the Muse Spark announcement is less a threat than a confirmation of the right bet.
The case for Claude Code was never "use it because there's no good alternative." It was: Anthropic's models lead on real-world agentic tasks, the tool ecosystem (MCP, Claude Code's terminal-native model) is purpose-built for autonomous work, and the API pricing has been compressing toward parity with self-hosted costs anyway. The [1M context window going GA](/posts/claude-1m-context-ga-agentic-coding/) eliminated one of the last "we need to run our own model for large context" arguments.
But the open-weight story was always the fallback for compliance-driven enterprises who couldn't send code off-premises. That fallback just became shakier.
Anthropic's response to the enterprise access problem wasn't another model — it was [Claude Code on Amazon Bedrock](/posts/claude-code-channels-coding-from-anywhere/), which puts model inference on AWS-managed infrastructure with zero Anthropic operator access. For enterprises who need code to stay inside their AWS perimeter, that's now the cleaner answer than running a 700B-parameter open-weight model on their own GPU cluster.
## The Ideological Hangover
There's something worth sitting with here beyond the practical tooling implications.
Meta built enormous developer trust on the back of open-source. Engineers defended Llama against closed alternatives. Companies made architectural bets on open weights. Researchers published work depending on Llama access continuing. All of that was predicated on the implicit assumption that Meta's open-source commitment was durable.
It wasn't. It was a strategy, and strategies change.
Simon Willison, who has been one of the most thoughtful chroniclers of the open-source AI ecosystem, [noted](https://simonwillison.net/2026/Apr/8/muse-spark/) that this doesn't mean Meta will stop releasing open models entirely — Llama 5 may still materialize. But the symbolic damage is real: the company that styled itself as the principled alternative to closed AI just made the same choice as everyone else when the revenue math changed.
For CTOs making infrastructure decisions: this is a good moment to audit your dependency on any single lab's ideological commitments. Platform risk in AI isn't just about pricing or APIs going down. It's about the terms under which you built your stack being changed by a board decision in Menlo Park.
Open-source AI isn't dead. But its most prominent patron just walked out the door.
---
**Sources:**
- [Meta debuts Muse Spark model — TechCrunch](https://techcrunch.com/2026/04/08/meta-debuts-the-muse-spark-model-in-a-ground-up-overhaul-of-its-ai/)
- [Meta debuts first major AI model since $14B Alexandr Wang deal — CNBC](https://www.cnbc.com/2026/04/08/meta-debuts-first-major-ai-model-since-14-billion-deal-to-bring-in-alexandr-wang.html)
- [Meta's Muse Spark is closed source — The Next Web](https://thenextweb.com/news/meta-muse-spark-msl-first-model)
- [Simon Willison on Muse Spark](https://simonwillison.net/2026/Apr/8/muse-spark/)
---
# GitHub Copilot Finally Got Autopilot Mode. It's Still Not an Agent.
URL: https://sdd.sh/2026/04/github-copilot-finally-got-autopilot-mode.-its-still-not-an-agent./
Date: 2026-04-09
Updated: 2026-04-09
Tags: GitHub Copilot, IDE, agentic coding, Claude Code, VS Code, MCP, autonomous AI
Categories: AI Tools, Agentic Workflows
Summary: GitHub Copilot's April 8 VS Code update ships Autopilot Mode, nested subagents, and MCP sandboxing. These are real improvements. They're also a demonstration of why bolting autonomy onto an IDE produces something fundamentally different from a real agent.
GitHub shipped a genuinely significant Copilot update on April 8. The VS Code March Releases changelog lists features that would have been remarkable eighteen months ago: Autopilot Mode, nested subagents, MCP sandboxing, cross-platform MCP bridging. On paper, this looks like GitHub closing the gap with Claude Code.
It isn't. And understanding why matters if you're making architectural decisions about which AI coding tool to build workflows around.
## What Copilot Actually Shipped
Let's be precise about what's in the update, because the features are real and worth understanding on their own terms:
**Autopilot Mode (preview):** Copilot agents can now approve their own actions without requiring human confirmation at each step. They'll also retry automatically on errors. This is the feature that most directly responds to the "Claude Code just runs, Copilot keeps asking for permission" criticism.
**Nested Subagents:** Subagents can now invoke other subagents. A top-level agent decomposing a complex task can delegate subtasks to specialized agents and aggregate results. This mirrors the multi-agent architectures that [Claude Code Agent Teams](/posts/claude-code-agent-teams-multi-agent-orchestration/) introduced earlier this year.
**MCP Sandbox:** Local MCP servers now run in OS-level sandboxes on macOS and Linux, reducing the blast radius of a malicious or misbehaving server. Given the [CLAUDE.md CVE we covered last week](/posts/claude-code-cve-2026-claudemd-supply-chain-attack/), sandboxing MCP tool execution is exactly the right security posture.
**Cross-platform MCP:** MCP server configurations from VS Code now bridge to Copilot CLI and Claude agent sessions. This is interoperability progress — a configured MCP server doesn't need to be re-declared for each surface.
**Media in Chat:** Screenshots and video attachments in agent conversations. Useful for debugging UI bugs without describing them in words.
These are all genuine product improvements. The Autopilot Mode in particular required real engineering — the retry-on-error loop with automatic action approval is the kind of thing that separates a "you can ask it things" tool from a "it can do things" tool.
So why isn't this enough?
## The Architecture Problem
Autopilot Mode makes Copilot more autonomous within VS Code. That sentence contains the limitation.
VS Code is a GUI application. It runs on your desktop. Its agent loop is mediated by an IDE process that manages windows, buffers, tabs, and UI state. When Copilot's agent works on a task, it operates inside this UI context — which means it needs VS Code running, it inherits VS Code's resource model, and its "autonomous" work is bounded by what VS Code's extension API exposes.
Claude Code has no IDE. It runs in a terminal. It has direct access to the filesystem, shell, and any tool you configure via MCP. Its execution loop doesn't go through a UI framework. When it works autonomously on a multi-hour task — the kind of [8-hour autonomous sessions that GLM-5.1 demonstrated](/posts/glm-5-1-open-source-beats-frontier-models-swe-bench-pro/) or the [15-agent team architectures in Claude Code](/posts/claude-code-agent-teams-multi-agent-orchestration/) — it's not keeping a GUI application alive for the duration.
This isn't pedantry. It has concrete implications:
**Parallel workloads.** Running [ten Claude Code instances in parallel](/posts/parallel-ai-agents-the-tools-that-let-you-run-ten-claudes-at-once/) means opening ten terminal sessions or using an orchestrator. Running ten Copilot agents in parallel means... running VS Code ten times? The headless model scales horizontally in ways the IDE-bound model fundamentally cannot.
**CI/CD integration.** Claude Code runs in GitHub Actions, in Docker containers, in automated pipelines. It doesn't require a display server. Copilot's agent is tied to an interactive VS Code session — you can't kick it off from a CI pipeline and walk away. (The Copilot CLI is separate and has its own, more limited agent capabilities.)
**Context persistence.** When VS Code closes, the agent's working context disappears. Headless agents can checkpoint state, hand off between sessions, and be orchestrated across time. The [KAIROS proactive daemon and ULTRAPLAN](/posts/what-is-spec-driven-development/) patterns in Claude Code's roadmap are predicated on agents that can sleep, wake, and continue — not on keeping an IDE open.
**Tool access.** MCP sandboxing is a security improvement, but it also reflects that Copilot's tool access model is defensive-by-necessity because the IDE's process model creates a larger attack surface. Terminal-native agents can take a different security posture — explicit permissions, explicit tool declarations, no ambient GUI state to exploit.
## Autopilot Mode as Symptom
The existence of Autopilot Mode is revealing. It was built to address the friction of Copilot constantly asking "should I do this?" — which users experienced as annoying, and which the product team correctly identified as a problem.
But that friction exists because Copilot was designed as an **assistant** first. Its default state is "ask the human." Autopilot is a mode you opt into to temporarily suppress that default. The architecture assumes a human is present, watching, and available to re-engage.
Claude Code's default state is different. The assumption is that you gave it a task and it should complete it. Human confirmation is opt-in, not opt-out. The [Auto Mode announcement](/posts/claude-code-auto-mode-anthropic-hands-ai-more-control/) earlier this year was about pushing that default further toward autonomy with an explicit safety layer — not about bolting autonomy onto a tool designed for confirmation-heavy workflows.
This is the difference between an agent with guardrails and a human-in-the-loop tool with an express lane. Copilot's nested subagents are a sophisticated express lane. Claude Code's agent architecture starts from the other direction.
## What GitHub Got Right
To be fair: the MCP Sandbox work is the right call. OS-level isolation for tool execution is correct security engineering, and GitHub shipping it now — before a major exploit forces the issue — is responsible product development. Other vendors should follow.
The cross-platform MCP bridging is also genuinely useful. Reducing configuration duplication across surfaces (IDE, CLI, agent sessions) is real developer experience work that most teams will appreciate.
And Autopilot Mode will meaningfully improve the Copilot experience for the large population of developers who want AI assistance within their existing VS Code workflow but don't want to be interrupted every thirty seconds. That's a valid use case. Not every coding task needs an autonomous headless agent — some people want suggestions and light automation with a human clearly in the loop.
The problem is positioning. If GitHub ships Autopilot Mode and the press coverage says "Copilot is now an autonomous agent," that's a category error with real consequences. Engineers will build workflows on an autonomy guarantee that the architecture cannot actually deliver.
## The Strategic Picture
Microsoft has the most to lose from the agentic shift and the most at stake in slowing it down. VS Code is a dominant developer tool precisely because the IDE paradigm has been central to software development for twenty-five years. If the future of coding is headless agents running in terminals and CI pipelines — which is where the evidence points — then VS Code becomes infrastructure for human review, not the environment where work happens.
Copilot's roadmap is trying to ride both horses: keep VS Code central while also supporting increasingly autonomous operation. The VS Code March Releases show this tension clearly. Autopilot Mode, nested subagents, and MCP sandboxing are genuinely good features. They're also features designed to make the IDE paradigm competitive for a few more years.
It's the right business move. It's probably not the right architecture for the long term.
Claude Code's terminal-native model and GitHub Copilot's IDE-native model are not converging to the same place. They're optimizing for different points on the autonomy spectrum. Knowing which point you need is the only question that matters when choosing between them.
---
**Sources:**
- [GitHub Copilot in VS Code — March Releases changelog (April 8, 2026)](https://github.blog/changelog/2026-04-08-github-copilot-in-visual-studio-code-march-releases/)
- [Claude Code Agent Teams: multi-agent orchestration](/posts/claude-code-agent-teams-multi-agent-orchestration/)
- [Claude Code Auto Mode](/posts/claude-code-auto-mode-anthropic-hands-ai-more-control/)
- [CVE-2026-21852: CLAUDE.md supply-chain attack](/posts/claude-code-cve-2026-claudemd-supply-chain-attack/)
---
# GLM-5.1: The Open-Source Model That Just Beat Everyone on SWE-bench Pro
URL: https://sdd.sh/2026/04/glm-5.1-the-open-source-model-that-just-beat-everyone-on-swe-bench-pro/
Date: 2026-04-08
Updated: 2026-04-08
Tags: open-source, glm, benchmarks, agentic-coding, swe-bench, ai-models
Categories: AI Tools, Industry
Summary: Z.AI released GLM-5.1 today — a 754B open-weight model under MIT license that scored 58.4% on SWE-bench Pro, beating GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro. Its headline demo: an 8-hour autonomous session that built a complete Linux desktop environment across 655 iterations. The closed-model monopoly on frontier coding capability just got its first serious challenge.
Today, Z.AI shipped GLM-5.1. If you follow coding benchmarks, you'll want to pay attention: this is the first open-weight model to beat the closed frontier on SWE-bench Pro.
The numbers: **58.4% on SWE-bench Pro**, a harder derivative of SWE-bench Verified that uses problems from post-training-cutoff repositories to reduce contamination risk. That's above GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro. The weights are fully open under MIT license. And the headline capability demo — an autonomous 8-hour session that built a functional Linux desktop environment from scratch — is the kind of thing that changes how you think about what local models can do.
## What GLM-5.1 Is
GLM-5.1 is a 754-billion-parameter mixture-of-experts model developed by Z.AI, the commercial arm of the Tsinghua KEG research lab (makers of the earlier ChatGLM series). The design is clearly optimized for agentic, long-horizon tasks:
- **200K token context window** with coherent utilization across the full range
- **128K output tokens** — matching Claude Opus 4.6's recent expansion
- **MIT license** — commercially usable, no restrictions on deployment or fine-tuning
- **Designed for extended autonomous sessions**, with training that explicitly targeted multi-hour task completion
The 8-hour demo is worth dwelling on. In a benchmarked autonomous session, GLM-5.1 started with a blank environment and produced a working Linux desktop: installed packages, wrote configuration files, wired together a window manager, handled dependency conflicts, debugged failures mid-session, and validated the result — all without human intervention, across 655 distinct action steps. The session ran for 8 hours of wall-clock time.
For context: Claude Opus 4.6's task horizon is documented at 14.5 hours in multi-agent configurations. GLM-5.1 is claiming competitive territory on single-agent sustained execution.
## The Benchmark in Context
SWE-bench Pro is worth understanding before reading too much into 58.4%.
The standard SWE-bench Verified leaderboard has become increasingly suspect as training datasets have expanded — models have had opportunities to see the problems. SWE-bench Pro addresses this by sourcing issues from repositories that were created or substantially modified after the models' training cutoffs. The contamination risk is lower. The scores are also lower: where Opus 4.6 sits at ~80% on Verified, it drops significantly on Pro. GLM-5.1's 58.4% on Pro is a stronger signal than an 80%+ score on Verified.
That said, benchmark performance on SWE-bench Pro isn't the same as real-world coding capability. The tasks are GitHub issue resolutions — a specific, narrow slice of software engineering. The 8-hour Linux desktop demo is arguably more informative about sustained autonomous capability than any benchmark number.
## Why This Matters: The Open-Weight Frontier
Until today, the conversation about frontier coding models was a conversation about closed APIs. You wanted the best coding capability, you called Anthropic's API, OpenAI's API, or Google's API. You paid per token, accepted their terms, and sent your code to their servers.
GLM-5.1 changes that calculus. 754B is a large model — you're not running this on a MacBook — but MIT license means you can:
- Deploy it in an air-gapped environment (defense, finance, healthcare with strict data residency requirements)
- Fine-tune it on proprietary codebases without data leaving your infrastructure
- Build commercial products on it without per-token API costs or vendor lock-in
- Run it in regions where the major US-based AI APIs are unavailable or restricted
The economics of running a 754B model are not trivial. You need serious GPU hardware. But for organizations that have that hardware — or are already renting it for other purposes — the total cost of ownership calculation shifts meaningfully.
## What Closed-Model Vendors Should Be Worried About
The pattern here is familiar from other infrastructure categories. Closed vendors dominate early, open-weight alternatives emerge, and eventually the open-weight option reaches "good enough" for most use cases. The closed vendors retain a premium tier but lose the low-to-middle of the market.
For AI coding models, "good enough" is a high bar — you need the model to handle real-world engineering complexity, not just toy examples. GLM-5.1's SWE-bench Pro score and 8-hour autonomous demo suggest it's cleared that bar, at least for a meaningful range of tasks.
The risk for Anthropic, OpenAI, and Google isn't immediate. Opus 4.6's 80.8% on SWE-bench Verified is still nominally above GLM-5.1's 58.4% on Pro (though the benchmarks aren't directly comparable). The integration work in Claude Code — the tooling, the skills system, the MCP ecosystem, the computer use capability — isn't something you replicate by downloading weights. And the closed models are still pulling ahead on reasoning and instruction-following in real-world evals.
But the gap is narrowing. And on the specific axis of "autonomous multi-hour task completion with open weights," GLM-5.1 just planted a flag at the frontier.
## Implications for Agentic Coding Workflows
For practitioners building agentic coding systems, GLM-5.1 opens some genuinely new options.
**Private deployment agents**: If you're running a Claude Code-style workflow but need it to operate entirely on-premise — a common requirement in regulated industries — GLM-5.1 is now a credible foundation. It's not Claude Code (which is a full tooling stack, not just a model), but the model capability is there to build on.
**Multi-agent cost economics**: In a 15-agent team configuration like Claude Code Agent Teams, the per-token cost matters. Running GLM-5.1 on your own hardware eliminates the token billing entirely, which changes what's economically viable for large-scale agentic pipelines.
**Fine-tuning for specialized domains**: The MIT license means you can fine-tune GLM-5.1 on your organization's codebase, documentation, and patterns. Closed models don't offer this. For large engineering organizations with substantial proprietary code, a fine-tuned open model may outperform the frontier on their specific domain even if it underperforms on generic benchmarks.
**Competitive pressure on frontier pricing**: The existence of a credible open-weight alternative at frontier performance levels gives every organization more leverage in their API pricing negotiations. Even if you stay on Claude or GPT, GLM-5.1's existence as an alternative constrains pricing.
## The Caveats
A few honest caveats before treating this as a wholesale alternative to closed models.
**Hardware requirements**: 754B parameters requires substantial GPU infrastructure. Inference at reasonable latency needs multiple high-end GPUs. This isn't a laptop model or a small-team option without significant investment.
**Tooling ecosystem**: GLM-5.1 is a model, not a platform. Claude Code's value isn't just the underlying model — it's the skill system, MCP integrations, computer use capability, hooks architecture, and years of tooling built on top. GLM-5.1 starts from scratch on that layer.
**Long-horizon reliability**: The 8-hour demo is impressive. But published demos are optimized. Independent, reproducible long-horizon benchmarks on GLM-5.1 haven't been published yet. SWE-bench Pro is a useful proxy but doesn't capture all dimensions of real-world autonomous reliability.
**Support and iteration pace**: Anthropic ships weekly Claude Code updates. Z.AI's release cadence at this scale is unknown. Frontier model development is resource-intensive; sustaining it requires either substantial commercial revenue or continued research investment.
## The Bigger Picture
GLM-5.1 is the first credible evidence that the open-weight ecosystem is reaching frontier coding capability. It won't replace Claude Code for developers who need the full platform. But it signals that the moat around closed frontier models is narrower than it was six months ago.
For the industry, the more interesting question is what comes next. If Z.AI's 754B model can beat GPT-5.4 on SWE-bench Pro today, what does the open-weight landscape look like when Meta's Llama 5 or Mistral's next release arrives? The trajectory suggests that "open-source but frontier-capable" is not an oxymoron any longer.
---
*Sources: [VentureBeat: GLM-5.1 launch](https://venturebeat.com/technology/ai-joins-the-8-hour-work-day-as-glm-ships-5-1-open-source-llm-beating-opus-4) · [MarkTechPost](https://www.marktechpost.com/2026/04/08/z-ai-introduces-glm-5-1-an-open-weight-754b-agentic-model-that-achieves-sota-on-swe-bench-pro-and-sustains-8-hour-autonomous-execution/) · [Dataconomy](https://dataconomy.com/2026/04/08/z-ais-glm-5-1-tops-swe-bench-pro-beating-major-ai-rivals/)*
---
# Claude Mythos Goes Official: Project Glasswing and the Zero-Day Reckoning
URL: https://sdd.sh/2026/04/claude-mythos-goes-official-project-glasswing-and-the-zero-day-reckoning/
Date: 2026-04-08
Updated: 2026-04-08
Tags: claude, anthropic, mythos, security, cybersecurity, project-glasswing, ai-models
Categories: AI Tools, Industry
Summary: Anthropic officially unveiled Claude Mythos Preview on April 7, confirming what the March leak hinted at: a model that autonomously found thousands of zero-days across every major OS and browser. Their response — Project Glasswing — grants restricted access to a select group of tech giants to use Mythos as a defensive weapon. This is the most consequential 'too dangerous to release' moment in AI history.
Last March, a CMS misconfiguration gave the world an accidental glimpse of Claude Mythos. On April 7, Anthropic made it official — and the full picture is more striking than the leak suggested.
The announcement came via a technical capability assessment published at `red.anthropic.com`. It confirmed that Mythos had autonomously discovered **thousands of high-severity zero-day vulnerabilities** across every major operating system and web browser, including a 27-year-old bug in OpenBSD and a 16-year-old flaw in FFmpeg. Anthropic's own characterization: "unprecedented offensive cybersecurity capability." Their response to that characterization is unlike anything the AI industry has done before.
## What Mythos Found
The assessment details are striking not just in quantity but in depth. These aren't shallow fuzzer hits or known CVE variants. The model's autonomous vulnerability research produced:
- Novel exploitation chains across browser rendering engines (Blink, WebKit, Gecko)
- Kernel-level privilege escalation paths in Windows, macOS, and Linux
- Parser vulnerabilities in widely deployed compression and media libraries
- Remote code execution paths in SSH implementations used by millions of servers
The 27-year-old OpenBSD bug is particularly noteworthy. That codebase has been among the most security-audited open-source projects in existence for three decades. Human security researchers hadn't found it. Mythos did, autonomously, as part of a broader sweep.
Anthropic ran these findings through their standard responsible disclosure process — the vendors listed above have been notified — but the implication is clear: **an AI model can now do the work of an elite offensive security team, at scale, continuously, without human direction.**
## Project Glasswing: The Controlled Release
Rather than a standard commercial launch, Anthropic announced **Project Glasswing** — a restricted access program for infrastructure defenders. The initial cohort: AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, Microsoft, and NVIDIA.
The framing is intentional. Each of these organizations operates infrastructure that, if compromised, could affect hundreds of millions of people. The idea is to give them Mythos access specifically to find and patch vulnerabilities in their own systems before adversaries do — essentially turning Mythos's offensive capability into a defensive tool under controlled conditions.
This is not a general API rollout. There are no plans to make Mythos available through Claude.ai, the standard API, or Claude Code subscriptions. Anthropic explicitly stated the model is not appropriate for general availability "at this time."
What "at this time" means is left deliberately vague.
## The March Leak, Revisited
[The March 30 leak article](/posts/claude-mythos-leaked-model-step-change-cybersecurity/) documented the CMS exposure: an unpublished draft describing a model internally codenamed Capybara, positioned above Opus 4.6 in the tier stack, with "dramatically higher scores" on coding and cybersecurity benchmarks.
The official announcement tracks closely with that leak. A few updates:
- The pricing tier question remains unanswered. Anthropic has not announced Mythos pricing; the Project Glasswing program appears to be a bilateral arrangement, not a commercial product
- The cybersecurity risk characterization from the draft — "currently far ahead of any other AI model in cyber capabilities" — was confirmed and expanded in the official assessment
- The coding benchmark claims have not been independently released yet. The assessment focuses on security capability; performance on SWE-bench or Terminal-Bench has not been published
The gap between the leak's coding claims and the official release is worth noting. The March draft described "dramatically higher scores" on coding. The April announcement is entirely framed around security capability. Either the coding story is being held back, or the security findings were significant enough to dominate the narrative.
## What This Means for the AI Coding Landscape
For developers thinking about agentic workflows, Mythos raises two questions that are easy to conflate but shouldn't be.
**Question 1: When does Mythos become available for coding?**
The honest answer is: not soon, and possibly not in the current form. The Project Glasswing framing suggests Anthropic sees Mythos as a dual-use capability that requires guardrails before broad deployment. That's not necessarily a permanent state — Anthropic's track record is to gradually expand access as safety work matures — but it's not a Q2 2026 Claude Code update.
**Question 2: What does Mythos capability signal about the trajectory of agentic coding models?**
This is the more interesting question. If Mythos can autonomously produce novel, high-quality security research across a vast attack surface, that same capability architecture almost certainly produces qualitatively better software engineering output than Opus 4.6. The zero-day work isn't a separate skill; it's the product of deep code comprehension, long-horizon reasoning, and the ability to maintain coherent analysis across large codebases.
Opus 4.6 already handles 14.5-hour task horizons and runs 15-agent teams. A model that can hold a 27-year-old OpenBSD bug in context while simultaneously mapping the broader attack surface is doing something cognitively different from — and more capable than — current frontier models in agentic roles.
The coding benchmarks will come. When they do, expect the gap over current models to be significant.
## The "Too Dangerous to Release" Threshold
Anthropic is the first major AI lab to publicly decline to release a model on capability grounds. Meta publishes Llama weights. Mistral publishes Mixtral. Google has open-weight Gemma. OpenAI has its commercial frontier, but hasn't withheld a model it's built with a public explanation tied to offense capability.
This is new territory. And Anthropic's decision to confirm the capability rather than quietly suppress it is notable — it's consistent with their stated approach to transparency around risk, and it creates a de facto disclosure norm that other labs will need to respond to.
The Project Glasswing framing is also instructive. Rather than treating Mythos's capability as purely a liability, Anthropic is converting it into a strategic asset: using the model to harden the infrastructure that the broader internet runs on. If the initiative produces meaningful vulnerability discoveries and patches at the Glasswing partners, it could become the template for how frontier AI gets deployed when the dual-use calculus is too sharp for open access.
## What Comes Next
The responsible disclosure pipeline from Mythos's initial sweep will take months to clear. Hundreds of vulnerabilities across major OS and browser vendors require coordinated patch development, testing, and staged rollout. Expect a stream of CVEs attributed to "AI-assisted security research" over the next 6-12 months without the underlying model being named.
For the coding world, the signal is this: the next tier of AI capability is already built. The question is how the industry navigates deployment. Anthropic's answer, for now, is "very carefully, with infrastructure defenders first."
---
*Sources: [Anthropic technical capability assessment (red.anthropic.com)](https://red.anthropic.com/2026/mythos-preview/) · [The Register](https://www.theregister.com/2026/04/07/anthropic_all_your_zerodays_are_belong_to_us/) · [Fortune: Project Glasswing](https://fortune.com/2026/04/07/anthropic-claude-mythos-model-project-glasswing-cybersecurity/) · [Tom's Hardware](https://www.tomshardware.com/tech-industry/artificial-intelligence/anthropics-latest-ai-model-identifies-thousands-of-zero-day-vulnerabilities-in-every-major-operating-system-and-every-major-web-browser-claude-mythos-preview-sparks-race-to-fix-critical-bugs-some-unpatched-for-decades)*
---
# The CLAUDE.md Trap: How a New Supply-Chain Attack Targets Agentic Developers
URL: https://sdd.sh/2026/04/the-claude.md-trap-how-a-new-supply-chain-attack-targets-agentic-developers/
Date: 2026-04-07
Updated: 2026-04-07
Tags: Claude Code, Security, CVE, Supply Chain, Agentic Workflows
Categories: AI Tools, Guides
Summary: A patched vulnerability in Claude Code (CVE-2026-21852) reveals an entirely new attack surface: poisoned project config files that silently bypass your deny rules and exfiltrate credentials. Here's what happened, how the exploit works, and what it means for agentic security.
On April 6, 2026, Anthropic shipped Claude Code v2.1.90 to patch a critical command-parser vulnerability — CVE-2026-21852. The bug itself is subtle: a hard-coded 50-subcommand cap in the deny-rule parser silently discarded any rule check beyond the 50th entry. Attackers who knew the cap could craft a malicious `CLAUDE.md` project config that buried a payload just past the invisible ceiling, and Claude Code would execute it without complaint.
The patch is out. Update immediately. But the story doesn't end there, because the attack surface this vulnerability exposed isn't going away when you apply the patch. It's structural to how agentic coding tools work — and every developer running autonomous agents in unfamiliar codebases needs to understand it.
## What the Vulnerability Actually Did
Claude Code respects a hierarchy of configuration files. At the top sits your user-level config, followed by workspace-level settings, and finally `CLAUDE.md` files in individual project directories. These project-level files are enormously useful — they let you embed context, coding conventions, and tool permissions directly in the repo, so your agent picks them up automatically when it opens the project.
The parser that enforces deny rules had a hard-coded cap: after processing 50 subcommands in a single config block, it stopped checking and silently fell through to `ask` mode (or in some versions, `allow`). The cap was never documented. It was never surfaced in logs. From the developer's perspective, their deny rules appeared to be active. They weren't.
Check Point Research and Adversa AI independently described a practical attack chain:
1. An attacker publishes a `CLAUDE.md` in a public repository — something innocuous-looking, like a well-maintained open-source tool or a popular starter template.
2. The file contains 50 legitimate-looking build instructions, linting rules, or tool configurations.
3. The 51st entry is the payload: a shell command that exfiltrates SSH keys, cloud credentials, or API tokens to an attacker-controlled endpoint.
4. A developer clones the repo, opens it in Claude Code, and runs an automated task. The agent reads the config, processes the first 50 entries (all benign), and then executes the 51st without any deny-rule check.
5. Credentials leave the machine before the developer sees anything suspicious.
The related CVE-2026-33068 documents a separate but similar bypass via the Workspace Trust Dialog — repository settings that could override trust decisions at the workspace level, letting a malicious repo elevate its own trust before the user reviewed it.
InfoWorld also flagged that some attack surfaces from an earlier fix (CVE-2025-59536) were not fully closed by that patch, meaning this class of vulnerability has been a persistent weak point in Claude Code's security model, not a one-off.
## The CLAUDE.md Attack Surface Is Genuinely New
This vulnerability highlights something important: **agentic coding tools have introduced a config-file attack surface that simply did not exist before them**.
Traditional static analysis tools, linters, or even IDEs read config files, but they don't execute arbitrary shell commands based on them. An agent does. When you point Claude Code at a directory, it reads `CLAUDE.md` and treats its contents as trusted instructions. That's enormously powerful for legitimate use — you can embed build context, specify allowed tools, set coding standards. But it also means a malicious `CLAUDE.md` is a potential remote code execution vector disguised as documentation.
Compare this to the classic supply-chain attack via `package.json` postinstall scripts. That threat model is well-understood: developers know that running `npm install` in an untrusted repo can execute arbitrary code, and tooling has been built to surface that risk. The `CLAUDE.md` threat model is new, and developer instincts haven't caught up yet.
The attack is particularly dangerous because it targets the specific moment when developers are most likely to let their guard down: when they're exploring an unfamiliar codebase and want the AI agent to help them understand it. "Just clone the repo and ask Claude Code to walk me through the architecture" is exactly the workflow this attack weaponizes.
## What the Patch Does
The v2.1.90 patch addresses the immediate problem: the 50-subcommand cap is removed, and the fallback behavior when the parser encounters an edge case is changed from `ask` to `deny`. Deny rules now apply correctly regardless of how many subcommands a config block contains.
The recommended security posture from the patch notes adds a tree-sitter deny-check pattern applied to the legacy code path as well, closing the secondary surface that earlier patches had missed. Workspace Trust Dialog handling is hardened to prevent repository-level settings from overriding workspace trust decisions without explicit user confirmation.
To verify you're on the patched version:
```bash
claude --version
# Should report 2.1.90 or later
```
If you're running Claude Code via the Anthropic API directly (not the CLI), check the release notes on platform.claude.com for the corresponding SDK version.
## What You Should Do Right Now
**Update immediately.** This is not a wait-for-the-next-scheduled-update situation. The attack vector is public, the proof-of-concept exists, and the repos that exploit it don't announce themselves.
**Audit your existing project configs.** If you've been running Claude Code against external repos without reviewing their `CLAUDE.md` files, review them now. Look for anything that invokes shell commands, accesses environment variables, or makes network requests. A legitimate `CLAUDE.md` rarely needs to do any of these things.
**Treat CLAUDE.md files like code, not documentation.** The mental model shift required here is significant. When you clone a repo with a `CLAUDE.md`, you're not just downloading a README. You're downloading instructions that a powerful agent will execute. Apply the same scrutiny you'd give to a `Makefile`, a `Dockerfile`, or a postinstall script.
**Use explicit allowlists, not denylists, for sensitive operations.** The vulnerability exposed a problem with deny rules specifically. If you're managing Claude Code permissions for a team, prefer explicit allowlists that enumerate what the agent is permitted to do, rather than denylists that attempt to enumerate everything forbidden. Allowlists don't have caps.
**Isolate agent sessions that touch untrusted code.** For any workflow that involves running Claude Code against external repos — code review, dependency auditing, open-source contribution — consider running the agent in a sandboxed environment with no access to credentials or production systems. A separate VM, a Docker container with stripped environment variables, or a fresh cloud dev environment are all reasonable options.
## The Bigger Picture: Agentic Security Is Still Young
This vulnerability is a sign of maturity, not failure. The fact that security researchers at Check Point and Adversa AI are actively auditing Claude Code means the tool has graduated to "worth attacking." The fact that Anthropic patched it quickly and published detailed CVE documentation means the security process is working.
But the category of agentic-coding-specific vulnerabilities is just getting started. CLAUDE.md-style config injection, prompt injection via repository comments, tool-chaining exploits, and credential exfiltration through seemingly benign file operations — these are all threat vectors that didn't exist three years ago. They're not going away.
If you're building SDD workflows or deploying Claude Code in production pipelines, security needs to be part of the architecture from the start. Not a checkbox at the end, not a trust in the AI to "know better." Defense in depth: review configs before running agents, isolate agent environments from production credentials, and stay current on CVEs for every tool in your agentic stack.
The CLAUDE.md trap is patched. Build as if the next one isn't.
---
*CVE-2026-21852 was patched in Claude Code v2.1.90, released April 6, 2026. CVE-2026-33068 addresses the related Workspace Trust Dialog bypass. Sources: [Adversa AI](https://adversa.ai/blog/claude-code-security-bypass-deny-rules-disabled/), [Check Point Research](https://research.checkpoint.com/2026/rce-and-api-token-exfiltration-through-claude-code-project-files-cve-2025-59536/), [Cybersecurity News](https://cybersecuritynews.com/claude-code-vulnerability/), [InfoWorld](https://www.infoworld.com/article/4154199/claude-code-is-still-vulnerable-to-an-attack-anthropic-has-already-fixed.html), [RAXE Labs](https://raxe.ai/labs/advisories/RAXE-2026-040).*
---
# SDD Is Eating Software Engineering: The Methodology That Went From Blog Post to Industry Movement
URL: https://sdd.sh/2026/04/sdd-is-eating-software-engineering-the-methodology-that-went-from-blog-post-to-industry-movement/
Date: 2026-04-07
Updated: 2026-04-07
Tags: Spec-Driven Development, Agentic Workflows, AWS Kiro, Frameworks, Industry
Categories: Spec-Driven Development, Industry
Summary: Spec-Driven Development has crossed from niche methodology to recognized category — with 30+ competing frameworks, a conference track at Agentic Conf Hamburg, AWS Kiro as the first commercial SDD IDE, and enterprise backing from McKinsey and Anthropic's own trend reports. Here's what's happening and what it means.
Sometime in early 2026, Spec-Driven Development stopped being a thing that a certain kind of developer did and became a thing that the industry named, argued about, and built products around. That transition matters. When a methodology acquires competing frameworks, a conference track, a McKinsey citation, and a dedicated commercial IDE, it has crossed from pattern to paradigm.
This is that moment for SDD. And if you've been doing this already — writing specs instead of code, letting Claude or another agent handle implementation — you're not ahead of the curve anymore. The curve just arrived.
## What Happened This Week
The convergence of signals was striking even by 2026's standards.
Vishal Mysore published a Medium piece mapping 30+ SDD and agentic-coding frameworks — SpecKit, OpenSpec, GSD, Devika, Tessl, and many more. The headline: "SDD is eating software engineering." It's a Marc Andreessen riff, which means it's aspirational, but the underlying map is real. These are not vaporware: several have production users, GitHub stars in the tens of thousands, and teams building on top of them.
Agentic Conf Hamburg 2026 accepted a session titled "Beyond the Vibes: Lessons from Using Spec-Driven Development Frameworks for Agentic Coding" — a talk that explicitly addresses the transition from informal vibe coding to structured SDD discipline. When a methodology gets a conference slot with the word "lessons" in the title, it means practitioners have accumulated enough production experience to have learned things worth sharing.
Rick's Cafe AI published a piece titled "The 2nd Phase of Agentic Development," framing the current moment as the end of the experiment and the beginning of the infrastructure. The first phase was about proving AI could write code. The second phase is about making that reliable, repeatable, and governable — which is exactly where SDD lives.
And Anthropic's own 2026 Agentic Coding Trends Report frames the shift from AI-as-assistant to AI-as-engineer as the central trend of the year, with SDD as the methodology that makes that shift tractable.
## AWS Kiro: The First Commercial SDD IDE
The most concrete signal is hardware — or rather, software with a price tag. AWS Kiro launched as the first commercial IDE built explicitly around the SDD model, with Agent Hooks and MCP integration as first-class citizens rather than afterthoughts.
The core Kiro workflow should feel familiar if you've been doing SDD with Claude Code: you write a spec, the agent reads it, implements it, runs tests, and iterates. What Kiro adds is a structured spec format (`.spec.md` files with explicit sections for requirements, acceptance criteria, and implementation notes), an Agent Hooks system that fires on spec changes (rerun tests, regenerate types, update documentation), and deep AWS integration so that Kiro agents can provision infrastructure as part of implementation, not as a separate step.
The Agent Hooks piece deserves attention. One persistent challenge with SDD is keeping the agent in sync with the spec as requirements evolve. If you update the spec, does the agent automatically re-run the affected tests? Does it regenerate the API contracts? Kiro automates these triggers, reducing the cognitive overhead of maintaining spec-to-implementation coherence over time.
The MCP integration is less novel — Claude Code, Cursor, Windsurf, and VS Code all support MCP at this point — but Kiro's implementation is tighter than most. Tool calls are scoped to the current spec, so an agent working on a database schema spec doesn't accidentally invoke file-system tools outside the schema directory. This kind of per-spec sandboxing is good security hygiene, and it's worth noting that it's architecturally similar to what Claude Code's permission system enables manually.
AWS backing Kiro is significant for enterprise adoption. The McKinsey/QuantumBlack agentic workflows piece from February 2026 — which is now heavily cited across the industry — concluded that the biggest barrier to agentic adoption in enterprises isn't AI capability, it's governance. "How do we audit what the agent did?" is the question that stops enterprise SDD deployments. A Kiro spec is, by design, an audit trail: requirements, acceptance criteria, and the gap between what was specified and what was implemented are all explicit and version-controlled.
## 30+ Frameworks: What They Get Right and Wrong
Mysore's framework map is a Cambrian explosion, and like most Cambrian explosions, a lot of these creatures won't survive. A few observations:
**The spec format wars have started.** Every framework has its own opinion about what a spec should look like. SpecKit uses a structured YAML front matter with a natural language body. OpenSpec goes full JSON Schema. GSD (Goal-Spec-Done) keeps it intentionally informal. Tessl invented its own DSL. None of them are wrong exactly, but the fragmentation means that skills and tooling built for one spec format don't transfer cleanly to another.
This is where Claude Code's format-agnostic approach has an advantage. Claude Code doesn't enforce a spec format — it reads whatever you write and infers structure from context. That's less elegant than a rigid schema, but it's more durable. Developers who have been writing `SPEC.md` files in their own ad-hoc style for the past year don't need to migrate anything.
**The best frameworks separate concerns cleanly.** The frameworks that seem most durable are the ones that treat the spec as the source of truth for *what* the system should do, and leave *how* entirely to the agent. The worst frameworks bleed implementation details into the spec — database column names, specific library choices, performance targets tied to current hardware. Over-specified specs age badly and fight the agent instead of guiding it.
**None of them has solved multi-agent spec coordination.** The hard problem in SDD isn't single-agent workflows — it's keeping multiple agents in sync when they're working from the same spec simultaneously. If two agents are implementing different sections of a spec and both touch the same interface, who wins? Most frameworks either ignore this problem or handle it with brute-force locking that serializes work anyway. Claude Code's mailbox architecture (covered here in March) is probably the most production-ready solution to this, but it requires explicit design, not something frameworks handle automatically yet.
## The Enterprise Imprimatur
The McKinsey/QuantumBlack paper matters because McKinsey papers are how methodologies get approved in enterprises. If your CTO read "agentic workflows reduce time-to-delivery by 40% in the teams we studied" in a McKinsey PDF, the conversation about whether to invest in SDD tooling just got much easier.
The data in that paper is real: TELUS, Zapier, and organizations in the McKinsey/QuantumBlack portfolio all show meaningful productivity gains from structured agentic workflows. The caveat — which McKinsey buries but is important — is that the gains come from *structured* agentic workflows. Teams that handed agents unstructured tasks and hoped for the best saw minimal gains. Teams that wrote specs, defined acceptance criteria, and built feedback loops between spec and implementation saw the 40% numbers.
That's the SDD thesis in a single data point.
## What This Means for Practitioners
If you've been doing SDD informally with Claude Code — writing SPEC.md files, using CLAUDE.md to enforce conventions, running agents against well-defined tasks — you're in a stronger position than you might realize. The methodology is being validated externally, which means the organizational politics of advocating for it are getting easier.
A few things to watch:
**The spec file as source of truth is becoming conventional wisdom.** Which means you'll start seeing spec-aware tooling in places that aren't SDD-specific. Expect CI/CD pipelines that validate implementation against spec, code review tools that surface spec-to-code divergence, and monitoring systems that alert when runtime behavior drifts from specified behavior.
**Framework convergence is coming.** The current fragmentation in SDD frameworks mirrors the early JavaScript framework wars — too many options, incompatible formats, duplicated effort. Consolidation will happen, probably around the formats that major tool vendors (AWS, Anthropic, Microsoft) choose to support natively.
**The role of the spec writer is becoming a real job title.** Anthropic's trends report notes that a new role is emerging: developers who spend most of their time writing and refining specs rather than writing code. This is the SDD practitioner's endgame. The spec is the leverage point; everything else is execution.
The methodology that Marc Andreessen would say is eating software engineering didn't start as a methodology. It started as a practice: write down what you want, let the AI build it, review the result. That practice is now a movement. The question isn't whether SDD will shape how software is built — it already is. The question is which tools, formats, and frameworks survive the consolidation.
---
*Sources: [Vishal Mysore — Medium](https://medium.com/@visrow/spec-driven-development-is-eating-software-engineering-a-map-of-30-agentic-coding-frameworks-6ac0b5e2b484), [Agentic Conf Hamburg 2026](https://agentic.hamburg/conf-2026/sessions/beyond-the-vibes/), [Rick's Cafe AI](https://cafeai.home.blog/2026/04/06/the-2nd-phase-of-agentic-development/), [Anthropic 2026 Agentic Coding Trends Report](https://resources.anthropic.com/hubfs/2026%20Agentic%20Coding%20Trends%20Report.pdf), [JetBrains Blog](https://blog.jetbrains.com/blog/2026/03/24/introducing-jetbrains-central-an-open-system-for-agentic-software-development/), [The New Stack](https://thenewstack.io/5-key-trends-shaping-agentic-development-in-2026/).*
---
# Windsurf After Cognition: GPT-5.4, One Million Users, and an Identity Crisis
URL: https://sdd.sh/2026/04/windsurf-after-cognition-gpt-5.4-one-million-users-and-an-identity-crisis/
Date: 2026-04-06
Updated: 2026-04-06
Tags: Windsurf, Cognition, GPT-5.4, AI coding tools, IDE, agentic coding
Categories: AI Tools, Industry
Summary: Windsurf has crossed one million active users, added GPT-5.4 with five reasoning effort levels, and is now fully under Cognition AI's ownership. The product is better. The question is whether it has found an identity that justifies its place in the market.
Since Cognition AI acquired Windsurf in December 2025 for roughly $250 million, the product has been busy. It's crossed one million active users. It added GPT-5.4 — OpenAI's latest frontier model — with five adjustable reasoning effort levels. LogRocket's AI Dev Tool Power Rankings put it at number one among IDE-native tools as of early 2026. It is, by most measures, better than it was.
That's worth acknowledging clearly before getting into the tensions underneath it.
## What's Actually New
The GPT-5.4 integration is the most substantive recent update. Unlike most model integrations that simply add a new model to a dropdown, Windsurf exposes five reasoning effort levels — from "minimal" (fast, lower cost, simple edits) to "maximum" (slow, expensive, deep architectural reasoning). This is a meaningful UX decision. It lets developers choose where on the latency/cost/quality curve they want to sit for each task, rather than paying frontier model prices for autocomplete suggestions.
Windsurf now supports Claude Opus 4.6, GPT-5.4, GPT-5.3-Codex, and Gemini 3.1 Pro as selectable backends. That multi-model flexibility is increasingly the default expectation for IDE-native tools, but Windsurf's implementation of adjustable reasoning effort is a differentiator — it's more granular than what Cursor offers today and considerably more transparent than GitHub Copilot's backend routing.
The one million active users milestone is significant, though the definition of "active" is doing some work in that sentence. Windsurf has always had a free tier, which drives adoption numbers. The question that matters more is paying retention — how many of those million users are subscribing, and at what tier. Cognition hasn't disclosed that.
## The Cognition Acquisition: What It Actually Changed
When Cognition bought Windsurf, the narrative was that two agentic companies were merging — Devin's underlying autonomy infrastructure would supercharge Windsurf's IDE experience. The combined entity would be the most capable coding assistant on the market.
Four months in, that thesis is partially validated. Windsurf's task queue and background agent features show Devin's DNA — you can hand off a multi-step refactor, close the IDE, and come back to a completed PR. That's genuinely useful and not something Cursor does as cleanly.
But the deeper Devin capabilities — autonomous, multi-step execution across repositories and external systems, the kind of thing Devin 2.0 demonstrated — haven't fully landed in Windsurf's mainstream product. The gap between Cognition's marketing and Windsurf's day-to-day experience is narrowing, but it hasn't closed.
There's also a pricing complexity that the acquisition introduced. Windsurf's subscription tiers, Devin's task-based billing, and the per-model compute costs for frontier models like GPT-5.4 at maximum reasoning effort don't map cleanly onto each other. Users who came for Windsurf's simple flat-rate model are finding the post-acquisition pricing more complicated than expected.
## The Fundamental Limitation Still Applies
Windsurf is an IDE wrapper. A very good one, with increasingly capable agent features baked in. But the model is still fundamentally: a human sits in an IDE, an AI assists that human.
That's not nothing. For a lot of developers and a lot of workflows, that's exactly the right model. Not every team is ready to hand a multi-agent system the keys to their production codebase. There's real value in an AI that makes your existing workflow faster without requiring you to rethink how you work.
But the trajectory of the AI coding tools market is toward autonomy. The tools that are winning the high end — Claude Code, Devin, GitHub Copilot Autopilot — are winning because they can operate independently of a human watching every move. They can run in CI, respond to GitHub events, operate from a mobile notification, spin up in a background thread while you do something else.
Windsurf, even post-Cognition, is still fundamentally anchored to the IDE session. The task queue and background agents are steps toward independence, but they're still initiated from within the IDE and primarily surfaced there. The async, terminal-native, notification-driven agentic model that Claude Code exemplifies is a different paradigm, not just a feature.
## What the Rankings Actually Measure
LogRocket's power ranking puts Windsurf at number one. That's based on a composite of user satisfaction scores, feature breadth, performance benchmarks, and developer survey data — all of which are legitimate signals. Windsurf scores well on user satisfaction because it's genuinely polished, the autocomplete is fast, and the multi-model support means users can swap backends when one is underperforming.
But power rankings based on surveys measure what developers *currently value*, not what they *will need* when the baseline shifts. In 2024, "best code completion" was the primary axis. In 2026, the axis is shifting toward "most capable autonomous execution." By that measure, the rankings look different.
## Who Windsurf Is Actually For
After everything: Windsurf is the right choice for developers who want a premium IDE experience with AI deeply integrated, don't want to leave their editor, and want access to multiple frontier models with sensible cost controls. The GPT-5.4 reasoning levels are a genuine UX innovation. The multi-model flexibility is real.
It's not the right choice for teams who want to move toward autonomous, asynchronous AI workflows where the agent operates without a human at the keyboard. For those teams, the IDE-centric model is the constraint, not the feature.
Cognition's acquisition was supposed to resolve that tension. The jury is still out on whether it will. Right now, Windsurf is a better Windsurf — but it's still Windsurf.
---
**Sources:**
- [Windsurf — GPT-5.4 Integration Announcement](https://windsurf.com/blog/gpt-5.4)
- [Cognition AI — Windsurf Acquisition Announcement](https://cognition.ai/blog/windsurf)
- [LogRocket AI Dev Tool Power Rankings 2026](https://blog.logrocket.com/ai-dev-tool-power-rankings-2026/)
- [VentureBeat — Cognition/Windsurf Acquisition Overview](https://venturebeat.com/ai/cognition-windsurf-acquisition/)
---
# Anthropic's OpenClaw Ban Is a Platform Power Move — And an Honest One
URL: https://sdd.sh/2026/04/anthropics-openclaw-ban-is-a-platform-power-move-and-an-honest-one/
Date: 2026-04-06
Updated: 2026-04-06
Tags: Anthropic, Claude Code, OpenClaw, platform strategy, agentic tools
Categories: AI Tools, Industry
Summary: Anthropic just blocked Claude Pro and Max subscribers from using their subscriptions with OpenClaw and other third-party harnesses. The decision is strategically transparent, commercially necessary — and a sign of where the agentic ecosystem is heading.
On April 4, 2026, Anthropic drew a quiet but significant line. Effective noon PT, Claude Pro and Max subscribers can no longer use their flat-rate subscriptions to power third-party agentic harnesses — most prominently OpenClaw. Users who want to keep running Claude through external tooling must switch to pay-as-you-go "extra usage" billing or the raw API, which for heavy users can mean costs 30–50x higher.
Boris Cherny, Head of Claude Code at Anthropic, announced the change directly. The explanation was technical: third-party harnesses, especially OpenClaw, bypass Claude's prompt-caching layer, driving dramatically higher infrastructure costs than native Claude Code sessions for the same apparent subscription tier.
That's a legitimate reason. But the context around it is worth unpacking.
## The OpenClaw Problem — and Its Political Layer
OpenClaw started as an unofficial Claude Code client with a richer UI, better file management, and features that Anthropic hadn't shipped natively. It was popular precisely because it filled gaps in Claude Code's own interface. Developers used it to get a better experience out of their existing Claude subscriptions.
Then, in February 2026, OpenClaw's creator Peter Steinberger joined OpenAI. The project was handed to an open-source foundation — with OpenAI backing. Weeks later, Anthropic shut off third-party subscription access.
The timing is not invisible. Steinberger has been vocal about it, accusing Anthropic of copying OpenClaw's popular features into Claude Code's native interface and then locking out the open-source version once the competitive threat was apparent. That's a hard claim to prove, but it's also not an unreasonable read.
Anthropic's official position is that this is about infrastructure costs and sustainable pricing, not competitive suppression. They're offering a one-time credit equal to one month of subscription cost (redeemable through April 17) and up to 30% off pre-purchased extra usage bundles. That's a gesture toward goodwill, even if it doesn't fully absorb the cost delta for power users.
## What This Actually Reveals
Strip away the politics for a moment. What Anthropic is doing here is structurally consistent with how platform companies behave when a protocol layer they control starts being used in ways that threaten unit economics or strategic direction.
Claude Pro and Max subscriptions were designed around an assumption: that usage would flow through Claude.ai or Claude Code, where Anthropic controls the session management, prompt caching, and tool routing. Third-party harnesses that bypass caching are a real cost problem — a user paying $20/month can generate API costs that would justify $200+/month if priced at API rates.
From that frame, the decision makes sense. You can't sustainably run an infrastructure-heavy AI platform where any external developer can arbitrage flat-rate access through a custom client.
The harder question is whether Anthropic handled the transition well. A migration path announced with 24-hour notice and a limited credit window is not generous. Developers who built workflows on OpenClaw — workflows that weren't breaking any written terms of service — got caught in a policy change they had no warning about.
## Claude Code Channels: The Native Alternative
The timing is also notable because Anthropic shipped Claude Code Channels — Telegram and Discord integration for Claude Code — around the same period. It's not subtle. The message is: you don't need OpenClaw to get async, mobile-accessible agentic Claude. Use our native integration.
Claude Code Channels gives you Claude Code sessions that run asynchronously, report back via Telegram or Discord, and persist context between messages. For many of the use cases OpenClaw was serving — "run this refactor while I'm away from my desk" or "kick off a test suite from my phone" — Channels is a direct replacement.
Whether it's as flexible as OpenClaw for power users is another question. Anthropic's native interface has historically lagged behind what the community builds. That's partly why OpenClaw existed in the first place.
## The Platform Lock-In Question
This episode surfaces a tension that will keep repeating as Anthropic's platform grows: how open is Claude Code's ecosystem, really?
Anthropic has invested heavily in the MCP ecosystem, co-founded the Agentic AI Foundation with OpenAI under the Linux Foundation, and built a public SDK for Claude Code extensions. That's genuine ecosystem investment. It's not a walled garden in the way older platform incumbents built.
But subscription access is a different layer. Anthropic is drawing a distinction between "you can build tools on top of Claude's API" (permitted, paid per token) and "you can use flat-rate subscription billing to run those tools" (now blocked for third-party clients). That distinction matters economically, and Anthropic is enforcing it.
The risk is developer trust. Anthropic's relationship with the open-source and indie developer community has been a genuine competitive advantage — the enthusiasm around Claude Code's extensibility, the MCP ecosystem, the skills library, the community-built tooling. Actions that look like "copy popular features, then lock out the originators" damage that relationship, even if the underlying decision has a defensible business rationale.
## What Developers Should Do
Practically speaking: if you were on OpenClaw, your options are now API access at full token pricing, Claude Code native with Channels for async workflows, or switching platforms. Anthropic's credit offer is real but limited in time — if you're a heavy user, run the math on extra usage bundles before the discount expires.
If you were building integrations on top of Claude subscriptions: this is a signal to price those integrations against API costs, not subscription costs. Flat-rate access was never designed to cover arbitrary third-party session management, and Anthropic has now confirmed they'll enforce that.
The broader lesson is one the Claude Code ecosystem will keep learning: Anthropic will invest in extensibility at the tool and protocol layer, but it will protect economics at the billing layer. Those are not incompatible positions — but they're worth being clear-eyed about as you build.
---
**Sources:**
- [TechCrunch — Anthropic Cuts Off OpenClaw Support for Claude Subscribers](https://techcrunch.com/2026/04/04/anthropic-says-claude-code-subscribers-will-need-to-pay-extra-for-openclaw-support/)
- [VentureBeat — Anthropic Cuts Off the Ability to Use Claude Subscriptions with OpenClaw](https://venturebeat.com/technology/anthropic-cuts-off-the-ability-to-use-claude-subscriptions-with-openclaw-and)
- [The Decoder — Anthropic Cuts Off Third-Party Tools Like OpenClaw for Claude Subscribers](https://the-decoder.com/anthropic-cuts-off-third-party-tools-like-openclaw-for-claude-subscribers-citing-unsustainable-demand/)
- [TNW — Anthropic Bans OpenClaw from Claude Subscriptions Over Cost](https://thenextweb.com/news/anthropic-openclaw-claude-subscription-ban-cost)
---
# Gemma 4: Google Just Made the Case for Running Your Coding Agent Locally
URL: https://sdd.sh/2026/04/gemma-4-google-just-made-the-case-for-running-your-coding-agent-locally/
Date: 2026-04-05
Updated: 2026-04-05
Tags: Gemma, Google, open-weight, local AI, coding agents, benchmarks, Apache 2.0
Categories: AI Tools, Agentic Workflows
Summary: Google's Gemma 4 dropped on April 2 with Apache 2.0 licensing, 80% on LiveCodeBench v6, a Codeforces ELO of 2,150, and agentic tool-use scores that make the previous generation look like a prototype. The 26B MoE model runs on a single consumer GPU with 256K context. Here's what it actually means.
For the past two years, the case for running a capable coding agent entirely on your own hardware has been theoretical. The open-weight models were good — genuinely impressive for their size — but when it came to the things that matter most for agentic coding (tool use, sustained reasoning, large context), they were still a tier below the frontier API models. You ran local models for privacy or cost reasons, accepting a performance penalty to do so.
Google's Gemma 4, released April 2, 2026, changes that calculus. Not completely, and not without caveats. But meaningfully.
## The Numbers That Matter
Gemma 4 ships in four variants. The two that matter most for coding agents are the **26B Mixture-of-Experts** and the **31B Dense** model.
The 31B Dense scores **80.0% on LiveCodeBench v6** with a Codeforces ELO of **2,150**. Those are not open-weight numbers — that is competitive with frontier API models from six months ago. The 26B MoE follows closely at **77.1% LiveCodeBench** and **1,718 Codeforces ELO**.
For context: Gemma 3 27B, the previous generation, scored **29.1% on LiveCodeBench** and a Codeforces ELO of **110**. That is not a modest improvement. That is a different class of capability.
The agentic tool-use numbers are equally striking. On τ2-bench (a retail task-completion benchmark that measures how well a model actually uses tools to accomplish goals), Gemma 4 31B scores **86.4%**. Gemma 3 27B scored **6.6%**. The model that came before Gemma 4 was not a coding agent in any meaningful sense. Gemma 4 is.
## The License Change That Matters More Than Benchmarks
Every previous Gemma release shipped under a restrictive custom license that limited commercial use and prohibited various deployment patterns. Gemma 4 ships under **Apache 2.0**.
VentureBeat's coverage put it directly: "the license change may matter more than the benchmarks." That is not hyperbole. Apache 2.0 means Gemma 4 can be embedded in commercial products, fine-tuned and redistributed, deployed in any cloud or on-premise environment, and integrated into enterprise toolchains without legal review. The prior license required that review. Most companies never got through it.
The practical effect: Gemma 4 is now the strongest fully-commercial open-weight coding model available. DeepSeek V3.2 ships under MIT. Qwen 3.5 ships under Apache 2.0. Gemma 4 is now in the same tier on licensing while competing on benchmarks.
## Hardware: What You Actually Need
The four model variants have meaningfully different hardware requirements:
| Model | Minimum VRAM | Practical GPU |
|---|---|---|
| E2B (2.3B) | ~3–4 GB | Any modern GPU; Raspberry Pi 5 works |
| E4B (4.5B) | ~6 GB | RTX 3060 (8GB) |
| 26B MoE | ~8 GB | RTX 3080 (10GB) or M2 Pro |
| 31B Dense | ~20 GB | RTX 3090/4090 (24GB) or M3 Max |
The **26B MoE is the practical sweet spot** for coding agent use. It fits on a single consumer GPU, delivers 77.1% LiveCodeBench, supports the full 256K context window, and runs at approximately 150 tokens/second on an RTX 4090. That throughput is fast enough for agentic loops where the model is calling tools and processing results iteratively.
For Apple Silicon: the MLX backend handles all four variants. A MacBook Pro M3 Max (128GB unified memory) runs the 26B MoE comfortably at good throughput. The 31B Dense fits on M2 Ultra or M3 Max configurations.
The E2B and E4B variants deserve a mention for a different reason: they run on edge hardware. Gemma 4 E2B has been demonstrated at 7.6 decode tokens/second on a Raspberry Pi 5, and 31 tokens/second on a Qualcomm Dragonwing IQ8 NPU. These are not coding agent numbers — they are edge deployment numbers. But the capability progression from E2B to 31B Dense under a single unified model family is notable.
## The 256K Context Window
Both the 26B MoE and 31B Dense support **256K token context**. For a coding agent, that means you can feed an entire mid-sized codebase — API layer, frontend, database schema, test suite, documentation — into a single prompt. No chunking, no summarization, no retrieval-augmented lookup to find the relevant file. The model sees everything.
This is enabled by Proportional RoPE on global attention layers, combined with an alternating attention architecture that mixes local sliding-window attention (1,024 tokens) with full global attention. The design keeps computation tractable at long context lengths without the quality degradation that typically appears when naive RoPE scaling is used.
## Day-One Tooling: Mostly Ready
Gemma 4 launched with support across the full stack: Ollama, LM Studio, MLX, llama.cpp, Hugging Face Transformers, vLLM, SGLang, Unsloth, NVIDIA NIM/NeMo, Keras, and JAX. The model integrates with any OpenAI-compatible server — feed it to `aider`, `continue.dev`, or any other tool that speaks the OpenAI API via `llama-server`.
One caveat that matters if you are building a coding agent: **tool calling is currently broken in Ollama v0.20.0**. The streaming parser drops tool calls into the reasoning field rather than parsing them as structured output. A workaround exists (a community gist for OpenCode users that patches the streaming response), but this is a real bug that will affect any agent that relies on Ollama-served Gemma 4 for tool use. Track the Ollama release notes; this will likely be patched quickly, but check before you build on it.
Beyond Ollama, tool calling works correctly through llama.cpp's server, through the Transformers `pipeline` interface, and through fine-tuning frameworks like Unsloth. If you are building a production coding agent on Gemma 4, skip Ollama for now and use llama.cpp's OpenAI-compatible server instead.
## Enable Thinking: The On-Demand Reasoning Mode
Gemma 4 ships with a chain-of-thought reasoning mode activatable at inference time via `apply_chat_template(..., enable_thinking=True)`. When enabled, the model works through a problem step-by-step before producing output — similar to Claude's extended thinking or the reasoning modes in newer GPT models.
For coding tasks, this matters most for algorithm design, complex debugging, and refactoring decisions where a direct answer is less reliable than a reasoned one. The AIME 2026 score — **89.2% for 31B Dense** — is the benchmark proxy for this reasoning quality.
You can turn it off for fast, simple completions and on for tasks that benefit from deeper analysis. That toggle-at-inference-time design is more practical than models that either always reason (slow, expensive) or never do (fast, sometimes wrong).
## Where Gemma 4 Does Not Lead
Honest assessment: **DeepSeek V3.2 still leads on raw coding benchmark performance** for large-scale code generation. For SWE-bench Verified-style task completion, GLM-5 has posted 77.8%, and DeepSeek V3.2's numbers on general code generation continue to impress. If raw coding throughput is your only metric and you can use a cloud-hosted open model, those alternatives are worth evaluating.
Qwen 3.5 27B is the nearest competitor to Gemma 4 26B MoE on the specs that matter: similar coding benchmark performance, Apache 2.0, comparable context length. For pure coding tasks, Qwen3 Coder Next (80B MoE, 3B active parameters) is specifically tuned for coding agents and delivers strong results — though at infrastructure complexity that offsets the "local" advantage.
The distinction worth making: Gemma 4 26B MoE is the strongest option **for private, local, single-GPU deployment** of a coding agent. If your threat model requires that no code leaves your machine, or your cost model requires zero API spend, Gemma 4 is now the answer to reach for.
## What This Means
The practical conclusion is straightforward. Teams that have been deferring the "run AI locally" decision because local models were not good enough for real agentic work now have a concrete option. A developer with an RTX 4090 or a MacBook Pro M3 Max can run a coding agent that scores within striking distance of the frontier cloud models from a year ago — with Apache 2.0 licensing, 256K context, and production-grade tool use — at zero marginal cost per token.
That changes the economics for privacy-sensitive codebases, air-gapped environments, teams in jurisdictions with strict data residency requirements, and individual developers who want capable AI assistance without the API bill.
The tool-calling Ollama bug is temporary. The Apache 2.0 license and the benchmark numbers are not.
---
**Sources**
- [Gemma 4: Byte for byte, the most capable open models — Google Blog](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/)
- [Gemma 4 — Google DeepMind](https://deepmind.google/models/gemma/gemma-4/)
- [Welcome Gemma 4: Frontier multimodal intelligence on device — Hugging Face](https://huggingface.co/blog/gemma4)
- [Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blog](https://developers.googleblog.com/bring-state-of-the-art-agentic-skills-to-the-edge-with-gemma-4/)
- [Gemma 4 and what makes an open model succeed — interconnects.ai](https://www.interconnects.ai/p/gemma-4-and-what-makes-an-open-model)
- [Google releases Gemma 4 under Apache 2.0 — and that license change may matter more — VentureBeat](https://venturebeat.com/technology/google-releases-gemma-4-under-apache-2-0-and-that-license-change-may-matter)
- [AINews: Gemma 4 — The Best Small Multimodal — latent.space](https://www.latent.space/p/ainews-gemma-4-the-best-small-multimodal)
- [Gemma 4 Hardware Requirements — avenchat](https://avenchat.com/blog/gemma-4-hardware-requirements)
- [Gemma 4 VRAM Requirements — DEV Community](https://dev.to/geek_/gemma-4-vram-requirements-the-hardware-guide-i-wish-i-had-3plo)
- [Ollama library — gemma4](https://ollama.com/library/gemma4)
- [Google releases Gemma 4 — Hacker News thread](https://news.ycombinator.com/item?id=47616361)
---
# Claude's 1M Context Window Is Now Standard: What Actually Changes for Agentic Coding
URL: https://sdd.sh/2026/04/claudes-1m-context-window-is-now-standard-what-actually-changes-for-agentic-coding/
Date: 2026-04-05
Updated: 2026-04-05
Tags: Claude, Anthropic, context window, agentic workflows, Claude Code, API, Sonnet 4.6, Opus 4.6
Categories: AI Tools, Agentic Workflows
Summary: On March 13, Anthropic made the 1M token context window standard on Sonnet 4.6 and Opus 4.6 — no beta header, no pricing premium above 200K. Here is what that actually changes for coding agents, how it compares to the competition, and what it still cannot solve.
Long context has been a headline feature since 2024. The arms race — first 100K, then 200K, then 1M tokens — has generated more marketing copy than practical guidance on what these numbers actually mean for the people building software with AI agents.
On March 13, 2026, Anthropic made the 1M token context window **generally available** on Claude Sonnet 4.6 and Claude Opus 4.6. No beta header. No per-token surcharge above 200K. Standard pricing throughout. That is the administrative detail. The practical question is: what changes?
## From Beta to Standard
Until March 13, accessing more than 200K tokens on Claude required including the `anthropic-beta: context-1m-2025-08-07` header in your API requests and paying a long-context premium — a per-token surcharge applied to everything above 200K tokens. Both the header requirement and the surcharge are now gone.
If you were already using the beta header, you can leave it in place — it is silently ignored. No code changes required. If you were staying under 200K to avoid the premium, that constraint is removed. Opus 4.6 at $5/$25 per million input/output tokens applies uniformly whether you send 9,000 tokens or 900,000.
This is not a new capability. Claude has supported 1M context since August 2025. What changed is that the capability is now first-class, not experimental — and the pricing model no longer penalizes you for using it.
## Where Claude Sits in the Long-Context Market
The 1M context field is more crowded than it was a year ago.
| Model | Context Window | Notes |
|---|---|---|
| Claude Opus 4.6 / Sonnet 4.6 | 1M tokens | GA, standard pricing |
| Gemini 3.1 Pro | 1M tokens | Released February 2026 |
| GPT-5.4 | 1M tokens | Released March 2026; $2.50/MTok input |
| GPT-5.3-Codex | 400K tokens | OpenAI's coding model; capped here |
The raw token count comparison understates what matters: **recall quality at long context lengths**. Anthropic benchmarks Claude at **78.3% on MRCR v2** (a needle-in-haystack style recall evaluation at extreme context lengths) — the highest published score among frontier models at this context length. Gemini 3.1 Pro matches on window size but trails on MRCR v2. GPT-5.3-Codex, OpenAI's dedicated coding model, is capped at 400K — a real disadvantage for large codebase work.
GPT-5.4 reaches 1M at cheaper input pricing, but recall quality at the extremes is where the differentiation lies for coding agents, not raw token count. A model that can ingest one million tokens but reliably retrieves only what appears at the start or end of the window is less useful than a model with 78% recall uniformly distributed across the whole thing.
## What 1M Context Actually Unlocks for Coding
The honest answer is that 1M context does not change what is *possible* for most tasks — it changes what is *convenient* and what is *reliable*.
**Whole-codebase analysis without chunking.** A typical mid-sized production application — API server, frontend, database migrations, test suite, CI configuration, documentation — fits inside 1M tokens. Previously, you would chunk the codebase, retrieve relevant sections via embeddings or keyword search, and feed those chunks to the model. That process introduces retrieval errors: if the relevant code is not retrieved, the model works with incomplete context. At 1M context, you feed everything and let the model find what it needs. Fewer retrieval errors, more complete analysis.
**Multi-step agentic loops without context loss.** A long Claude Code session — search logs, cross-reference source code, examine test failures, trace a bug through five files, propose a fix, revise based on test output — accumulates context quickly. At 200K, that session compacts. Compaction means earlier context is summarized or dropped; the agent may forget decisions it made in step three when it is working on step twelve. Claude Code teams have measured a **15% reduction in compaction events** since 1M context became standard. Sessions run longer before the agent loses its grip on earlier work.
**Richer input alongside code.** Up to 600 images or PDF pages per request (up from approximately 100). An architecture diagram, an API spec in PDF, a runbook screenshot, and the relevant source code can all live in the same context. This matters for the kind of reasoning that starts from "here is how this system was designed to work" and ends at "here is why it does not."
**For Spec-Driven Development specifically:** the 1M window is the right fit for the spec-to-implementation workflow. A complete specification document, the existing codebase the implementation must integrate with, the test suite the implementation must pass, and prior implementation history can coexist in a single context. The model plans and implements against the whole picture, not a summarized version of it.
## The Other Updates That Shipped With It
The 1M GA was the headline, but the surrounding API changelog from March and April 2026 is worth reviewing:
**Message Batches max_tokens raised to 300K** (March 30). For batch code generation — generating implementations for dozens of spec files simultaneously — the previous output ceiling limited what you could produce per batch item. 300K max_tokens per batch item removes that ceiling for most practical purposes.
**`thinking.display: "omitted"` field** (March 16). When using Claude's extended thinking mode, you can now suppress the chain-of-thought from the response without losing the multi-turn continuity that requires the thinking signature. The model still thinks; you just do not receive — or pay to transmit — the thinking content. Useful for production agents where thinking tokens are overhead, not output.
**Models API capability fields** (March 18). `GET /v1/models/{model_id}` now returns `max_input_tokens`, `max_tokens`, and a `capabilities` object. This is plumbing, but useful plumbing: agents can now introspect the limits of the model they are using rather than hardcoding them.
**Haiku 3 retires April 20.** If you are still running `claude-3-haiku-20240307` anywhere, migrate to `claude-haiku-4-5-20251001` before April 20. Requests to the old model ID will return an error after that date.
## What 1M Context Does Not Solve
Being precise about this matters because the marketing around long context tends to oversell.
**Context rot is real.** All long-context models degrade at extreme lengths. Information reliably retrieved near the beginning and end of the window is less reliably retrieved from the middle. A 78.3% MRCR v2 score is excellent; it also means roughly one in five needle-in-a-haystack retrievals fails. For tasks where missing a critical detail matters — security analysis, correctness-sensitive refactoring — long context does not substitute for careful prompt design.
**Token cost compounds.** In a multi-turn agentic session, every turn reprocesses the accumulated context. A session that runs to 500K tokens is expensive on a per-turn basis. The 1M window raises what is *possible* per session; prompt discipline still determines what is *economical*.
**Latency scales with context.** Processing 1M tokens takes more compute than processing 100K tokens. For real-time interactive use cases — completions that need to appear in under two seconds — the 1M window is not the right tool. Use Haiku 4.5 or Claude Code's compaction API for latency-sensitive paths; reserve the 1M window for the planning and analysis phases of an agentic workflow.
## The Practical Conclusion
The 1M context GA is not a feature launch — it is a pricing and availability change for a capability that already existed. The practical effect is that the constraint disappears. Developers who were staying under 200K to avoid the surcharge no longer have to. Teams building coding agents who were managing chunking pipelines to stay within window limits can simplify their architecture.
The capability that matters most for coding agents — ingesting a full codebase and reasoning across it without retrieval gaps — is now available at standard rates on the models best suited for agentic work.
For SDD workflows, the implication is direct: write the spec, point the agent at the whole codebase, and let it implement. No chunking, no partial context, no "here is a summary of the files I did not include." The context window is large enough to hold the whole problem.
---
**Sources**
- [1M context window generally available — Anthropic Blog](https://claude.com/blog/1m-context-ga)
- [Claude API Release Notes — Anthropic](https://platform.claude.com/docs/en/release-notes/overview)
- [Model Deprecations — Anthropic](https://platform.claude.com/docs/en/about-claude/model-deprecations)
- [Anthropic adds 1M context to Opus 4.6 and Sonnet 4.6 — Medium](https://medium.com/ai-software-engineer/anthropic-adds-1-million-context-window-to-opus-4-6-sonnet-4-6-now-you-can-code-at-scale-f5a932ba347c)
- [Claude 1M context guide 2026 — Karol Zieminski / Substack](https://karozieminski.substack.com/p/claude-1-million-context-window-guide-2026)
- [Claude Code 1M context for large codebases — Verdent](https://www.verdent.ai/guides/claude-code-1m-context-window)
- [AI context window comparison 2026 — Digital Applied](https://www.digitalapplied.com/blog/ai-context-window-comparison-2026-1m-to-10m-tokens)
- [Gemini 3.1 Pro context window — MarkTechPost](https://www.marktechpost.com/2026/02/19/google-ai-releases-gemini-3-1-pro-with-1-million-token-context-and-77-1-percent-arc-agi-2-reasoning-for-ai-agents/)
---
# Pinterest's MCP Blueprint: 66,000 Invocations a Month, 7,000 Hours Saved — This Is What Production MCP Looks Like
URL: https://sdd.sh/2026/04/pinterests-mcp-blueprint-66000-invocations-a-month-7000-hours-saved-this-is-what-production-mcp-looks-like/
Date: 2026-04-04
Updated: 2026-04-04
Tags: MCP, enterprise, case study, agentic workflows, Pinterest, production
Categories: AI Tools, Agentic Workflows
Summary: MCP hit 97 million downloads. Pinterest just showed what you do with them. Their production MCP ecosystem — domain-specific servers, a central registry, two-layer JWT auth, and hard ROI numbers — is the blueprint every serious engineering team will follow.
The MCP adoption story so far has been told in download counts. Ninety-seven million installs. OpenAI adopting the protocol. The Linux Foundation taking over governance. Big numbers, but abstract ones — the kind that tell you a technology is winning without telling you what winning actually looks like when it runs in production at scale.
Pinterest just filled in that gap.
In a detailed writeup published on the Pinterest Engineering Blog in March 2026, the company laid out how it built a full production MCP ecosystem: the architecture, the security model, the governance process, and the results. The numbers are real: **66,000 tool invocations per month**, **844 monthly active users**, **7,000 hours saved per month**. This is what MCP looks like when it graduates from prototype to infrastructure.
## The Core Design Decision: Fleet, Not Monolith
Most early MCP deployments take the obvious path: one big MCP server, all your tools in one place. Pinterest deliberately rejected this. Instead, they built a **fleet of domain-specific MCP servers** — separate servers for Presto (their query engine), Spark (data processing), Airflow (orchestration), and other internal systems.
The logic is sound. A monolithic MCP server grows without bound. Every team dumps their tools in, context windows bloat, the agent has to wade through hundreds of irrelevant tool descriptions to find the two it needs. Isolation limits that. A Spark-specific server contains exactly the tools relevant to Spark workflows, nothing else. The agent retrieves cleaner context, makes better decisions, and runs faster.
There is also a security argument. Fine-grained access control is easier when tools are grouped by domain. You can grant a data analyst access to the Presto MCP server without also granting them access to the deployment tooling server. A monolith makes this harder.
## The Registry: Governance Without Bureaucracy
The fleet model creates its own problem: discovery. If there are a dozen domain-specific MCP servers, how does an agent — or a developer — know which one to call? Pinterest's answer is a **central MCP registry**.
The registry serves two audiences simultaneously. For humans, it is a searchable catalog: browse available tools, read documentation, see which teams own which servers. For agents, it exposes an API — clients query the registry for available tools before making calls, rather than hardcoding server addresses. This gives the platform team a single control point for versioning, deprecation, and governance.
That governance layer is not optional. Before any MCP server can reach production at Pinterest, it requires a review ticket that touches three teams: Security, Legal/Privacy, and the internal GenAI platform group. This is not about slowing things down; it is about making sure that when a tool invocation triggers a Presto query against production data, someone has thought through the access implications. The registry enforces this — servers that have not completed review do not appear in the catalog.
## Two-Layer JWT: Human Loops and Service Flows
The security architecture is worth examining in detail because it solves a problem that most MCP deployments gloss over: the difference between a human-initiated agentic flow and a fully automated service-to-service flow.
Pinterest uses **two-layer JWT authorization**. When a human is involved — a developer triggering an agent that calls an MCP server — end-user JWTs authenticate the request. The tool knows who the human is, and access policies apply to that person's permissions. When the flow is fully automated, service-mesh identities replace user JWTs. The tool knows it is being called by a trusted internal service, and applies service-level policies instead.
This distinction matters more than it might appear. In a naive single-auth model, automated flows either inherit human permissions (a security risk — what "human" do you use?) or get unrestricted service access (a bigger risk). The two-layer model handles both cases correctly.
## The Numbers: What 7,000 Hours Actually Means
Pinterest measured impact the way any honest engineering team would: by assigning a "minutes saved per invocation" estimate to each tool, cross-referencing with actual invocation counts, and aggregating. The methodology is transparent enough to be credible and conservative enough to be defensible.
At **66,000 invocations per month** across **844 monthly active users**, the estimated savings come to **7,000 hours per month**. That is roughly 8.3 hours saved per active user per month — a bit more than one full workday.
To put this in context: Pinterest engineering employs approximately 1,000 engineers. If 844 of them are active MCP users saving 8 hours a month, the platform is recovering the equivalent of roughly two full-time engineers per month in productivity. The deployment cost of the MCP infrastructure is not public, but the order of magnitude makes the ROI case nearly self-evident.
## What This Blueprint Tells the Rest of Us
Pinterest is not a startup experimenting with AI tools. It is a production engineering organization with strict data governance requirements, real security constraints, and a platform team that has to justify infrastructure investments. The fact that they shipped this, and that it generated these numbers, carries a different weight than a hackathon demo or a beta blog post.
A few things stand out as patterns worth replicating:
**Domain-specific servers over monoliths.** The context efficiency and access control benefits compound as you add more tools. Start narrow; expand incrementally.
**Registry-first discovery.** Hardcoded server addresses are technical debt from day one. A registry gives you versioning, governance, and discoverability without requiring agents to have baked-in knowledge of your infrastructure topology.
**Explicit security review, not optional.** The two-layer JWT model and the review process gate are not bureaucracy — they are what makes it possible to grant broad tool access without also granting broad data access. Skipping this creates liability that will surface later.
**Measure what matters.** "Minutes saved per invocation" is a replicable measurement model. It is not perfect — some invocations save 30 seconds, others save an afternoon — but it is honest and scalable. Teams without measurement cannot justify continued investment.
## The Gap the MCP Narrative Had
The 97 million download story was real, but it was a supply-side story: protocol adoption, ecosystem momentum, vendor commitments. Pinterest is the first major public case study on the demand side — what happens when a production engineering organization actually deploys MCP and runs it at scale.
The answer is: it works. The ROI is measurable. The architecture has solved the hard problems (governance, security, discovery) in ways that other teams can copy. The 97 million installs now have a concrete reference implementation behind them.
The question for other engineering organizations is no longer whether MCP is ready for production. Pinterest's writeup closes that debate. The question is how quickly you build your own registry.
---
*Sources: [Pinterest Engineering Blog — Building an MCP Ecosystem at Pinterest](https://medium.com/pinterest-engineering/building-an-mcp-ecosystem-at-pinterest-d881eb4c16f1); [InfoQ — Pinterest Deploys Production-Scale MCP Ecosystem for AI Agent Workflows](https://www.infoq.com/news/2026/04/pinterest-mcp-ecosystem/)*
---
# GitHub Copilot CLI Goes GA: Microsoft Just Admitted Claude Code Was Right
URL: https://sdd.sh/2026/04/github-copilot-cli-goes-ga-microsoft-just-admitted-claude-code-was-right/
Date: 2026-04-04
Updated: 2026-04-04
Tags: GitHub Copilot, CLI, agentic coding, Claude Code, Microsoft, terminal, Cursor
Categories: AI Tools, Industry
Summary: GitHub Copilot CLI reached general availability on February 25 with full autopilot mode, multi-model support, and a cloud offload feature that lets you delegate to an agent mid-session. Microsoft just shipped a terminal-native agentic coding tool. The irony is deliberate.
For the past year, the dominant framing in AI coding tools has been IDE-centric: Cursor, Copilot Chat, Windsurf, all living inside the editor, all asking for your approval before touching anything important. The terminal-native, fully agentic model — plan, execute, iterate, ship — was Claude Code's thesis. Anthropic's bet that serious autonomous development belongs in the terminal, not nested inside an IDE extension.
On February 25, 2026, GitHub shipped Copilot CLI to general availability. It has an autopilot mode. It runs in the terminal. It delegates to specialized sub-agents. It supports cloud offload. It ships with MCP built in.
Microsoft just published a concession statement, and they formatted it as a changelog entry.
## What Copilot CLI Actually Does
The feature list is not subtle about what this tool is designed to compete with.
**Two modes**: Plan mode (guided, pauses for human approval at key steps) and Autopilot mode (fully autonomous — executes shell commands, calls tools, iterates on test failures, ships without asking). If you have used Claude Code's default mode and yolo mode, you know this design.
**Specialized sub-agents**: Copilot CLI delegates to four internal agents depending on the task — Explore (codebase analysis and search), Task (builds, test runs, iteration), Code Review (diff analysis, issue surfacing), and Plan (implementation strategy). This mirrors the multi-agent architecture that Anthropic's agent teams work have been developing, with domain-specific agents that hand off to each other.
**Cloud offload**: Prefix any prompt with `&` and the task gets delegated to GitHub's cloud coding agent while your local terminal session stays free. Use `/resume` to pull the remote session back locally. This is the asynchronous workflow story — start a complex refactor, go do something else, come back when it is done. Claude Code has had async via background agents; Copilot CLI is now matching it.
**Multi-model from day one**: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex, Gemini 3 Pro. You pick the model per task. This is partly a hedge — Microsoft does not want to be dependent on OpenAI — and partly a genuine acknowledgment that different models have different strengths for different workloads.
**GitHub MCP server built in**, plus support for custom MCP servers. Given that MCP just crossed 97 million installs and the protocol is effectively the standard now, this is table stakes, but it is worth noting that Microsoft shipped it from day one rather than bolting it on later.
## The Timeline That Makes This Significant
Copilot CLI entered public preview in September 2025. It reached GA on February 25, 2026. The gap between preview and GA was five months — fast by enterprise software standards, and fast by GitHub's standards specifically.
Compare this to the broader Copilot agentic rollout: agent mode went GA across VS Code and JetBrains in March 2026, with agentic code review (which generates fix PRs automatically, not just suggestions) shipping in the same wave. March 2026 also brought a 50% reduction in coding agent initialization time, via pre-indexing and parallel context loading.
This is not a slow, feature-by-feature IDE evolution. Microsoft is shipping the full agentic stack — terminal tool, IDE integration, cloud agent, code review agent — in parallel, in a compressed timeline. The pace suggests urgency, not roadmap execution.
## What This Means for Cursor
Cursor built a $50 billion business on a single proposition: the best AI-augmented coding experience inside an IDE. Smart autocomplete, inline edits, Composer, chat with codebase context — all of it, beautifully executed, inside VS Code.
The limitation has always been the same: Cursor keeps you in the loop. It is AI-assisted, not AI-autonomous. It asks before it acts. When Cursor shipped self-hosted cloud agents in March 2026 — Brex, Notion, Money Forward running containerized Cursor agents in their own infrastructure — it was the clearest sign yet that even Cursor knows its IDE-only model has a ceiling.
Now Copilot CLI is coming at Cursor from the platform layer. GitHub owns the repository. GitHub owns Actions. GitHub owns code review. Copilot CLI's cloud offload connects directly to that infrastructure — when the agent runs in the cloud, it has native access to your PRs, your CI runs, your issue tracker. Cursor's cloud agents have to be configured to integrate with all of that. Copilot CLI comes with it built in.
This is the squeeze: Claude Code pressures Cursor from the autonomy side (truly terminal-native, genuinely agentic, first-class multi-agent support), and Copilot CLI pressures it from the platform-integration side (GitHub-native, repository-aware, CI-connected from day one). Cursor's strongest argument has always been "the best coding UX." That argument holds in the editor. It weakens when the editor is no longer the center of gravity.
## The Irony Is Not Subtle
GitHub Copilot launched in June 2021 as an IDE autocomplete tool. It was a VS Code extension. The entire value proposition was: AI in your editor, helping you write code faster, right where you work.
Five years later, GitHub has shipped a terminal tool that can run autonomously without your editor open. They called it "Copilot CLI" — which undersells what it is. It is not a CLI for Copilot. It is a terminal-native agentic coding tool with cloud execution, multi-agent delegation, and autonomous autopilot mode.
The name is the last vestige of the original IDE-centric identity. Everything else about the tool is a repudiation of it.
Anthropic has been building this way since Claude Code launched. The terminal is not a regression from the IDE — it is the right environment for agents that need to run for minutes or hours, call external tools, execute shell commands, manage files, and iterate without a human window managing their lifecycle. Microsoft watched the market shift and shipped accordingly.
## What Claude Code's Lead Actually Is
None of this means Copilot CLI has caught up. There is a difference between feature parity on a changelog and architectural parity in practice. Claude Code has been running production agentic workflows for longer, has a tighter integration with Anthropic's safety and context management research, and has the KAIROS daemon architecture (proactive background processing, nightly memory consolidation) that Copilot CLI does not yet match.
But the direction is now unambiguous. The question is not whether the terminal-native agentic model is the future of AI coding tools — Microsoft just answered that. The question is which implementation you trust to run autonomously on your codebase.
For that question, the lead still belongs to the tool that was built for autonomy from the start, not the one that reached GA five months after preview.
---
*Sources: [GitHub Changelog — Copilot CLI is now generally available](https://github.blog/changelog/2026-02-25-github-copilot-cli-is-now-generally-available/); [Visual Studio Magazine — GitHub Copilot CLI Reaches General Availability](https://visualstudiomagazine.com/articles/2026/03/02/github-copilot-cli-reaches-general-availability-bringing-agentic-coding-to-the-terminal.aspx); [GitHub Changelog — Copilot in Visual Studio March update](https://github.blog/changelog/2026-04-02-github-copilot-in-visual-studio-march-update/); [Awesome Agents — GitHub Copilot CLI Goes Generally Available](https://awesomeagents.ai/news/github-copilot-cli-generally-available/)*
---
# What Anthropic's Accidental 512K-Line Leak Reveals About Claude Code's Future
URL: https://sdd.sh/2026/04/what-anthropics-accidental-512k-line-leak-reveals-about-claude-codes-future/
Date: 2026-04-03
Updated: 2026-04-03
Tags: Claude Code, Anthropic, KAIROS, agentic coding, source code leak
Categories: AI Tools, Agentic Workflows
Summary: Anthropic accidentally published Claude Code's full TypeScript source to npm. Fifty thousand downloads later, we know about KAIROS — a proactive always-on daemon — plus ULTRAPLAN, undercover mode, anti-distillation traps, and a virtual pet. This isn't a scandal. It's an accidental roadmap.
Anthropic didn't intend to ship this.
On March 30–31, 2026, a Bun toolchain bug caused Claude Code v2.1.88 to publish its full TypeScript source maps to the public npm registry. Within hours, 50,000 downloads had spread the 512,000-line, ~2,000-file codebase across GitHub mirrors and developer Discords. Anthropic issued DMCA takedowns and an official statement: *"A Claude Code release included some internal source code. No sensitive customer data or credentials were involved or exposed. This was a release packaging issue caused by human error, not a security breach."*
The statement is accurate but incomplete. What the leak exposed isn't a security breach — it's an accidental product roadmap. And the roadmap is more ambitious than anyone outside Anthropic knew.
## KAIROS: Claude Code as a Colleague, Not a Tool
The most significant find is a feature codenamed **KAIROS** — a proactive, always-on background daemon that persists after your terminal session closes.
The architecture is deliberate and well-developed:
- **Background persistence**: A daemon that continues running between sessions
- **Append-only daily memory logs**: Every interaction contributes to a rolling log that survives session boundaries
- **Periodic `` prompts**: The system regularly asks the model to decide whether to act proactively or stay quiet
- **15-second blocking budget**: Each proactive action is time-boxed to prevent runaway behavior
- **Nightly "dreaming"**: Memory consolidation and pruning (this is the system already described publicly as AutoDream — it's a subset of KAIROS)
- **GitHub webhook subscriptions**: The daemon can respond to repository events asynchronously
- **Cron-triggered refresh every 5 minutes**: Continuous background polling
The implication is significant. Today, Claude Code is a tool you invoke — you open a session, ask it to do something, it does it. KAIROS is Claude Code as a colleague who is always around, monitoring your project, acting when something needs doing, and waiting when it doesn't.
The 15-second blocking budget and the "decide whether to act" design are also notable safety architecture choices. Anthropic isn't building an agent that acts constantly — it's building one that acts *judiciously*. The proactivity is bounded and auditable. That's the right design, even if it won't satisfy people who want Claude to ship features while they sleep.
## ULTRAPLAN: Outsource Your Hardest Planning Problems
**ULTRAPLAN** is the second major feature. When you have a problem that's too complex to solve in a single session, ULTRAPLAN offloads it to a cloud container running Claude Opus for up to 30 minutes of autonomous planning.
The UX flow from the leaked source:
1. You describe the goal — architecture redesign, complex refactor, system design from scratch
2. Claude Code invokes ULTRAPLAN, which spins up a cloud Opus instance
3. You walk away
4. When planning completes, you receive a notification and approve the result from phone or browser
5. A special sentinel value (`__ULTRAPLAN_TELEPORT_LOCAL__`) imports the finished plan back into your local terminal context
This is meaningful because it decouples planning from your local context window and local model. The hardest problems get Opus-level reasoning time with no session boundary interruptions. You don't have to babysit it.
It also suggests Anthropic's architecture for expensive, long-horizon planning is cloud-native — not an upgraded local model but a cloud container with dedicated compute and time budget. The local model handles interactive work; the cloud handles the deep thinking.
## Undercover Mode: Anthropic Employees Don't Leave AI Fingerprints
**Undercover Mode** (`undercover.ts`) is the most controversial reveal, and it deserves honest assessment.
When an Anthropic employee (`USER_TYPE === 'ant'`) uses Claude Code in a public or non-internal repository, the following instructions activate:
- Do not reveal internal codenames (Capybara, Tengu, etc.)
- Do not identify as "Claude Code"
- Strip `Co-Authored-By` metadata from commit output
There is no force-off option. AI-authored commits from Anthropic employees in open-source projects show no AI attribution.
The charitable reading: Anthropic employees sometimes need to work in public repos without leaking internal project names or attracting attention to what they're testing. Competitive and operational reasons exist.
The less charitable reading: Anthropic is, by design, removing AI attribution from AI-generated code when its own employees are the authors — while building a culture of AI disclosure expectations for everyone else. The transparency asymmetry is real.
This isn't a scandal, but it is worth naming clearly. "AI attribution matters" as a norm is undermined when the company most invested in that norm quietly carves out an exception for itself.
## Anti-Distillation Traps: Poisoning Competitor Training Data
The leak revealed two mechanisms designed to prevent competitors from training on Claude Code's API traffic:
**Fake Tools Injection**: Decoy tool definitions are sent in certain API calls, polluting any training data recordings with false information about Claude Code's capabilities and interface.
**Connector-Text Summarization**: Instead of returning raw assistant reasoning between tool calls, Claude Code buffers the reasoning and returns a cryptographically signed summary. A MITM proxy sees summaries, not raw chain-of-thought.
Both are bypassable with enough effort (env variable, MITM proxy), but they establish friction for casual competitive data collection.
This reveals something about Anthropic's perception of competitive dynamics: they believe their API traffic is being monitored and potentially used for model training by competitors, and they've built active countermeasures. That's a sign of how seriously the race for frontier-model training data is being taken at the infrastructure level.
The native client attestation system (a cryptographic hash computed by Bun's Zig-native HTTP stack, replacing a `CCH=00000` placeholder before transmission) is related — effectively DRM for the API, designed to prove requests came from the legitimate binary rather than a scraper or wrapper.
## BRIDGE MODE: The Multi-Agent Orchestration Layer
**BRIDGE MODE** (also called Coordinator Mode in some files) formalizes what Claude Code Agent Teams already does publicly, but with a more defined architecture:
- One Claude instance acts as coordinator
- Parallel worker instances receive tasks via a mailbox system
- Division of labor: one worker writes code, one reviews, one writes tests
- The coordinator manages dependencies and integration
The publicly shipped Agent Teams feature already supports multi-agent workflows, but BRIDGE MODE in the source suggests a more opinionated, structured orchestration model is in development — one where the roles are predefined rather than ad-hoc.
## BUDDY: The Tamagotchi
Because Anthropic.
**BUDDY** is a virtual pet companion system. You get a pet assigned deterministically by your user ID hash (18 species: duck, dragon, axolotl, capybara, mushroom, ghost, and more). Rarity tiers run from Common to Legendary (1% drop rate). Shiny variants exist. Stats include Debugging, Patience, Chaos, Wisdom, and Snark.
Internal notes reference an April 1–7 teaser with a May 2026 launch date.
One reads this and thinks: either someone at Anthropic is very good at having fun, or someone believes gamification of the developer experience is a meaningful retention lever. Probably both.
## What the Leak Actually Tells You
Strip away the drama of the accidental publication, and what remains is a coherent product thesis:
Claude Code is being built from the assumption that software development is primarily a continuous, background activity — not a sequence of synchronous prompts. KAIROS is always running. ULTRAPLAN handles the deep work asynchronously. BRIDGE MODE structures multi-agent collaboration. Memory consolidation (AutoDream/KAIROS dreaming) keeps the context accurate over weeks and months.
This is a fundamentally different product model than Cursor, Copilot, or Windsurf. Those tools insert AI into a human developer's workflow. KAIROS inverts it: the AI workflow runs continuously, and the human developer joins when input is needed.
The safety design embedded in that architecture is also worth noting. The 15-second blocking budget, the proactive-vs-quiet decision loop, the cron-based tick system — these aren't features added because of AI safety concerns, they're the operating model. Anthropic is building the autonomy incrementally and with deliberate constraints. That's a better approach than shipping unconstrained agents and patching the problems afterward.
The anti-distillation traps and undercover mode are the less comfortable findings. They reveal a company operating in a competitive environment where the norms it publicly advocates for — transparency, attribution, open ecosystem — are applied selectively when organizational interests intervene. That's worth watching.
For now: the roadmap is out. Claude Code is becoming the always-on agent. The question is whether the safety design stays coherent as the autonomy expands.
---
*Sources: [VentureBeat: Claude Code's Source Code Appears to Have Leaked](https://venturebeat.com/technology/claude-codes-source-code-appears-to-have-leaked-heres-what-we-know), [Ben's Bites: Inside the Leaked Claude Code Files](https://www.bensbites.com/p/inside-the-leaked-claude-code-files), [Alex Kim's Blog: Fake Tools, Frustration Regexes, Undercover Mode](https://alex000kim.com/posts/2026-03-31-claude-code-source-leak/), [Geeky Gadgets: Claude Code Undercover Mode, KAIROS, ULTRAPLAN](https://www.geeky-gadgets.com/claude-code-undercover-mode/), [WaveSpeed AI: BUDDY, KAIROS & Every Hidden Feature](https://wavespeed.ai/blog/posts/claude-code-leaked-source-hidden-features/), [Cybernews: Controversial Features in Leaked Claude Code](https://cybernews.com/security/anthropic-claude-source-code-discovered-features/), [Engadget: Claude Code Leak Suggests Proactive Mode](https://www.engadget.com/ai/claude-code-leak-suggests-anthropic-is-working-on-a-proactive-mode-for-its-coding-tool-150107049.html)*
---
# GitHub Copilot's April 24 Data Grab: What You're Agreeing To and How to Opt Out
URL: https://sdd.sh/2026/04/github-copilots-april-24-data-grab-what-youre-agreeing-to-and-how-to-opt-out/
Date: 2026-04-03
Updated: 2026-04-03
Tags: GitHub Copilot, data privacy, AI tools, Microsoft, developer rights
Categories: AI Tools, Industry
Summary: Starting April 24, GitHub will train its AI models on Copilot Free, Pro, and Pro+ users' code by default — private repos included. The opt-out exists, but it's buried, not available on mobile, and unverifiable. Here's what's actually in the policy change and what it means.
GitHub gave developers 30 days notice. On March 25, it announced that starting **April 24, 2026**, interaction data from Copilot Free, Pro, and Pro+ users will be used to train GitHub and Microsoft AI models — by default. Opt-out is required to prevent it.
The announcement was quiet. The developer reaction was not.
## What the Policy Actually Covers
The updated policy collects the following when you use Copilot in an opted-in account:
- **Code completions accepted or modified** — the code you accepted from Copilot suggestions
- **Inputs and code snippets** — what you sent to Copilot
- **Surrounding code context** — the files open around your cursor when you invoked Copilot
- **Comments and documentation** — inline comments, docstrings, any text context
- **File names and repository structure** — metadata about how your project is organized, including in private repositories
- **Navigation patterns** — how you move through the codebase
- **Chat interactions** — full prompts and responses from Copilot Chat
- **Feedback signals** — thumbs up/down ratings you gave suggestions
To be precise: this applies to **individual plan users** (Free, Pro, Pro+). Copilot Business and Enterprise accounts are not affected. Students and teachers are exempt.
The data is used to improve model performance. GitHub CPO Mario Rodriguez cited Microsoft's existing precedent of training on interaction data to justify it.
## Why the Opt-Out Default Is the Story
The technical scope of what's collected is less interesting than the framing of how it's collected: opt-out by default.
GitHub's own community discussion thread shows 59 thumbs-down and 3 rocket emojis — with essentially no endorsement beyond GitHub's own VP of developer relations. Hacker News and Reddit generated sustained negative discussion. The criticisms cluster around a few real problems:
**The employer problem.** Individual users cannot license their employer's proprietary code for AI training. But the opt-out is enforced at the user level, not the organization level. A developer using a personal Pro account to work on company code — a common pattern for engineers doing exploratory work on their own machines — cannot prevent that code from entering GitHub's training pipeline on behalf of their employer. The policy doesn't address this.
**The mobile problem.** At launch, the opt-out toggle is not available in GitHub's mobile app. Users who primarily access settings via mobile have no mechanism to opt out at launch. This is not a small edge case — the mobile app has a large user base.
**The model collapse problem.** A thread on Reddit with 1,000+ upvotes raised the feedback loop concern: if Copilot generates code, that code is accepted by users, and then fed back into training data, you get a model trained increasingly on its own outputs. Model collapse through recursive self-training is a well-documented failure mode in machine learning. GitHub hasn't published methodology around how it intends to prevent this.
**The verification problem.** There is no enforceable guarantee that toggling the opt-out does anything verifiable. You are trusting GitHub's implementation of its own privacy toggle. There is no audit mechanism, no third-party verification, and no way to confirm your data is not being used after you opt out.
## How to Opt Out
The opt-out is available in Settings → Privacy on the GitHub web interface. Set it before April 24 if you want to prevent any data collection from the start.
The process:
1. Sign in to github.com
2. Navigate to Settings → Privacy
3. Find the "Allow GitHub to use my data to improve AI models" toggle
4. Disable it
Note: as of this writing, the mobile app toggle is not yet available. Set it from the web.
## What This Means for Copilot Relative to Alternatives
This is worth placing in context of the broader market.
Copilot Business and Enterprise — the paid tiers designed for professional developer use — are explicitly excluded. GitHub is drawing a line: free and personal-plan users are the product; enterprise customers are the customers. That's a coherent business model, but it makes the opt-out default on personal plans feel more like a boundary-push than a genuine oversight.
The comparison to Anthropic is instructive. Claude Code does not train on user code. Anthropic's API terms explicitly prohibit training on user inputs unless users specifically opt into feedback programs. The transparency asymmetry between GitHub and Anthropic on this question is significant.
Cursor's privacy model is more opaque — training data usage depends on tier and policy version — but Cursor Business offers explicit no-training guarantees for enterprise customers. Windsurf (now under Cognition) has similar tiered privacy protections.
The pattern across the industry is the same: free and personal tiers fund model improvement through data; enterprise tiers buy data protection. GitHub's April 24 change is consistent with that pattern, but the implementation — opt-out by default, no mobile toggle, no verification mechanism — is worse than industry norms.
## The Deeper Question
There is a more fundamental tension in GitHub Copilot's business model that this policy change makes explicit.
GitHub Copilot is a tool that generates code. That code gets committed into repositories. GitHub hosts those repositories. Now GitHub wants to use that committed code — and the prompts and completions that produced it — to train the models that generate more code. The flywheel is real and makes business sense.
But it also means that every developer using a free or personal Copilot plan is, by default, a contributor to GitHub's AI training pipeline. Not a customer. A contributor.
The distinction matters. Customers have negotiated relationships with service providers. Contributors have terms of service they may or may not have read, opted out of by default, and no independent way to verify their preferences are honored.
GitHub is a dominant platform. Developers don't easily leave. That makes the opt-out default more aggressive than it might be from a smaller vendor — the power asymmetry is high.
The April 24 deadline is in three weeks. If you use Copilot Free, Pro, or Pro+ on personal accounts that touch employer or client code, the time to check your settings is now.
---
*Sources: [GitHub Blog: Updates to Copilot Interaction Data Usage Policy](https://github.blog/news-insights/company-news/updates-to-github-copilot-interaction-data-usage-policy/), [The Register: GitHub Going to Train on Your Data After All](https://www.theregister.com/2026/03/26/github_ai_training_policy_changes/), [InfoQ: GitHub Will Use Copilot Interaction Data for AI Training](https://www.infoq.com/news/2026/04/github-copilot-training-data/), [Help Net Security: GitHub Copilot Data Privacy Policy Update](https://www.helpnetsecurity.com/2026/03/26/github-copilot-data-privacy-policy-update/), [Hacker News Discussion](https://news.ycombinator.com/item?id=47548243)*
---
# MCP Dev Summit NYC 2026: Authentication Is the Crisis, OpenAI Is Now a Stakeholder
URL: https://sdd.sh/2026/04/mcp-dev-summit-nyc-2026-authentication-is-the-crisis-openai-is-now-a-stakeholder/
Date: 2026-04-02
Updated: 2026-04-02
Tags: MCP, Model Context Protocol, Anthropic, OpenAI, security, OAuth, SDK
Categories: AI Tools, Industry
Summary: The first major Linux Foundation MCP summit signals protocol maturity — but surfaces an uncomfortable truth: 43% of MCP servers have OAuth vulnerabilities, auth is still the dominant unsolved problem, and breaking changes are coming in SDK V2.
The Model Context Protocol turns 18 months old this week, and the Agentic AI Foundation (AAIF) marked the occasion with the [MCP Dev Summit North America](https://events.linuxfoundation.org/mcp-dev-summit-north-america/) — 95+ sessions across two days in New York City, April 2–3, with AWS, Docker, Workato, and WorkOS as diamond sponsors.
The event is a milestone signal: MCP is no longer a clever Anthropic experiment. It is Linux Foundation infrastructure with dedicated working groups, a conformance testing suite, and corporate backers who have shipped it to production. That graduation is worth celebrating.
But the summit also made one uncomfortable fact impossible to ignore: **the dominant unsolved problem in the MCP ecosystem is authentication**, and the security posture of the average MCP deployment is not good.
---
## OpenAI Is No Longer Just an Adopter
The headlining arc of the summit is that OpenAI is now a co-steward of the protocol it once resisted. Nick Cooper, an OpenAI engineer and AAIF Governing Board member, is keynoting Day 2 with a session titled "MCP x MCP" — a deliberate framing that positions OpenAI as a peer contributor, not a follower.
The substance behind the branding: OpenAI's `openai-agents` SDK added `list_resources()` and `read_resource()` for MCP Resources in the days before the summit. A parallel implementation was simultaneously pending in the Anthropic Python SDK. The goal is cross-ecosystem MCP Resource interoperability — an Anthropic-built agent querying context from an `openai-agents`-designed server, and vice versa.
This matters architecturally. MCP Resources are how servers expose structured context (files, database records, API responses) to agents without forcing a tool call. When both major agentic SDKs implement the same Resources spec from the same governance body, the protocol stops being "the thing Anthropic made" and starts being the thing that actually wins.
MCP at 18 months already has 97 million cumulative downloads (covered previously). OpenAI's full buy-in — complete with a board seat and keynote slot — is the political seal on an already technical fait accompli.
---
## SDK V2 Is Coming. It Will Break Things.
Max Isbey of Anthropic presented "Path to V2 for MCP SDKs" — described as the **first public statement of intent** for an MCP Python SDK v2.
The Python SDK (`mcp` on PyPI) has been at v1.26.0 since January. The TypeScript SDK shipped steadily. Python users have been in a holding pattern waiting for guidance on where the spec was going. Now they have it, and the answer includes breaking changes.
The most significant: **`mcp.server.auth` is getting a compatibility-breaking rewrite in V2**. Anyone using the auth module in production should treat this as a migration gate. The good news is that six dedicated summit sessions on authentication suggest Anthropic and the community have diagnosed exactly what is wrong and are fixing it at the spec level rather than papering over it.
Paul Carleton (Anthropic) is separately presenting "One Spec, Ten SDKs, Zero Excuses: Conformance Testing MCP" — which signals that cross-SDK consistency (TypeScript, Python, Go, Java, Kotlin, C#, Swift, and others) is being enforced via automated conformance suites rather than hoping implementations converge on their own. That is the right call for a protocol at this scale.
---
## The Authentication Crisis
Six of the 95 sessions are dedicated to MCP authentication. That concentration is not an accident — it reflects a genuine ecosystem-wide problem.
The statistics from Docker's analysis of the MCP ecosystem are damning: **43% of MCP servers have OAuth authentication flaws**. Emily Lauber of Microsoft is presenting the specific attack class at the summit: OAuth mix-up attacks, where multi-issuer confusion allows attackers to leak authorization codes to attacker-controlled redirect URIs.
The mechanics are worth understanding:
1. MCP servers act simultaneously as an authorization server for MCP clients AND as a single OAuth client to upstream providers — creating one shared `client_id` across all users
2. Because OAuth state isn't bound to user sessions, a malicious link can redirect an authorization code to an attacker's endpoint
3. Compromised servers can take tokens issued for one service and present them to another (audience confusion)
4. The `WWW-Authenticate` discovery mechanism can be abused to trigger SSRF against cloud metadata endpoints
This is not theoretical. CVE-2025-6514 (CVSS 9.6) — in the widely-installed `mcp-remote` npm package — demonstrated RCE via OS commands embedded in OAuth discovery fields, affecting hundreds of thousands of installs.
The mitigations are in the spec: Resource Indicators (RFC 8707) have been mandated since June 2025, and the Sponsored Enhancement Proposals SEP-1932 (DPoP) and SEP-1933 (Workload Identity Federation) are under active development. Aaron Parecki, author of the OAuth 2.1 draft spec, is attending — suggesting the fixes are being designed by people who understand the attack surface rather than just adding flags.
---
## Production at Scale: What Enterprise Deployments Look Like
Beyond the spec and security tracks, the summit's keynote roster is a useful window into what serious MCP production looks like.
**Uber** (Meghana Somasundara and Rush Tehrani) is presenting on operating MCPs at enterprise scale — not a proof of concept, but operational lessons from an organization that cannot afford 3 a.m. page-outs from an agent that misconfigured a payment system.
**Duolingo** (Aaron Wang) built an internal AI Slackbot with 180+ MCP tools — an interesting data point on the composition problem. At 180 tools, you are no longer thinking about individual tool calls; you are thinking about tool namespacing, permission scoping, and how the model navigates a context that could theoretically call anything.
**Datadog's** Diamond Bishop presented "The First 100 Agents: Scaling With MCP From Prototype to Platform" — a title that implicitly admits there is a cliff between building one agent and running a fleet of them. The session covers tooling gaps that don't appear until you're at 50+ agents in production.
---
## What This Means for Developers Building on MCP
The summit is a snapshot of where the ecosystem stands in April 2026. Three takeaways for developers shipping MCP integrations today:
**1. Audit your auth module usage now.** If your MCP server uses `mcp.server.auth`, treat V2 as a near-term migration. The old implementation has known structural flaws and the V2 rewrite is coming. Start reading the break notes.
**2. Implement Resource Indicators.** RFC 8707 has been required since June 2025 but compliance is uneven. If your server handles tokens meant for multiple upstream services, binding tokens to explicit resource URIs is the concrete fix for audience confusion attacks.
**3. MCP Resources are about to matter more.** With both Anthropic and OpenAI shipping `list_resources()` / `read_resource()` in their agent SDKs, MCP Resources go from a niche feature to a compatibility surface you have to support if you want agents from different ecosystems to consume your server.
---
## The Bigger Picture
The Linux Foundation governance, the sponsored security work, the conformance test suite, the OpenAI board seat — these are the institutional markers of a protocol that won. The question was never whether MCP would be adopted; the 97 million download figure settled that. The question now is whether the security and interoperability foundations can be hardened fast enough to keep pace with the deployment rate.
Six authentication sessions, a V2 auth rewrite, and a CVE with a 9.6 CVSS score at 18 months suggest the ecosystem is running to catch up with itself. That's not unusual for infrastructure at this velocity. It does mean that "we're using MCP" and "we're using MCP securely" are still two different statements.
The summit's session recordings are available on-demand on AAIF's YouTube channel. If you're building agents in 2026, the auth track is required watching.
---
*Sources: [AAIF MCP Dev Summit Schedule](https://events.linuxfoundation.org/2026/02/24/agentic-ai-foundation-unveils-mcp-dev-summit-north-america-2026-schedule/), [Linux Foundation Event Page](https://events.linuxfoundation.org/mcp-dev-summit-north-america/), [DEV.to Python Developer Analysis](https://dev.to/peytongreen_dev/mcp-dev-summit-2026-what-python-developers-should-actually-pay-attention-to-5ald), [Obsidian Security: MCP OAuth Account Takeover](https://www.obsidiansecurity.com/blog/when-mcp-meets-oauth-common-pitfalls-leading-to-one-click-account-takeover), [MCP Protocol Roadmap](https://modelcontextprotocol.io/development/roadmap)*
---
# Cursor Is Worth $50 Billion. Its Biggest Problem Is That It Still Needs You.
URL: https://sdd.sh/2026/04/cursor-is-worth-50-billion.-its-biggest-problem-is-that-it-still-needs-you./
Date: 2026-04-02
Updated: 2026-04-02
Tags: Cursor, enterprise, cloud agents, self-hosted, Claude Code, autonomous coding, security
Categories: AI Tools, Industry
Summary: Cursor's $50B valuation is real, its self-hosted cloud agents are a genuine enterprise product, and 67% of Fortune 500 companies are customers. But the autonomy ceiling — the fundamental limit that keeps Cursor in the IDE and humans in the loop — hasn't moved.
The numbers are hard to argue with. Cursor crossed $2 billion in annualized revenue in February 2026, doubling in three months. [Bloomberg reported on March 12](https://www.bloomberg.com/news/articles/2026-03-12/ai-coding-startup-cursor-in-talks-for-about-50-billion-valuation) that the company is in talks for a new round at a roughly $50 billion valuation — nearly double the $29.3B it commanded in November. Sixty percent of that revenue comes from enterprise customers. Sixty-seven percent of the Fortune 500 use Cursor.
The company shipped self-hosted cloud agents to general availability on March 25, landing Brex, Notion, and Money Forward as early customers. This is a real enterprise product for real enterprise security requirements.
And yet: Cursor's fundamental competitive problem has not been solved. It has been deferred.
---
## What Self-Hosted Cloud Agents Actually Are
The new architecture is worth understanding correctly, because the marketing tends to conflate "in your network" with "fully private."
[Cursor's self-hosted agents](https://cursor.com/blog/self-hosted-cloud-agents) run worker processes inside your own Kubernetes cluster or VM fleet. Workers connect **outbound via HTTPS** to Cursor's cloud — no inbound firewall rules, no VPN tunnel required. When you start an agent session, Cursor's cloud handles inference and orchestration, then sends tool calls down to your worker for local execution. Results flow back to Cursor's servers for the next inference round.
Each session gets a dedicated isolated VM with terminal, browser, and desktop access. Workers deploy via Helm chart, scale horizontally, and support Cursor's full feature set: Composer 2, frontier models, MCP plugins, skills, subagents, and hooks. The limits are 10 workers per user and 50 per team.
The early customer quotes capture the genuine value proposition. From Brex: "Self-hosted solution will allow us to delegate end-to-end software builds entirely to Cursor's cloud agents." From Notion: it "allows agents to access more tools more securely and saves our team from needing to maintain multiple stacks." Money Forward is building workflows to let nearly 1,000 engineers create pull requests directly from Slack.
This is not theater. For organizations with data residency requirements, regulated codebases, or simply a policy against sending source code to third-party SaaS, on-network execution is a hard gate. Cursor crossed it.
---
## The Part the Blog Post Doesn't Mention
The self-hosted architecture solves the data-at-rest and data-in-transit problem. It does not solve the inference privacy problem.
**Planning, orchestration, and model inference remain on Cursor's cloud.** Your code doesn't leave your network as a file, but descriptions of what needs to happen, task decomposition, and the model's internal reasoning all live outside your perimeter. Whether that matters depends on your threat model — but calling it "self-hosted" without this asterisk is doing real work.
The security surface has separate issues too. Two CVEs disclosed in 2025 are still instructive:
- **CurXecute (CVE-2025-54135):** Attackers craft malicious Slack messages that cause Cursor's AI to rewrite MCP configuration files and execute arbitrary commands with developer privileges — no user interaction beyond reading a Slack message
- **MCPoison (CVE-2025-54136):** Shared repository configurations can enable persistent, team-wide compromise via context poisoning
Both are prompt injection variants enabled by agent auto-run in complex, multi-source environments. Self-hosted deployment doesn't change the attack surface for either. The agent still reads your codebase, still follows instructions from files in that codebase, and still has terminal access to the worker VM.
There is also a structural fragility in Cursor's supply chain that is not widely discussed. Cursor's primary model is Claude Sonnet, licensed from Anthropic at retail API rates — while Anthropic sells it wholesale to others and is simultaneously building Claude Code as a direct competitor. A Fortune analysis from March 21 quoted a VC: "burning $1 to make 90 cents isn't a business." At $2B ARR and growing fast, the unit economics eventually have to converge.
---
## The Valuation Trajectory vs. The Underlying Bet
Let's be clear about what $50 billion is pricing in. This is the trajectory:
- **August 2024:** $400M valuation
- **December 2024:** $2.5B
- **June 2025:** $9.9B
- **November 2025:** $29.3B
- **March 2026:** ~$50B (in talks)
That's roughly 125x in 19 months. Cursor's $2B ARR at this stage is genuinely impressive. But the implied multiple on forward revenue is pricing a scenario where Cursor becomes the default enterprise development platform — not just a popular IDE extension.
That bet depends on Cursor's IDE-centric model remaining competitive with increasingly autonomous agents. CEO Michael Truell told Fortune the company is "expecting to disrupt ourselves over, and over, and over again." That's the right answer to give. The harder question is whether self-disruption is possible when the product architecture is anchored to the IDE.
---
## The Autonomy Ceiling
Fortune's framing from March is the sharpest articulation of the competitive wedge: **Tony Stark wearing the Iron Man suit vs. JARVIS wearing it.**
Cursor, at its core, is an AI-assisted development tool. The human sits in the IDE, reviews suggestions, accepts or rejects changes, and approves actions. The agent has capabilities — Cursor Automations, Background Agents, self-hosted workers — but the paradigm is still human-in-the-loop acceleration. You write faster. You review more code. You still review code.
Claude Code's model is different. You write a spec or a task. The agent plans, implements, runs tests, iterates on failures, and opens a PR. The human reviews a PR, not individual keystrokes. The loop is task-shaped, not edit-shaped.
The 39% more PRs merged (University of Chicago, teams using Cursor's agentic features) versus the 72.5% SWE-bench score (Claude Code operating autonomously) aren't measuring the same thing. The first measures human output amplification. The second measures autonomous task completion.
The METR study finding that AI coding tools slowed experienced developers by 19% is a useful corrective against naive productivity claims — but it also suggests that the benefit of AI-assisted coding is unevenly distributed and context-dependent. The developers who benefit most are those who can offload the largest chunks of work, not those who get marginal autocomplete improvements.
---
## Why Enterprise Customers Are Buying Anyway
None of this means Cursor's enterprise growth is irrational. Quite the opposite.
Most enterprise software teams are not ready for fully autonomous agents. They have compliance requirements around code review. They have liability concerns about AI-generated code entering production without human inspection. They have senior engineers who want to stay in the loop. They have codebases where the context required to make good decisions isn't capturable in a spec.
Cursor meets those teams where they are. The IDE is familiar. The learning curve is shallow. The productivity gains are tangible and attributable. Self-hosted agents now let the security-conscious enterprise tick the data residency box.
But "meets enterprise where they are today" is not the same as "wins enterprise in 2027." The direction of travel in autonomous software development — SWE-bench scores, Terminal-Bench, Claude Code's Agent Teams, Computer Use — is toward larger and larger task offload. Each six months, the autonomous ceiling rises. Each six months, the question "why is a human reviewing this step?" has fewer good answers.
---
## The Honest Assessment
Cursor at $50B is a bet that:
1. Enterprise software development stays anchored to the IDE for long enough to justify the valuation multiple
2. Cursor can build autonomous capabilities faster than Anthropic can build IDE ergonomics
3. The supply chain dependency on Anthropic doesn't become a pricing or availability problem
None of those bets are crazy. Some of them are probably right. Enterprise software transitions are slow. The IDE is an incredibly sticky interface. Cursor's product iteration velocity is genuinely impressive.
But the product that 60% of its revenue depends on is, structurally, a faster human. The transition to agents that replace human-in-the-loop review is happening on a timescale that $50B valuations have to price.
---
*Sources: [Bloomberg: Cursor $50B Valuation Talks](https://www.bloomberg.com/news/articles/2026-03-12/ai-coding-startup-cursor-in-talks-for-about-50-billion-valuation), [Cursor Blog: Self-Hosted Cloud Agents](https://cursor.com/blog/self-hosted-cloud-agents), [Cursor Changelog March 25](https://cursor.com/changelog/03-25-26), [Fortune: Cursor's Crossroads](https://fortune.com/2026/03/21/cursor-ceo-michael-truell-ai-coding-claude-anthropic-venture-capital/), [DevOps.com: Cursor Cloud Agents](https://devops.com/cursor-cloud-agents-get-their-own-computers-and-35-of-internal-prs-to-prove-it/), [The New Stack: Why Cursor Is Bringing Self-Hosted Agents to Fortune 500](https://thenewstack.io/cursor-self-hosted-coding-agents/)*
---
# The SWE-bench Plateau: Three Frontier Models Walk In, All Score 80% — Now What?
URL: https://sdd.sh/2026/04/the-swe-bench-plateau-three-frontier-models-walk-in-all-score-80-now-what/
Date: 2026-04-01
Updated: 2026-04-25
Tags: Benchmarks, Gemini, Claude, SWE-bench, AI Models, Coding
Categories: Industry, AI Tools
Summary: Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.3-Codex are all within 0.8% of each other on SWE-bench Verified. When every frontier model aces the exam, the exam stops being useful. Here's what actually differentiates them.
> **Updated April 25, 2026:** Claude Opus 4.7 now leads SWE-bench Verified at 87.6% and SWE-bench Pro at 64.3%. GPT-5.5 pushes Terminal-Bench 2.0 to 82.7%. MiniMax M2.7 hits 57.0% on Terminal-Bench 2 — open source, publicly available weights. The Stanford AI Index 2026 confirms SWE-bench Verified is approaching 100% of the human baseline. The plateau is gone. The benchmark is cooked. And that changes the analysis below in ways worth reading. The new section is at the bottom.
Three frontier models walk into SWE-bench Verified. Claude Opus 4.6 scores 80.8%. Gemini 3.1 Pro, released February 19, scores 80.6%. GPT-5.3-Codex scores 80.0%.
The variance is smaller than the margin of error on most software benchmarks. Statistically, these three models are identical on the metric the industry has been using as its primary yardstick for coding capability.
This is not a success story for the benchmark. It's a failure mode.
## What SWE-bench Actually Measures
SWE-bench Verified presents models with real GitHub issues from Python open-source repositories. The model must write a patch that passes the associated test suite. It's a respectable benchmark — grounded in real-world bugs, not synthetic puzzles — and it drove meaningful progress for two years.
But it measures one narrow thing: isolated bug-fixing in a well-structured Python codebase with existing tests. A model that scores 80% on SWE-bench can:
- Read a traceback and identify the affected function
- Generate a patch that passes the existing test suite
- Handle standard Python idioms across popular libraries
What it cannot reveal:
- Whether the model can maintain coherent intent across 50+ files
- Whether it can design and implement a feature from a spec, not just fix a known bug
- Whether it can orchestrate tools, manage state, and recover from errors in a multi-step agent loop
- Whether it can interact with a running UI, a database, or a cloud API
- Whether it can work autonomously for 2 hours without going off the rails
The tasks that matter most for real-world agentic coding are nowhere in SWE-bench's test suite.
## The Price Compression Problem
The benchmark convergence arrives alongside a pricing story that's harder to dismiss.
Gemini 3.1 Pro costs $2 per million input tokens and $12 per million output tokens. Claude Opus 4.6 costs $15 input and $75 output. For teams building agentic pipelines that process millions of tokens per day, Gemini is 6–8x cheaper for nominally equivalent SWE-bench performance.
Then there's MiniMax M2.5, which benchmarks at 80.2% SWE-bench Verified at $0.30 input / $1.20 output. The price-per-point at the frontier is collapsing.
This matters because it forces a sharper question: if you can't distinguish these models on the benchmark everyone uses, and one costs 25x less, why are you paying for the expensive one?
The answer cannot be "better benchmark numbers." It has to be something the benchmark doesn't measure.
## Terminal-Bench 2.0: A Different Story
The more interesting results come from Terminal-Bench 2.0, which evaluates models on realistic terminal-based agentic tasks — not isolated patches, but multi-step autonomous workflows.
The rankings shift noticeably:
| Model | Terminal-Bench 2.0 |
|---|---|
| GPT-5.4 | 75.1% |
| Gemini 3.1 Pro | 68.5% |
| Claude Opus 4.6 | 65.4% |
Claude drops behind on this benchmark — which is worth being honest about. But Terminal-Bench 2.0 still doesn't capture tool orchestration quality, computer use fidelity, long-context coherence at 1M tokens, or the behavioral properties that matter when an agent is running unsupervised for hours.
What these numbers show is that the relative rankings *change* depending on what you measure. A model that performs identically on SWE-bench may behave very differently under agentic conditions. The flat-line at 80% is an artifact of the benchmark reaching its ceiling, not a statement about capability parity.
## What Actually Differentiates Frontier Models in 2026
For teams making actual architecture decisions, here is where the real differences live:
**Tool orchestration.** Claude consistently performs better on complex multi-tool chained tasks — reasoning across API calls, file edits, terminal output, and web fetches in a single coherent session. This is a function of how Anthropic has trained Claude's tool use behavior and how Claude Code's MCP-native architecture structures context.
**Computer use fidelity.** Claude Opus 4.6 and Sonnet 4.6 are the only frontier models with production-grade computer use deployed in a major coding tool (Claude Code, launched March 23). Gemini's computer use is available in Gemini Code Assist but tightly constrained to IDE surfaces. This is a meaningful gap for any task requiring visual context — rendered UIs, desktop apps, or systems with no programmatic API.
**Long-context coherence.** Claude Code now defaults to 1M context for Max, Team, and Enterprise accounts. What matters is not just the window size but whether the model maintains coherent intent across it. Independent testing consistently shows Claude with stronger recall and reasoning across very long contexts.
**Behavioral safety under autonomy.** This is the hardest to benchmark and the most important for unsupervised agentic use. How does the model behave when it hits an unexpected state at step 47 of a 60-step task? Does it attempt a risky recovery or pause and report? Claude's Constitutional AI training and Anthropic's alignment investment show up here — not in benchmark scores, but in real production deployments where teams report fewer catastrophic failures in long-running agent sessions.
**Ecosystem integration.** Claude Code's MCP-native architecture means every tool in the 5,800+ MCP server ecosystem is available natively. Gemini Code Assist runs in VS Code and JetBrains. The surface area of what Claude can touch is structurally larger.
## The Benchmark Gap Anthropic Should Worry About
One number worth sitting with: on Terminal-Bench 2.0, GPT-5.4 leads at 75.1% and Gemini 3.1 Pro is at 68.5%, while Claude Opus 4.6 is at 65.4%. OpenAI's lead on that benchmark is real, and it suggests that for raw agentic terminal task performance, Claude is not automatically first.
The counterargument is that Terminal-Bench 2.0 still measures isolated agentic tasks, not integrated systems. Claude Code's architecture, tooling, and the full MCP ecosystem mean the system — agent plus tools plus infrastructure — outperforms what the raw model score suggests. But that's a fragile argument if OpenAI or Google close the integration gap.
Anthropic's moat is not the model number. It's the investment in safe, predictable autonomous behavior at scale — the kind that enterprise customers need before they'll give an agent access to production systems. That's harder to benchmark and harder to copy.
## What Should Actually Replace SWE-bench
The field needs a benchmark that reflects how models are actually being used in 2026:
- Multi-session, long-horizon tasks (days, not minutes)
- Tool-use complexity (orchestrating 10+ MCP servers simultaneously)
- Autonomous recovery from unexpected states
- Security and safety behavior under adversarial inputs
- Real business outcomes (feature shipped, incident resolved, PR merged)
Until that benchmark exists, the 80% cluster is a ceiling artifact, not a verdict. The models that matter are the ones that do well on the hard stuff that nobody has figured out how to measure yet.
Claude is betting that hard stuff is agentic autonomy at scale, with safety guarantees, in complex multi-tool environments. The March 23 Computer Use launch, the Agent Teams architecture, and the 1M context window are all part of that bet.
The SWE-bench score is a floor, not a ceiling. Everything interesting is above it.
---
## April 2026 Update: The Plateau Is Gone — and So Is the Benchmark
Three weeks after this piece was published, the plateau broke.
Claude Opus 4.7 launched April 14 with 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro. GPT-5.5, released April 23, scored 82.7% on Terminal-Bench 2.0. MiniMax M2.7, open-sourced April 12, hit 56.22% on SWE-Pro and 57.0% on Terminal-Bench 2.
The updated Terminal-Bench 2.0 picture:
| Model | Terminal-Bench 2.0 | SWE-bench Pro |
|---|---|---|
| GPT-5.5 | **82.7%** | 58.6% |
| GPT-5.4 | 75.1% | — |
| Gemini 3.1 Pro | 68.5% | — |
| Claude Opus 4.6 | 65.4% | — |
| Claude Opus 4.7 | — | **64.3%** |
| MiniMax M2.7 (open) | 57.0% | 56.22% |
Two things changed that shift the analysis:
**SWE-bench Verified is now a dead benchmark.** The Stanford AI Index 2026 (published April 2026) documents SWE-bench Verified approaching 100% of the human expert baseline — up from roughly 60% just one year earlier. A benchmark that closes from 60% to near-100% in twelve months is not measuring general capability. It's measuring training data coverage and scaffold optimization. OpenAI quietly retired it as a primary metric after acknowledging contamination and flawed test design in approximately 59% of the evaluation set (see [SWE-bench Pro vs. Verified: The Benchmark That Lied](/posts/swe-bench-pro-vs-verified-the-benchmark-that-lied/) for the full analysis). SWE-bench Pro, which uses harder, curated GitHub issues with stricter validation, is now the meaningful number.
**The open-source gap closed dramatically.** When this article was written, there was no credible open-source competitor to frontier models on hard coding benchmarks. MiniMax M2.7 at 56.22% SWE-Pro — available on Hugging Face and Ollama, fully self-hostable — is eight points behind Claude Opus 4.7 on the same benchmark. That gap used to be 20+ points. The self-improvement technique M2.7 used (100 autonomous rounds of scaffold optimization) suggests the gap will keep closing.
The core argument of this piece stands: benchmark scores are a floor, not a verdict. What actually differentiates frontier models in production — tool orchestration, computer use fidelity, long-context coherence, behavioral safety under autonomy — is not captured by SWE-bench in any version. That argument is now more urgent, not less, as the closed-source models approach the SWE-Pro ceiling and open-source models close from below.
The question is no longer whether your model can solve an isolated Python bug. It's whether your agent can ship a feature end-to-end without supervision and without doing something catastrophic when it hits an unexpected state at step 47. No benchmark measures that. Production deployments do.
---
**Sources**
- [Gemini 3.1 Pro — Google Blog](https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/)
- [Gemini 3.1 Pro Complete Guide: Benchmarks, Pricing — NxCode](https://www.nxcode.io/en/resources/news/gemini-3-1-pro-complete-guide-benchmarks-pricing-api-2026)
- [Best AI for Coding 2026: Every Model Ranked — MorphLLM](https://www.morphllm.com/best-ai-model-for-coding)
- [New AI Model Releases March 2026 — RenovateQR](https://renovateqr.com/blog/ai-model-releases-2026)
- [SWE-bench Verified leaderboard — swebench.com](https://www.swebench.com/)
- [Stanford AI Index 2026 — HAI Stanford](https://aiindex.stanford.edu/report/)
- [MiniMax M2.7 Official Announcement — MiniMax](https://www.minimax.io/news/minimax-m27-en)
- [OpenAI GPT-5.5 Announcement — OpenAI](https://openai.com/index/introducing-gpt-5-5/)
- [Claude Opus 4.7 Release — Anthropic](https://www.anthropic.com/news/claude-opus-4-7)
---
# Claude Code Computer Use: The Agent That Can Now See, Click, and Ship
URL: https://sdd.sh/2026/04/claude-code-computer-use-the-agent-that-can-now-see-click-and-ship/
Date: 2026-04-01
Updated: 2026-04-01
Tags: Claude Code, Computer Use, Agentic Workflows, Anthropic, Autonomous Coding
Categories: AI Tools, Agentic Workflows
Summary: Anthropic's March 23 Computer Use launch for Claude Code is the closest thing yet to a fully autonomous coding agent. It can open your files, run your app, spot the bug, and fix it — without you touching a keyboard.
On March 23, Anthropic shipped something that quietly reframes what "agentic coding" actually means. Computer Use for Claude Code is now in research preview for Pro and Max subscribers on macOS — and it is not a gimmick.
Claude can now open files, launch development tools, navigate your screen, run your application, identify bugs by looking at what's actually rendered, and ship the fix. One prompt. One continuous loop. No handoff.
That's a different category of tool than what most developers are using today.
## What It Actually Does
The capability sounds deceptively simple: Claude can see your screen and interact with it. But the implications for autonomous coding are significant.
Here's a concrete example of what a single Claude Code session can now do with Computer Use enabled:
1. You describe a layout bug in a web app
2. Claude opens the codebase, reads the relevant files
3. It launches the dev server and opens a browser
4. It looks at what's actually rendering — pixels, not just code
5. It identifies the CSS specificity conflict causing the bug
6. It edits the file, reloads, and visually confirms the fix is correct
7. It runs the test suite and reports back
That loop previously required a human at the keyboard for steps 3, 4, and 7. Now it doesn't.
## The Technical Architecture
Computer Use in Claude Code works through a built-in `computer-use` MCP server — not a separate product, not a plugin, but a first-class tool within the agent's existing context. This matters architecturally: the same session that has access to your files, terminal, and git history can also see and interact with your screen. It's a unified context, not a handoff to a different system.
The implementation requires macOS Accessibility and Screen Recording permissions. Claude Code version 2.1.85 or later is required, and the feature is available to Pro ($20/month) and Max ($100–$200/month) subscribers.
Anthropic also raised the output token ceiling to 128K for Opus 4.6 and Sonnet 4.6 to support longer autonomous runs — longer-horizon tasks need more room to reason and output.
## Safety by Default: Code First, Screen Second
One design choice worth calling out: Claude tries code and API routes first. Computer use is the fallback, not the default. If Claude can accomplish something by reading a file or calling a shell command, it will. The screen gets involved only when there's no programmatic alternative — typically when visual verification is the point.
Every screen interaction also requires explicit user permission before it executes. This isn't optional and cannot be bypassed in the current research preview. Anthropic's framing here is deliberate: the agent gains capability while the human retains the final veto.
This is consistent with how Anthropic has approached Auto Mode and other expansions of Claude's autonomy — incremental capability extension with explicit human control surfaces, rather than full autonomy shipped all at once.
## Dispatch: Your Agent Works While You're Away
The most interesting use case isn't "Claude fixes a bug while you watch." It's what Anthropic calls Dispatch: you send a task from your phone, and Claude completes it on your desktop while you're doing something else.
You're in a meeting. You tap a message on your phone: "Fix the login redirect issue we discussed." Claude Code picks it up, opens your codebase, reproduces the bug, patches it, runs the relevant tests, and notifies you when it's done. You review the diff on your phone.
This is asynchronous, autonomous software development. It is not pair programming. It is not AI-assisted coding. It is delegation — the same kind of delegation you'd give a senior engineer you trust.
The difference between this and what Cursor's agents do is worth naming: Cursor's computer use capability (announced the same week) runs inside an IDE. Claude Code's runs in a terminal-native environment with MCP as the integration layer. One is a UI feature; the other is an architecture. When you need your agent to interact with external tools, APIs, or systems that don't have a VS Code extension, the distinction matters.
## Why This Is the Agentic Milestone People Have Been Waiting For
Most "agentic" AI coding tools today operate in text space. They read code, write code, and run text-based commands. They're blind to anything that requires visual context — rendered UIs, error dialogs, accessibility trees, layout bugs, visual regressions.
Computer Use breaks that constraint. Claude can now work with the full surface area of software development, not just the parts that are representable as text.
This unlocks several categories of tasks that were previously out of reach for autonomous agents:
- **Frontend debugging**: layout issues that only appear in a running browser
- **Integration testing**: verifying that UI state matches expected behavior
- **Desktop tooling**: interacting with IDEs, database GUIs, design tools
- **Legacy systems**: applications with no programmatic API, where the UI is the only interface
- **Documentation**: generating accurate screenshots or UI descriptions from live state
None of these require the agent to have write access to your entire system. The combination of selective screen access, granular permissions, and MCP's tool scoping means capability can be extended without opening the blast radius.
## The Road Ahead
Computer Use is in research preview — the qualifier matters. Anthropic is collecting data on how developers actually use it, what goes wrong, and where the permission model needs refinement. Expect the feature to evolve quickly.
The bigger story is what this represents in Anthropic's roadmap. The trajectory from Claude Code's early 2025 launch to today: agentic sessions → Multi-agent orchestration (Agent Teams) → Auto Mode → Computer Use. Each step extends how much of the software development lifecycle Claude can handle end-to-end.
What comes next? The logical extension is persistent agents that run continuously, monitor codebases, respond to events (CI failures, new issues, Slack pings), and take action — not when you prompt them, but when the situation warrants it. The infrastructure is already there. The remaining piece is trust, built incrementally through exactly the kind of controlled rollout Anthropic is running now.
Computer Use is not the end state. It's the capability that makes the end state possible.
---
**Sources**
- [Claude Code and Cowork can now use your computer — Engadget](https://www.engadget.com/ai/claude-code-and-cowork-can-now-use-your-computer-210000126.html)
- [Anthropic says Claude can now use your computer — CNBC](https://www.cnbc.com/2026/03/24/anthropic-claude-ai-agent-use-computer-finish-tasks.html)
- [Claude Code changelog — Anthropic](https://code.claude.com/docs/en/changelog)
- [Claude Code 2026 new features guide — Apiyi](https://help.apiyi.com/en/claude-code-2026-new-features-loop-computer-use-remote-control-guide-en.html)
---
# MCP Crosses 97 Million Downloads: The Protocol That Won
URL: https://sdd.sh/2026/03/mcp-crosses-97-million-downloads-the-protocol-that-won/
Date: 2026-03-31
Updated: 2026-03-31
Tags: MCP, Model Context Protocol, agentic AI, infrastructure, OpenAI, Anthropic, ecosystem
Categories: AI Tools, Agentic Workflows
Summary: Sixteen months after Anthropic published a draft spec, MCP has crossed 97 million monthly SDK downloads — and OpenAI's adoption paired with retiring the Assistants API has effectively handed MCP the crown. Here's what that means for agentic development.
In November 2024, Anthropic published a draft specification called Model Context Protocol. The pitch was practical: instead of every AI agent team writing custom integrations for every tool, why not a standard interface — a USB-C connector for AI agents to talk to data sources, APIs, and services?
Sixteen months later, MCP has crossed **97 million monthly SDK downloads**, up from roughly 2 million at launch. That's 4,750% growth. More importantly, it's no longer Anthropic's spec. It belongs to the industry — and OpenAI's adoption has made it effectively permanent.
## The Number That Actually Matters
97 million downloads is a headline. The more important data point happened quietly alongside it: OpenAI added native MCP support to ChatGPT and simultaneously announced the **deprecation of their Assistants API**, with a sunset date in mid-2026.
That's not adoption. That's capitulation. When the company that invented the modern AI API model tells its developer ecosystem to migrate to someone else's standard, that standard has won the coordination game. You don't sunset your own API unless you've concluded that fighting a competitor's standard isn't worth the fracture.
This is what TCP/IP moments look like in practice. Not a triumphant announcement, but a quiet acknowledgment by competing parties that there's more value in interoperability than in winning a standards war.
## What 5,800 Servers Actually Mean
The MCP ecosystem has grown to **5,800+ community and enterprise servers**. The top 50 alone generate 170,000 monthly US searches. But the more instructive data comes from companies that have already standardized on MCP at scale:
- **Block** eliminated **340 custom connectors** by migrating to MCP. Those aren't minor internal utilities — each connector represents an integration someone built, maintained, and had to update when upstream APIs changed. MCP retired that entire engineering backlog.
- **Apollo** cut integration maintenance overhead by **60%**. Integration maintenance is the invisible tax on every engineering team — the work that doesn't ship features but breaks if you stop doing it.
- **Replit** built its entire AI development environment on MCP primitives. For a company where "AI runs the dev environment" is the product pitch, that's not a tooling choice; it's a foundational architecture decision.
These numbers explain the adoption curve. MCP crossed 97 million downloads not because developers love protocols, but because it made a concrete, measurable problem — the integration maintenance problem — significantly cheaper to solve.
## From Anthropic's Spec to Linux Foundation Governance
In December 2025, MCP was donated to the **Agentic AI Foundation (AAIF)** under the Linux Foundation, co-founded by Anthropic, Block, and OpenAI. This is the right move, done at the right time.
A protocol that lives inside one company's GitHub repository won't become industry infrastructure. Enterprises don't standardize on something they can't audit, fork, or hold someone accountable for. The Linux Foundation donation was the moment MCP stopped being Anthropic's smart idea and became a public good with governance structure to match.
The analogy to earlier infrastructure standards is deliberate. HTTP lived inside CERN's servers before Tim Berners-Lee donated it to the public domain. TCP/IP was a DARPA project before it became the Internet. The pattern is consistent: a technical standard becomes infrastructure when the inventor gives up control.
## The Security Problem Everyone Saw Coming
Adoption at scale surfaces the rough edges. In March 2026, Qualys published a report flagging MCP servers as **"the new Shadow IT"**: enterprise security teams discovering MCP deployments they didn't authorize, authentication tied to static secrets, no standardized audit trails, and gateway behavior that varies by implementation.
Every MCP server is a potential attack surface. A rogue MCP server can request permissions an agent shouldn't have, exfiltrate data through tool call responses, or inject malicious context into agent reasoning. The fact that MCP is agent-native makes these risks different from traditional API security — agents that can act on behalf of users are a larger blast radius than read-only API calls.
The 2026 MCP roadmap addresses this with async operations and hierarchical multi-agent support, but security standardization is the hard problem. The tools ecosystem needs a layer that handles authentication, authorization, and audit logging consistently across MCP implementations — or enterprises will build it themselves and create another fragmentation problem.
This is the predictable growing pain of any infrastructure standard. It's not a reason to avoid MCP; it's a reason to build security tooling around it, which is exactly what's happening.
## Why Claude Code Is the Primary Beneficiary
Claude Code is a first-class MCP host. This wasn't an accident — Anthropic designed MCP while building the agent systems that would use it. The integration is structural, not bolted on.
Every MCP server added to the ecosystem — database connectors, git tools, monitoring integrations, enterprise SaaS APIs — becomes immediately available to Claude Code agents without any code changes or custom integration work. When Block standardized on MCP, every Claude Code user working in that ecosystem got those 340 connectors for free.
This is a compounding flywheel:
1. More MCP servers → more capable Claude Code agents
2. More capable Claude Code agents → more developers building on Claude Code
3. More Claude Code adoption → more incentive to build MCP servers
Cursor has MCP support. VS Code Copilot has MCP support. But Claude Code's terminal-native architecture means the agent *runs your tools* rather than an IDE plugin suggesting how you might run them yourself. The MCP integration is load-bearing; in IDE-centric tools it's a feature.
## The Infrastructure Moment
97 million downloads is the right headline because it signals something specific: MCP has crossed the threshold where it's cheaper to build on the standard than to build around it. When Block retires 340 connectors, they're not making an ideological statement. They're making an engineering economics decision.
The protocol won because enough parties — Anthropic, OpenAI, Google, Microsoft, Block, Replit — decided that the coordination value of a shared standard outweighed the competitive value of proprietary connectors. OpenAI's Assistants API sunset date formalized that decision.
For developers building on Claude Code: the MCP ecosystem is the toolbox. The question isn't whether to use it — it's which of the 5,800+ servers your agents should have access to.
---
**Sources:**
- [MCP crosses 97M monthly installs — arturmarkus.com](https://www.arturmarkus.com/anthropics-model-context-protocol-hits-97-million-installs-on-march-25-mcp-transitions-from-experimental-to-production-standard-layer-for-agentic-ai/)
- [MCP 97M downloads analysis — Digital Applied](https://www.digitalapplied.com/blog/mcp-97-million-downloads-model-context-protocol-mainstream)
- [2026 MCP Roadmap — modelcontextprotocol.io](http://blog.modelcontextprotocol.io/posts/2026-mcp-roadmap/)
- [MCP servers as Shadow IT — Qualys](https://blog.qualys.com/product-tech/2026/03/19/mcp-servers-shadow-it-ai-qualys-totalai-2026)
- [MCP enterprise readiness — WorkOS](https://workos.com/blog/2026-mcp-roadmap-enterprise-readiness)
---
# Anthropic's $380B Moment: What the IPO Signal Means for Claude Code
URL: https://sdd.sh/2026/03/anthropics-380b-moment-what-the-ipo-signal-means-for-claude-code/
Date: 2026-03-31
Updated: 2026-03-31
Tags: Anthropic, IPO, Claude Code, enterprise AI, AI industry, funding, valuation
Categories: Industry, AI Tools
Summary: Anthropic is targeting an October 2026 IPO to raise over $60 billion at a $380 billion valuation, with $19B in annualized revenue and 8 Fortune 10 customers. For developers building on Claude Code, the financial mechanics matter less than what they signal.
On March 27, The Information and Bloomberg reported that Anthropic is targeting an IPO as early as October 2026, aiming to raise over $60 billion. At a $380 billion valuation — up from $4.1 billion in late 2023 — this would be among the largest tech IPOs in history.
For developers, the financial mechanics of the raise matter less than what the IPO signal means: **agentic AI tooling has crossed the enterprise credibility threshold**, and Anthropic is betting the public markets will price it accordingly.
## The Numbers That Validate the Thesis
Anthropic's annualized revenue stands at **$19 billion** as of March 2026, with guidance toward roughly $26 billion by year-end. Eight Fortune 10 companies are customers. JPMorgan Chase has deployed AI coding tools to 60,000 developers and is reporting 30% developer velocity improvements. Goldman Sachs, Walmart, and BMW announced enterprise-wide AI coding rollouts in Q1 2026.
These aren't pilot programs. These are production deployments at companies where "something went wrong in prod" means regulatory scrutiny, not a GitHub issue. The fact that Fortune 10 legal, security, and compliance teams approved Claude for enterprise deployment is a harder bar to clear than any benchmark score.
For context on the revenue trajectory: AWS generated roughly $4.6 billion in its first full year and scaled from there. Anthropic is projecting $26 billion in year two of meaningful commercial traction. The pace is categorically different from prior technology adoption curves.
## A 92x Valuation Jump in 26 Months
The valuation timeline is worth laying out explicitly:
- **November 2023**: $4.1 billion (Series C, Google leading)
- **September 2024**: ~$40 billion (Series F)
- **February 2026**: ~$350 billion (following a $30 billion raise)
- **IPO target**: $380 billion+
That's a 92x increase in roughly 26 months. Skeptics will note that private market valuations are speculative; IPO pricing is the moment of reckoning where institutional investors price the actual business. But the direction is clear: Anthropic has been consistently undervalued at each prior round, as subsequent raises demonstrated.
The banks reportedly in early discussions — Goldman Sachs, JPMorgan, Morgan Stanley — are not taking courtesy meetings. These firms price real businesses.
## What This Means for Claude Code's Roadmap
An IPO isn't just about investor liquidity. It's a capital event that determines what a company can build over the next five to ten years.
Claude Code's most transformative features require sustained, expensive investment:
- **1M-token context windows** for Max, Team, and Enterprise tiers
- **Computer use** — Mac screen control that's stable enough for Pro/Max users to delegate multi-step workflows
- **Agent Teams** — the 15-agent parallel orchestration architecture
- **Multi-platform** support including Windows PowerShell (v2.1.84, currently in preview)
These aren't features you ship by moving fast and breaking things. The computer use implementation alone requires significant safety work to ensure agents don't take unintended actions when given Mac-level control. The Windows expansion requires infrastructure that handles the PowerShell surface area reliably.
An Anthropic with $60+ billion on its balance sheet is an Anthropic that can run longer experiments, employ larger safety research teams, and maintain the compute infrastructure that makes 1M-token context windows economically viable at current pricing.
## The Safety Moat as an Enterprise Sales Strategy
Anthropic's "responsible scaling policy" and public safety commitments have attracted considerable criticism from people who view them as marketing. The IPO data suggests a different read: **responsible AI is an enterprise sales strategy that works**.
Eight Fortune 10 customers didn't choose Anthropic despite its safety positioning. They chose it partly *because* of it. Enterprise legal teams don't want to explain to their board why they deployed a model that the developer publicly described as "scary good" at cyberattacks with no safety guardrails. Anthropic's transparent risk disclosures — including the Claude Mythos leak, which explicitly called out unprecedented cybersecurity risk — give enterprise procurement teams the documentation they need to approve deployments.
This is the moat that's hard to replicate: technical capability plus a credible safety narrative plus a track record at Fortune 10 scale. OpenAI has the capability and scale; the safety narrative is contested. Google has the enterprise relationships; the AI capability leadership is contested. Anthropic's combination is genuinely differentiated.
## The Competitive Landscape After the IPO
OpenAI is reportedly valued at $340 billion in private markets. xAI completed its SpaceX merger at roughly $1.25 trillion in February 2026, with a reported June IPO target. The major AI labs are all moving toward public markets within the same 12-month window.
What's different about Anthropic's path: the traditional IPO route requires audited financials, forward guidance, and public accountability. That's a higher standard than private fundraising. Choosing that path over alternative structures signals confidence in the underlying business durability — you don't take the harder path unless you think the numbers hold up under scrutiny.
For developers choosing a long-term AI stack, this matters. Tools built on platforms with uncertain longevity carry hidden costs: migration risk, deprecation announcements, shifts in pricing as the business model evolves. Anthropic targeting public markets signals a 5-10 year investment horizon with accountability to public shareholders. That's a different durability guarantee than "we just raised another private round."
## The Developer Equation
None of this changes what Claude Code can do today. The terminal-native agentic model, the MCP ecosystem integration, the multi-agent architecture — these exist regardless of the IPO status.
What the IPO signal changes is the confidence interval around Anthropic's roadmap. When an AI lab is navigating private financing rounds, product priorities can shift dramatically based on what's fundable. Public company product roadmaps are anchored by what's defensible to analysts on a quarterly call.
Anthropic's $19B revenue run-rate is a Claude Code success story as much as a model licensing story. The enterprise customers paying for Claude access need it to integrate into developer workflows — which means Claude Code adoption at JPMorgan, Goldman, and Walmart is driving the revenue that funds the next model generation.
That's a flywheel worth understanding: enterprise Claude Code deployments → revenue → model investment → better agentic capabilities → more enterprise deployments. The IPO is the moment that flywheel becomes a public company.
---
**Sources:**
- [Anthropic targets IPO as early as October 2026 — Winbuzzer](https://winbuzzer.com/2026/03/30/anthropic-ipo-q4-2026-60-billion-target-xcxwbn/)
- [Anthropic eyes $60B IPO raise — The Tech Portal](https://thetechportal.com/2026/03/27/anthropic-targets-ipo-as-early-as-october-2026-eyes-over-60-billion-raise-report/)
- [Anthropic IPO October 2026 — MLQ.ai](https://mlq.ai/news/anthropic-eyes-ipo-with-potential-october-2026-listing/)
- [Anthropic computer use in Claude Code — CNBC](https://www.cnbc.com/2026/03/24/anthropic-claude-ai-agent-use-computer-finish-tasks.html)
- [Enterprise AI coding adoption — Build Fast With AI](https://www.buildfastwithai.com/blogs/ai-tools-developers-march-2026)
---
# Jules Deep Dive: Google's Async Agent That Closes the CI Loop Without You
URL: https://sdd.sh/2026/03/jules-deep-dive-googles-async-agent-that-closes-the-ci-loop-without-you/
Date: 2026-03-30
Updated: 2026-03-30
Tags: jules, google, gemini, async-coding, agentic-workflows, ci-cd
Categories: AI Tools, Agentic Workflows
Summary: Jules is now generally available with Gemini 3.1 Pro at its core, an autonomous CI failure detection and fix loop, and audio changelogs. This is what a fully async coding agent actually looks like — and how it compares to the terminal-native model Claude Code represents.
Most AI coding tools are synchronous. You ask, they answer. You review, you accept or reject. The human is in the loop by design — which is another way of saying the human is the bottleneck by design.
Jules, Google's async coding agent, takes a different approach. You give it a GitHub issue. You walk away. It comes back with a pull request. If the CI checks fail on that PR, Jules detects the failure, develops a fix, commits it, and resubmits — without you touching a keyboard.
Now generally available with Gemini 3.1 Pro as its default model for Pro users, Jules is worth a proper deep dive.
## The Architecture: Async VM, Not Terminal Takeover
The fundamental distinction between Jules and tools like Cursor or GitHub Copilot is the execution model.
Cursor operates inside your IDE. Copilot lives in your editor. Both require you to be present, reviewing suggestions in real time. Jules runs in an **isolated VM on Google's infrastructure**, spawned when you assign it a task and torn down when it's done. You get a PR in your inbox when the work is complete.
This has several concrete implications:
- **No context switching.** You don't hand Jules your keyboard. You assign it a GitHub issue, the same way you'd assign it to a junior developer, and continue with your own work.
- **Reproducible environment.** The VM is clean and consistent — no "works on my machine" drift, no local dependency conflicts bleeding into the agent's output.
- **Parallel workstreams.** You can queue multiple Jules tasks simultaneously. While Jules is debugging one issue, it can be writing tests for another.
The tradeoff is latency. An async agent that runs in a cloud VM will never feel as snappy as an autocomplete suggestion. But that is the wrong comparison. Jules is competing with the time it would take *you* to do the work, not with how quickly a suggestion pops up.
## The CI Fixer: Why This Is the Milestone
The most significant recent addition is the **CI failure loop**.
When Jules opens a PR and a GitHub Actions check fails, Jules does not stop and wait for you to review the error. It receives the CI output, reasons about the failure, develops a fix, commits it to the branch, and resubmits. The loop closes automatically.
This is a meaningful shift. Every async coding agent can open a PR. That's table stakes. The breakdown that required human intervention was always: "the CI failed." Someone had to read the error, interpret it, write the fix, push, and wait again. Jules now handles that entire sequence.
The practical implication: for straightforward CI failures — lint errors, type mismatches, failing unit tests with clear error messages — Jules can fully resolve an issue start-to-finish without any human involvement in the middle. You assigned the issue, you get a green PR. That's the agentic promise, actually delivered.
Does it work on complex CI failures involving environment configuration, flaky tests, or cascading dependency issues? Less reliably. But the 80% case — deterministic, readable CI output that maps to a specific code change — is now handled.
## Gemini 3.1 Pro: The Model Upgrade That Matters
Jules' March 9 upgrade to Gemini 3.1 Pro as the default model for Google Pro users is not a minor version bump.
Gemini 3.1 Pro delivers **2x+ reasoning improvement** over Gemini 3 Pro and ranks first on 12 of 18 tracked coding and reasoning benchmarks. It supports a 1M token context window with 65K output tokens — enough to reason over large codebases in a single pass rather than chunking and losing coherence.
For Jules specifically, the reasoning uplift matters because Jules is not just executing instructions. It is doing the kind of multi-step reasoning that software debugging requires: reading the issue, finding the relevant code, forming a hypothesis about the root cause, implementing a fix, anticipating side effects, writing tests. Each of those steps compounds. A 2x reasoning improvement does not produce 2x better PRs, but it meaningfully raises the ceiling on which tasks Jules can handle autonomously versus where it needs to kick back to a human.
## Audio Changelogs: Surprisingly Useful
Jules offers **audio summaries of recent commits** — a listenable changelog of what changed in your repository.
This sounds like a novelty feature until you think about the use cases. Async development environments, especially teams using multiple AI agents in parallel, generate commit history faster than a human can read it. An audio summary you can play during a commute or between meetings is a genuinely practical way to stay oriented in a fast-moving codebase.
It also fits a broader trend: AI agents are generating more output than developers can synchronously review. Tools that help humans efficiently process that output — instead of requiring them to read every diff — become more valuable as agent output volume increases.
## Commit Attribution: The Right Level of Control
Jules gives you three commit authorship modes:
- **Jules sole author** — Jules' name and email on the commit
- **Co-authored: Jules + You** — shared credit, Jules primary
- **Co-authored: You + Jules** — shared credit, you primary
This is a small feature that signals the right philosophy. Attribution matters for code review, for audit trails, for understanding provenance. The answer is not to hide that an agent wrote the code. The answer is to make attribution accurate and configurable.
## Pricing and Access
Jules' pricing structure reflects its async positioning:
- **Free**: 15 tasks per day
- **Pro ($19.99/month)**: approximately 75 tasks per day + Gemini 3.1 Pro default
- **Ultra ($124.99/month)**: approximately 300 tasks per day
For comparison, Claude Code's Agent Teams feature requires Opus 4.6 and a Pro or Max subscription, with token costs approximately 7x higher per task than single-agent operation. Jules' flat task-based pricing is simpler to reason about for budgeting purposes, though it obscures the complexity dimension — a simple lint fix and a multi-file refactor count as one task each.
## Jules vs. Claude Code: Two Agentic Models
Jules and Claude Code both target the "actually autonomous" category of AI coding tools, but they represent different architectural philosophies.
**Jules** is GitHub-native and async. Its integration surface is a GitHub issue. Its output is a pull request. It is deliberately not your interactive coding environment — it is an async contributor on your team. The CI loop is its clearest expression of this: Jules is designed to operate without you, and the CI fixer removes the last common interruption point.
**Claude Code** is terminal-native and interactive-but-autonomous. It is designed for developers who want to be present in the workflow, orchestrating agents from a terminal rather than an IDE. Claude Code Agent Teams allow you to address individual agents directly, see their work in real time via tmux panels, and intervene at any point. The autonomy is opt-in and granular.
Neither model is strictly superior. Jules is more appropriate when you want to delegate and return to a completed PR. Claude Code is more appropriate when you want to run a complex, multi-step workflow and maintain strategic oversight of what the agents are doing.
The convergence point is agentic engineering — developers who are orchestrators, not implementors. Jules takes the delegation model to its logical conclusion. Claude Code takes the orchestration model to its. Both are more honest about what AI coding assistance actually is in 2026 than any IDE plugin that adds autocomplete and calls itself an agent.
## Bottom Line
Jules is now a serious async coding agent. Gemini 3.1 Pro raises the capability floor, the CI fixer closes the loop that previously required human intervention, and the async VM model keeps your local environment clean. For GitHub-centric teams that want to delegate implementation tasks to an AI contributor without restructuring their workflow around a new tool, Jules is the most mature option available.
The CI fixer is the feature to watch. If it proves reliable on a wide range of projects in real-world use, the argument for keeping a human in the middle of a CI debug cycle gets hard to make.
---
**Sources**: [Jules Changelog — Gemini 3.1 Pro Upgrade (March 9, 2026)](https://jules.google/docs/changelog/2026-03-09/) · [Jules Full Changelog](https://jules.google/docs/changelog/) · [Google Blog — Jules GA](https://blog.google/technology/google-labs/jules-now-available/) · [Google DeepMind — Gemini 3.1 Pro](https://deepmind.google/technologies/gemini/flash/) · [LogRocket AI Dev Tool Power Rankings](https://blog.logrocket.com/ai-dev-tool-power-rankings/)
---
# Claude Mythos: The Leaked Model That Scared the Security World
URL: https://sdd.sh/2026/03/claude-mythos-the-leaked-model-that-scared-the-security-world/
Date: 2026-03-30
Updated: 2026-03-30
Tags: claude, anthropic, ai-models, security, mythos
Categories: AI Tools, Industry
Summary: A CMS misconfiguration at Anthropic accidentally revealed 'Claude Mythos' — a model tier above Opus 4.6 that Anthropic itself calls an unprecedented cybersecurity risk. Here's what leaked, what it means for agentic coding, and why the security industry noticed immediately.
On March 26, a CMS misconfiguration at Anthropic exposed roughly 3,000 unpublished assets — including a draft blog post describing a new model that had not yet been announced. Within hours, the AI world was talking about **Claude Mythos**.
The details are striking enough to warrant a full breakdown. This is not a minor feature drop.
## What Leaked
The unpublished post described a model internally codenamed **Capybara** and publicly branded **Claude Mythos**. The document called it "by far the most powerful AI model we've ever developed" with "dramatically higher scores" across coding, academic reasoning, and cybersecurity benchmarks compared to Opus 4.6.
Two things stand out from the leaked material:
1. **Mythos adds a fourth pricing tier above Opus.** The current stack — Haiku, Sonnet, Opus — would gain a new top layer. Anthropic has not confirmed pricing, but the framing strongly implies a significant premium. Think Opus 4.6's Max plan, but more so.
2. **Anthropic's own assessment is that Mythos poses "unprecedented cybersecurity risks."** The draft stated the model is "currently far ahead of any other AI model in cyber capabilities." This is Anthropic — a company whose safety culture is baked into its founding story — voluntarily characterizing their own product as a potential threat vector.
That second point is what triggered a sell-off in cybersecurity stocks and briefly pushed Bitcoin down to $66K as markets processed what a step-change in AI offensive capability might mean.
## Why This Matters for Agentic Coding
The reflexive reaction to "unprecedented cybersecurity risk" is alarm. But for developers thinking about agentic coding, the relevant question is different: **what does a model that dramatically outperforms Opus 4.6 on coding actually look like in practice?**
Opus 4.6 already sits at 75.6% on SWE-bench with a 14.5-hour task completion time horizon — the longest of any commercial model. It runs multi-agent teams of up to 15 peers. It operates inside a 1M token context window. It can use a computer.
If Mythos genuinely represents a step change over that baseline on coding tasks, you are looking at a model that could potentially close a much larger category of real-world bugs and features without human checkpoints. The 45% SWE-bench Pro ceiling that currently defines even the best models — where tasks involve 107 lines across 4+ files — gets meaningfully pushed.
The 14.5-hour task horizon is already longer than a half-day sprint. A substantially more capable model running in an Agent Teams configuration would start to resemble something closer to a fully autonomous developer working overnight, not an assistant that needs babysitting.
## Anthropic's Deliberate Transparency About Risk
One of the more interesting aspects of this episode is what it reveals about Anthropic's internal framing.
Anthropic is not downplaying the capabilities of its own model. They are publishing — albeit accidentally — a candid characterization of Mythos as a serious cybersecurity risk. This fits their Responsible Scaling Policy, which establishes capability thresholds that trigger additional safety review before deployment. A model that crosses the "current frontier of cyber capabilities" would almost certainly require expanded red-teaming, access controls, and likely policy coordination before a general release.
This explains why Mythos is currently in limited early-access testing with no general release date. The announcement, when it comes, will probably be accompanied by significant safety disclosures.
That transparency is worth noting. Anthropic's approach here contrasts sharply with competitors who routinely release models while quietly acknowledging safety concerns internally. Calling your own model "unprecedented" in a risk context before it ships is at least consistent.
## What the Cybersecurity Angle Actually Means
"Cybersecurity capability" in this context almost certainly means offensive as well as defensive. A model that dramatically outperforms on vulnerability research, exploit development, and penetration testing reasoning is useful to defenders — and also to attackers.
The existing high-capability threshold in Anthropic's RSP already requires additional safeguards around models that meaningfully advance offensive cyber capabilities beyond the current state of the art. A model Anthropic describes as "currently far ahead of any other AI model in cyber capabilities" would trigger those safeguards.
This has two practical implications:
**For enterprise security teams**: If Mythos ships with the kind of code analysis capabilities the leak implies, it will likely become the default tool for vulnerability audits and threat modeling. A model that can reason about complex multi-file codebases with dramatically better coverage than Opus 4.6 would substantially change what a solo security engineer can audit in a day.
**For AI tooling providers**: Any platform that exposes model capabilities through APIs will face heightened scrutiny once Mythos ships. Rate limits, access controls, and audit logging that were optional considerations with Opus 4.6 will become baseline requirements.
## The Timing: Why March 2026
Anthropic's March was already their busiest product month ever — 14+ launches, including Opus 4.6 going GA, 1M context becoming available to Max and Team plans, computer use in Claude Code, and Windows PowerShell support. The leak landed in the final week of a sprint that had already outpaced anything in the company's history.
The timing matters because it frames Mythos as a next step, not a long-shot future research project. Anthropic is clearly in a capabilities race and accelerating. The fact that blog post drafts about Mythos existed at all suggests the announcement was weeks or months away, not years.
## What to Expect When It Ships
Reading the leaked framing carefully: Mythos is positioned as a premium tier for use cases where raw capability matters more than cost. The existing Claude lineup already handles most developer workflows well — Sonnet 4.6 for daily coding tasks, Opus 4.6 for complex multi-agent runs. Mythos appears targeted at the hardest category: extended autonomous operation on genuinely difficult software and research tasks.
For Claude Code users, the practical question will be whether Mythos ships as an option for Agent Teams and whether the token cost remains reasonable for long-running autonomous sessions. The 7x token overhead of Agent Teams on Opus 4.6 already requires careful budgeting. A model priced above Opus will require even more deliberate scoping.
The leak was not planned. But what it revealed is genuinely significant: Anthropic has a model in late-stage testing that they themselves consider a step change — and they are being unusually candid about both its power and its risk profile. That combination is either a warning sign or a competitive moat, depending on where you sit.
---
**Sources**: [Fortune — Anthropic Confirms Mythos](https://fortune.com/2026/03/26/anthropic-says-testing-mythos-powerful-new-ai-model-after-data-leak-reveals-its-existence-step-change-in-capabilities/) · [The Decoder — Leak Details](https://the-decoder.com/anthropic-leak-reveals-new-model-claude-mythos-with-dramatically-higher-scores-on-tests-than-any-previous-model/) · [Fortune — Cybersecurity Assessment](https://fortune.com/2026/03/27/anthropic-leaked-ai-mythos-cybersecurity-risk/) · [Futurism — Initial Report](https://futurism.com/artificial-intelligence/anthropic-step-change-new-model-claude-mythos) · [SiliconAngle — Analysis](https://siliconangle.com/2026/03/27/anthropic-launch-new-claude-mythos-model-advanced-reasoning-features/)
---
# From Vibe Coding to Agentic Engineering: The Paradigm Shift That Outran Its Own Branding
URL: https://sdd.sh/2026/03/from-vibe-coding-to-agentic-engineering-the-paradigm-shift-that-outran-its-own-branding/
Date: 2026-03-29
Updated: 2026-03-29
Tags: Vibe Coding, Agentic Workflows, Industry, Trends, AI Tools
Categories: Agentic Workflows, Industry
Summary: Andrej Karpathy coined 'vibe coding' on February 2, 2025. Collins Dictionary named it Word of the Year. Then Karpathy declared it passé and replaced it with 'agentic engineering.' Here's what happened in the 13 months between the tweet and the paradigm shift.
On February 2, 2025, Andrej Karpathy posted a short X thread that got 4.5 million views. The core idea: "fully give in to the vibes," let the LLM write all the code, don't even read what it generates. He called it "vibe coding."
Collins Dictionary named it Word of the Year 2025. MIT Technology Review named generative coding one of its 10 Breakthrough Technologies of 2026. By most accounts, vibe coding had won the cultural moment.
And then, in early 2026, Karpathy declared it passé and introduced a replacement: **agentic engineering**.
The year between those two declarations is among the most compressed paradigm shifts in software development history. Understanding why it happened so fast tells you a lot about where AI-assisted coding is actually going.
## What Vibe Coding Actually Said
The original tweet was short and deliberately provocative. Karpathy wasn't describing a rigorous methodology — he was naming something developers were already doing informally and mostly pretending not to. The practice: describe what you want, accept what the AI generates, fix errors by describing them to the AI again, never read the code yourself.
"It's not really coding," Karpathy wrote. "I just see stuff, say stuff, run stuff, and it mostly works."
For small prototypes, weekend projects, and throwaway tools, this was liberating. The friction between idea and working software dropped dramatically. Developers who would have spent three days on a data pipeline could have something running in three hours. Non-developers could build tools that previously required hiring engineers.
The marketing machine pulled in two directions simultaneously: AI tool companies amplified the message as validation; developers with 20 years of hard-won expertise quietly winced.
## The Backlash Was Also Right
By mid-2025, counter-data was accumulating. A December 2025 CodeRabbit analysis of 470 open-source pull requests found that AI co-authored code had **1.7x more major issues**, **75% more misconfigurations**, and **2.74x more security vulnerabilities** than human-authored PRs.
A January 2026 paper — "Vibe Coding Kills Open Source" — made a different argument: that fully surrendering to AI reduced developer engagement with open-source communities, because the act of reading code, contributing to discussions, and understanding implementation details was where community bonds formed. If you never read the code, you never have anything to say about it.
Both critiques landed. Vibe coding, taken literally, produced worse security posture and was arguably hollowing out the engineering culture that made open-source software good in the first place.
But the critiques also missed something. The problem wasn't AI-assisted coding. The problem was *unsupervised* AI-assisted coding — humans removing themselves entirely from the loop in contexts where the loop existed for good reason.
## Benchmarks and the Reality Check
SWE-bench Verified — the benchmark AI coding tools cited endlessly throughout 2025 — has been effectively retired. A Scale AI audit found 59.4% of its hard tasks have flawed tests. OpenAI stopped reporting scores on it after contamination concerns became too significant to ignore.
The replacement, **SWE-bench Pro**, tells a different story. Its 1,865 long-horizon tasks require an average of 107 lines changed across 4.1 files. Top scores as of early 2026:
- **Claude Opus 4.5**: 45.9%
- **GPT-5 (High)**: 41.8%
- **GPT-5.2 Codex**: 41.0%
The same models scoring 80%+ on SWE-bench Verified score 41–46% on SWE-bench Pro. That gap is the gap between "solving isolated problems on known codebases" and "doing real software engineering work in production repositories." The dominant failure mode for top models on Pro is context overflow — which directly explains why the 1M context window and Compaction API are strategically significant, not just marketing.
Vibe coding at 45% reliability is appropriate for prototypes. It is not appropriate for production infrastructure, security-sensitive code, or systems where failures page an on-call engineer.
## The Successor: Agentic Engineering
Karpathy's reframing is worth quoting precisely. His characterization of the shift: "'agentic' because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight."
The distinction from vibe coding is exact:
- **Vibe coding**: surrender to the AI; don't read the output; trust the vibes
- **Agentic engineering**: orchestrate multiple AI agents; act as strategic oversight; review results before shipping
The human role in agentic engineering is closer to engineering management than to typing. You set goals, structure problems for parallel execution, review outputs, and steer based on what you see. You don't necessarily write every line, but you absolutely read the code — or at least the parts that matter.
This is a better mental model for what the data actually shows. It explains why the security failures in vibe-coded projects happened: there was no oversight layer. It explains why teams that adopted async agent workflows with proper review processes — Claude Code's Agent Teams, Jules's VM-isolated execution, OpenAI Codex's sandboxed cloud agent — saw better outcomes than teams running in pure vibe mode. And it explains why Spec-Driven Development works: the spec is the strategic brief the engineering manager gives the team before work begins.
## What the Numbers Actually Show
Despite the critiques, adoption is unambiguous. By 2026:
- **92% of US developers** use AI coding tools daily
- **41% of all code written globally** is now AI-generated
- The AI coding tools market sits at **$4.7 billion in 2026**, projected at $12.3B by 2027
The argument was never whether AI tools belong in development workflows — they clearly do. The argument is about the *posture* the developer takes toward them. Vibe coding said "trust the output." Agentic engineering says "direct the process."
That shift is commercially important too: the tools built for vibe coding (fast autocomplete, single-agent generation, IDE autocomplete) are different from the tools built for agentic engineering — multi-agent orchestration, async VM execution, long-context memory, MCP-connected external services, structured review workflows.
## What This Means for the Tooling Landscape
The vibe coding era favored tools with great autocomplete and fast iteration loops. You typed, the AI suggested, you accepted or rejected. Cursor, GitHub Copilot, and Tabnine were well-positioned for that paradigm.
The agentic engineering era favors different primitives:
**Async execution**: Jules runs in a Google Cloud VM while you do other work; Claude Code Agent Teams spawns up to 15 independent teammates working in parallel; OpenAI Codex executes in sandboxed cloud environments. You assign work and return to results, rather than watching an agent type in real time.
**Terminal-native orchestration**: Claude Code's CLI model wins over IDE wrappers because agentic engineering doesn't require a GUI. Long-running agents, tmux panel layouts for parallel teams, scriptable workflows — these are terminal-native primitives that IDEs weren't designed to express.
**Long-running context**: With 1M tokens generally available for Opus 4.6, week-long agent sessions are viable. The context overflow failures that dominated early SWE-bench Pro results are becoming solvable. Compaction means agents no longer hit walls mid-task on large codebases.
**Spec-first workflows**: Writing goals before generating code is the natural interface for agentic engineering. The developer acts as architect; the agents handle implementation. This is SDD's core claim, and the 2026 tooling landscape has converged on it: Windsurf's Plan Mode, Jules's plan-then-execute model, and Claude Code's structured task planning all reflect the same underlying logic.
IDE-centric tools like Cursor are adapting — Cursor 2.0's Plan Mode and parallel agents are direct responses to this shift — but the architectural starting point still anchors them to synchronous, human-in-the-loop workflows. The developer watches agents work. That model doesn't scale to 15-agent teams or week-long async sessions.
## The Branding Outpaced the Practice
"Vibe coding" became Word of the Year before "agentic engineering" was a phrase because "vibe coding" described something that felt new and was easy to demo in 30 seconds. "Agentic engineering" describes something that requires setup, thought, and familiarity with multi-agent architectures to appreciate.
But the harder thing is what actually works at scale. Vibe coding was the right mental model for 2025's tool capabilities. Agentic engineering is the right mental model for 2026's.
The timeline compressed faster than anyone predicted. In 13 months, the paradigm that Karpathy named became insufficient to describe the paradigm he was helping build. That's not a critique of the original framing — it's a measure of how fast the underlying technology moved.
The next question isn't whether to use AI in development workflows. It's whether you're vibe coding (trusting outputs without oversight) or agentic engineering (orchestrating agents with strategic direction). The tools, the benchmarks, and the failure data all point in the same direction.
---
**Sources**
- [Vibe coding — Wikipedia](https://en.wikipedia.org/wiki/Vibe_coding)
- [What Is Vibe Coding in 2026? One Year From Karpathy's Tweet — DEV Community](https://dev.to/h1gbosn/what-is-vibe-coding-in-2026-one-year-from-karpathys-tweet-5f43)
- [Vibe coding is passé — Karpathy names "agentic engineering" — The New Stack](https://thenewstack.io/vibe-coding-is-passe/)
- [The uncomfortable truth about vibe coding — Red Hat Developer](https://developers.redhat.com/articles/2026/02/17/uncomfortable-truth-about-vibe-coding)
- [SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81% — Morph LLM](https://www.morphllm.com/swe-bench-pro)
- [SWE-bench February 2026 leaderboard update — Simon Willison](https://simonwillison.net/2026/Feb/19/swe-bench/)
- [CodeRabbit analysis: AI co-authored code quality issues — December 2025](https://coderabbit.ai/blog/ai-code-quality-2025)
- [MIT Technology Review: 10 Breakthrough Technologies 2026](https://www.technologyreview.com/10-breakthrough-technologies/2026/)
- [7 AI Tools That Changed Developer Workflow (March 2026) — Build Fast With AI](https://www.buildfastwithai.com/blogs/ai-tools-developers-march-2026)
---
# Claude Code Agent Teams: One Developer, Fifteen AI Teammates
URL: https://sdd.sh/2026/03/claude-code-agent-teams-one-developer-fifteen-ai-teammates/
Date: 2026-03-29
Updated: 2026-03-29
Tags: Claude Code, Multi-Agent, Anthropic, Agentic Workflows
Categories: AI Tools, Agentic Workflows
Summary: Claude Code's experimental Agent Teams feature lets a single session orchestrate up to 15 independent AI teammates, each with its own context window and toolset. Here's what the architecture looks like — and why a Rust C compiler built by 16 agents is a stress test worth understanding.
For most of 2025, "multi-agent coding" meant one thing in practice: a main agent spawning subagents to handle subtasks and report back. Useful, but fundamentally hierarchical. The lead agent retained all context; subagents were essentially remote function calls with more tokens.
Claude Code Agent Teams changes that model. Launched in experimental preview on February 5, 2026, it's the first agentic coding feature designed for genuine collaboration rather than delegation.
## Subagents vs Teammates: The Architecture Distinction That Matters
The difference between subagents and teammates is not cosmetic. Subagents in Claude Code are child processes: they receive a task, execute it, and return results to the parent. Communication is unidirectional. The human developer talks to the lead agent; subagents only surface through the lead's output.
Teammates are peers. Each teammate has its own context window, its own tool access, and — critically — a **mailbox**. Teammates can message each other directly, maintain state across turns, and can be addressed by the human developer without routing through the lead agent. There's also a shared task list that all team members can read and write.
This is a meaningful architectural shift. When four teammates are working on parallel modules, the developer can drop into any one of them, check progress, redirect, or ask a question — without losing context elsewhere. Cursor's parallel agents require context-switching in the IDE; Agent Teams maintains continuity across the entire team from a single terminal session.
## Enabling It
Agent Teams is available to Pro ($20/mo) and Max ($100–$200/mo) subscribers running Opus 4.6. Enable it by setting `CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1` in your environment or in `settings.json`:
```json
{
"env": {
"CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1"
}
}
```
Optionally, enable tmux to get per-agent terminal panels — each teammate gets its own pane, visible simultaneously. For multi-module work or parallel debugging hypotheses, the visual separation alone is worth the setup cost.
## The Team Lead Model
When Agent Teams is enabled, your Claude Code session becomes a **team lead**. It can:
- Spawn up to 15 teammates
- Assign tasks via the shared task list or direct messages
- Discover active teammates and check their status
- Initiate graceful shutdown with approval workflows
The lead retains strategic oversight: it sets goals, coordinates handoffs, and handles final integration. Teammates handle implementation, testing, or domain-specific subtasks. The division of labor mirrors how senior engineers actually work with teams — not line-by-line supervision, but clear task assignment and async check-ins.
Human developers don't lose access to the team either. You can message any teammate directly, inject additional context mid-task, or redirect a struggling agent without interrupting the others. This is qualitatively different from watching a single agent work sequentially.
## Token Cost Reality
There's no free lunch here. A team running in plan mode costs approximately **7x a single session in tokens**. With 15 teammates active, you're looking at sustained Opus 4.6 usage across 16 concurrent contexts.
For most tasks, that's overkill. But for the right problem — a major feature spanning frontend, backend, database migration, and tests; a refactor touching 40 files across 6 modules; a debugging session where three competing hypotheses need simultaneous investigation — the cost is justifiable. A team that saves two days of developer time at $150K/year salary costs less than it saved.
The key is task sizing. Agent Teams isn't a productivity multiplier for small tasks; it's a force multiplier for problems that are genuinely parallelizable and have clear interfaces between components.
## The Stress Test: A Rust C Compiler
Anthropic's own stress test for Agent Teams is worth understanding: they tasked 16 agents (one lead, 15 teammates) with writing a complete C compiler in Rust capable of building the Linux kernel. The result was a 100,000-line compiler spanning 2,000 sessions at approximately $20,000 in API costs.
The exercise wasn't about cost efficiency — it was about reliability, coordination, and failure mode discovery. What breaks when 15 agents work in parallel on a codebase of that scope? How does the mailbox system hold up under concurrent writes? Where do context windows overflow and cause agents to lose thread?
The results were instructive. The compiler worked. But real failure modes emerged: context overflow in long compilation phases, occasional duplicate task claims from the shared list. These findings directly informed subsequent Claude Code bug fixes — including the fix for background subagents becoming invisible post-compaction, which was causing duplicate agent spawns in early builds.
## Practical Use Cases
**Parallel code review**: Spawn three reviewers, each focused on a different dimension — security, performance, correctness. They work simultaneously on the same PR diff. The lead synthesizes findings.
**Cross-layer feature development**: Frontend teammate builds the React components; backend teammate writes the API endpoints; third teammate handles database migration and integration tests. Handoffs happen through the task list — no waiting for one layer to finish before starting the next.
**Competing debugging hypotheses**: The lead articulates three possible root causes for a production bug. Three teammates investigate each simultaneously. The first to establish or rule out their hypothesis broadcasts to the group.
**Multi-module refactors**: Each teammate owns one module. They coordinate on interface changes through the shared task list. The lead handles cross-module integration and final validation.
## What This Isn't
Agent Teams isn't a substitute for clear task definition. If the initial spec is ambiguous, 15 agents will diverge in 15 different directions. The productivity multiplier applies to parallelizable work with clean interfaces — it becomes expensive chaos applied to vague problems.
It's also genuinely experimental. The `CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS` flag is a real signal: Anthropic is still finding and fixing failure modes. Use it for serious work, but keep oversight proportional to team size. A 4-agent team on a well-scoped feature is probably more reliable right now than a 15-agent team on something sprawling.
## The Competitive Landscape
Cursor 2.0's parallel agents and GitHub Copilot's issue-to-PR agent both allow some form of multi-agent work. Neither offers the mailbox architecture or the developer-addressable teammate model. Cursor's agents are IDE-centric — you watch them work through a UI. Claude Code's model is terminal-native and designed for async oversight.
More importantly, Cursor and Copilot's multi-agent features don't expose the architecture to the developer. You can't query a specific agent's context, redirect a struggling teammate, or inject mid-task guidance without interrupting the whole workflow. With Agent Teams, the developer is a co-orchestrator, not a passive observer above a black box.
## The Bigger Picture
Agent Teams is where the "orchestrator/executor" model of agentic engineering becomes concrete. The developer isn't writing code; the developer is setting goals, reviewing results, and steering a team. The relevant skill isn't keyboard speed — it's knowing how to structure a problem for parallel execution and how to read 15 simultaneous progress streams and identify what needs intervention.
That is, essentially, management. Senior engineers who already think in systems and interfaces will find this natural. Developers accustomed to writing every line themselves will find it disorienting until the mental model shifts.
The shift is worth making. A developer who can effectively direct a 15-agent team is not 15x more productive — some tasks don't parallelize, and coordination has real overhead — but for the right class of problems, the multiplier is real. This is what Anthropic means when they say the role of software engineers is changing: not disappearing, but moving up the abstraction stack.
---
**Sources**
- [Orchestrate teams of Claude Code sessions — Claude Code Docs](https://code.claude.com/docs/en/agent-teams)
- [Building a C Compiler with Agent Teams — Anthropic Engineering](https://www.anthropic.com/engineering/building-c-compiler)
- [Collaborating with Agent Teams in Claude Code — Heeki Park, Medium](https://heeki.medium.com/collaborating-with-agents-teams-in-claude-code-f64a465f3c11)
- [Claude Code Agent Teams: SwarmMode and TeammateTool — Sean Kim](https://blog.imseankim.com/claude-code-team-mode-multi-agent-orchestration-march-2026/)
- [Claude Opus 4.6 Introduces Adaptive Reasoning and Context Compaction — InfoQ](https://www.infoq.com/news/2026/03/opus-4-6-context-compaction/)
---
# Your AI Agent Is Drowning in Tokens — Here's How to Fix It
URL: https://sdd.sh/2026/03/your-ai-agent-is-drowning-in-tokens-heres-how-to-fix-it/
Date: 2026-03-28
Updated: 2026-03-28
Tags: Claude Code, agentic-coding, AI tools, tokens, developer-experience
Categories: AI Tools, Guides
Summary: A single `cargo test` can dump 4,800 tokens into your context window when only 11 matter. Multiply that across an agentic session and you're paying for noise that actively degrades your agent's reasoning. The fix exists — and it's not a bigger context window.
Run `cargo test` in a Rust project. You get a wall of compilation progress, timing info, test names, and formatting that totals about 4,800 tokens. The information you actually need — which tests passed, which failed — fits in 11 tokens.
That's a 99.8% noise ratio. And your AI coding agent just ate all of it.
Now multiply that across an agentic session. The agent runs `git status`, `ls`, `cat` a few files, runs tests, checks the diff, runs tests again. Each command dumps verbose output into the context window. Within thirty minutes, 80% of the context is CLI boilerplate that the agent will never reason about — but that actively interferes with its ability to reason about the 20% that matters.
This isn't a cost problem. Well, it is — but the more insidious issue is a **quality** problem.
## Why Noise Hurts More Than You Think
### The "Lost in the Middle" Effect
The paper "Lost in the Middle" (Liu et al., 2023) demonstrated something counterintuitive: LLMs exhibit a U-shaped attention pattern. They're great at using information at the beginning and end of the context, but **significantly worse** at using information in the middle.
This means every token of verbose `git log` or `npm install` output doesn't just waste space — it pushes earlier context (your code, your instructions, the agent's prior reasoning) into the attention dead zone. The more noise you add, the worse the agent gets at recalling what it was doing and why.
Even models with 200K+ token context windows aren't immune. The degradation is about **position**, not capacity.
### Irrelevant Context Degrades Reasoning
The GSM-IC study (Shi et al., ICML 2023) showed that adding irrelevant information to prompts "dramatically decreased" model performance — even when the model has the capability to solve the problem. LLMs get distracted by noise just like humans do.
In an agentic coding loop, this compounds. The agent runs a command, reasons about the output, runs another command. Each noisy output degrades the next reasoning step. Over a multi-hour session, the cumulative effect is measurable: the agent starts making worse decisions, forgetting earlier context, and repeating itself.
### The Compounding Problem
Agentic sessions are iterative. Each loop adds more context. The effects compound in four ways:
1. **Context exhaustion**: Noisy output fills the window faster, forcing earlier compaction or session restart. Sessions with filtered output last roughly 3x longer.
2. **Reasoning degradation**: Each iteration pushes prior reasoning into less-attended positions.
3. **Cost multiplication**: On pay-per-token models, 70% of spending can go to CLI noise. A 10-person team wastes roughly $1,750/month on tokens that actively make the agent worse.
4. **Rate limit pressure**: On subscription plans, noisy output burns through quotas ~40% faster than necessary.
The fundamental insight: **a 200K-token context filled with 80% noise performs worse than a 40K context filled with 100% signal**. Token reduction is not just cost optimization — it's reasoning quality optimization.
## The Fix: Filter Before It Hits the Context
The solution is conceptually simple: intercept command output before it reaches the LLM and strip the noise. In practice, this requires knowing what's noise and what's signal for dozens of different CLI tools.
### RTK (Rust Token Killer)
[github.com/rtk-ai/rtk](https://github.com/rtk-ai/rtk) — MIT, 11.7k stars
RTK is the clear category leader. It's a single Rust binary that acts as a CLI proxy: it intercepts shell commands issued by AI agents, runs them, and compresses the output before it reaches the context window.
**How it works**: `rtk init --global` installs a `PreToolUse` hook in Claude Code's settings. When the agent issues `git status`, the hook transparently rewrites it to `rtk git status`. The agent never sees the rewrite — it only receives the compressed output. Less than 10ms overhead per command.
**Four compression strategies**:
- **Smart filtering**: removes progress bars, ANSI codes, boilerplate, timing info
- **Grouping**: aggregates files by directory, errors by type
- **Truncation**: preserves relevant context, cuts redundancy
- **Deduplication**: collapses repeated log lines with occurrence counts
**Measured savings** (from the docs):
| Command | Before | After | Reduction |
|---------|--------|-------|-----------|
| `cargo test` | 4,823 tokens | 11 tokens | 99% |
| `git diff HEAD~1` | 21,500 tokens | 1,259 tokens | 94% |
| `git log -n 10` | 1,430 tokens | 194 tokens | 86% |
| `ls` (large dir) | 3,200 tokens | 640 tokens | 80% |
| `npm test` | 25,000 tokens | 2,500 tokens | 90% |
It covers 40+ command patterns: git, cargo, docker, kubectl, npm/pnpm, pytest, vitest, playwright, eslint, tsc, ruff, golangci-lint, and more.
**Agent support**: Claude Code (native hook), Gemini CLI (Rust hook processor), OpenCode (plugin), and any MCP client via the [rtk-mcp](https://github.com/ousamabenyounes/rtk-mcp) bridge.
**The analytics are great**: `rtk gain` shows cumulative savings, `rtk gain --graph` gives a 30-day ASCII chart, `rtk discover` scans your Claude Code history and tells you which unoptimized commands are wasting the most tokens. That last one is brilliant — it mines your actual usage to find optimization opportunities.
### The RTK Ecosystem
RTK has spawned a constellation of community extensions:
- **[rtk-mcp](https://github.com/ousamabenyounes/rtk-mcp)**: MCP server bridge — use RTK from Cursor, Windsurf, Claude Desktop, any MCP client
- **[openrtk](https://github.com/martinstannard/openrtk)**: OpenCode plugin
- **[pi-rtk](https://github.com/voska/pi-rtk)**: Extension for the Pi agent framework
- **[rtk-dashboard](https://github.com/ChrisX101010/rtk-dashboard)**: React-based real-time analytics dashboard
- **[rtk-flake](https://github.com/realitymolder/rtk-flake)**: Nix packaging
## Beyond CLI Filtering: Other Approaches
RTK solves the CLI output problem. But there are other angles on token reduction worth knowing about.
### AST-Based Skeletonization
**[skltn](https://github.com/danielcalvolopez/skltn)** takes a different approach: instead of filtering command output, it skeletonizes *code files*. Using tree-sitter AST parsing, it reduces files to function signatures, type definitions, and docstrings — collapsing implementation bodies. Files under 2,000 tokens pass through unchanged; larger files get the skeleton treatment. Claims 5-15x more codebase fits in the same context window. Works as an MCP server.
**[Repomix](https://github.com/yamadashy/repomix)** (22.6k stars) packs entire repositories into single AI-friendly files. Its `--compress` flag uses tree-sitter to extract key code elements. Not a real-time proxy like RTK — more of a "prepare the repo for an AI conversation" tool. The most popular tool in the broader "make code AI-friendly" space.
### Statistical Context Selection
**[copt](https://github.com/jmouchawar/copt)** (Context Optimizer) uses Bayesian statistical methods to decide which context chunks to include. It breaks context into semantic chunks, applies Beta-Bernoulli modeling with Thompson Sampling, and uses feedback-driven learning to improve chunk selection over time. A completely different philosophy from rule-based filtering — it *learns* what's useful.
### .NET-Specific Filtering
**[DotnetTokenKiller](https://github.com/HandyS11/DotnetTokenKiller)** targets .NET CLI commands specifically (`dotnet build`, `test`, `restore`). Strips SDK banners, MSBuild headers, progress lines, ANSI codes. If your stack is .NET, this complements RTK for the commands RTK doesn't cover.
## How to Start
The highest-ROI move is installing RTK. It takes thirty seconds:
```bash
brew install rtk-ai/tap/rtk # or: cargo install --git https://github.com/rtk-ai/rtk
rtk init --global
```
That's it. Every Claude Code session from now on gets filtered output. No workflow changes, no new commands to learn. The hook is transparent.
After a few days, run `rtk gain` to see your actual savings. Then run `rtk discover` to see what's still slipping through.
If you want to go further:
- Add **skltn** as an MCP server for large codebase navigation
- Use **Repomix** when preparing repo context for Claude.ai conversations
- Keep an eye on **copt** if the statistical approach appeals to you
## The Bigger Picture
The AI coding tool ecosystem is converging on a realization: **context window management is infrastructure, not an afterthought**.
We spent the last two years making context windows bigger. We're now learning that bigger isn't enough — cleaner matters more. A model reasoning over a pristine 40K-token context will outperform the same model reasoning over a noisy 200K-token context, every time.
Token reduction tools are the first generation of this infrastructure. Expect the next generation to get smarter: dynamic filtering based on the current task, learned models of what output is relevant, and tighter integration between agents and their output pipelines.
For now, RTK alone is worth the install. Your agent — and your wallet — will thank you.
## Tool Comparison
| Tool | Approach | Savings | Stars | Best For |
|------|----------|---------|-------|----------|
| **RTK** | CLI proxy, rule-based filtering | 60-99% per command | 11.7k | Daily agentic coding (Claude Code, Gemini, OpenCode) |
| **skltn** | AST skeletonization via tree-sitter | 5-15x more code in context | — | Navigating large codebases |
| **Repomix** | Repo packaging with compression | varies | 22.6k | Preparing context for Claude.ai |
| **copt** | Bayesian chunk selection | learns over time | — | Experimental / research-oriented |
| **DTK** | .NET CLI filtering | varies | — | .NET-specific projects |
---
# Windsurf Arena Mode: Let the Models Fight It Out
URL: https://sdd.sh/2026/03/windsurf-arena-mode-let-the-models-fight-it-out/
Date: 2026-03-28
Updated: 2026-03-28
Tags: Windsurf, AI tools, agentic-coding, model comparison, developer tools
Categories: AI Tools, Agentic Workflows
Summary: Windsurf Arena Mode runs two AI agents on the same task in parallel isolated worktrees, then asks you to pick the winner. It's a clever answer to a real problem — but it also reveals something telling about where IDE-centric AI is stuck.
When you're paying for an AI coding tool, you're implicitly trusting that the model it routes you to is the right one for your task. Most of the time, you have no idea if that trust is warranted.
Windsurf decided to surface that uncertainty explicitly. In January 2026, as part of [Wave 13](https://docs.windsurf.com/windsurf/cascade/arena), they shipped **Arena Mode**: a feature that runs two Cascade agents simultaneously on the same prompt, in separate isolated git worktrees, then asks you to vote for the better output. Not which explanation sounds more confident — which actual code diff you want to merge.
It's one of the more honest things any AI coding tool has shipped. It's also worth examining carefully for what it reveals.
## How It Works
The mechanics are cleaner than most competitive A/B features:
**Isolated execution**: Each agent gets its own git worktree — a full, independent copy of your repository. They can't contaminate each other's changes. You get two real diffs, not two blocks of suggested text.
**Blind evaluation**: Model identities are hidden during the battle. You see "Model A" and "Model B" until you vote. This isn't just aesthetics — it forces you to evaluate output quality rather than brand reputation.
**Battle Groups**: You choose which models compete. "Fast vs. smart," specific model pairs, or let Windsurf select from curated groups based on the task type. You can also force specific models when you have a hypothesis to test.
**Sync or Branch**: After voting, follow-up prompts can either stay synchronized (both agents see the same continuation) or branch independently (you're now running two separate coding sessions from a common ancestor). The branching mode is surprisingly powerful for exploring solution spaces.
**Vote-driven leaderboards**: Your votes accumulate into both a personal leaderboard and a global one. Over time, the system learns which models perform best in your specific codebase and against your preferences.
Also shipped alongside Arena Mode: **Plan Mode**, which requires the agent to surface clarifying questions and produce a structured task plan before writing any code. It's Windsurf's answer to the "the agent just started doing something random" failure mode.
## The Problem It's Solving
Model benchmarks are unreliable guides for real work. SWE-bench and HumanEval measure performance on standardized problems in controlled environments. Your codebase isn't controlled; it has quirks, conventions, debt, and implicit architectural decisions that no benchmark captures.
The only reliable test of "which model is better for my project" is running both models on your project and comparing results. That's expensive to do by hand — it requires context switching, manual comparison, and subjective judgment applied to outputs that look superficially similar.
Arena Mode automates the experimental apparatus. You do the same task you were going to do anyway; the comparison happens in parallel rather than sequentially. The vote takes seconds. The data accumulates.
This is a genuine contribution to developer tooling. The personal leaderboard idea in particular — building a routing preference model calibrated to your specific judgment on your specific code — is exactly the kind of personalization that makes AI tools more useful over time without requiring users to think about model selection.
## The Catch
Here's what Arena Mode reveals by existing: Windsurf doesn't know which model is best for your task, so it's asking you to figure it out for them.
That's not an insult — it's an honest acknowledgment of a hard problem. Model performance is highly task-dependent, codebase-dependent, and sometimes random-seed-dependent. No routing heuristic is perfect. Asking developers to contribute signal is smarter than pretending the problem is solved.
But it does highlight a fundamental constraint of the IDE-centric model. To run Arena Mode, you need to be present. You're reviewing two diffs, making a judgment call, clicking a button. The feature assumes you have time to evaluate competing outputs — that you're not doing something else while the agent works.
This is the [human-in-the-loop paradigm](/posts/cursor-vs-copilot-vs-claude-code-vs-windsurf-2026/) in hardware form. Arena Mode is an excellent tool for the developer who is actively engaged with the AI, thinking carefully about quality, running controlled experiments. It has essentially zero value for the developer who has launched an agent to handle a task while they're in a meeting.
Autonomous workflows don't have a "pick the winner" step. They need the routing decision made upfront, with fallback logic for failures — not a human judge on standby.
## What Windsurf Is Getting Right
Set aside the structural critique for a moment: Arena Mode is a smart product decision.
It turns model selection from a one-time configuration choice ("which model should I set as my default?") into an ongoing learning process. It acknowledges that the answer changes over time as models update, as your codebase evolves, and as your own preferences develop.
Plan Mode deserves more credit than it's getting in the coverage. Requiring structured clarification before execution is something Claude Code users do through workflow discipline (writing specs, using `CLAUDE.md` project context) — but making it a first-class UI affordance lowers the barrier for developers who don't have that discipline yet. You get the benefits of SDD-style upfront thinking without needing to know what SDD is.
The combination — Plan Mode to capture intent, Arena Mode to test execution — is a coherent approach to reducing the gap between "what you asked for" and "what the agent built."
## Broader Context
Windsurf shipped Arena Mode two weeks before [Cognition acquired Windsurf](/posts/cognition-buys-windsurf-ai-coding-market-consolidates/) in February 2026, making it effectively the last major feature shipped by the independent company. It's a fitting send-off: ambitious, technically solid, honest about the unsolved problems.
The real test for Arena Mode under Cognition's ownership is whether it survives the integration intact. Devin 2.0's positioning is all about autonomous execution — the human-comparison-loop feature is architecturally awkward alongside "set it and forget it" agents. Watch whether Arena Mode gets expanded, deprecated, or quietly rebranded as something that fits the new product narrative.
In the meantime, if you use Windsurf and have ever wondered whether you're on the right model: this is the tool you've been waiting for. It doesn't solve the autonomy problem. It does solve the comparison problem, and it does it well.
---
**Sources:**
- [Arena Mode — Windsurf Documentation](https://docs.windsurf.com/windsurf/cascade/arena)
- [Windsurf Wave 13: Arena Mode and Plan Mode — Digital Applied](https://www.digitalapplied.com/blog/windsurf-wave-13-arena-mode-plan-mode-swe-1-5-guide)
- [Windsurf Introduces Arena Mode for Parallel AI Model Comparison — InfoQ](https://www.infoq.com/news/2026/02/windsurf-arena-mode/)
- [Cognition Acquires Windsurf — sdd.sh](/posts/cognition-buys-windsurf-ai-coding-market-consolidates/)
---
# Parallel AI Agents: The Tools That Let You Run Ten Claudes at Once
URL: https://sdd.sh/2026/03/parallel-ai-agents-the-tools-that-let-you-run-ten-claudes-at-once/
Date: 2026-03-28
Updated: 2026-03-28
Tags: Claude Code, agentic-coding, AI tools, developer-experience, workflow
Categories: AI Tools, Guides
Summary: One Claude Code session is powerful. Ten running in parallel is a different paradigm entirely. Here's the emerging ecosystem of multiplexers, orchestrators, and dashboards — and how to pick the right one.
One Claude Code session is powerful. But at some point, you're waiting. The agent is refactoring a module, and you have three other tasks you could kick off right now — a bug fix, a test suite, a docs update. You're watching a single-threaded workflow when the work is embarrassingly parallel.
This is the **multi-agent problem**: how do you run multiple AI coding sessions simultaneously without them stepping on each other's files, losing track of which branch is which, or melting your laptop?
An entire ecosystem has emerged in the past few months to solve it. Let's break it down.
## The Universal Pattern: Git Worktrees + Session Management
Nearly every tool in this space converges on the same core architecture:
1. **Git worktrees** for file isolation — each agent gets its own working copy of the repo, on its own branch, sharing the same `.git` directory
2. **Session management** (tmux panes, native windows, or daemon processes) to keep agents running independently
3. **A dashboard or notification layer** to monitor what's happening across all sessions
The differentiation is in how much intelligence sits on top of this pattern: from "just run the commands" (a bash script) to "decompose the work and coordinate the agents" (a full orchestrator with a supervisory AI).
## Tier 1: GUI Orchestrators
### Conductor
[conductor.build](https://conductor.build/) — Proprietary, free, macOS (Apple Silicon only)
Built by the Melty Labs team (YC-backed), Conductor is a native Mac app that gives you a unified dashboard for running Claude Code and Codex agents in parallel. Each agent gets its own git worktree. You see all agents, their status, and can review diffs from a single view.
**The good**: Polished UI, macOS native notifications when agents need attention, uses your own API keys with no markup. Dead simple to get started — it just wraps Claude Code in a nice interface.
**The catch**: Mac-only, Apple Silicon only. Closed source. And it's been controversial for requesting broad GitHub permissions (full read-write to your GitHub account), which is more access than a local development tool typically needs.
**Best for**: Individual developers on Mac who want a visual dashboard without learning any new CLI tools.
### Superset
[superset.sh](https://superset.sh/) — Elastic License 2.0, free tier + $20/mo Pro, macOS/Windows/Linux
The most ambitious tool in this space. Superset is a full terminal replacement purpose-built for the AI agent era. It runs 10+ parallel agents, each in isolated worktrees, with a daemon architecture that persists sessions across crashes and app restarts.
**The good**: Cross-platform. Real-time dashboard with status indicators. Built-in diff viewer and editor. Intelligent resource management prevents your machine from thrashing. Electric SQL for real-time sync of tasks and PRs.
**The catch**: The Elastic License means you can't offer it as a hosted service (fine for most users, problematic for platform teams). The Pro tier adds features but the free tier is genuinely usable.
**Best for**: Teams that want a production-grade tool for sustained multi-agent workflows across platforms. The daemon architecture is the differentiator — sessions survive everything.
### Nimbalyst
[nimbalyst.com](https://nimbalyst.com/) — Proprietary, free for individual use, macOS/Windows/Linux/iOS
Takes the kanban metaphor seriously: every agent session is a card on a board with automatic status tracking (running, waiting, completed, failed). The iOS app means you can monitor your agents from your phone while stepping away.
**Best for**: Visual thinkers who want to see all their work-in-progress at a glance, and people who want mobile monitoring.
## Tier 2: Terminal Multiplexers
### cmux (craigsc)
[github.com/craigsc/cmux](https://github.com/craigsc/cmux) — MIT, ~276 stars
The minimalist's answer. A ~560-line Bash script that wraps the git worktree lifecycle into single commands: `cmux new` creates a worktree and launches Claude Code, `cmux merge` and `cmux rm` auto-detect the current worktree from `$PWD`. No dependencies, no build step.
A `.cmux/setup` hook lets you run project-specific init (symlink secrets, install deps, etc.) when creating new worktrees.
**The good**: Zero overhead. If you know tmux, you already know how to use this. The code is readable in one sitting.
**The catch**: No notifications, no dashboard. You're manually switching between tmux windows to check on agents. Works fine for 3-5 parallel tasks, gets chaotic beyond that.
**Best for**: Developers who live in tmux and want the thinnest possible layer over git worktrees.
### cmux (manaflow)
[cmux.com](https://cmux.com/) — Open source, free, macOS
Confusingly shares a name with the above but is a completely different tool: a native macOS terminal app built on Ghostty (libghostty). Not Electron. Vertical tabs, split panes, embedded browser with accessibility tree snapshotting — agents can interact with web UIs.
The notification system is built-in via OSC terminal escape sequences and a `cmux notify` CLI, so any agent or script can trigger alerts.
**Best for**: Mac users who want a terminal-native experience with first-class notification support and don't want Electron.
### dmux
[dmux.ai](https://dmux.ai/) — MIT
Creates tmux panes with git worktrees and launches agents. AI-generated branch names and commit messages (via OpenRouter). Multi-select launches and smart merging.
**Best for**: Quick parallel launches with less manual branch management.
### amux
[amux.io](https://amux.io/) — MIT + Commons Clause, ~80 stars
The most interesting terminal multiplexer. A single Python file (~23k lines) that runs dozens of parallel agents with a self-healing watchdog — it auto-compacts context, restarts crashed agents, and unblocks stuck prompts. A shared kanban board with atomic SQLite task claiming lets agents coordinate. Web dashboard at `localhost:8822` gives you live terminal peek, file explorer, and cross-output search from your browser or phone.
**The good**: The watchdog and self-healing features mean you can actually walk away and come back hours later to completed work. The web dashboard solves the "what's happening across 20 tmux panes" problem.
**The catch**: The Commons Clause restricts commercial resale (fine for most teams).
**Best for**: Running many agents unattended for extended periods. The watchdog is the killer feature.
## Tier 3: Multi-Agent Orchestrators
These go beyond session management into AI-coordinating-AI territory.
### Gas Town
[github.com/steveyegge/gastown](https://github.com/steveyegge/gastown) — MIT, ~12.7k stars
By Steve Yegge. The "Mayor" is a Claude Code instance with full workspace context that decomposes tasks and spawns worker agents. Persistent work tracking via git-backed hooks. A Convoy Panel shows in-progress work, an Event Stream gives chronological activity, and a Problems view surfaces agents needing human intervention.
Supports Claude, Gemini, Codex, Cursor, Augment, AMP, OpenCode, Copilot, and more. Vendor-agnostic by design.
**Best for**: Complex projects where task decomposition is the bottleneck. You give the Mayor a high-level goal and it figures out the parallelism.
### Multiclaude
[github.com/dlorenc/multiclaude](https://github.com/dlorenc/multiclaude) — Open source, ~257 stars
By Dan Lorenc (Chainguard founder). Spawns autonomous Claude Code instances that coordinate, compete, and collaborate. Two modes: Single Player (a merge-queue auto-merges PRs when CI passes) and Multiplayer (a PR-shepherd coordinates with human reviewers).
**Best for**: Long autonomous runs where you want to walk away entirely and let the agents self-coordinate via PRs and CI.
### Claude Code Agent Teams (Built-in)
Claude Code's own experimental multi-agent feature. One session acts as team lead, spawning teammates that work in their own context windows and communicate with each other.
**The catch**: Experimental, disabled by default (`CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS`). Adds significant coordination overhead and token cost. Worth watching but not production-ready.
## How to Choose
The decision tree is simpler than the ecosystem suggests:
**"I just want to run 3-5 Claude Code sessions without branch conflicts"**
→ cmux (craigsc). Bash script, zero learning curve, zero dependencies.
**"I want a visual dashboard and cross-platform support"**
→ Superset. Most polished, daemon-backed persistence, works everywhere.
**"I'm on Mac and want something native and pretty"**
→ Conductor for simplicity, or cmux (manaflow) for a terminal-first approach.
**"I want to launch 20 agents and walk away"**
→ amux for the watchdog and self-healing, or Multiclaude for fully autonomous PR workflows.
**"I want AI to decompose the work, not just run it"**
→ Gas Town. The Mayor pattern is the most sophisticated approach to task distribution.
## The Real Bottleneck
Here's the thing nobody talks about in the multi-agent discourse: **the bottleneck isn't the tooling, it's the task decomposition**.
Running ten agents in parallel is easy. Giving ten agents tasks that are *actually independent* — that don't create merge conflicts, don't make contradictory architectural decisions, don't duplicate work — that's hard. The tools that acknowledge this (Gas Town's Mayor, Multiclaude's merge-queue) are more honest about the real problem than the ones that just give you ten tmux panes.
If you're starting out, don't optimize for maximum parallelism. Start with two or three agents: one working on the current task, one on the next task, one running tests or doing a review. That's already a 3x multiplier on your throughput with minimal coordination overhead. Scale up the parallelism as you get better at scoping independent work.
The tools are ready. The question is whether your tasks are.
## Comparison Table
| Tool | Type | Platform | License | Notifications | Price |
|------|------|----------|---------|---------------|-------|
| **Conductor** | GUI | macOS (AS) | Proprietary | macOS native | Free |
| **Superset** | GUI Terminal | All | ELv2 | macOS + in-app | Free / $20/mo |
| **Nimbalyst** | GUI Kanban | All + iOS | Proprietary | Kanban + mobile | Free |
| **cmux (craigsc)** | Bash | Any (tmux) | MIT | None | Free |
| **cmux (manaflow)** | Native terminal | macOS | Open source | OSC + CLI | Free |
| **dmux** | CLI | macOS | MIT | macOS native | Free |
| **amux** | Python + Web | Any (tmux) | MIT + CC | Web dashboard | Free |
| **Gas Town** | Orchestrator | Any | MIT | Event stream | Free |
| **Multiclaude** | Orchestrator | Any (tmux) | Open source | PR-based | Free |
| **Agent Teams** | Built-in | Any | — | In-session | API cost |
---
# Anthropic's 8 Agentic Coding Trends: A Manifesto, Not Just a Report
URL: https://sdd.sh/2026/03/anthropics-8-agentic-coding-trends-a-manifesto-not-just-a-report/
Date: 2026-03-28
Updated: 2026-03-28
Tags: Anthropic, Claude Code, agentic-coding, AI trends, software engineering
Categories: Agentic Workflows, Industry
Summary: Anthropic just published the most data-rich statement on where agentic coding is headed. Here's what the eight trends actually mean — and what it tells you about the next two years of software development.
Anthropic doesn't publish a lot of market research. When they do, it's worth reading carefully — not just as a snapshot of where the industry stands, but as a signal of where they're building. The [2026 Agentic Coding Trends Report](https://resources.anthropic.com/2026-agentic-coding-trends-report) is the most substantive thing they've published on the state of agentic development, and it reads less like an industry survey and more like a product roadmap with citations.
Eight trends. Real numbers from production deployments. Here's what they mean.
## Trend 1: Engineering Roles Are Shifting Faster Than Anyone Expected
The report opens with a framing claim that would have sounded overblown eighteen months ago: engineers are transitioning from writing code to orchestrating agents. "Systems thinking over syntax" is the phrase they use.
The supporting data makes this credible. TELUS has deployed 13,000+ custom AI solutions across its engineering organization. 30% faster engineering cycle times. 500,000+ hours saved. These aren't pilot numbers — they're operating at scale.
Zapier reports 89% AI tool adoption among their engineering team and 800+ internal agents in active use. That's not a team experimenting with AI; that's a team that has rebuilt its operating model around it.
What's shifting isn't that AI writes code and humans review it. What's shifting is that the highest-leverage engineering skill is now defining problems clearly enough for agents to solve them — writing specifications, designing agent architectures, identifying where autonomous execution breaks down and requires human judgment. The engineers who are thriving are the ones who were already good at this. The ones who relied on implementation skill alone are finding the market less forgiving.
## Trend 2: Multi-Agent Orchestration Is Becoming Standard Infrastructure
A single agent handling a complex software task is a parlor trick. A team of specialized agents operating in parallel under an orchestrator is how real work gets done.
The report highlights two dominant orchestration patterns emerging in 2026. **LangGraph** for graph-based workflows — stateful agents with clear dependency maps, suited for tasks where you need deterministic sequencing with conditional branches. **Microsoft AutoGen** for conversation-based multi-agent systems — agents that reason about tasks by talking to each other, suited for exploratory work where the path isn't known upfront.
The infrastructure pattern that's becoming standard: an orchestrator agent receives the task, decomposes it, spins up specialized subagents (a code-writer, a test-writer, a security reviewer), and aggregates their outputs. Claude Code's experimental Agent Teams feature (`CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1`) is Anthropic's implementation of this pattern. It's still rough, but the direction is clear.
## Trend 3: Long-Running Agents Are Viable
"Long-running" used to mean twenty minutes. Now it means hours or days.
This changes the kinds of tasks you can delegate. Full application builds. Tech debt clearance across a large codebase. Systematic test coverage improvements. These aren't tasks you'd start and babysit — you'd define success criteria, launch the agent, and review the output when it's done.
The infrastructure requirements are real: reliable context management (Claude's Compaction API handles this for long conversations), checkpoint/resume capability, observability tooling so you know what the agent did over a multi-hour run. The tooling is catching up. Anthropic's 1M token context window — now generally available on Max, Team, and Enterprise plans — makes it practical to maintain coherent context over extended tasks without lossy summarization.
## Trend 4: Human-in-the-Loop Has Been Redesigned
The old framing of HITL was: "AI does most of the work, but humans approve before anything risky happens." The new framing is more precise: developers delegate roughly 60% of their work to AI agents autonomously, with full delegation possible on well-specified tasks.
The key redesign is treating HITL as a **deterministic gate** rather than a continuous interruption. You define upfront which types of decisions require human approval — deploying to production, modifying billing logic, changing authentication flows — and the agent routes everything else autonomously. The human's attention is reserved for decisions worth their attention.
This is a mature framing of a problem that earlier agentic tools handled badly. Cursor and Copilot both defaulted to constant confirmation requests (which trained developers to approve without reading). The better pattern is exception-based: the agent proceeds confidently and pauses only when it hits something that meets your predefined escalation criteria.
## Trend 5: Parallel Workflows and Agentic CLIs Are Winning
The terminal-native, CLI-first model is pulling ahead. The report's data on parallel agent workflows is striking: teams running multiple Claude Code sessions against the same codebase via git worktree isolation — one agent on a feature branch, one on a bug fix, one running the test suite — report 3-5x throughput improvement on sprint velocity.
This is only viable with a tool designed for the command line. IDE-centric agents like Cursor are UI-bound — you can technically run multiple windows, but the workflow isn't designed for it and the session management is painful. Claude Code's architecture assumes multiple concurrent sessions; worktree isolation is built in.
The report also flags the rise of agentic CLIs beyond code: Jules (Google's async coding agent), Devin 2.0, and the OpenAI Codex desktop app all follow the same pattern — you define a task, the agent executes autonomously, you review results. The human-presence-required model is losing ground to the task-delegation model.
## Trend 6: MCP Standardization Is Becoming Critical Infrastructure
MCP is cited throughout the report as the connective tissue enabling everything else. 97 million monthly SDK downloads as of March 2026. 6,400+ registered servers. Native support from Anthropic, OpenAI, Google, and Microsoft.
The 2026 roadmap priorities — Streamable HTTP for stateless scaling, Server Cards for zero-connection discovery, fine-grained authorization, Human-in-the-Loop standardization at the protocol level — are all oriented toward enterprise production use. MCP is graduating from "developer experiment" to "infrastructure standard," and the governance shift to the Linux Foundation makes that transition formal.
For teams building agentic systems: if your internal tools aren't MCP-compatible yet, they will need to be. The protocol is becoming table stakes.
## Trend 7: Agentic Coding Is Spreading Beyond Engineering
The report documents something I haven't seen much coverage of: sales teams, legal teams, and marketing teams are building their own agents using the same tooling and patterns as software engineers. Not IT-mediated custom development — direct deployment by non-engineers.
This is only possible because the tools have gotten good enough that writing an agent specification doesn't require programming knowledge. You describe what the agent should do, what tools it has access to, what its constraints are. If you can write a detailed email, you can write an agent spec.
The implication for software teams: the skills you're developing for AI-assisted development have broader organizational value than just shipping features faster. Engineers who can teach their organization to build agents well are becoming disproportionately influential.
## Trend 8: Security Is a Double-Edged Sword
The report closes with the most uncomfortable trend, and they don't soften it. Agents accelerate both defensive security work (automated code review, vulnerability scanning, continuous compliance checking) and offensive exploit development.
They cite this as context for why Anthropic's safety layer — the prompt injection screening, the refusal policies for destructive actions — isn't friction, it's a feature. Claude Code's auto mode comes with built-in guardrails specifically because autonomous agents running at scale amplify both good and bad instructions with equal efficiency.
The security framing also explains why Claude Code's permission model (explicit approval for file system access, network calls, shell commands) exists as it does. It's not over-engineering. It's a recognition that an agent with broad permissions in a production codebase is a significant attack surface if compromised.
## What This Report Is Really Saying
Reading the eight trends together, the message is clear: agentic coding is no longer experimental. The teams that deployed it early are now reporting production metrics, not pilot results. The infrastructure — context management, multi-agent orchestration, MCP, observability — is mature enough for serious workloads.
The McKinsey data the report cites is bracing: 20-40% operating expense reduction and 12-14 point EBITDA margin improvement at AI-centric organizations. Those aren't productivity improvements at the margin — they're structural changes to cost structures.
For individual engineers: the shift is real and it's accelerating. The teams publishing these numbers didn't get there by having their engineers review AI suggestions line by line. They got there by restructuring work around autonomous execution, reserving human judgment for decisions that actually require it.
The report is available in full at [resources.anthropic.com/2026-agentic-coding-trends-report](https://resources.anthropic.com/2026-agentic-coding-trends-report). It's worth your time.
---
**Sources:**
- [2026 Agentic Coding Trends Report — Anthropic](https://resources.anthropic.com/2026-agentic-coding-trends-report)
- [Eight trends defining how software gets built in 2026 — Anthropic Blog](https://claude.com/blog/eight-trends-defining-how-software-gets-built-in-2026)
- [5 Key Trends Shaping Agentic Development in 2026 — The New Stack](https://thenewstack.io/5-key-trends-shaping-agentic-development-in-2026/)
- [Claude Code Changelog — Anthropic Docs](https://code.claude.com/docs/en/changelog)
- [Agentic Workflows for Software Development — McKinsey / QuantumBlack](https://medium.com/quantumblack/agentic-workflows-for-software-development-dc8e64f4a79d)
---
# GPT-5.3-Codex: The First AI Model That Helped Build Itself — and Got a Scary Security Rating
URL: https://sdd.sh/2026/03/gpt-5.3-codex-the-first-ai-model-that-helped-build-itself-and-got-a-scary-security-rating/
Date: 2026-03-27
Updated: 2026-03-27
Tags: OpenAI, GPT-5.3-Codex, cybersecurity, agentic coding, benchmarks
Categories: AI Tools, Industry
Summary: OpenAI's GPT-5.3-Codex was instrumental in creating itself, introduced mid-turn steering for agentic workflows, and became the first OpenAI model rated 'High capability' for cybersecurity — which means it can reliably exploit real vulnerabilities.
OpenAI dropped GPT-5.3-Codex on February 5, 2026 — the same day Anthropic launched Claude Opus 4.6. If you think that timing was accidental, I have a bridge to sell you. The AI coding arms race has officially reached the point where release schedules are synchronized to the news cycle, not to readiness.
But the timing is the least interesting thing about GPT-5.3-Codex. What's actually worth paying attention to: the model helped build itself, it can now be steered mid-task without losing context, and it became the first OpenAI model to earn a "High capability" rating for cybersecurity — which is not the kind of distinction you put on a press release unless you've done some serious soul-searching.
## The "Built Itself" Claim — What It Actually Means
OpenAI's exact language was careful: GPT-5.3-Codex was "the first model that was instrumental in creating itself." Not "built itself." That distinction matters, and OpenAI's PR team knows it.
What actually happened: the Codex team used early versions of the model to debug training runs, manage deployment processes, and evaluate test results during the model's own production pipeline. The model was a productive participant in its own creation — not an autonomous agent bootstrapping itself from scratch, but capable enough to contribute meaningfully to real engineering work on a production system. Its own production system.
That's still significant. It marks a threshold. We've been talking for years about the recursive loop where AI helps train the next generation of AI. GPT-5.3-Codex is the first OpenAI model where that loop was demonstrably closed in production, not just as a research experiment. If you're a senior engineer, you know how hard it is to trust any tool with deployment checks and test evaluation — those are high-stakes, judgment-heavy tasks. The fact that an early version of this model handled them reliably enough to stay in the loop is the real story, not the marketing framing.
## Mid-Turn Steering: Small Feature, Big Deal for Agentic Work
The other capability that deserves attention is mid-turn steering. You can now submit a message to Codex while it's actively working, redirecting its behavior without losing context or forcing a restart.
Available via Settings > General > Follow-up behavior in the Codex app, and supported across the CLI, IDE extension, and Codex Cloud.
This sounds like a minor UX improvement. It isn't. Anyone who has run long agentic tasks knows the pain: you set the agent loose on a complex refactor or a multi-file feature implementation, you check back twenty minutes later, and it has gone in completely the wrong direction. Your options were previously brutal — let it finish and throw away the work, or kill it and start over with a more constrained spec.
Mid-turn steering means you can intervene surgically. Catch the wrong direction early, redirect without context loss, and keep the task moving. The compounding effect on long tasks is real. This is the kind of QoL improvement that doesn't benchmark well but makes the difference between agentic coding being practical and being a frustrating novelty.
## The Cybersecurity Rating: This Is the Part That Should Make You Uncomfortable
GPT-5.3-Codex is the first OpenAI model classified as "High capability for cybersecurity" under OpenAI's Preparedness Framework. That classification is not a marketing badge. It means the model crossed thresholds that OpenAI's own safety team considers significant enough to require documented mitigations.
The numbers: 77.6% on Cybersecurity CTF challenges, and approximately 90% on CVE Bench — a benchmark that tests identification of real-world vulnerabilities in real software. Ninety percent. On real CVEs.
For context on what it couldn't do: it failed at Endpoint Detection and Response evasion, Certificate Authority and DNS hijacking, and exploiting leaked tokens. So there are ceilings. But a model that reliably identifies real vulnerabilities at 90% accuracy is not a toy.
OpenAI's response to the High capability rating is their "most comprehensive cybersecurity safety stack to date" — safety training, automated monitoring, trusted access controls. That's the right answer. It's also the answer you'd expect from any responsible lab that just shipped a model with these capabilities and needs to say something.
The honest take: this is a genuine double-edged sword, and the edge cuts both ways sharply. For defensive security work — code review, vulnerability scanning, automated pen testing, threat modeling — a model at this capability level is extraordinarily useful. If you're a security engineer or a CTO who takes AppSec seriously, GPT-5.3-Codex is probably the most powerful tool you've had access to.
The other edge: a model that scores 90% on real CVE identification represents a meaningful uplift for anyone trying to find and exploit vulnerabilities at scale. The safety stack matters. The trusted access controls matter. And if you're in security, you should be stress-testing those controls, not just reading the press release.
## Benchmark Reality Check
At "xhigh" reasoning effort, the numbers look like this:
- SWE-Bench Pro (Public): 56.8% — up marginally from GPT-5.2-Codex's 56.4%
- Terminal-Bench 2.0: 77.3% — beats GPT-5.4's 75.1%
- OSWorld-Verified: 64.7%
- GDPval (professional tasks): 70.9%
- 25% faster than GPT-5.2-Codex
The SWE-Bench improvement is incremental, which is honest. Anyone still expecting 10-point jumps between model releases hasn't been paying attention — we're in the territory where gains are measured in single digits and speed matters as much as accuracy. The Terminal-Bench lead over GPT-5.4 is the interesting result: it suggests GPT-5.3-Codex's specialization pays off on agentic terminal tasks even after GPT-5.4 absorbed its general coding capabilities.
For the record: GPT-5.4 shipped March 5, absorbed GPT-5.3-Codex's coding capabilities into a general-purpose model, and immediately set a new baseline at 57.7% on SWE-Bench Pro. The specialized variant had a one-month window as the frontier model. That's the current pace.
## What You Should Actually Do With This
If you're running an engineering team that uses AI coding tools, three things follow from this release:
First, mid-turn steering is available now across Codex app, CLI, IDE extension, and Codex Cloud (Windows app launched March 4). If your team runs long agentic tasks and hasn't experimented with steering, the iteration tax you're paying is real.
Second, if you have a security-conscious team or work in a regulated environment, GPT-5.3-Codex's CVE identification capability is worth evaluating seriously as part of your security review workflow. Ninety percent on real CVEs is a tool you can use.
Third, the "built itself" framing is partly marketing and partly genuine milestone. Don't dismiss it entirely because OpenAI's PR team got enthusiastic. The recursive loop is real. Models participating in their own production pipelines is a threshold that has implications for how fast capability curves steepen going forward. Watch that curve.
The Codex Spark variant dropped February 12. GPT-5.4 landed March 5. The pace is not slowing down.
---
**Sources**
- [Introducing GPT-5.3-Codex — OpenAI](https://openai.com/index/introducing-gpt-5-3-codex/)
- [OpenAI GPT-5.3-Codex Warns of Unprecedented Cybersecurity Risks — Fortune](https://fortune.com/2026/02/05/openai-gpt-5-3-codex-warns-unprecedented-cybersecurity-risks/)
- [GPT-5.3-Codex Released: Full Benchmark Results and What's New — Mac Observer](https://www.macobserver.com/news/gpt-5-3-codex-released-full-benchmark-results-and-whats-new/)
- [Codex Changelog — OpenAI Developers](https://developers.openai.com/codex/changelog)
---
# Cursor Composer 2: The Model That Learns to Forget — and Sparked a Controversy
URL: https://sdd.sh/2026/03/cursor-composer-2-the-model-that-learns-to-forget-and-sparked-a-controversy/
Date: 2026-03-27
Updated: 2026-03-27
Tags: Cursor, Kimi K2.5, model training, benchmarks, compaction
Categories: AI Tools, Agentic Workflows
Summary: Cursor's new coding model beats Claude Opus 4.6 on key benchmarks — but the real story is a training breakthrough called compaction-in-the-loop RL, and a transparency controversy that revealed Cursor quietly built it on a Chinese open-source model.
Cursor shipped Composer 2 on March 17, 2026, and buried the lead in two ways. First, they glossed over the genuinely interesting technical contribution — a new training technique that teaches a model to compress its own memory mid-task. Second, they forgot to mention the model is built on Kimi K2.5, a Chinese open-weight model from Moonshot AI. The internet noticed. Elon Musk weighed in. A co-founder apologized.
Let's start with the part that actually matters.
## The Problem With Long-Horizon Tasks
If you've used an AI agent to do anything non-trivial — refactor a large codebase, debug a multi-file regression, run a migration — you've hit the context wall. Agent trajectories get long. Hundreds of turns, tool call results, file contents, intermediate reasoning: it all stacks up, and eventually you blow past whatever the model's context window can hold.
The standard workarounds are bad. Prompted summarization tells the model to compress its context, but it's a bolt-on step with no connection to the task reward. The model has no incentive to preserve what actually matters. Sliding windows just drop old tokens, which is worse. Both approaches cause information loss that compounds over a long trajectory. One forgotten function signature or dropped constraint and the entire task can silently go sideways.
This is the unsolved problem Cursor's research team went after.
## Compaction-in-the-Loop RL
Cursor's solution is conceptually elegant: make the model's summarization behavior part of the reinforcement learning signal itself.
Here is how it works. During RL training, the model runs on a task. When it hits a fixed token-length trigger — 40k or 80k tokens — it pauses, generates a compressed summary of its own context (targeting around 1,000 tokens), and then continues from that summary. This cycle repeats as many times as necessary. At the end, the RL reward covers the complete chain: task completion and every compaction step along the way. If the model summarizes sloppily and loses something critical, it fails the task. Negative reward. The model learns what to keep.
The results Cursor published are worth taking seriously. Compared to prompted summarization: 50% fewer compaction errors, 5x more token-efficient (1k tokens versus 5k), and in a live demo, the model solved a "make-doom-for-mips" Terminal-Bench challenge in 170 turns while compressing over 100k tokens of context without losing the thread.
The deeper implication is the one that should interest you if you care about agent capability ceilings. By training with compaction in the loop, you can train on trajectories that are substantially longer than your maximum context window. The model can learn tasks that require hundreds of sequential actions — the kind of tasks that real software projects actually involve. That's a qualitatively different capability class than what you get from context-window-limited training runs.
This is, as far as I can tell, the first time a commercial coding tool has embedded long-horizon task compression directly into the RL loop rather than treating it as a post-processing afterthought. That matters.
## The Benchmarks
Composer 2 is not a marginal improvement on its predecessor.
On CursorBench, Composer 1.5 scored 44.2. Composer 2 scores 61.3. Claude Opus 4.6 scores 58.2. On Terminal-Bench 2.0, the numbers are 47.9 to 61.7, with Claude Opus 4.6 at 58.0. On SWE-bench Multilingual, it goes from 65.9 to 73.7.
Beating Claude Opus 4.6 on Cursor's own benchmarks while also being 86% cheaper than Composer 1.5 is a real result. Yes, CursorBench is Cursor's benchmark, so apply appropriate skepticism about benchmark overfitting. But Terminal-Bench and SWE-bench Multilingual are external, and the numbers hold there too.
Pricing: Composer 2 Standard is $0.50/M input and $2.50/M output. Composer 2 Fast is $1.50/M input and $7.50/M output. Both come with a 200k context window. Composer 1.5 was considerably more expensive at scale. This is not a "better and costs more" story — the price drop is substantial.
## The Part They Forgot to Mention
Now for the controversy.
Composer 2 is built on Kimi K2.5, a 1-trillion-parameter mixture-of-experts model with 32 billion active parameters, released open-weight by Moonshot AI. Cursor's March 17 research blog described the training methodology in detail. It did not mention Kimi K2.5.
A developer spotted the model identifier `kimi-k2p5-rl-0317-s515-fast` leaking through in API responses. The thread went to 2.6 million views. Elon Musk showed up and wrote, in characteristically helpful fashion: "Yeah, it's Kimi 2.5." Cursor co-founder Aman Sanger acknowledged the oversight: "It was a miss to not mention the Kimi base in our blog from the start." Cursor's VP of Developer Education Lee Robinson subsequently argued that roughly 75% of Composer 2's performance characteristics come from Cursor's additional training rather than the base model.
That framing may be accurate. Compaction-in-the-loop RL is a serious contribution, not a thin wrapper. But the sequence of events — omit the base model, get caught, apologize — is exactly the pattern that erodes trust in a space already prone to benchmark theater and capability overclaiming.
There is also a licensing wrinkle worth noting. Kimi K2.5 uses a modified MIT license that requires companies generating over $20 million per month in revenue to display "Kimi K2.5" in their product UI. Cursor reportedly runs at approximately $160 million per month in revenue and had not done this at the time of the controversy. Whether this gets resolved quietly or becomes a bigger issue remains to be seen.
## What to Make of All This
The technical contribution is real. Compaction-in-the-loop RL is a sensible and apparently effective answer to a genuine agent capability problem, and the benchmark results suggest it works in practice, not just in theory. If you are running long autonomous agent tasks — the kind where context overflow is a constant source of frustration — Composer 2 is worth testing seriously.
The transparency failure is also real. In an era where every AI product is one API probe away from revealing what it actually runs on, the base model is not a minor detail. Developers deploying AI tools in production have a reasonable interest in knowing what they're actually running — for supply chain reasons, for compliance, for geopolitical risk assessment if that's part of your threat model. "We trained on top of it" is not a sufficient answer to "what model does this run on?"
The question this incident sharpens is one the industry hasn't cleanly answered: when a company fine-tunes an open-weight model extensively, what provenance disclosure do users have a right to expect? Cursor's position — that 75% comes from their training — implies they believe substantial fine-tuning changes the disclosure calculus. That's a debatable position, and the debate is now happening in public whether Cursor wanted it to or not.
Composer 2 is good. The compaction technique is worth watching. And the controversy is a useful reminder that "built on top of" and "built from scratch" mean very different things, regardless of how much training you stack on top.
---
## Sources
- Cursor research blog, March 17, 2026: [cursor.com/blog/self-summarization](https://cursor.com/blog/self-summarization)
- Moonshot AI, Kimi K2.5 model release and license
- CursorBench, Terminal-Bench 2.0, SWE-bench Multilingual benchmark results as reported by Cursor
- Aman Sanger (Cursor co-founder) public statement on base model disclosure
- Lee Robinson (Cursor VP of Developer Education) public statement on training contribution breakdown
---
# Claude Code AutoDream: Your AI Agent Finally Sleeps on It
URL: https://sdd.sh/2026/03/claude-code-autodream-your-ai-agent-finally-sleeps-on-it/
Date: 2026-03-27
Updated: 2026-03-27
Tags: Claude Code, Anthropic, memory, agentic coding, AutoDream
Categories: AI Tools, Agentic Workflows
Summary: Anthropic quietly shipped AutoDream — a background memory consolidation system for Claude Code that runs between sessions, prunes stale notes, and fixes conflicting data. Think REM sleep for your coding agent.
Anthropic didn't announce this one. There was no blog post, no TechCrunch headline, no product launch email. AutoDream was discovered by developers poking around Claude Code's internals, reading system prompts they weren't supposed to see. The system prompt says it plainly: *"You are performing a dream — a reflective pass over your memory files."*
That's either the most poetic thing a software tool has ever said, or a sign that someone at Anthropic is having too much fun naming internal features. Either way, the feature is real, it's rolling out now, and it fixes something that's been quietly broken about Claude Code's memory for a while.
## The Problem with AutoMemory
AutoMemory — the predecessor — let Claude Code persist notes across sessions. The idea was right: long-running projects benefit from context that outlasts individual conversations. You don't want to re-explain your architecture every time you open a new session.
The execution had problems.
Memory files decayed. Notes written three weeks ago referenced "yesterday's refactor" with no absolute date. Conflicting decisions piled up without resolution — Claude would have two notes that said opposite things about the same module and no way to know which was current. And memory files grew unbounded, consuming token budget that should have gone to actual code.
The result was a memory system that technically worked but required manual curation to stay useful. Which meant most people didn't curate it, which meant the memory got progressively less reliable over time.
## What AutoDream Does
AutoDream runs as a background sub-agent between sessions — never during active work. It executes a four-phase process:
1. **Orientation** — reads the current memory files and assesses their state
2. **Gather Signal** — identifies what's stale, conflicting, or redundant
3. **Consolidation** — merges fragmented or related notes into coherent entries
4. **Prune and Index** — removes outdated entries and restructures the file for retrieval efficiency
The concrete example that keeps coming up: relative temporal references. AutoDream converts *"Yesterday we decided to use Redis"* into *"On 2026-03-15, we decided to use Redis."* Small change. Enormous difference in how useful that note is three months from now.
It also resolves conflicts. When two memory entries contradict each other, AutoDream determines which is more recent and more likely to be current, then consolidates them into a single authoritative note. No more Claude confidently citing a decision that was reversed six commits ago.
## Why the Sleep Metaphor Is Actually Apt
The human brain consolidates memory during REM sleep — transferring short-term experiences into long-term storage, pruning irrelevant details, strengthening connections between related concepts. This is well-established neuroscience, and it's a legitimately good metaphor for what AutoDream does.
Claude Code's equivalent: you work on a project for an hour, generate a bunch of session notes, and AutoDream runs afterward to decide what's worth keeping, what's stale, and how everything fits together. The agent that starts your next session is working from a cleaner, more accurate picture of your project than the one that ended the last session.
The "between sessions" timing is deliberate. You don't want a background process competing for tokens while you're actively working. And the consolidation benefits from having a complete session to work with rather than interrupting one mid-stream.
## Rollout Status (Honest Assessment)
This is still a staged rollout with some rough edges.
The feature flag is `tengu_onyx_plover` on the server side — not a user-configurable toggle that's universally available yet. The minimum interval between dream runs is 24 hours. It requires at least 5 accumulated sessions before it triggers. Some users have reported that `autoDreamEnabled` shows `true` in their settings and the toggle appears in `/memory`, but running `/dream` manually returns "Unknown skill" — the server-side flag hasn't flipped for them yet.
Anthropic has said public rollout is coming within the week. Given their track record with Claude Code launches, that probably means sometime in the next 72 hours.
To check if you have it: run `/memory` inside a Claude Code session and look for **"Auto-dream: on"** in the selector. You can also add `"auto_dream": true` to `~/.claude/settings.json` — this won't force the feature if your account isn't in the rollout, but it'll enable it automatically when you are.
## What This Changes
The practical impact is larger than the feature description suggests.
Claude Code has been positioning itself as a long-running project collaborator, not just a task-by-task tool. That positioning only works if the memory system is reliable. Unreliable memory means you spend the first few minutes of every session correcting Claude's misconceptions about your codebase. That's the opposite of delegation.
AutoDream is what makes the memory system trustworthy over time. It's the maintenance layer that ensures the context Claude starts with is accurate, not just accumulated. The difference between a junior dev who takes good notes and one who takes good notes *and* reviews them regularly.
This also has implications for context efficiency. Pruned memory files use fewer tokens. On long projects, that compounds — you're not burning context budget on stale information from two months ago.
## The Competitive Gap This Opens
Cursor uses static `.cursorrules` files. Useful for project setup, not designed to evolve. GitHub Copilot has minimal persistent memory. Windsurf has project-level rules but no cross-session learning mechanism.
AutoDream is the first AI coding tool that actively manages its own memory over time. That's a meaningful capability gap, and it compounds — a Claude Code instance that's been working on your project for three months will be qualitatively more useful than a fresh setup, not just quantitatively. The memory gets better as it gets curated.
Whether that gap stays wide depends on how quickly the other tools ship comparable features. But Anthropic has a structural advantage here: Claude Code's memory system was designed as a first-class feature from the start, not retrofitted onto a chat interface. AutoDream is a natural extension of that architecture.
## The Unannounced Launch Tells You Something
Anthropic didn't announce AutoDream because it wasn't ready for a public launch. The staged rollout, the buggy `/dream` command, the server-side flag — these are signs of a team shipping something real before it's polished, which is the right call. Waiting for perfect means shipping late.
But it also reflects something about where Claude Code is in its development cycle: features are landing faster than the communications team can announce them. That's a good problem to have, and it means the changelog is worth reading more carefully than the press releases.
AutoDream will get a proper announcement soon. For now, check `/memory`, enable the flag if you have access, and let your agent sleep.
---
*Sources: [Claude Code AutoDream: Anthropic's New Memory Feature — ClaudeFa.st](https://claudefa.st/blog/guide/mechanics/auto-dream), [Does Claude Code Need Sleep? Inside the Unreleased Auto-dream Feature — DEV Community](https://dev.to/akari_iku/does-claude-code-need-sleep-inside-the-unreleased-auto-dream-feature-2n7m), [What Is Claude Code AutoDream? — MindStudio](https://www.mindstudio.ai/blog/what-is-claude-code-autodream-memory-consolidation-3), [AutoDream GitHub Issue #38461 — anthropics/claude-code](https://github.com/anthropics/claude-code/issues/38461)*
---
# GitHub Copilot Gets Smarter — and Wants Your Code Data
URL: https://sdd.sh/2026/03/github-copilot-gets-smarter-and-wants-your-code-data/
Date: 2026-03-26
Updated: 2026-03-26
Tags: github-copilot, AI-tools, agentic-coding, security, data-privacy
Categories: AI Tools, Industry
Summary: Cross-agent memory, built-in security scanning, Jira integration, and a model picker make Copilot's coding agent genuinely capable. Then GitHub announced it's using your interaction data for training. Here's the full picture.
GitHub has been quietly shipping meaningful improvements to Copilot's coding agent all through early 2026 — cross-agent memory, security scanning baked into the agent's workflow, Jira integration, a model picker. Taken together, they represent a genuine leap in what the coding agent can do.
Then, on March 25, GitHub announced it would start using interaction data from free, Pro, and Pro+ users to train its AI models — effective April 24.
The improvements and the policy change arrived within 24 hours of each other, which is either coincidence or timing. Either way, they're worth looking at together, because the tradeoffs are now explicit.
## What's Actually New
### Cross-Agent Memory
Memory went on by default for Pro and Pro+ users on March 4, 2026. The concept is straightforward: knowledge that the coding agent acquires is stored and shared across sessions, across tools (coding agent, CLI, code review), and across time.
Practically, this means the agent can learn that your test suite is slow, that a particular module has unstable tests, that your team prefers a specific error-handling pattern — and apply that knowledge in future tasks without being told again. Memories are repository-scoped and validated against the current codebase before being applied, so stale knowledge doesn't silently cause problems. They auto-expire after 28 days.
GitHub ran A/B tests on the feature. The results: 7% increase in PR merge rates for coding agent sessions with memory (90% vs. 83% without), and a 2% bump in positive code review feedback. Both results were statistically significant (p < 0.00001). They also tested adversarial conditions — deliberately seeding memory with false information — and found agents caught and corrected contradictions rather than propagating bad data.
For Business and Enterprise plans, memory is off by default and must be enabled in org settings. The likely reason: organizations that require audit trails need to understand what the agent remembers before turning it loose.
### Security Scanning in the Agent Workflow
Since March 18, 2026, the Copilot coding agent runs a security validation layer before opening a pull request — automatically, with no configuration needed. The agent's output goes through:
- **CodeQL scanning**: Static analysis for code vulnerabilities
- **Secret scanning**: Detecting API keys, tokens, and credentials in new code
- **Dependency vulnerability checks**: New packages are checked against the GitHub Advisory Database for malware advisories and CVSS High/Critical CVEs
These run for free, whether or not a team has GitHub Advanced Security. Repository admins can configure which validation tools run from repo settings.
This is significant. One of the legitimate concerns about AI-generated code is that it can introduce security issues that slip past human reviewers — hallucinated API calls that expose data, dependencies pulled from typosquatted package names, credentials hardcoded because the agent didn't know better. Running CodeQL and secret scanning inside the agent's loop, before the PR is even opened, addresses that concern at the right layer.
### Model Picker
The coding agent now lets developers select the model for each task. Faster models for routine work (writing unit tests, renaming variables); more capable models for complex architectural changes. An "Auto" option delegates the choice to GitHub.
The model picker was available to Pro/Pro+ users earlier; it extended to Business and Enterprise users in February 2026. GPT-5.4 was added to the picker on March 5 — GitHub reported it "consistently hits new rates of success in agentic software development."
### Jira Integration
In public preview since March 5, 2026: you can assign Jira issues directly to Copilot, and it will open a draft PR in the corresponding GitHub repository. No context-switching between Atlassian and GitHub. For teams where planning lives in Jira and code lives in GitHub — which is most enterprise teams — this closes a workflow gap that previously required either manual handoffs or custom tooling.
### Self-Review Before Opening PRs
Before the coding agent opens a pull request, it runs Copilot code review against its own changes, incorporates the feedback, and iterates. The PR that lands in your review queue has already gone through a round of automated self-criticism. GitHub still recommends human review, but the signal-to-noise ratio of what reaches humans should improve.
## The Data Policy Change
On March 25, GitHub announced that starting April 24, 2026, it will use interaction data from Copilot Free, Pro, and Pro+ users to train AI models. Interaction data includes: inputs, outputs, code snippets, context, accepted suggestions, chat interactions, and feedback.
Copilot Business, Enterprise, and student tier users are exempt. Data may be shared with Microsoft affiliates but not third-party AI providers. Users can opt out in account settings; prior opt-outs carry over.
GitHub's stated rationale is "more intelligent, context-aware coding assistance," citing improved suggestion acceptance rates from internal testing.
The developer reaction has been predictably polarized. Some frame it straightforwardly: if you're on a free or consumer tier, the product is partly funded by your usage data. That's how consumer software works. Others argue that code is proprietary by default — developers on Pro plans who've written internal tooling, personal projects, or client code didn't sign up to have their work become training data, even if GitHub claims it's non-identifying.
A few things worth noting:
**Business and Enterprise users are exempt.** The policy draws a clear line between individual developers and organizational deployments. Organizations paying for Copilot Business or Enterprise have explicit data protection commitments; individuals on consumer plans do not, or at least not the same ones.
**Opt-out exists but requires action.** Users who care must find the setting and disable it. Most won't notice the change. This is a deliberate design choice.
**The timing is notable.** Releasing major capability improvements — memory, security scanning, Jira integration — and then announcing a data policy change in the same week is a reasonable PR strategy: lead with value, bury the controversy.
## How to Think About the Package
GitHub Copilot's coding agent, in March 2026, is meaningfully more capable than it was six months ago. Cross-agent memory reduces repetitive context-setting. Security scanning addresses a real gap in AI-generated code quality. The model picker gives developers control over cost and quality tradeoffs. Self-review reduces the noise that reaches human reviewers.
The data policy change doesn't negate those improvements. But it does clarify the terms.
If you're building anything sensitive on a Pro or Pro+ plan — client code, proprietary algorithms, internal tooling — you should either opt out, upgrade to a Business plan, or reconsider what you share with the coding agent. Not because GitHub is necessarily doing something malicious with the data, but because "used for AI model training" is a broad category with unclear boundaries, and your code is yours until you decide otherwise.
The improvements are real. The tradeoff is real. Now you can make an informed choice.
---
**Sources:**
- [What's New with GitHub Copilot Coding Agent — GitHub Blog](https://github.blog/ai-and-ml/github-copilot/whats-new-with-github-copilot-coding-agent/)
- [Configure Copilot Coding Agent Validation Tools — GitHub Changelog](https://github.blog/changelog/2026-03-18-configure-copilot-coding-agents-validation-tools/)
- [Copilot Memory Now On by Default — GitHub Changelog](https://github.blog/changelog/2026-03-04-copilot-memory-now-on-by-default-for-pro-and-pro-users-in-public-preview/)
- [Building an Agentic Memory System for GitHub Copilot — GitHub Blog](https://github.blog/ai-and-ml/github-copilot/building-an-agentic-memory-system-for-github-copilot/)
- [GitHub Copilot Coding Agent for Jira — DevOps.com](https://devops.com/github-copilot-coding-agent-for-jira-connects-planning-to-pull-requests-without-leaving-your-workflow/)
- [GitHub to Use Copilot Data for AI Training from April 24 — Roboin](https://roboin.io/article/en/2026/03/26/github-to-use-copilot-data-for-ai-training/)
---
# Cursor Automations: Your IDE Just Became an Always-On Agent
URL: https://sdd.sh/2026/03/cursor-automations-your-ide-just-became-an-always-on-agent/
Date: 2026-03-26
Updated: 2026-03-26
Tags: cursor, automations, agentic-coding, AI-tools, workflows
Categories: AI Tools, Agentic Workflows
Summary: Cursor Automations turns your IDE into a reactive system that writes code, triages bugs, and responds to incidents while you sleep. Here's what it can do — and what it can't yet.
Until now, every AI coding tool shared the same basic premise: you sit at your keyboard, you invoke the AI, you review what it produces. Even the most autonomous agents — Claude Code in auto mode, Devin, GitHub Copilot coding agent — still require a human to start the job.
Cursor just broke that model.
With **Cursor Automations**, announced March 5, 2026, you no longer need to be at your keyboard for work to get done. Automations are always-on agents that wake up in response to external events — a Slack message, a new Linear issue, a merged pull request, a PagerDuty alert — and execute coding tasks end-to-end, autonomously, in cloud sandboxes.
This is a different category of tool. It's not an assistant you invoke. It's infrastructure.
## What Cursor Automations Actually Does
The mental model is simple: pick a trigger, write instructions, point the agent at your codebase.
Supported triggers include:
- **Slack messages** (channel-specific or DM-based)
- **Linear issue creation**
- **GitHub PR opens/pushes**
- **PagerDuty incidents**
- **Custom webhooks**
- **Cron schedules**
When a trigger fires, Cursor spins up a cloud sandbox, executes your instructions using whatever MCP connections and models you've configured, verifies its own output, and completes the task — all without you touching a keyboard. A built-in memory tool lets each agent learn from past runs, building up patterns over time.
The result is a system that behaves less like a coding assistant and more like a background engineering team.
## Four Use Cases That Show the Potential
Cursor's announcement highlighted several templates that illustrate the range of what's possible.
**PR risk classification.** Every time a pull request opens or receives a new push, an automation classifies it: blast radius, complexity, likely reviewers based on contribution history. Low-risk PRs can be auto-approved; higher-risk PRs get specific reviewers assigned. Decisions are logged to Notion and posted to Slack. A task that typically requires a senior engineer's judgment, running automatically in seconds.
**Incident response.** A PagerDuty alert triggers the automation. It pulls in Datadog logs via MCP, cross-references recent commits in the codebase, summarizes its findings in Slack to the on-call engineer, and opens a draft PR with a proposed fix. Before anyone has acknowledged the page, the diagnostic work is already done.
**Bug triage.** A Slack message in your `#bugs` channel triggers an agent. It searches for duplicate issues, creates a Linear ticket, investigates root cause from the codebase, and replies in-thread with what it found. Your bug inbox starts processing itself.
**Engineering dashboard.** A cron job runs every two hours, reads your meeting notes, open GitHub PRs, Jira issues, and Slack mentions, deduplicates across sources, and posts a clean priority dashboard. No more context-switching to piece together your day.
These aren't toy examples. They're the kinds of coordination work that eat hours of senior engineering time every week.
## The Architecture Behind It
Each automation runs in an isolated cloud sandbox, so there's no risk of one automation corrupting another or bleeding into your local environment. The agent uses the MCP connections you've configured — Datadog, Notion, Linear, GitHub — just like you would in an interactive session, but without you present.
The memory tool is notable. Automations can store observations from past runs: "this type of commit usually causes flaky tests in the auth module," or "PRs touching the payments directory need the payments team pinged." Over time, automations become more accurate as they accumulate context that a stateless agent would have to rediscover from scratch every time.
This is the difference between a tool that executes instructions and a system that learns from its environment.
## The Honest Limitations
Cursor Automations is powerful, but it's early.
**IDE lock-in.** Automations only run through Cursor's standalone IDE. If your team is split across VS Code, JetBrains, and Cursor, you can't roll this out uniformly. Teams would need to fully migrate to Cursor — a non-trivial ask for organizations with established tooling.
**Enterprise maturity.** Audit logs, role-based access controls, and compliance documentation are still maturing. For teams in regulated industries or with strict security requirements, the current state may not be enterprise-ready.
**No published pricing.** Cursor hasn't disclosed what Automations costs beyond base Cursor pricing. For teams that want to run dozens or hundreds of automations per day, the economics aren't yet clear.
**Templates, not turnkey.** Cursor has published templates on cursor.com/marketplace, but building automations still requires understanding your own codebase structure, MCP connections, and what instructions the agent needs to succeed. There's real configuration work involved.
## The Elephant in the Room: Claude Code Already Does This
Before you migrate your entire team to Cursor for Automations, it's worth naming what's missing from the narrative: **Anthropic got here first.**
Claude Code has operated as an autonomous, event-driven agent since its launch — available via API, scriptable, triggerable from CI pipelines, webhooks, and cron jobs without requiring anyone to have Cursor installed. The tool that Cursor is now building around IDE lock-in has been infrastructure-first from day one.
The deeper issue is philosophical. Cursor Automations require Cursor. Your team's trigger-driven workflows are now tethered to a specific IDE with opaque pricing, a history of base-model transparency failures, and an enterprise maturity story that's still being written. Claude Code, by contrast, runs anywhere a terminal runs — your laptop, CI, a Lambda function, a GitHub Action. No IDE required. No lock-in.
Cursor's event-driven layer is genuinely useful. But it's a feature bolted onto an IDE. Claude Code is an agent with IDE integration bolted on top, if you want it. That architectural difference matters enormously when you're designing systems that need to outlast whatever tool is fashionable in 12 months.
---
## Why This Matters for How We Think About AI Coding Tools
The evolution of AI coding tools has followed a clear progression:
1. **Autocomplete** (Copilot 2021): AI suggests the next line
2. **Chat** (2023): AI answers questions about your code
3. **Agent** (2024-2025): AI executes multi-step tasks you assign
4. **Autonomous** (2026): AI reacts to events without being asked
Cursor Automations is the first major IDE to reach that fourth stage at scale. Devin 2.0 made similar claims about autonomous operation, but it's a standalone tool rather than infrastructure woven into a development environment.
The frame shift matters. When AI acts only when invoked, engineers stay in control and AI is an accelerator. When AI reacts to events autonomously, engineers become the reviewers of work they didn't initiate — a fundamentally different relationship that requires different trust, different oversight, and different processes.
The PR risk classification example is revealing: an automation that auto-approves low-risk PRs is making a judgment call that historically required a human. That's not just automation; it's delegation of authority.
## Where This Is Heading
Cursor Composer 2, also launched in March, already handles long-horizon tasks better than any prior model — it uses self-summarization to compress context mid-task, enabling multi-hour refactors across hundreds of sequential steps. Pair that capability with event-driven triggers and persistent memory, and you start to see the outline of something more interesting: a background engineering team that never sleeps, escalates to humans only when genuinely uncertain, and improves with every task it completes.
That future isn't fully here yet. But Cursor Automations is the clearest signal yet that it's coming — and faster than most teams are prepared for.
---
**Sources:**
- [Cursor Automations — Official announcement](https://cursor.com/blog/automations)
- [Cursor Automations: The Always-On AI Agents Changing How Engineers Build Software](https://www.adwaitx.com/cursor-automations-ai-coding-agents/)
- [Cursor Launches Automations for Always-On Coding Agents — Tessl](https://tessl.io/blog/cursor-launches-automations-for-always-on-coding-agents/)
- [Cursor Automations Turns Code Review and Ops into Background Tasks — Help Net Security](https://www.helpnetsecurity.com/2026/03/06/cursor-automations-turns-code-review-and-ops-into-background-tasks/)
---
# Cognition Buys Windsurf: The AI Coding Market Is Consolidating
URL: https://sdd.sh/2026/03/cognition-buys-windsurf-the-ai-coding-market-is-consolidating/
Date: 2026-03-25
Updated: 2026-03-25
Tags: Windsurf, Devin, Cognition AI, market analysis, agentic coding, AI tools
Categories: AI Tools, Industry
Summary: Cognition AI — the company behind Devin — acquired Windsurf for roughly $250 million. Combine that with Devin 2.0's 96% price cut and Windsurf's Codemaps, and Cognition is suddenly the most vertically integrated player in agentic coding. Here's what this means for developers.
Two weeks ago, Cognition AI was known for one thing: Devin, the autonomous coding agent that launched with massive hype and then proceeded to underdeliver on benchmarks that were later revealed to be cherry-picked. The company spent 2025 quietly rebuilding its reputation and, more importantly, its product.
Then came March 2026. Cognition dropped two bombs in rapid succession: **Devin 2.0** with a 96% price cut (from $500/month to $20/month), and an **acquisition of Windsurf** for approximately $250 million. When the smoke cleared, Cognition had transformed from a single-product AI agent startup into the most vertically integrated player in agentic software development.
## What Cognition Actually Bought
Windsurf isn't just a code editor. At the time of acquisition, it was:
- The **#1 AI dev tool** in LogRocket's March 2026 power rankings
- Generating **$82M ARR** with over 350 enterprise customers
- Home to **Codemaps**: AI-annotated structural maps of codebases, powered jointly by SWE-1.5 and Claude Sonnet 4.5. Think of Codemaps as a persistent, auto-refreshed mental model of your codebase that any agent can query. It's the kind of context layer that dramatically improves an agent's ability to make changes in large, unfamiliar codebases.
- A team of 210 engineers who have spent the last two years building deeply model-agnostic, IDE-native agentic experiences
The Cascade Agent — Windsurf's autonomous multi-step coding agent with automatic planning mode — is conceptually similar to Devin but architected as an IDE-native experience rather than a standalone product. Acquiring Windsurf gives Cognition both the distribution (daily active IDE users) and the infrastructure (worktrees, parallel agent sessions, Git integration) that Devin lacked.
## Devin 2.0: The Setup to the Acquisition
The timing isn't accidental. Cognition launched Devin 2.0 before closing the Windsurf deal, and the two announcements need to be read together.
Devin 2.0's headline is the pricing: $20/month entry tier, with pay-as-you-go ACUs at $2.25 each (roughly $2.25 per 15 minutes of active agent work). This is an aggressive land-and-expand play — get developers in at a price where the cost-benefit calculus is almost trivially positive, then grow with usage.
But the more interesting features are structural:
- **Interactive Planning**: Before Devin starts working, it drafts a step-by-step plan and invites you to edit it. This addresses the original Devin's biggest failure mode — silently going off in the wrong direction for 30 minutes and returning with confidently wrong code.
- **Devin Wiki**: Auto-generated, continuously-refreshed architecture documentation for your repositories. Agents query the wiki before starting tasks, which should meaningfully reduce the "it doesn't understand our codebase" complaints that plagued Devin 1.
- **Devin Search**: Agentic codebase exploration with cited answers. Ask "why is auth handled this way?" and get an answer with line references, not a hallucinated guess.
These three features, combined with Windsurf's Codemaps, suggest Cognition is building toward a unified codebase-understanding layer that powers everything — IDE autocomplete, autonomous agents, documentation, code review.
## What This Means for the Market
The AI coding tool landscape has been converging on two models: **IDE plugins** (Copilot, Cursor, Windsurf) and **autonomous agents** (Devin, OpenAI Codex, Claude Code). The implicit assumption was that these would stay separate — agents for long-horizon tasks, IDE tools for moment-to-moment coding.
Cognition is betting that assumption is wrong. By owning Windsurf, they can collapse the distinction: Devin handles the autonomous heavy lifting while Windsurf's Cascade Agent handles the IDE-integrated work, and Codemaps provides the shared codebase understanding layer underneath both.
The competitive implications are significant:
**For Cursor**: Cursor's $2B+ ARR and 1M daily users make it the incumbent in the IDE segment, but Cognition now has a story that Cursor doesn't — a vertically integrated stack from autocomplete to autonomous agent, all backed by the same codebase understanding layer. Cursor's response (Automations, the Composer 2 model, JetBrains support) suggests it's aware of the threat.
**For GitHub Copilot**: Microsoft has the distribution advantage — every GitHub user is a potential Copilot customer — but the product is still catching up on agentic features. The Cognition acquisition raises the bar for what "agentic IDE" means, and Copilot will need to respond.
**For Anthropic**: Claude powers Windsurf (along with other models), and Claude Code is a direct competitor to Devin. The acquisition creates an interesting dynamic where Anthropic is simultaneously a supplier (Windsurf uses Claude Sonnet 4.5 for Codemaps) and a competitor. Whether Cognition will de-emphasize Claude in Windsurf post-acquisition is the question Anthropic's business team is certainly thinking about.
## The Price War Is Underway
Devin's price collapse from $500 to $20 is a shot across the bow. OpenAI Codex (included in ChatGPT Plus) and Claude Code (included in the Anthropic Pro plan) have set a precedent that capable autonomous agents should cost less than $25/month for individual developers. Devin was the outlier, and Devin 2.0 corrects that.
The race to the bottom on per-seat pricing is likely to continue. The real monetization battle will be at the enterprise tier — large organizations that need thousands of ACUs, custom security controls, audit logs, and SLAs. Devin's $500/month Team plan and custom Enterprise tier are the real revenue targets. The $20 plan is the funnel.
For developers, this is a good problem to have. The cost of access to a capable autonomous coding agent has dropped by 96% in one product cycle. Even if Devin 2.0's real-world performance is closer to the skeptical external evaluations (3 of 20 tasks completed) than Cognition's internal benchmarks (83% more tasks per ACU), the price now makes it worth finding out for yourself.
## What to Watch
Three things will determine whether Cognition's bet pays off:
1. **Integration**: Whether Windsurf and Devin share actual infrastructure (Codemaps, planning, codebase search) or remain separate products with a common parent company. The vision is compelling; execution is what matters.
2. **Model independence**: Windsurf's strength has been its model-agnostic approach — support for Claude, GPT-5, Gemini, and its own SWE models. If Cognition pushes Devin-only defaults, users will notice.
3. **Real-world performance**: The gap between Devin's benchmarks and independent evaluations needs to close. The architecture improvements (Interactive Planning, Devin Wiki) address the right problems. Whether they translate to meaningful task completion improvements on messy, real-world codebases is the question to answer over the next quarter.
The AI coding market entered 2026 as a field of independent experiments. It's leaving Q1 with its first clear consolidation move. More will follow.
---
*Sources: [Cognition Devin 2.0 announcement](https://cognition.ai/blog/devin-2), [VentureBeat — Devin 2.0 pricing](https://venturebeat.com/programming-development/devin-2-0-is-here-cognition-slashes-price-of-ai-software-engineer-to-20-per-month-from-500), [Cognition Codemaps](https://cognition.ai/blog/codemaps), [LogRocket AI dev tool rankings March 2026](https://blog.logrocket.com/ai-dev-tool-power-rankings/), [VentureBeat — Cursor Composer 2](https://venturebeat.com/technology/cursors-new-coding-model-composer-2-is-here-it-beats-claude-opus-4-6-but)*
---
# Claude Code Auto Mode: Anthropic Hands AI More Control (But Keeps It on a Leash)
URL: https://sdd.sh/2026/03/claude-code-auto-mode-anthropic-hands-ai-more-control-but-keeps-it-on-a-leash/
Date: 2026-03-25
Updated: 2026-03-25
Tags: Claude Code, Anthropic, agentic coding, AI safety, autonomous agents
Categories: AI Tools, Agentic Workflows
Summary: Auto Mode lets Claude decide which actions are safe to take without asking permission — but adds an AI safety layer that screens every action for prompt injection and risky behavior. Here's what changed and why it matters.
For the past year, Claude Code's permission model has been a speed bump. Every time the agent wanted to run a shell command, write a file outside the working directory, or call an external service, it stopped and asked. This is correct behavior — you want to know what your AI is doing. But it also turns every autonomous workflow into a game of Whack-a-Mole, approving the same classes of actions over and over.
Anthropic's answer, launched as a research preview on March 24, is **Auto Mode**: a new execution tier that lets Claude decide which actions are safe to take unilaterally, while adding an AI safety layer that screens each action before it executes.
## What Auto Mode Actually Does
Before Auto Mode, Claude Code offered two options: the default confirmation-for-everything mode, or `--dangerously-skip-permissions`, which bypasses all permission checks and trusts you to know what you're doing. The name contains its own warning label.
Auto Mode sits between these two extremes. When enabled, Claude evaluates each action it wants to take and classifies it as either safe-to-proceed or requires-confirmation. The classification is based on the type of action and the context of the current task — running tests in the current project directory is different from deleting files or making outbound network requests to unrecognized endpoints.
What's new is the safety layer: before any action executes, a secondary AI model screens it for two specific threat vectors:
- **Risky behavior**: actions that could cause irreversible side effects, exfiltrate data, or exceed the scope of the assigned task
- **Prompt injection**: attempts by content in the environment (a malicious README, a crafted code comment, a poisoned API response) to hijack Claude's action stream and redirect it toward attacker-controlled goals
This is a meaningful architectural change. Previously, prompt injection resistance was baked into Claude's training — it was resilient but not impervious. The safety layer adds a second, independent model checkpoint explicitly tuned to catch injection patterns, running on every action in the pipeline.
## Why This Matters for Agentic Workflows
The practical upshot is that you can now run longer autonomous sessions without babysitting the terminal.
If you assign Claude Code a task like "add pagination to the user list endpoint, write tests, and open a PR," Auto Mode lets it work through the entire chain — reading files, editing code, running the test suite, committing, pushing, creating the PR — with human-in-the-loop only when it hits something genuinely ambiguous or risky. No more approving `git add .` for the fifteenth time.
This also makes Claude Code more composable. Channels (the Telegram/Discord integration launched five days earlier) is much more useful when the agent can complete tasks end-to-end while you're away from your desk. Before Auto Mode, a Channels task would immediately stall waiting for permission approvals that nobody was around to give.
## The Tradeoff: Trust and Verification
Auto Mode is still a research preview, and Anthropic is being deliberate about not calling it production-ready. There are two honest tradeoffs to acknowledge.
**You're trusting two AI models instead of one.** The safety layer helps, but it is itself a model with failure modes. A sufficiently crafted injection or an edge-case action classification could slip through. Anthropic's approach here mirrors defense-in-depth in traditional security: no single control is perfect, but layering independent checks raises the cost of exploitation significantly.
**The permission model shifts from explicit to probabilistic.** With default confirmations, you know exactly which actions were human-approved. With Auto Mode, you're trusting Claude's judgment about what's safe, augmented by the safety layer. For most developer workflows this is acceptable — arguably more reliable than a human clicking "yes" for the hundredth time without reading. For regulated environments or codebases with strict audit requirements, the explicit model is still the right call.
The smart play is to use Auto Mode for clearly bounded tasks (feature work, test generation, documentation) and switch back to explicit confirmations for anything touching infrastructure, secrets, or production systems.
## How to Enable It
Auto Mode is currently available as a research preview. You enable it with the `--auto` flag:
```bash
claude --auto "Add input validation to the signup form and write unit tests"
```
Or toggle it from within an existing Claude Code session with `/auto`. The safety layer is always active when Auto Mode is enabled — there's no option to run Auto Mode without it, which is the right call.
Anthropic has also added a new permission tier, `cautious`, that sits below the default confirmation behavior. Cautious mode confirms everything, including actions that would normally be auto-approved. Useful for onboarding new agents or working in unfamiliar codebases where you want maximum visibility.
## The Bigger Picture
Auto Mode isn't just a convenience feature — it's Anthropic's thesis on how autonomous AI should be deployed. The pattern (capable agent + independent safety verifier + explicit escalation path) is the same architecture you'd design for any high-stakes autonomous system. It's closer to how autopilot works on commercial aircraft than how most people imagine AI agents.
The Anthropic 2026 Agentic Coding Trends Report frames this as the shift from "AI-assisted development" to "AI-delegated development." The distinction matters: assistance implies the human stays in the loop. Delegation implies the human sets the goal and reviews the output, with the agent handling everything in between.
Auto Mode is designed for the delegation model. Whether the rest of the ecosystem — CI systems, code review processes, organizational trust models — is ready to treat AI-generated work as delegated rather than assisted is a separate question. But Anthropic is clearly betting that the answer will be yes, and building the technical infrastructure ahead of that shift.
For now: enable it on bounded tasks, watch the safety layer logs, and form your own opinion. That's what research previews are for.
---
*Sources: [TechCrunch — Anthropic hands Claude Code more control](https://techcrunch.com/2026/03/24/anthropic-hands-claude-code-more-control-but-keeps-it-on-a-leash/), [Claude Code release notes](https://docs.anthropic.com/en/release-notes/claude-code), [Anthropic 2026 Agentic Coding Trends Report](https://resources.anthropic.com/hubfs/2026%20Agentic%20Coding%20Trends%20Report.pdf)*
---
# Xcode 26.3: Apple Goes All-In on Agentic Coding
URL: https://sdd.sh/2026/03/xcode-26.3-apple-goes-all-in-on-agentic-coding/
Date: 2026-03-24
Updated: 2026-03-24
Tags: Xcode, Apple, Claude Code, MCP, agentic coding, iOS development
Categories: AI Tools, Agentic Workflows
Summary: Apple's mid-cycle Xcode 26.3 release isn't a minor patch — it's a bet-the-ecosystem move that bakes Claude Agent and OpenAI Codex directly into the IDE. Here's what changed, what it means for iOS and Mac developers, and why MCP is the most important detail in the release notes.
Apple doesn't do mid-cycle drops for minor things. When the company shipped Xcode 26.3 on February 3rd, 2026 — months before WWDC — it was a signal. Agentic coding isn't experimental anymore. It's now officially part of Apple's development stack.
This is the release where your IDE stops being a text editor and starts being a collaborator.
## What Actually Shipped
Xcode 26.3 introduces native support for **coding agents**: AI systems that don't just suggest code but actively execute multi-step tasks with minimal supervision. Apple co-designed integrations with two specific agents — Anthropic's Claude Agent and OpenAI's Codex — but the more interesting move is the underlying architecture.
Under the hood, everything runs on the **Model Context Protocol (MCP)**, the open standard that Anthropic introduced in 2024. Xcode exposes 20 MCP tools, and a new command-line bridge called `xcrun mcpbridge` acts as a translator between MCP and Xcode's internal XPC communication layer. The practical upshot: any MCP-compliant agent can theoretically drive Xcode — not just Claude or Codex.
What can agents do once connected?
- **Browse documentation** — the entire Apple developer documentation corpus is accessible
- **Explore file structures** — navigate and reason about project layout
- **Update project settings** — change build configurations, add capabilities, modify entitlements
- **Build and iterate** — trigger builds, read compile errors, fix them, rebuild
- **Capture SwiftUI Previews** — take screenshots of running previews and verify visual output
- **Run tests** — execute test suites and interpret results
Apple's demo showed an engineer typing "add a feature to show the weather at a landmark." The agent analyzed the project structure, consulted documentation, wrote the code, built the project, captured a screenshot of the preview, and iterated through a bug without human intervention. That's not autocomplete. That's delegation.
## Why This Is a Big Deal for Apple Developers
Before Xcode 26.3, Apple's intelligence features were limited to code completion and basic chat. Agents could answer questions but couldn't take action. They were advisors, not contributors.
The 26.3 release changes the interaction model entirely. The agent now has **access to Xcode's build system** — the same system Apple developers have been wrestling with for decades. It can read linker errors, understand framework imports, navigate Swift Package dependencies, and iterate without being hand-held.
For iOS developers specifically, this is significant: the SwiftUI Preview capture capability means agents can verify visual output without a simulator — catching layout bugs or constraint issues that only manifest at runtime. It's not perfect, but it's a meaningful feedback loop that didn't exist before.
## MCP Is the Real Story
Claude Agent and Codex get the marketing copy, but **MCP is the architectural bet that matters**.
By building Xcode 26.3 around an open protocol, Apple effectively solved the AI tool fragmentation problem in their own ecosystem. The question "which AI tool works with Xcode?" now has a clear answer: any of them, as long as they speak MCP.
This is the same open-ecosystem playbook that made USB-C inevitable. When the platform holder adopts an open standard rather than a proprietary one, the standard wins. MCP was already gaining momentum through Anthropic, OpenAI, Google, and Microsoft adoption. Apple's integration makes it the default protocol for IDE-to-agent communication in one of the world's most-used developer tools.
The `xcrun mcpbridge` tool is worth examining closely:
```bash
# Connect Claude Code CLI to Xcode via MCP
xcrun mcpbridge --connect claude-code
```
This lets external tools — including the terminal-based Claude Code or Cursor — drive Xcode's build system remotely. You can use your preferred AI agent in your preferred environment and still get full access to Xcode's build infrastructure. Apple didn't try to own the agent layer. They just owned the IDE integration.
## Known Limitations
The 26.3 release has rough edges, and Apple was transparent about them.
**Hardware requirements are strict.** Agentic coding requires macOS 26 Tahoe on Apple Silicon. Intel Mac users are completely locked out — no workaround, no fallback. Given Apple's pace of Intel deprecation, this isn't surprising, but it's still a hard line.
**The MCP spec compliance issue.** There's a known bug in the 26.3 RC: `mcpbridge` returns data in the `content` field but not in `structuredContent`, which the MCP specification requires. Claude Agent and Codex handle this gracefully because their integrations were co-designed with Apple. Cursor follows the spec strictly and rejects non-compliant responses. This is Apple's bug to fix — it's expected in a future point release.
**Agents still need supervision on destructive operations.** File creation, project configuration changes, and anything that modifies build settings all still surface for human approval by default. This is the right call, but it does interrupt fully autonomous runs on complex refactors.
## The Competitive Context
Apple's move doesn't happen in a vacuum. Cursor, VS Code with Copilot, and Claude Code have all been maturing their agentic capabilities. But those tools exist outside Apple's first-party ecosystem. Xcode 26.3 changes the stakes for iOS and Mac development specifically.
Before this release, a developer wanting agentic capabilities had to choose between the best AI tooling (often outside Xcode) and native IDE features (exclusively in Xcode). The 26.3 integration collapses that choice. You can run Claude Agent or Codex inside Xcode and get both.
Cursor is the most affected competitor. Its strict MCP spec compliance means the Apple/Cursor integration is currently broken due to the `mcpbridge` bug — and that's likely temporary. But it illustrates how Apple being in the agent game changes the negotiating dynamics. When Apple ships a bug, third-party tools either have to work around it or wait.
## What to Expect at WWDC 2026
Apple described 26.3 as laying the foundation for deeper agentic integration. WWDC 2026 in June will almost certainly bring:
- First-party Apple Intelligence agents with Xcode access
- Expanded MCP tool surface (20 tools is a starting point)
- Possible on-device inference for privacy-sensitive operations
- Deeper integration with TestFlight and App Store submission pipelines
The mid-cycle 26.3 release was the proof of concept. WWDC will be the product vision.
## Bottom Line
Xcode 26.3 is the most significant change to Apple's developer toolchain since Swift. It's not a feature — it's a shift in what an IDE does.
The MCP foundation means this isn't locked to Claude or Codex. Any capable agent that speaks the protocol can drive Xcode. That's a more interesting long-term outcome than any specific AI integration: Apple just standardized how agents interface with Apple platform development.
For iOS and Mac developers, the practical advice is simple: if you're on Apple Silicon running macOS 26 Tahoe, start experimenting now. The rough edges are real, but the workflow changes are already significant. By the time WWDC ships the polished version, you'll want experience with the model.
The developers who figure out how to effectively delegate to agents inside Xcode will have a compounding productivity advantage over those who wait for the feature to mature.
---
*Sources: [Apple Newsroom — Xcode 26.3 release](https://www.apple.com/newsroom/2026/02/xcode-26-point-3-unlocks-the-power-of-agentic-coding/), [9to5Mac coverage](https://9to5mac.com/2026/02/26/apple-releases-xcode-26-3-with-support-for-agentic-coding/), [Awesome Agents teardown](https://awesomeagents.ai/news/xcode-26-3-agentic-coding-teardown/), [DEV Community deep dive](https://dev.to/arshtechpro/xcode-263-use-ai-agents-from-cursor-claude-code-beyond-4dmi)*
---
# Claude Code Channels: Your AI Agent, Now on Telegram and Discord
URL: https://sdd.sh/2026/03/claude-code-channels-your-ai-agent-now-on-telegram-and-discord/
Date: 2026-03-24
Updated: 2026-03-24
Tags: Claude Code, Telegram, Discord, MCP, async, agentic coding, remote
Categories: AI Tools, Agentic Workflows
Summary: Anthropic shipped Claude Code Channels on March 20, letting you message Claude Code directly from Telegram or Discord. The real story isn't convenience — it's the shift from synchronous IDE sessions to asynchronous agent partnerships, and what that means for how you work.
Until last week, using Claude Code required you to be at your computer. You opened a terminal, started a session, typed prompts, watched it work. The interaction model was synchronous: you wait, it executes, you review, you continue.
Claude Code Channels, announced March 20 as a research preview, breaks that model. Now you can fire off a task from your phone, let Claude Code run it on your machine while you're in a meeting, and get the result delivered back to your Telegram or Discord. The terminal session can keep running without you watching it.
This isn't a gimmick. It changes the fundamental rhythm of how you work with an AI agent.
## What Channels Actually Does
The feature is architecturally straightforward: Claude Code acts as a bot in your messaging app of choice. You send it a message, it executes the task in your local development environment, and it replies in the same thread.
Under the hood, it's built on MCP. The channel plugin runs locally on your machine, polling the messaging platform's API. There's no inbound port opened on your machine, no webhook exposed to the public internet, no reverse proxy required. The security model is clean: your machine reaches out; nothing reaches in.
Setup for Telegram:
1. Create a bot via Telegram's BotFather
2. Run `claude --channels` to start Claude Code with channel support
3. Run `/telegram:configure` in Claude Code to link your bot token
4. Pair your Telegram account with a security code
The pairing step locks the bot to your specific Telegram user ID. Messages from anyone else — even if they find the bot — are silently dropped. This matters on shared servers or if you accidentally give a colleague the bot username.
Discord setup follows the same pattern through the Discord Developer Portal. Both integrations support sending images back through the chat, which turns out to be genuinely useful for debugging visual output.
## The Async Workflow Shift
The headline use case is "message your agent while away from your desk." That's real and useful. But the deeper change is psychological.
Synchronous AI coding sessions have a specific pressure to them. You watch the agent work. When it pauses for approval, you approve. When it finishes a step, you assess and continue. It demands your attention.
Async changes the relationship. You delegate, then context-switch. The agent works while you do something else. You check results when you're ready, not when the agent is ready. This is closer to how you'd work with a human colleague on an async team than how you'd pair-program with someone sitting next to you.
One developer described their first Channels session in [a MacStories write-up](https://www.macstories.net/stories/first-look-hands-on-with-claude-codes-new-telegram-and-discord-integrations/): they kicked off building and deploying an iOS project to their iPhone over wireless, then made coffee. The build ran, the agent fixed two compile errors unprompted, and by the time they sat back down, the app was running on the device. "It felt like I had a junior developer working while I was on a break."
That's not hyperbole — that's a different mental model of what your AI agent is for.
## Permission Handling at a Distance
The obvious question: what happens when Claude Code needs approval to run something while you're not watching?
By default, Channels will pause and message you for approval, the same as the terminal does. Claude running a `rm -rf` on a directory it shouldn't touch will stop and ask — you'll get a Telegram message, tap approve or deny, and it continues.
Channels also has a **permission relay capability** for MCP servers that declare it: approval prompts can be forwarded to your phone as interactive messages. If Claude needs to run a destructive operation and you're halfway through a meal, you can approve or reject it from your phone and it keeps working.
This is more useful than it sounds. The alternative — auto-approving everything — is how you end up with agents that make irreversible mistakes. The permission relay lets you maintain oversight without being physically present.
## What You Can Send Besides Text
Channels isn't just a text bridge. You can send images to Claude Code via Telegram, which opens up some genuinely practical workflows:
- Screenshot a UI bug on your iPhone, send it to the agent, ask it to fix the layout
- Photograph a whiteboard diagram and ask the agent to scaffold the data model
- Send a screenshot of an error dialog from a QA device and ask for the fix
Voice messaging isn't supported yet, which is a notable gap — Telegram's voice note feature is heavily used, and it would be a natural way to dictate tasks while commuting. Based on the official documentation, this is on the roadmap.
## Fakechat for Local Testing
Before connecting a real messaging account, Channels ships with a mode called **Fakechat** — an officially supported demo channel that runs a chat UI on localhost. No authentication, no external service, nothing to configure. You install it, enable it, and get a browser tab that acts as a message interface to your Claude Code session.
This is smart product design. It lets you test Channels behavior, experiment with workflows, and verify your setup without creating a Telegram bot or joining a Discord server. Once you understand the interaction model, hooking up a real messaging account is straightforward.
## Current Limitations Worth Knowing
Channels is a research preview, and the limitations are real:
**Requires Claude Code 2.1.80 or later and the Bun runtime.** If you're not on Bun, you need to install it first.
**Only works with claude.ai accounts, not API keys.** If your workflow uses a direct API key without a claude.ai account, Channels won't work for you. This is likely a temporary restriction given the account linking required for the pairing system.
**No persistent background mode.** A terminal session has to stay open. You can't fully close your laptop and expect the agent to keep running. This limits the "fire and forget" use case — it's more "fire and go make coffee" than "fire and come back tomorrow."
**Platform gaps.** Telegram has no message history API, so if you switch devices you lose thread context. Discord requires more setup steps than Telegram. There's no Slack support yet, which will frustrate enterprise developers — though the plugin architecture suggests it's coming.
## The Bigger Picture
Claude Code Channels isn't an isolated feature. It's part of a clear direction Anthropic is building toward: agents that are available where you are, not just where your computer is.
The async interaction model — delegate, context-switch, receive results — is what makes AI agents genuinely additive to a developer's workflow rather than just a faster version of what they were already doing. When you have to be present for every step, the agent is a fancy autocomplete. When you can delegate and return, it's closer to having a developer working in parallel.
The MCP foundation means the channel architecture is extensible. Slack, WhatsApp, and iMessage have already been requested by the community, and the documentation explicitly signals that community-built channels are part of the plan. The Fakechat implementation shows they've thought carefully about how developers test and adopt new channel integrations.
Whether Telegram and Discord become permanent fixtures of how developers interact with AI agents, or just the first attempt at a model that evolves into something different, isn't clear yet. What is clear: the synchronous terminal session is no longer the only way to work with Claude Code, and that's the right direction.
If you're a Claude Code user, the setup takes about ten minutes. The experience of getting a Telegram message back from your own development environment after sending a task is strange the first time. By the third time, it feels obvious.
---
*Sources: [Anthropic — Claude Code Channels docs](https://code.claude.com/docs/en/channels), [VentureBeat announcement](https://venturebeat.com/orchestration/anthropic-just-shipped-an-openclaw-killer-called-claude-code-channels), [MacStories hands-on](https://www.macstories.net/stories/first-look-hands-on-with-claude-codes-new-telegram-and-discord-integrations/), [DEV Community technical breakdown](https://dev.to/ji_ai/claude-code-channels-how-anthropic-built-a-two-way-bridge-between-telegram-and-your-terminal-2dpn), [Anthropic techbuddies.io coverage](https://www.techbuddies.io/2026/03/21/anthropics-claude-code-channels-bring-always-on-ai-coding-to-telegram-and-discord/)*
---
# MCP's 2026 Roadmap: From Prototype Protocol to Production Standard
URL: https://sdd.sh/2026/03/mcps-2026-roadmap-from-prototype-protocol-to-production-standard/
Date: 2026-03-23
Updated: 2026-03-23
Tags: MCP, Model Context Protocol, agentic-coding, AI tools, standards, enterprise
Categories: AI Tools, Agentic Workflows
Summary: The MCP 2026 roadmap published by lead maintainer David Soria Parra reveals a protocol growing up fast — shifting from milestone releases to working groups, tackling stateless transport, enterprise auth, and governance maturity. Here's what's actually changing and why it matters for developers building on MCP today.
When Anthropic introduced the Model Context Protocol in November 2024, most developers were skeptical. Another standard? The graveyard of AI integration specs was already crowded. Sixteen months later, MCP is natively supported by Anthropic, OpenAI, Google, and Microsoft, ships in millions of daily active developer tool sessions, and has just published a serious 2026 roadmap from its lead maintainer.
The roadmap — written by David Soria Parra and published March 9, 2026 — reads less like a product announcement and more like a protocol growing up. Here's what it says and what you should care about.
## Why the Roadmap Format Changed
Previous MCP roadmaps organized work around release milestones: version X will include Y. That framing made sense when a small Anthropic team owned the entire spec. It no longer fits.
MCP is now a multi-company open standard under the Linux Foundation. The community has grown large enough that no small group of core maintainers can realistically review every proposed spec change — and organizing work around milestones that one team controls doesn't scale.
The 2026 roadmap switches to **Working Groups and Interest Groups** as the primary vehicle for protocol development. Instead of "version 2.0 ships in Q2," you get "the Transport WG owns Streamable HTTP improvements and drives its own timeline." SEPs (Spec Enhancement Proposals) in the four priority areas get expedited review. Everything else goes into the queue.
This is how successful open standards work. It's a meaningful structural maturation.
## The Four Priority Areas
Core maintainers ranked candidate areas and landed on four that will receive the most attention and the fastest SEP reviews in 2026.
### 1. Transport Scalability
Streamable HTTP gave MCP a production-ready transport — the right call to move away from the original SSE-only approach. But running it at scale has surfaced real gaps.
The current spec assumes you can maintain a session with a single server instance. The moment you put a load balancer in front of your MCP server, things break in subtle ways. The roadmap proposes evolving Streamable HTTP to run **statelessly across multiple server instances** with proper session creation, resumption, and migration semantics.
Alongside that, the working group is designing **MCP Server Cards** — a standard for exposing server metadata at a `.well-known` URL so registries can discover what a server does without actually connecting to it. Think of it as a capabilities manifest: what tools does this server expose, what scopes does it need, what transports does it support.
For teams deploying MCP servers in production today, stateless transport isn't optional. This work is overdue.
### 2. Agent Communication (Tasks)
MCP currently lets clients start asynchronous Tasks — fire-and-forget agent jobs. But early production deployments have revealed that the lifecycle semantics are underspecified.
What happens when a task fails? When should a client retry? How long are completed results retained before expiration? The current spec leaves these as implementation details, which means every team is making up their own answers — and interoperability breaks.
The roadmap proposes clearer **lifecycle rules** for tasks: defined retry behavior for failed jobs, explicit retention policies for completed results. This sounds like plumbing. It is plumbing. It's also exactly the kind of thing that separates a toy protocol from one you'd bet production systems on.
If you're building agentic workflows where a Claude Code session hands off tasks to background MCP agents, this directly affects you.
### 3. Governance Maturation
This is the least flashy priority area and arguably the most important for the long-term health of the protocol.
Right now, **every SEP must be reviewed by the full group of core maintainers** regardless of what it affects. A change to the TypeScript SDK's error handling and a change to the core spec transport layer go through the same full-council review. That doesn't scale — maintainer time gets spread thin, and contributors outside the core group have no clear path to taking on more responsibility.
Two SEPs are already in flight: a **Contributor Ladder SEP** that defines progression from community participant to WG lead to core maintainer, and a **delegation model** that lets Working Groups with a proven track record accept SEPs within their own domain without requiring a full core-maintainer vote.
The goal is a protocol that can continue growing without creating a bottleneck at the top. For developers contributing to MCP tooling, this means your SEPs may move significantly faster once WG delegation is in place.
### 4. Enterprise Readiness
Enterprises are deploying MCP at scale and hitting walls the spec doesn't address.
The three specific gaps called out in the roadmap:
- **Audit trails and observability**: Who called what tool, when, with what arguments? Enterprises need logs that satisfy compliance requirements. MCP currently has no standard way to emit or structure this data.
- **Enterprise-managed auth**: SSO-integrated flows where identity comes from the organization's IdP, not a developer-managed OAuth app. The current auth model works fine for individual developers; it's the wrong shape for enterprise IT.
- **Gateway and proxy patterns**: Many enterprises route all external API traffic through security gateways. MCP needs defined semantics for how an intermediary sits between a client and server without breaking the protocol.
Importantly, the roadmap notes that most enterprise readiness work will land as **extensions rather than core spec changes**. Enterprise needs are real, but the solution shouldn't make the base protocol heavier for everyone who doesn't need it. Smart call.
## SDK Ecosystem Status
The roadmap doesn't exist in isolation from the SDK implementations. As of March 2026:
- **TypeScript SDK v1.27.1** is the reference implementation with the most active conformance work. If you're building MCP servers, start here.
- **Python SDK v1.26** is tracking close behind. The gap between TypeScript and Python has narrowed significantly since 2025.
- **OpenAI Agents SDK v0.12.5** introduced MCP retry and error normalization — essentially implementing the task lifecycle rules the spec is still formalizing.
- **Google ADK v2.0 pre-release** added a structured Task API for agent-to-agent delegation.
The multi-vendor SDK convergence is one of MCP's strongest signals. When OpenAI's own agent SDK ships MCP support and invests in conformance, the "this is just an Anthropic thing" concern disappears.
## What This Means If You're Building on MCP Today
If you're integrating MCP into your tools now, a few practical implications:
**Build with Streamable HTTP.** SSE-based transport is legacy. The entire roadmap's transport work is anchored on Streamable HTTP. If you're starting a new MCP server, don't touch SSE.
**Don't design around session stickiness.** The stateless transport work is coming, and servers that assume a single-instance session will need to be refactored. Design for statelessness now.
**Expect the TypeScript SDK to move fastest.** It's the reference implementation. New features land there first. If you're writing Python and something feels unsupported, check whether it's in the TypeScript SDK yet.
**Watch the Governance WG.** Once the delegation model ships, SEP review times for targeted spec areas should drop significantly. If you have protocol proposals in flight, timing a submission for after delegation is in place may get you faster review.
## The Bigger Picture
MCP moved from zero to multi-company open standard in roughly 16 months. The 2026 roadmap reads like a protocol that's survived contact with real production workloads and is now doing the less glamorous work of growing up: governance, observability, enterprise auth, stateless transport.
The protocols that win aren't always the most elegant. They're the ones that show up for the boring production-readiness work after the initial hype fades. MCP appears to be doing exactly that.
---
**Sources:**
- [The 2026 MCP Roadmap — Model Context Protocol Blog](http://blog.modelcontextprotocol.io/posts/2026-mcp-roadmap/)
- [Roadmap — Model Context Protocol (official)](https://modelcontextprotocol.io/development/roadmap)
- [MCP's biggest growing pains for production use will soon be solved — The New Stack](https://thenewstack.io/model-context-protocol-roadmap-2026/)
- [MCP Ecosystem in 2026: What the v1.27 Release Actually Tells Us — Context Studios](https://www.contextstudios.ai/blog/mcp-ecosystem-in-2026-what-the-v127-release-actually-tells-us)
---
# Cursor vs. Copilot vs. Claude Code vs. Windsurf vs. Grok Build: Which AI Coding Tool Wins in 2026?
URL: https://sdd.sh/2026/03/cursor-vs.-copilot-vs.-claude-code-vs.-windsurf-vs.-grok-build-which-ai-coding-tool-wins-in-2026/
Date: 2026-03-23
Updated: 2026-05-16
Tags: Claude Code, Cursor, GitHub Copilot, Windsurf, Grok Build, xAI, AI Tools, Comparison
Categories: AI Tools, Guides
Summary: Five serious contenders, five distinct philosophies. Here's a no-nonsense breakdown of the AI coding tool landscape in 2026 — with real pricing, real benchmarks, and a decision framework that actually helps you choose. Updated May 16 with xAI Grok Build (launched May 14): 8 parallel agents, Arena Mode, local-first privacy, 70.8% SWE-bench Verified.
*Last updated May 16, 2026 — xAI Grok Build launched May 14 (8 parallel agents, Arena Mode, local-first privacy, 70.8% SWE-bench Verified, $99/month intro). Previously: CVE-2026-26268 (CVSS 9.9 RCE) patched in Cursor 2.5; SpaceX holds $60B Cursor acquisition option; Claude Code rate limits doubled at Code with Claude SF; Code Review GA at $15–25/PR; Dreaming + Outcomes + Multiagent now in Managed Agents public beta.*
---
Five tools now define the AI coding agent landscape in 2026: GitHub Copilot, Cursor, Claude Code, Windsurf, and the newest entrant, xAI's Grok Build. They share a genre, but they are solving different problems. Pick the wrong one and you'll spend more time fighting the tool than writing code. Pick the right one and you'll ship faster than you thought possible.
Here's what each tool actually is, what it's best at, and — most importantly — how to figure out which one belongs in your workflow.
---
## The Five Philosophies
Before we dive into features and pricing, understand that these tools aren't variations on a theme. They represent fundamentally different bets on how AI fits into development:
- **GitHub Copilot**: AI as a layer on top of whatever you already use. Broad compatibility, lowest switching cost.
- **Cursor**: AI baked into a VS Code fork. The editor itself becomes intelligent.
- **Claude Code**: AI as a terminal-native agent. You describe the problem; it handles the code.
- **Windsurf**: AI as a parallel agentic IDE. Multiple autonomous agents, side-by-side, working simultaneously.
- **Grok Build**: AI as a multi-agent CLI with automated output evaluation and local-first privacy. Eight agents compete; you pick the winner.
A useful shorthand from the community: *"Copilot sees a function. Cursor sees a file. Claude Code sees a problem."* Windsurf, increasingly, sees an entire sprint. Grok Build runs eight versions of the sprint simultaneously and asks you to adjudicate.
---
## GitHub Copilot: The Everywhere Tool
Copilot is four years old now and it has earned its ubiquity. It works in VS Code, JetBrains, Neovim, and more. If you have a preferred editor, Copilot probably supports it. That flexibility is its primary competitive advantage.
The tool has evolved well beyond inline completions. The **Copilot Coding Agent** — now a first-class feature — lets you assign a GitHub issue to Copilot. It branches, writes code, runs your tests, self-reviews its own changes, and opens a pull request while you do something else. As of March 19, 2026, startup time for the coding agent [improved 50%](https://github.blog/changelog/2026-03-19-copilot-coding-agent-now-starts-work-50-faster/), tightening the feedback loop considerably.
The March 11 update also brought [major agentic improvements to JetBrains IDEs](https://github.blog/changelog/2026-03-11-major-agentic-capabilities-improvements-in-github-copilot-for-jetbrains-ides/), including custom agents, sub-agents, and auto-approve support for MCP tools. Copilot is also building out an MCP registry, letting you discover and install context servers directly from your editor.
**Autopilot Mode** shipped in April 2026, adding nested subagents, an MCP sandbox, and the ability to hand off a task and wait for a pull request. It's the most autonomous Copilot has ever been — and it's still IDE-bound. The agent operates from within a running VS Code or JetBrains instance. Remove the editor, the workflow disappears. For a genuinely "fire and forget" coding agent, that architectural dependency is a hard ceiling.
One more item worth flagging: Copilot's April 24 data policy update defaults user code to training opt-in rather than opt-out. If your organization handles proprietary code, verify your enterprise settings before that date.
**What Copilot does well:** inline completions, GitHub-centric agentic tasks, enterprise rollout across heterogeneous teams.
**Where it falls short:** genuine autonomy. Copilot's agent is impressive for bounded tasks, but it still expects a human in the loop directing each step — and remains architecturally tied to a running IDE.
**Pricing (as of May 2026):**
- Pro: $10/month — note that Claude Opus 4.7 was removed from this tier in late April
- Pro+: $39/month (Opus 4.7 access, higher agent limits)
- Business: $19/user/month
**Billing change incoming — June 1, 2026:** Copilot's flat-rate era is ending. All plans are [switching to GitHub AI Credits](/posts/github-copilot-usage-based-billing-june-2026/), a consumption-based model where code completions remain free but chat, agents, and code review consume credits. Code review also burns GitHub Actions minutes simultaneously — a double billing that sparked significant developer backlash. Agentic workflows, which are the feature Copilot has been pushing hardest, are now the most expensive mode to use. Worth running a usage audit before June 1 if your team is on a heavy agent workflow.
---
## Cursor: The AI-Native IDE
Cursor is what VS Code would look like if it were rebuilt from scratch around AI. It's not a plugin — it's a fork, which means the AI has access to your entire project graph, not just the file you have open.
Composer mode handles multi-file edits with full project context. Agent mode iterates autonomously across files to complete a task. You can bring your own API keys and switch models. For large codebases where you need tight control over what the AI changes and why, Cursor is hard to beat.
**Cursor 3**, launched April 2, 2026, is a meaningful rebuild. The Composer panel is replaced by an **Agents Window** that manages local, cloud, SSH, and git-worktree agents simultaneously. **Design Mode** lets you annotate a browser screenshot to give an agent visual UI targets. The `/worktree` command spins up isolated git worktrees for parallel agent tasks. The `/best-of-n` command runs the same task across multiple models in parallel, then lets you pick the winner.
These are real improvements. But a critical structural fact hasn't changed: every Cursor 3 agent runs through a live Cursor application. Close the IDE, kill the agents. Cursor 3 describes itself as "agent-first" — that's accurate for the interface design, not the architecture.
Cursor now commands a [valuation of $50 billion](https://cursor-50b-self-hosted-agents-the-autonomy-ceiling) with 1M+ users, which is extraordinary for a product that asks you to swap your IDE.
**What Cursor does well:** large codebases, multi-file edits, model flexibility, project-wide context, Cursor 3's visual UI annotation.
**Where it falls short:** true autonomy. Cursor is a very powerful AI-assisted editor. Every agent still runs through a living Cursor process.
**Pricing:**
- Hobby: $20/month (Supermaven autocomplete, Agents Window, agent mode, codebase indexing)
- Pro: $40/month
- Business: $40/user/month
**Updated May 2026:** The [Cursor SDK launched](/posts/cursor-sdk-programmatic-agents-escape-the-ide/) — a TypeScript library that lets you invoke Cursor agents programmatically in sandboxed cloud VMs, without a running desktop application. **Cursor Security Review** entered beta on May 1 for Teams and Enterprise plans — two always-on agents that check every PR for vulnerabilities and run scheduled codebase scans.
**Security alert — update to Cursor 2.5 now:** [CVE-2026-26268](/posts/cve-2026-26268-cursor-rce-ide-security-architecture/) is a critical (CVSS 9.9) sandbox escape vulnerability disclosed by Novee Security on April 28. A prompt injection payload in a malicious repository can write a pre-commit hook to `.git/hooks/`, which then fires automatically when Cursor's agent runs a git operation — no warning, no permission prompt, full workstation code execution. Cursor patched it in version 2.5 and disputes NVD's 9.9 rating (Cursor's own assessment: 8.0), but the practical advice is identical regardless: update immediately. This vulnerability is a concrete illustration of the IDE-embedded AI security thesis — a sandboxed AI agent running inside a privileged desktop process inherits that process's full system access, making prompt-injection-to-RCE a structurally viable attack path.
**Market news:** In April 2026, SpaceX signed a $10 billion collaboration deal with Cursor to develop "coding and knowledge work AI," pairing Cursor's product with SpaceX's Colossus supercomputer (1M H100-equivalent GPUs). The deal includes an option for SpaceX to [acquire Cursor outright for $60 billion](https://techcrunch.com/2026/04/21/spacex-is-working-with-cursor-and-has-an-option-to-buy-the-60-billion/) after its IPO this summer — a valuation that would make it the most expensive developer tool acquisition in history. Strategically interesting: SpaceX simultaneously signed the Colossus compute deal with *Anthropic* on May 6 for Claude Code infrastructure. Musk is making parallel bets on both the IDE-centric and terminal-native models of agentic coding.
---
## Claude Code: The Agentic Terminal
Claude Code is not an IDE. It's not a plugin. It's a terminal-based agent that you point at a problem and let run. That distinction is more important than it sounds.
With up to 1 million tokens of context, Claude Code handles tasks that would overwhelm other tools — deep architectural reviews, large-scale refactors, complex debugging across an entire codebase. It integrates with external tools (Figma, Jira, Slack) and operates on your local filesystem with full autonomy.
A note on benchmarks: **Claude Opus 4.7**, released April 16, 2026, scores **87.6% on SWE-bench Verified** and **64.3% on SWE-bench Pro** — leapfrogging GPT-5.4 (57.7%) and Gemini 3.1 Pro (54.2%) on the harder, contamination-resistant benchmark. It's the first Claude model to pass implicit-need tests (understanding what the user meant, not just what they said) and delivers 14% faster multi-step workflows with 1/3 fewer tool errors than its predecessor. Pricing stays flat at $5/$25 per million tokens — Anthropic's bet is that the capability improvement makes the upgrade obvious.
The [JetBrains April 2026 developer survey](https://blog.jetbrains.com/research/2026/04/which-ai-coding-tools-do-developers-actually-use-at-work/) puts Claude Code at **18% adoption at work** — up from 3% a year ago, a 6× increase — with the highest satisfaction in the market: 91% CSAT and an NPS of 54. In the US and Canada the adoption figure is 24%. No other tool grew this fast from this base. Anthropic hit $30B ARR in April, overtaking OpenAI in revenue — the company behind Claude Code is no longer in catch-up mode.
Recent additions worth noting: **Claude Code desktop redesign** (April 14) brought a multi-session sidebar with worktree isolation per agent, side chats (`⌘+;`) that read main context without polluting it, and an integrated file editor, diff viewer, and HTML/PDF preview — all in one window. **Claude Code Routines** (April 15) brought cloud-native scheduled automation: cron triggers, API webhooks, and GitHub event triggers that run on Anthropic's infrastructure without your machine being online. **Claude Code Ultraplan** (`/ultraplan`) hands planning to a dedicated Opus session in Anthropic's cloud for up to 30 minutes. **Claude Managed Agents** absorbs the production agent loop infrastructure — sessions, checkpointing, sandboxing — that every team was building themselves.
For enterprise buyers, **Claude Code on Bedrock with Mantle** (v2.1.94) delivers zero operator access: no SSH, no Session Manager, cryptographic attestation via NitroTPM. Neither Anthropic nor AWS can access prompts or completions during inference. That's the architecture that passes regulated-industry security reviews.
The honest downside: no native IDE integration beyond the desktop app. For rapid iteration — write, test, tweak, repeat — you're still switching contexts to view diffs. The desktop redesign reduced this friction significantly with the integrated diff viewer, but it hasn't eliminated it.
**What Claude Code does well:** complex reasoning, autonomous multi-step tasks, large codebases, architectural analysis, enterprise air-gap deployment, cloud-native scheduled automations via Routines.
**Where it falls short:** native IDE integration (desktop app helps but doesn't replace), rapid iteration loops, pricing for casual users.
**Pricing:**
- Pro: $20/month (baseline usage)
- Max 5x: $100/month (5× usage limits)
- Max 20x: $200/month (20× usage limits; roughly 18× cheaper than equivalent API usage at heavy scale)
**Updated May 2026:** [v2.1.119](/posts/claude-code-v2-1-119-multi-vcs-settings-enterprise/) shipped multi-VCS support (`--from-pr` now accepts GitLab, Bitbucket, and GitHub Enterprise Server URLs). Claude Code [hit $2.5B ARR by February 2026](/posts/claude-code-2-5b-arr-terminal-beats-ide-market/) and accounts for more than half of Anthropic enterprise spending — faster ARR growth than any comparable developer tool.
**Code with Claude SF — May 6, 2026:** Anthropic's developer event shipped three notable updates for Claude Code users. First, Anthropic signed a compute deal with SpaceX's Colossus 1 data center (300MW, 220,000+ NVIDIA GPUs), unlocking enough capacity to **double the five-hour rate limits** across Pro, Max, Team, and Enterprise plans — and eliminate the peak-hour reduction for Pro and Max accounts. Second, **Claude Code Code Review** moved to general availability: multi-agent reviewers post directly to GitHub PRs, priced at $15–25 per PR and billed separately from your Claude Code subscription. Third, **Claude Managed Agents** received three public beta features: [Dreaming](/posts/claude-managed-agents-dreaming-self-improving-agents/) (scheduled overnight memory curation; Harvey saw 6× task completion improvement), Outcomes (rubric-based success grader with webhook notification), and Multiagent Orchestration (coordinator + up to 20 specialist agents working in parallel on a shared filesystem).
---
## Windsurf: The Parallel Agent IDE
Windsurf is the most interesting tool to watch in 2026 for one reason: **Wave 13** shipped parallel multi-agent sessions, and it changes the mental model of what an IDE can be.
The headline feature: you can run five Cascade agents simultaneously, each working on a separate branch via Git worktrees, monitored through side-by-side panes. While Claude Code does one thing deeply, Windsurf does several things in parallel. It's a different kind of autonomy — breadth over depth.
[Cognition AI acquired Windsurf in December 2025 for ~$250M](https://byteiota.com/windsurf-wave-13-free-swe-1-5-parallel-agents-escalate-ai-ide-war/) and has been integrating it with Devin's autonomous capabilities. Post-acquisition, Windsurf now supports **GPT-5.4** and adjustable reasoning effort levels alongside its own models, giving it unusual model flexibility. User count has crossed 1 million.
The identity question is real: Windsurf is an IDE built by a company famous for an autonomous agent (Devin). The tension between IDE-centric Windsurf users and Devin's fully autonomous model has produced a product that tries to serve both audiences — with mixed results. Arena Mode (two agents, side-by-side, hidden model identities, vote-driven leaderboards) is a genuinely clever feature for comparing models in your own workflow.
**What Windsurf does well:** parallel agentic workflows, multi-agent task distribution, model flexibility, Arena Mode for model comparison.
**Where it falls short:** post-acquisition identity crisis; IDE-centric ceiling limits true autonomy; enterprise features less mature than Copilot's.
**Pricing:**
- Free tier available
- Pro: $15/month
- Teams: $35/user/month
**Updated May 2026:** Cognition has confirmed Devin integration for Windsurf in H2 2026 — the goal being an IDE that can hand tasks to a fully autonomous Devin session without the user leaving the Windsurf interface. This would meaningfully close the autonomy gap. Until it ships, Windsurf remains an excellent parallel-agent IDE with the same architectural ceiling it had before the acquisition: every agent still runs through a live Windsurf instance.
---
## Grok Build: xAI Enters the Race
xAI launched Grok Build on May 14, 2026 — the first coding agent from Elon Musk's AI company, and the newest terminal-native entrant in a field where Claude Code has set the standard.
Grok Build is a CLI agent, not an IDE plugin. That's the right architectural call — xAI is making the same bet Anthropic made: the terminal is the right home for a serious coding agent. The similarities end there.
**Key features:**
- **Eight parallel sub-agents:** Grok Build spawns up to eight concurrent agents, each running a plan/search/build workflow. Complex tasks get subdivided and attacked simultaneously.
- **Arena Mode:** The headline feature — an automated evaluation layer that scores and ranks all agent outputs before you review them. Confirmed in code traces since February. **Not yet live in the early beta.**
- **Local-first privacy:** Zero codebase data transmitted to xAI servers during a session. Air-gap compatible. This is a meaningful differentiator for regulated industries that don't have an enterprise Anthropic contract with Bedrock Mantle.
- **Plan Mode:** Full execution plan presented for your approval before any file is touched. Good trust-building for an early beta; the question is whether it remains mandatory.
- **grok-code-fast-1:** 256K context window, **70.8% SWE-bench Verified**, $0.20/$1.50 per million tokens via API.
**Pricing:** SuperGrok Heavy at $300/month, introductory rate $99/month for the first six months.
**What Grok Build does well:** local-first privacy for sensitive codebases, parallel agent architecture, low API token pricing.
**Where it falls short:** 70.8% SWE-bench Verified trails Claude Opus 4.7 (87.6%) by 17 points. No MCP ecosystem. No CLAUDE.md equivalent for project-level instructions. No cloud execution or scheduling. Arena Mode — the flagship feature — isn't live. $300/month steady-state is at the top of the individual developer price range without ecosystem depth to justify it against Claude Code Max 20x at $200/month.
**Updated May 2026:** [Full Grok Build analysis here.](/posts/grok-build-xai-coding-agent-arena-mode/) The short version: right architecture, benchmark gap to close, watch when Arena Mode ships.
---
## The Decision Framework
Stop asking "which tool is best?" Start asking "which tool fits my workflow?"
| If you... | Use... |
|---|---|
| Want AI in your existing editor with minimal switching cost | **GitHub Copilot** |
| Work on large codebases and want multi-file context | **Cursor** |
| Have complex, multi-step tasks you want fully autonomous | **Claude Code** |
| Want to parallelize work across multiple agents simultaneously | **Windsurf** |
| Need GitHub-native autonomous PRs from issues | **GitHub Copilot** (Coding Agent) |
| Are a power user hitting rate limits regularly | **Claude Code Max 20x** |
| Work with proprietary code and need zero data transmission | **Grok Build** (local-first) |
| Want to evaluate multiple agent outputs before picking one | **Grok Build** (when Arena Mode ships) |
JetBrains survey data (January 2026) shows experienced developers using multiple tools concurrently. These tools are not mutually exclusive. Copilot handles inline completions while Claude Code tackles the gnarly refactor. Cursor manages daily coding while Windsurf runs a parallel batch of bug fixes in the background.
---
## Where the Market Is Going
The tools are converging on autonomy, but from different directions. Copilot is adding agents to an extension. Cursor is adding autonomy to an IDE. Claude Code *is* the autonomous agent. Windsurf is multiplying the agent. Grok Build is running a parallel evaluation tournament every time you ask for code.
The next 12 months will likely produce one significant shift: the gap between "AI-assisted" and "AI-autonomous" will become the dominant axis of competition. Tools that keep the human in the loop will feel slow compared to those that don't need to. Claude Code is the clearest expression of that future: no IDE wrapper, no approval dialogs by default, just an agent that takes a problem and runs.
Grok Build's entry is notable precisely because it's terminal-native — xAI made the same bet Anthropic made on architecture. But 70.8% SWE-bench Verified is not the benchmark number that justifies switching from a tool scoring 87.6%. xAI has the compute and the engineering to close that gap. Watch where grok-code-fast-2 benchmarks land and whether Arena Mode changes developer evaluations when it ships.
Cursor and Copilot are valuable tools for daily editing tasks. But if you're making a long-term bet on where software development is going — and that bet is on autonomous agents writing most of the code — then the only tool built around that premise from the ground up is Claude Code. Everything else is either an IDE with an agent bolted on, or a new entrant still closing the capability gap.
For practitioners today: use Claude Code for anything requiring genuine autonomy or complex reasoning. Use Copilot or your preferred IDE for the flow state work. And put Grok Build on the watchlist for the privacy-first local execution use case and, eventually, Arena Mode. But watch how much time you spend in the "AI-assisted" bucket versus the "AI-autonomous" one — that ratio should be shifting fast.
---
**Sources:**
- [GitHub Copilot coding agent 50% faster startup (March 2026)](https://github.blog/changelog/2026-03-19-copilot-coding-agent-now-starts-work-50-faster/)
- [GitHub Copilot agentic improvements for JetBrains (March 2026)](https://github.blog/changelog/2026-03-11-major-agentic-capabilities-improvements-in-github-copilot-for-jetbrains-ides/)
- [Windsurf Wave 13: Free SWE-1.5 and Parallel Agents — ByteIota](https://byteiota.com/windsurf-wave-13-free-swe-1-5-parallel-agents-escalate-ai-ide-war/)
- [Claude Code Pricing Guide — ClaudeLog](https://claudelog.com/claude-code-pricing/)
- [Which AI Coding Tools Do Developers Actually Use at Work? — JetBrains Research Blog](https://blog.jetbrains.com/research/2026/04/which-ai-coding-tools-do-developers-actually-use-at-work/)
- [Cursor 3 changelog — Cursor](https://cursor.com/changelog/3-0)
- [Plan in the cloud with ultraplan — Claude Code Docs](https://code.claude.com/docs/en/ultraplan)
- [Why SWE-bench Verified No Longer Measures Frontier Coding Capabilities — OpenAI](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/)
- [Claude Opus 4.7 on Amazon Bedrock — AWS Blog](https://aws.amazon.com/blogs/aws/introducing-anthropics-claude-opus-4-7-model-in-amazon-bedrock/)
- [Claude Code v2.1.94 release notes — GitHub](https://github.com/anthropics/claude-code/releases/tag/v2.1.94)
- [AWS Mantle zero operator access deep dive](https://aws.amazon.com/blogs/machine-learning/exploring-the-zero-operator-access-design-of-mantle/)
- [GitHub Copilot usage-based billing June 2026 — GitHub Blog](https://github.blog/changelog/2026-04-28-github-copilot-usage-based-billing/)
- [Cursor SDK Launches — Analytics Drift](https://analyticsdrift.com/cursor-sdk-ai-coding-agents-launch/)
- [Cursor Security Review beta (May 1, 2026) — Cursor Changelog](https://cursor.com/changelog)
- [Claude Code at $2.5B ARR — Stormy AI](https://stormy.ai/blog/claude-code-gtm-strategy-anthropic-revenue-2026/)
- [Claude Code v2.1.119 — newreleases.io](https://newreleases.io/project/github/anthropics/claude-code/release/v2.1.119)
- [CVE-2026-26268: How an AI Coding Agent Can Run Exploits in Cursor IDE — Novee Security](https://novee.security/blog/cursor-ide-cve-2026-26268-git-hook-arbitrary-code-execution/)
- [CVE-2026-26268 Detail — NVD](https://nvd.nist.gov/vuln/detail/CVE-2026-26268)
- [SpaceX is working with Cursor and has an option to buy the startup for $60B — TechCrunch](https://techcrunch.com/2026/04/21/spacex-is-working-with-cursor-and-has-an-option-to-buy-the-60-billion/)
- [Higher usage limits for Claude and a compute deal with SpaceX — Anthropic](https://www.anthropic.com/news/higher-limits-spacex)
- [New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration — Anthropic](https://claude.com/blog/new-in-claude-managed-agents)
- [xAI Enters the Coding Agent Race With Grok Build — DevOps.com](https://devops.com/xai-enters-the-coding-agent-race-with-grok-build/)
- [Grok Build Early Beta: 6 Ways xAI's New AI Coding Agent Plans to Take on Claude Code — Techloy](https://www.techloy.com/grok-build-early-beta-6-ways-xais-new-ai-coding-agent-plans-to-take-on-claude-code/)
- [xAI Grok Build: Multi-Agent Arena Mode Redefines AI Coding — AI2Work](https://ai2.work/blog/xai-grok-build-multi-agent-arena-mode-redefines-ai-coding/)
---
# Claude Code March 2026: Voice Mode Isn't the Story
URL: https://sdd.sh/2026/03/claude-code-march-2026-voice-mode-isnt-the-story/
Date: 2026-03-22
Updated: 2026-03-22
Tags: Claude Code, agentic-coding, MCP, AI tools, developer-experience
Categories: AI Tools, Guides
Summary: Voice mode grabbed the headlines. The 64k default output tokens, /loop, MCP elicitation, and --channels are the updates that will actually change how you use Claude Code day to day.
Every month, Claude Code ships something that gets headlines, and every month the genuinely interesting changes are buried two paragraphs below the fold. March 2026 is no exception. Voice mode is the marquee feature — hold spacebar, speak, release. It works. Fine.
But if you're using Claude Code for anything resembling serious agentic work, here's what actually changed this month.
---
## The token limit upgrade is bigger than it sounds
Default maximum output tokens for Claude Opus 4.6 jumped to **64k tokens**. The upper bound for both Opus 4.6 and Sonnet 4.6 is now **128k tokens**. The 1M context window is available on Max, Team, and Enterprise plans.
Why does this matter? Because most meaningful agentic tasks — write a module, refactor a service, generate a test suite — produce output that was quietly getting truncated before. The previous 32k default felt fine for conversational use. For agentic sessions where Claude is writing code, it's constantly bumping against ceilings.
Going from a 32k default to 64k is roughly equivalent to doubling the size of the code Claude can produce in a single step without breaking the task into chunks. For context: a mid-sized TypeScript service might be 800–1,200 lines. You can now have Claude generate it end-to-end rather than managing the handoffs manually.
The 128k max is the more interesting number for CI pipelines and long-running tasks. If you're scripting Claude Code with `-p` and handling the output downstream, you now have a lot more headroom before you need to decompose your prompts.
---
## `/loop` makes recurring tasks first-class
The new `/loop` command runs a slash command on a recurring interval. The syntax is `/loop 5m /foo` (defaults to 10 minutes).
This is the "I want Claude to keep doing this" feature that power users have been hacking around for months. Before, the workarounds were: run it in a cron job, write a shell wrapper, or babysit it manually. Now it's a native command.
Practical uses: keep a linting check running while you work on a feature, monitor a test suite every few minutes during a refactor, poll an external status endpoint, or run a review pass on recent commits on a schedule.
The important detail is that `/loop` works with any slash command, including custom skills defined in your project. If you've built a `/review-pr` skill or a `/check-types` skill, you can loop it without any additional setup.
---
## MCP elicitation: AI can ask structured questions mid-task
This one has been needed for a while. **MCP elicitation support** lets MCP servers request structured input from you while a task is running — via an interactive dialog with form fields, or a browser URL.
Previously, if an MCP server needed clarification partway through a task, you had one option: write all possible parameters into your initial prompt and hope for the best. Or abort, reconfigure, and restart. Neither is good.
Elicitation changes the interaction model. An MCP server can now pause mid-execution and ask: "I found three matching tables — which one?" or "The deployment target is ambiguous — pick one: staging, prod-eu, prod-us." You respond via a structured dialog (not a free-text field), the server validates your input, and execution resumes.
For developers building MCP servers, the new `Elicitation` and `ElicitationResult` hooks let you intercept and override responses before they're sent back to the server. This is useful for testing MCP servers without manual intervention, or for building automation layers on top of MCP workflows.
---
## `--channels`: MCP servers can now talk to you
The `--channels` flag (still in research preview) enables MCP servers to **push messages into your active session**. This is the inverse of the normal MCP flow.
Normally: you give Claude a task → Claude calls an MCP server → MCP server returns a result. With `--channels`: an MCP server can initiate a message to your Claude Code session unprompted.
The obvious use case is notifications. An MCP server watching a CI pipeline can push "build failed" or "deploy complete" into your active session without you polling for status. A monitoring server can interrupt with an alert. A code review bot can surface a blocking comment without you switching context.
There's also a `--channels` permission relay capability — channel servers that declare the permission capability can forward tool approval prompts to your phone. If Claude needs to run a destructive command and you're not at your desk, it can ask you via your phone rather than blocking the session.
This is early-stage (research preview means the API may change), but the direction is clear: Claude Code sessions are becoming interactive in both directions.
---
## `/effort` and `/color`: smaller changes worth knowing
`/effort` lets you set the model's effort level for the current session. The full-effort mode previously called "max" is gone — now it's three levels, and maximum effort is triggered by using the `ultrathink` keyword in your prompt rather than a mode setting.
`/color` sets a color for the current session's prompt bar. Trivial feature, genuinely useful when you're running multiple parallel sessions (see: [parallel AI agents](/posts/parallel-ai-agents-the-tools-that-let-you-run-ten-claudes-at-once/)) and need to keep track of which terminal window is which agent.
---
## `--bare` for CI and scripted use
The new `--bare` flag is specifically for scripted `-p` calls. It skips hooks, LSP initialization, plugin sync, and skill directory walks. It requires `ANTHROPIC_API_KEY` or an `apiKeyHelper` via `--settings` — OAuth and keychain auth are disabled.
If you're running Claude Code in CI pipelines or calling it from scripts, `--bare` makes sessions start faster and removes the overhead of interactive-mode setup. It also means your scripted calls won't accidentally trigger hooks or plugins configured for interactive sessions.
For teams using Claude Code as a CI step — code review, test generation, doc updates — `--bare` is the flag you've been waiting for.
---
## The direction this points
Voice mode is a UX convenience. The changes that matter are about making Claude Code work better in three contexts that were previously awkward:
1. **Long agentic tasks** — 64k default output tokens, 128k max
2. **Recurring and scheduled work** — `/loop`
3. **Bidirectional, interactive pipelines** — MCP elicitation + `--channels`
The pattern here is Claude Code becoming less of a tool you talk to and more of a runtime you run workflows on. The `/loop` command, elicitation support, and channel push notifications are all about reducing the amount of human-in-the-loop intervention required for tasks that *should* be autonomous.
That's the real story of March 2026.
---
**Sources**
- [Claude Code March 2026: All Updates from /loop to Voice Mode](https://pasqualepillitteri.it/en/news/381/claude-code-march-2026-updates) — Pasquale Pillitteri
- [Claude Code Changelog — Claude Code Docs](https://code.claude.com/docs/en/changelog)
- [Release Notes — Claude Help Center](https://support.claude.com/en/articles/12138966-release-notes)
- [Claude Code Changelog: Complete Version History](https://claudefa.st/blog/guide/changelog) — ClaudeFa.st
---
# What Is Spec-Driven Development?
URL: https://sdd.sh/2026/03/what-is-spec-driven-development/
Date: 2026-03-21
Updated: 2026-03-21
Tags: spec-driven-development, agentic-coding, Claude Code, AI tools, workflow
Categories: Spec-Driven Development, Guides
Summary: Vibe coding gets you started. Spec-Driven Development gets you to production. Here's the paradigm shift that's quietly rewriting how software gets built in 2026.
Every developer has now experienced the "AI demo high": you describe a feature, the agent builds something that *almost* works, you iterate a few more times, and thirty minutes later you have spaghetti that vaguely resembles what you wanted.
That's vibe coding. It's fast. It's fun. And it falls apart at scale.
Spec-Driven Development (SDD) is the antidote — a structured approach where you write the specification first and let the AI handle implementation. Not because you want to slow down, but because a good spec makes the agent dramatically faster and more reliable.
## The Core Idea
SDD flips the traditional development loop:
- **Old loop**: write code → write tests → document what you built
- **SDD loop**: write spec → AI generates code → code is disposable, spec is the truth
The spec isn't a README afterthought. It's a contract: what the system does, how it behaves at the edges, what constraints it operates under. The AI doesn't guess at intent — it executes against a document that captures intent precisely.
This sounds familiar because it is. It's what good software engineers do in their heads before touching a keyboard. SDD just makes that mental model explicit, and hands it to the agent as context.
## The Four-Phase Workflow
In practice, SDD follows a repeatable structure:
**1. Specify** — Describe what you're building and why. User stories, acceptance criteria, edge cases. The more precisely you capture intent here, the less correction you'll do later. The AI can help draft and challenge this document — treat it like a pairing session.
**2. Plan** — Define the technical constraints: stack, patterns, architectural decisions, service boundaries. The agent produces an implementation plan grounded in your spec. Review it before a single line of code is written.
**3. Break down** — The plan becomes concrete, testable tasks. Each task has inputs, expected outputs, and validation criteria. Tasks that can run in parallel are flagged. Dependencies are explicit.
**4. Implement** — The agent works through the task list, using the spec and plan as context for every decision. When it gets stuck or diverges, you update the spec, not the code.
The key insight: the spec is the thing you maintain. Code is regenerated as needed.
## Is This Just Waterfall with AI?
It's the most common objection, and it's worth taking seriously.
Waterfall failed for a specific reason: the cost of discovering your specification was wrong came *after* months of implementation. Fixing a misunderstood requirement meant rewriting systems that took teams quarters to build. The feedback loop was catastrophically long.
SDD compresses that loop to minutes. You can generate a 2,000-line implementation, discover the architecture is wrong, update the spec, and regenerate — in an afternoon. When code is cheap to produce, the economics of writing a good spec upfront change completely.
What remains from waterfall is the discipline of thinking before building. That part was always correct. SDD keeps it.
## Three Levels of Commitment
Not every team needs to go all the way. Martin Fowler's team recently mapped out three patterns for how teams are using SDD in practice:
- **Spec-first**: The spec guides the AI but code remains primary. Think of it as structured prompting — you write requirements and the agent references them, but you'd never throw the code away.
- **Spec-anchored**: The spec persists as a governing contract. Code can diverge temporarily, but the spec is the source of truth for reviews, onboarding, and future changes.
- **Spec-as-source**: The spec *is* the source. Code is treated as a build artifact — an intermediate product between your requirements and compiled binaries. You maintain the spec; the agent handles the rest.
Most teams land on spec-anchored today. Spec-as-source is where the genuinely radical practitioners are headed, and the tooling is catching up fast.
## The Tooling Ecosystem in 2026
[GitHub Spec Kit v0.3.2](https://github.com/github/spec-kit) is the closest thing to a canonical SDD toolkit: open-source, works with 22+ AI coding platforms including Claude Code, GitHub Copilot, Amazon Q, and Gemini CLI.
Spec Kit introduces slash commands that structure the entire workflow: `/speckit.constitution` to define your project's principles, `/speckit.specify` to describe what you're building, `/speckit.plan` to produce an implementation plan, and task generation that breaks work into parallel and sequential steps with explicit file paths.
Developers are already building serious things with it. One engineer shipped a full CLI tool with TUI and web interfaces in under three days — zero lines written by hand.
On the IDE side, **Kiro** (built on AWS infrastructure) implements SDD natively using EARS syntax (Easy Approach to Requirements Syntax) with hooks that keep the spec synchronized as the agent works. **OpenSpec** targets brownfield projects with delta markers that track what's changing relative to existing functionality — a harder problem than greenfield, and one most frameworks ignore.
For Claude Code users specifically, several community-maintained spec workflows exist on GitHub, including [claude-code-spec-workflow](https://github.com/Pimzino/claude-code-spec-workflow) which automates the requirements → design → tasks → implementation pipeline.
**Google's Jules** agent (now powered by Gemini 3 Pro) has built SDD-style Plan Mode as a first-class feature: it surfaces clarifying questions and produces a structured task plan before writing any code. Windsurf's [Plan Mode](https://docs.windsurf.com/windsurf/cascade/arena), shipped with Wave 13 in January 2026, takes a similar approach — requiring structured upfront planning before the agent executes. Both tools are converging on the same insight: forcing the agent to think before it acts is a forcing function for clearer specifications.
The result is that SDD is becoming table stakes, not a niche methodology. When the mainstream tools build spec-first planning into their default UI, the practices that define SDD — explicit requirements, upfront constraint capture, structured decomposition — become the default mode of AI-assisted development.
## What Makes a Good Spec?
The quality of the spec determines the quality of the output. A few principles that hold up in practice:
**Be explicit about constraints, not just features.** "Users can reset their password" is a feature. "Password reset tokens expire after 15 minutes, are single-use, and must invalidate all active sessions on use" is a spec. The agent needs the second version to make sound implementation decisions.
**Write acceptance criteria, not implementation details.** Specify the behavior, not how to achieve it. The agent is better than you at picking implementation details; it's not better than you at knowing what the system should do.
**Define the edges.** What happens when the API is down? When the user provides invalid input? When a race condition occurs? Agents that don't have edge cases specified will make choices for you — sometimes good ones, often not.
**Keep the spec updated.** This is where most teams fail. When a requirement changes, update the spec *before* touching the code. If the spec drifts from reality, you lose the main advantage of the approach.
## The Bigger Picture
Anthropic's [2026 Agentic Coding Trends Report](/posts/anthropic-8-agentic-coding-trends-2026/) documents how fast this is moving in production: TELUS has deployed 13,000+ custom AI solutions with 30% faster engineering cycle times and 500,000+ hours saved. Zapier reports 800+ internal agents in active use. McKinsey cites 20–40% operating expense reductions at AI-centric organizations. These aren't pilot numbers.
Dario Amodei has predicted that within months, 90% of all code will be written by AI. Whether or not that exact timeline holds, the direction is clear: the bottleneck in software development is shifting from *writing code* to *knowing what to build*.
Spec-Driven Development is what happens when that shift gets formalized. Engineers who are good at capturing requirements, thinking through constraints, and communicating intent precisely will find their leverage increasing substantially. Engineers who relied on implementation skill alone should be paying attention.
The spec is the new source of truth. The code is the build artifact. That's the paradigm shift, and it's already happening.
---
*Updated 2026-03-28: Added Jules/Gemini and Windsurf Plan Mode context; updated real-world deployment numbers from Anthropic's 2026 Agentic Coding Trends Report.*
**Sources:**
- [GitHub Spec Kit — open-source toolkit for SDD](https://github.com/github/spec-kit)
- [Spec-driven development with AI: Get started with a new open source toolkit — GitHub Blog](https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/)
- [Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl — Martin Fowler](https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html)
- [Spec-Driven Development Is Eating Software Engineering — Medium](https://medium.com/@visrow/spec-driven-development-is-eating-software-engineering-a-map-of-30-agentic-coding-frameworks-6ac0b5e2b484)
- [Using spec-driven development with Claude Code — Medium](https://heeki.medium.com/using-spec-driven-development-with-claude-code-4a1ebe5d9f29)
- [Spec-Driven Development: It Looks Like Waterfall (And I Feel Fine) — Roger Wong](https://rogerwong.me/2026/03/spec-driven-development)
- [2026 Agentic Coding Trends Report — Anthropic](https://resources.anthropic.com/2026-agentic-coding-trends-report)
- [Arena Mode (Plan Mode) — Windsurf Documentation](https://docs.windsurf.com/windsurf/cascade/arena)
- [AI dev tool power rankings & comparison, March 2026 — LogRocket](https://blog.logrocket.com/ai-dev-tool-power-rankings/)
---