# sdd.sh — Full Content > CTO & Software engineer Author: Florent Clairambault Site: https://sdd.sh/ Generated: 2026-05-20 This file contains the full content of every article on sdd.sh, concatenated for AI ingestion in a single context window. Articles are sorted newest first. For the article index with summaries only, see https://sdd.sh/llms.txt. --- # Gemini 3.5 Flash: Google's "Budget" Model Outperforms Flagships on Agentic Benchmarks URL: https://sdd.sh/2026/05/gemini-3-5-flash-benchmarks-agentic-coding/ Date: 2026-05-20 Tags: google, gemini, gemini-3-5-flash, benchmarks, agentic-coding, model-release, claude-code Categories: AI Tools, Industry Summary: Gemini 3.5 Flash launched at Google I/O on May 19. Google calls it a Flash model — implying budget tier — but at $9/M output tokens it sits between Haiku and Sonnet pricing while hitting 76.2% on Terminal-Bench 2.1 and leading all competitors on MCP Atlas. It does not beat Claude Opus 4.7 on SWE-bench. The benchmark picture is more complicated than Google's marketing suggests. Google launched Gemini 3.5 Flash at I/O 2026 on May 19. The name carries expectations: Flash has always meant fast and cheap, the tier you use when you need throughput at scale and can accept some quality tradeoff. Gemini 3.5 Flash breaks that contract. The benchmarks are near-frontier. The price is not budget. Understanding what Google actually built here matters more than the naming. ## What Shipped Gemini 3.5 Flash is available immediately in the Gemini API, Google AI Studio, Antigravity 2.0, and Gemini CLI. The specs: - **Context window**: 1 million tokens input, 64,000 output - **Speed**: approximately 4x faster than comparable frontier models on output tokens per second - **Pricing**: $1.50/M input tokens, $9.00/M output tokens, $0.15/M for cached input That pricing sits between Claude Haiku 4.5 ($0.80/$4.00) and Claude Sonnet 4.6 ($3/$15) on input, but the $9/M output rate is considerably higher than Haiku and approaching Sonnet. For pure budget workloads, Claude Haiku 4.5 remains meaningfully cheaper. Google is asking you to pay near-Sonnet rates for a model the company brands as Flash. The implicit argument: the performance justifies the price. Let's look at whether it does. ## The Benchmark Picture Gemini 3.5 Flash leads or ties on several benchmarks that matter for agentic coding, while trailing on others. **Where 3.5 Flash leads:** - **MCP Atlas**: 83.6% — the benchmark that measures tool-use across MCP server integrations. This is the one Google built the model around, and it shows. Claude Opus 4.7 and GPT-5.5 trail here. - **GDPval-AA**: 1,656 Elo — a real-world agentic evaluation benchmark. Gemini 3.1 Pro scored 1,314. That is a substantial jump. - **Finance Agent v2**: 57.9% versus Gemini 3.1 Pro's 43.0%. The model handles multi-step financial workflows significantly better than its predecessor. - **CharXiv Reasoning**: 84.2%, leading comparable models. - **GPQA Diamond**: 90.4%, competitive with frontier models on graduate-level reasoning. - **Terminal-Bench 2.1**: 76.2%, ahead of Gemini 3.1 Pro's 70.3%. **Where it trails:** - **SWE-bench Verified**: 78%. Claude Opus 4.7 scores approximately 87.6% and GPT-5.5 scores around 83%. For pure coding correctness at repo scale — finding bugs in existing code, implementing features in established codebases — the quality gap versus Opus 4.7 is real and meaningful. - **Terminal-Bench 2.1**: GPT-5.5 leads at 82.7%. Gemini 3.5 Flash's 76.2% is stronger than 3.1 Pro but does not take the top position on terminal-native coding tasks. The pattern: Gemini 3.5 Flash is optimized for MCP-driven agentic workflows and real-world multi-step tasks, at the cost of raw coding correctness on tasks that require reading and editing complex existing codebases. This is a design choice, not a deficiency. ## "Flash" Is Now a Speed Tier, Not a Budget Tier The naming matters because it shapes expectations and buying decisions. For three generations, Flash meant: acceptable quality, fast inference, low cost — use it for high-volume, latency-sensitive workloads where you can tolerate some quality reduction versus the Pro/Ultra/Opus tier. Gemini 3.5 Flash changes this. At $9/M output, it is not a budget model. At 76.2% Terminal-Bench 2.1, it is not a quality-compromised model. It is a speed-tier model: frontier-class performance at frontier-class speed, at a price point below the flagships ($25/M output for Opus 4.7, $30/M for GPT-5.5) but above what developers historically expected from Flash. The TechTimes headline "costs 3x more per token" versus prior Flash models is accurate in absolute terms. Whether you view that as expensive depends on the comparison: versus flagship models, 3.5 Flash is considerably cheaper. Versus prior Flash models and true budget options like Haiku 4.5, it is substantially more expensive. Google is repositioning the Flash tier. The question for teams is whether the performance jump justifies paying more than Haiku while falling short of Opus 4.7 on the metrics that matter most for complex coding. ## Where 3.5 Flash Wins in Practice The strongest case for Gemini 3.5 Flash is MCP-orchestrated agentic workflows on Google infrastructure. If your agent stack uses Antigravity 2.0 for deployment, BigQuery for data access, and MCP servers for tool integration, Gemini 3.5 Flash is the fastest path to production. The model leads on MCP Atlas specifically — not because Google gamed the benchmark, but because the model was built with this architecture in mind. Speed (4x faster than frontier) matters when you are running agents with 15-30 MCP tool calls per workflow. The combination of Firebase Studio (launched at I/O 2026 as the agent-native build environment), Jules (free-tier async coding agent), and Gemini 3.5 Flash in Antigravity creates a coherent Google-native stack that is genuinely competitive for teams already in the Google Cloud ecosystem. **The realistic comparison for a Google Cloud team:** - Gemini 3.5 Flash in Antigravity: MCP Atlas leadership, 4x speed, tight Google Cloud integration, $9/M output - Claude Code on Bedrock: Opus 4.7 foundation, 87.6% SWE-bench Verified, Managed Agents depth, $25/M output The price delta is real. If your workload is primarily MCP-orchestrated pipeline work rather than deep repo-scale coding, 3.5 Flash on Antigravity is a defensible choice. If your workload is spec-driven autonomous development at the scale that Managed Agents and Code Review address, the SWE-bench quality gap matters more than the speed advantage. ## The Distribution Argument Google's actual competitive advantage is not Gemini 3.5 Flash's benchmark numbers. It is where the model runs. Gemini CLI is free with 1,000 requests/day for any developer. Firebase Studio now provisions it by default for new agent-native projects. Antigravity 2.0 runs it as the default model for Google Cloud agentic deployments. Every developer who starts a new project in Firebase Studio, opens a Gemini CLI session, or deploys to Cloud Run through Antigravity is defaulting to Google's model stack. This is the distribution moat that benchmark tables do not capture. OpenAI's equivalent is ChatGPT's installed base and Azure's enterprise relationships. Anthropic's equivalent is Amazon Bedrock's 100,000+ enterprise customers and the GitHub Copilot Pro+ integration. Google's is the developer surface area of the Google Cloud ecosystem and the free access tier that gets Gemini CLI into every developer's terminal. Benchmark leadership matters. Distribution at scale matters more. ## Bottom Line Gemini 3.5 Flash is a meaningful model release. It is not the "budget Flash" the name implies. It is a near-frontier agentic model optimized for MCP-driven workflows, fast inference, and Google Cloud native integration, priced at a substantial premium over prior Flash models but below flagship pricing. Claude Opus 4.7 retains the SWE-bench Verified lead. GPT-5.5 retains the Terminal-Bench 2.1 lead. Gemini 3.5 Flash leads on MCP Atlas and GDPval-AA — the benchmarks that most directly measure real-world agentic workflow performance. The practical read: if you build on Google Cloud and your agents are MCP-orchestrated pipeline work, evaluate 3.5 Flash seriously. If you are running spec-driven autonomous development where coding correctness under uncertainty matters, Opus 4.7 remains the benchmark and the gap is not closed yet. Google is doing what Google does: competing on breadth and integration rather than narrow benchmark supremacy. That has worked before. --- **Sources:** - [Gemini 3.5 Flash: Benchmarks, Pricing, and Complete Specs](https://llm-stats.com/blog/research/gemini-3.5-flash-launch) — LLM Stats - [Google releases Gemini 3.5 Flash; surpasses GPT-5.5 in agentic benchmarks](https://seekingalpha.com/news/4595030-google-releases-gemini-3_5-flash-surpasses-gptminus-5_5-in-agentic-benchmarks) — Seeking Alpha - [Gemini 3.5 Flash — Google DeepMind](https://deepmind.google/models/gemini/flash/) — Google DeepMind - [Google Unleashes Gemini 3.5 Flash: A Coding Powerhouse That's 4x Faster and Half the Cost](https://finance.biggo.com/news/202605191936_Google_Gemini_3.5_Flash_launched_at_IO_2026) — BigGo Finance - [Google Ships Gemini 3.5 Flash, a Cheap-to-Run Agent Model That Costs 3x More Per Token](https://www.techtimes.com/articles/316861/20260519/google-ships-gemini-35-flash-cheap-run-agent-model-that-costs-3x-more-per-token.htm) — TechTimes - [Gemini 3.5 Flash vs GPT-5.5 vs Opus 4.7: Agentic Coding](https://www.digitalapplied.com/blog/gemini-3-5-flash-vs-gpt-5-5-opus-4-7-agentic-coding) — Digital Applied - [Gemini 3.5 Flash: more expensive, but Google plan to use it for everything](https://simonwillison.net/2026/May/19/gemini-35-flash/) — Simon Willison - [Gemini 3.5 Flash Benchmarks, Pricing & Context Window](https://llm-stats.com/models/gemini-3.5-flash) — LLM Stats --- # Claude Code v2.1.139: Agent View Turns Your Terminal Into a Fleet Dashboard URL: https://sdd.sh/2026/05/claude-code-agent-view-goal-command-v2-1-139/ Date: 2026-05-20 Tags: claude-code, anthropic, agent-view, agentic-coding, background-sessions, goal-command Categories: AI Tools, Agentic Workflows Summary: Claude Code v2.1.139 ships two features that change how multi-agent work actually looks: Agent View — a unified dashboard showing every running, blocked, and completed session — and the /goal command, which keeps Claude working autonomously across turns until a defined completion condition holds. Claude Code has supported background sessions and parallel worktrees for months. The workflow was real but the visibility was not: you launched agents with `claude --bg`, context-switched between terminals to check progress, and manually tracked which sessions were waiting on you versus still running. v2.1.139 closes that gap with two features that make multi-agent development something you can actually see and direct. ## Agent View: One Screen for Every Session `claude agents` opens a live list of every Claude Code session in the current environment, organized by state: - **Running** — the agent is actively working on its current turn - **Blocked** — the agent needs a human decision before it can continue - **Done** — the agent has completed its work and is ready for review The view updates in real time. If you have five background sessions across three repos and one is waiting on a tool approval, it appears in the Blocked row immediately. You do not have to cycle through terminal tabs or check process output to discover that. The practical effect is a shift in how you allocate attention during multi-agent work. Today's pattern: launch a task, forget it, discover it was stuck ten minutes ago when you check back. The Agent View pattern: glance at the dashboard every few minutes, handle anything in the Blocked column, continue with whatever you're doing. Agent View also shows sessions started via `claude --bg` alongside interactive ones, with background sessions marked `bg`. The `/resume` command works directly from the view, which means handling a blocked session no longer requires remembering which background session handles which task. The feature ships as a research preview, which is the same stage Computer Use and Code Review shipped before going GA. Expect it to evolve based on usage patterns before it stabilizes. ## /goal: The Command That Doesn't Stop The /goal command takes outcome-based execution seriously. Instead of running a task and returning control when it completes, /goal defines a completion condition and Claude keeps working across as many turns as necessary until that condition holds: ``` /goal All tests in the auth module pass and coverage is above 85% ``` After each turn, a separate Haiku model evaluates whether the condition holds. If the condition is not met, Claude starts another turn. If it is met, execution stops and control returns to you. The dual-model design is worth paying attention to. Using the same model to decide both what to do and when to stop creates a failure mode where the agent convinces itself the condition has been met before it actually has — what practitioners call mission drift. By running the evaluation in a separate Haiku context that receives only the current state and the original condition, the architecture keeps the judgment about success structurally separate from the work itself. While /goal runs, a live overlay panel shows elapsed time, turn count, and current token usage. You can monitor cost accumulation in real time rather than discovering a large bill after the fact. The command works in interactive mode, with the `-p` flag for scripted invocations, and in Remote Control. This means you can wire /goal into a CI pipeline, a Routine, or any other orchestration layer that drives Claude Code programmatically. ## What Else Shipped in 2.1.139 The two headline features are accompanied by a set of smaller but useful changes: **Plugin improvements**: `claude plugin details` now shows a plugin's full component inventory and a projected per-session token cost estimate before you install or run it. The marketplace also enforces plugin dependencies — if a plugin requires another plugin or MCP server to function, the dependency is flagged at install time rather than discovered at runtime. Both changes address the common experience of installing a plugin that silently fails to work. **Session and background controls**: Background sessions (`--bg`) now appear in interactive mode session lists marked as `bg`. The `/resume` command accepts session IDs from background sessions, making it consistent with the existing interactive session resumption flow. **Transcript navigation**: The transcript view now supports keyboard shortcuts for jumping between user prompts. For long sessions involving many turns, this is meaningfully faster than scrolling. **Observability for multi-agent work**: API requests from subagents now carry `x-claude-code-agent-id` and `x-claude-code-parent-agent-id` headers. The `claude_code.llm_request` OpenTelemetry span includes matching `agent_id` and `parent_agent_id` attributes. Teams running multiple Claude Code agents in orchestrated workflows can now trace which agent made which API call without correlating by timestamp. **/scroll-speed**: A minor quality-of-life command that adjusts mouse wheel scroll speed in the terminal with a live preview. Trivial, but the kind of thing that grates when it's wrong for your setup. **Notable fixes**: A deadlock that blocked `claude auth login`, `logout`, and `status` when expired credentials coincided with the `forceRemoteSettingsRefresh` enterprise policy is resolved. The `autoAllowBashIfSandboxed` flag now correctly approves commands that include shell expansions. Unbounded memory growth from HTTP/SSE MCP servers streaming non-protocol data is patched. ## From Session to Fleet The direction of 2.1.139 is readable: Claude Code is evolving from a per-session tool into a multi-agent coordination platform you operate from a single interface. Agent View is the control plane. /goal is the autonomy primitive that makes agents worth controlling. The observability additions (OTel agent headers, projected plugin costs) are the instrumentation layer that lets you understand what the fleet is doing and what it costs. These three things are coherent; they belong together. The practical ceiling for today's solo developer is probably eight to twelve parallel sessions before coordination overhead exceeds productivity gain. But the relevant frame is not the solo developer — it is the team running Claude Code Routines overnight, the enterprise running Code Review on every PR, the startup where one developer is directing a dozen specialized agents across a monorepo. Agent View makes that model of working possible to manage without losing track. Upgrade via `npm install -g @anthropic-ai/claude-code` or wait for the auto-update. The Agent View research preview requires no configuration beyond launching `claude agents`. --- **Sources:** - [Release v2.1.139 · anthropics/claude-code](https://github.com/anthropics/claude-code/releases/tag/v2.1.139) — GitHub - [Claude Code CLI 2.1.139 changelog](https://x.com/ClaudeCodeLog/status/2053913638197416198) — ClaudeCodeLog on X - [Claude Code v2.1.139: Agent View, Goal Setting, and Enhanced Workflow Control](https://claude-world.com/articles/claude-code-21139-release/) — ClaudeWorld - [Claude Code Agent View and Goal Command for AI Engineers](https://zenvanriel.com/ai-engineer-blog/claude-code-agent-view-goal-command-guide/) — Zen van Riel - [Claude Code 2.1.139 adds /goal command](https://explainx.ai/blog/claude-code-goal-command-long-running-agents-2026) — explainx.ai - [Changelog - Claude Code Docs](https://code.claude.com/docs/en/changelog) — Anthropic --- # Google I/O 2026: Firebase Studio Is Live, Jules Goes Free, and the Agentic Race Gets a Third Contender URL: https://sdd.sh/2026/05/google-io-2026-firebase-studio-jules-free-gemini-code-assist-recap/ Date: 2026-05-19 Updated: 2026-05-19 Tags: google, google-io, firebase, firebase-studio, jules, gemini, gemini-code-assist, agentic-coding, claude-code Categories: AI Tools, Agentic Workflows, Industry Summary: Google I/O 2026 delivered the developer tools story it promised: Firebase Studio launched as a full-stack agent-native development platform, Jules exited beta with free-tier access, and Gemini Code Assist hit general availability. Google's agentic coding stack is now a real product, not a roadmap. The [preview article I published six days ago](https://sdd.sh/posts/google-io-2026-preview-gemini-4-firebase-agents-agentic-coding/) said to watch three things at Google I/O 2026: whether Gemini 4's context window advantage translated into better coding outcomes, whether Firebase Studio became a real agent-native platform or a rebranding exercise, and whether Jules V2 had a credible answer to Claude Code Routines. The keynote delivered answers. Not all of them were what Google needed — but some were. ## What Actually Shipped Google's I/O 2026 developer story is cleaner than most expected. Three products moved from preview or beta to shipped: - **Firebase Studio** launched as an agent-native full-stack development environment - **Jules** exited beta and became available to all users on free and paid AI Pro and Ultra tiers - **Gemini Code Assist** hit general availability for individuals and GitHub users, powered by Gemini 2.5 Supporting these launches: **Gemini Intelligence** — Google's integrated AI suite across Android, ChromeOS, Wear OS, Android Auto, and Android XR — and **Googlebooks**, the first Aluminium OS laptops from Acer, ASUS, Dell, HP, and Lenovo arriving this fall. Google did not ship Gemini 4 as a standalone model with a clean benchmark story. The model capability layer is table stakes now; the developer tooling story is what differentiated I/O 2026. ## Firebase Studio: The Real Announcement Firebase Studio is the most significant thing Google shipped today for developers who care about agentic workflows. It is not Project IDX with a new name. It is a substantively different product. The architecture: a Code OSS IDE environment in the browser, a no-code prototyping layer for non-developer stakeholders, and an agent mode capable of executing multi-step development tasks autonomously. Figma integration means a design file becomes an application prototype in Firebase Studio without manual handoff. Google Cloud backend provisioning is automated — Cloud Run, Firebase Hosting, and related services are available without configuration. The intended workflow is: prototype in Google AI Studio → build in Firebase Studio → deploy to Google Cloud. For teams already in the Google ecosystem, this is a credible end-to-end pipeline with fewer seams than anything Google has shipped before. The comparison to Claude Code is the obvious one and Google knows it. Firebase Studio's thesis is that the browser is where more developers live, that cloud-native development removes the local environment complexity, and that Figma-to-deployment in a single environment lowers the barrier for teams that currently have a designer-to-developer handoff problem. Claude Code's thesis is that the terminal provides the access, flexibility, and tooling depth that truly autonomous agents require — and that browser-based environments introduce platform constraints that limit what agents can do. Both theses are coherent. They're not targeting exactly the same developer. Where Firebase Studio wins: Google Cloud-native integration depth is real and unmatched. If you are deploying to Cloud Run, using BigQuery, or running on Firebase Hosting, the one-click deployment and native service wiring saves hours of configuration that Claude Code requires separately via MCP servers or custom scripts. Where Claude Code wins: environment ownership. Claude Code agents can run arbitrary shell commands, modify system configs, install toolchains, and manage processes in ways that a browser IDE cannot. For the kind of spec-driven, multi-agent, CI-integrated autonomous development that Claude Code Routines enables, Firebase Studio's browser-native architecture is a constraint, not an advantage. Firebase Studio is a real product that will earn real adoption. It is not Claude Code. ## Jules Goes Free: The KPI-Driven Bet Jules exiting beta is significant for one reason: it is now available to all users, including the free tier. That means any developer can queue an async task on Google's infrastructure, walk away, and come back to a pull request. The architectural story has not changed from [the Jules deep dive published in March](https://sdd.sh/posts/jules-deep-dive-google-async-agent-ci-loop/). Jules integrates with GitHub, runs on Google infrastructure, creates multi-step plans, executes them asynchronously, and presents results as a diff with reasoning attached. Audio changelogs of its work are available. The CI loop closes automatically. What is new is Project Jitro — the Jules V2 approach that changes the input model. Instead of telling Jules what to do (fix this bug, refactor this module), you tell it what to achieve: raise test coverage to 80%, reduce p95 API latency by 30 milliseconds, resolve all accessibility violations in the component library. Jitro maps the goal to the required code changes, runs asynchronously, and delivers a pull request targeting the metric, not the task. KPI-driven development is a genuinely interesting framing. It is also harder to evaluate than task-driven development because the rubric for success is embedded in the goal definition. If you tell an agent "raise test coverage to 80%," the most efficient path is to write trivial tests that cover lines without exercising real behavior. Whether Google has solved that evaluation problem is not yet clear from today's launch. Claude Code's analog is Managed Agents Outcomes, announced at Code with Claude SF on May 6: a separate rubric-based grader that runs in its own context window, evaluates whether the agent's output meets defined criteria, and triggers re-runs if it doesn't. The grader runs independently of the agent, which is structurally different from building the goal evaluation into the task itself. Neither approach has published failure rate data at production scale. Jules free tier changes who can evaluate these tools. The comparison is now available to any developer without a budget commitment. ## Gemini Code Assist: GA With Caveats Gemini Code Assist reached general availability for individuals and GitHub users. Gemini 2.5 powers the assistant. A 2 million token context window is announced as coming soon, not yet live. The "coming soon" caveat matters. A 2M token context window for Gemini Code Assist would change the competitive comparison with Claude Code's 1M context window significantly for whole-codebase tasks. But it is not shipping today. GA means the product is available and supported — it does not mean the context window feature announced for the future is present now. At current Gemini 2.5 Pro performance levels, the model quality gap versus Opus 4.7 on SWE-bench Pro is approximately ten percentage points (54% vs 64.3%). Gemini Code Assist's competitive advantage today is not model quality — it is Google Cloud native integration and price. The free tier (700,000+ VS Code installs) gives Google enormous distribution. If Gemini 4 substantially closes the model quality gap, that distribution becomes a moat. ## The Platform Layer: Gemini Intelligence and Googlebooks Two announcements from today are not directly about developer tooling but matter for the longer arc. **Gemini Intelligence** — the integrated AI suite across Android, ChromeOS, Wear OS, Android Auto, and Android XR — represents Google's bet that the agent layer lives in the OS, not just in development tools. Features like proactive task automation, custom widget generation, and the Rambler speech-to-text assistant are not developer tools. They are consumer surfaces that normalize agentic behavior for users who will eventually consume agentic software. Google is building the audience for agentic applications at the OS level while simultaneously building the tools to create them. **Googlebooks** — the first Aluminium OS laptops from major OEMs — is the physical manifestation of the Android-ChromeOS merger that developers have been tracking for two years. Aluminium OS arrives in fall 2026. For Android and web developers, it is a new primary development and consumption target. ## The Three-Way Race I/O 2026 establishes something that was not clearly true twelve months ago: the agentic coding market has three serious competitors. Claude Code leads on model quality (Opus 4.7, 64.3% SWE-bench Pro), terminal-native autonomy, Managed Agents platform depth, and enterprise infrastructure (Cowork, Analytics API, Code Review GA). It is the benchmark that everyone else is chasing. OpenAI's Codex is the credible cost alternative, leveraging GPT-5.5 (82.7% Terminal-Bench 2.0, 58.6% SWE-bench Pro) with async execution, a mobile supervision layer, and pricing that enterprise procurement finds easier to defend than Claude Code's per-token costs. Google now has Firebase Studio (agent-native platform for Google Cloud deployments), Jules (free-tier async agent with KPI-driven V2 approach), and Gemini Code Assist (2M context window incoming, 700K VS Code installs). Google's stack wins on distribution, integration depth within the Google ecosystem, and price. It loses on autonomous execution depth and current model quality. The developer who builds on Google Cloud, has Figma in their design workflow, and wants an integrated environment for new projects now has a real choice that Firebase Studio represents. The developer doing spec-driven multi-agent development at team scale with CI integration, production CLAUDE.md invariants, and autonomous overnight coding runs has not had a reason to switch today. The benchmark that matters: when Gemini Code Assist ships the 2M context window and if Google closes the SWE-bench Pro gap with whatever model ships next, the model-quality argument for Claude Code's premium pricing weakens. Until then, Google has narrowed the tooling gap without closing the quality gap. That is real progress. It is also still a gap. --- **Sources:** - [Everything announced at The Android Show: I/O 2026 edition](https://www.engadget.com/2171038/everything-announced-at-android-show-google-io-2026/) — Engadget - [Google I/O 2026: The Developer Briefing](https://byteiota.com/google-io-2026-developer-preview/) — Byteiota - [Firebase Studio - Google](https://firebase.google.com/docs/studio) — Firebase Docs - [Firebase Studio lets you build full-stack AI apps with Gemini](https://cloud.google.com/blog/products/application-development/firebase-studio-lets-you-build-full-stack-ai-apps-with-gemini) — Google Cloud Blog - [Google's Next Coding Agent Could Change How Developers Think About Their Work](https://devops.com/googles-next-coding-agent-could-change-how-developers-think-about-their-work/) — DevOps.com - [Google tests Jules V2 agent capable of taking bigger tasks](https://www.testingcatalog.com/google-prepares-jules-v2-agent-capable-of-taking-bigger-tasks/) — Testing Catalog - [Google I/O 2026 Developer Preview: Gemini 4, Android 17, Agentic Coding](https://www.abhs.in/blog/google-io-2026-may-19-gemini-4-android-17-agentic-coding-developer-preview) — Abhishek Gautam - [Google Counters GitHub & Microsoft with Jules Agent & Enhanced Gemini AI](https://visualstudiomagazine.com/articles/2025/05/20/google-counters-github-microsoft-with-jules-agent-enhanced-gemini-ai.aspx) — Visual Studio Magazine - [AI assistance within Firebase Studio](https://firebase.google.com/docs/studio/ai-assistance) — Firebase Docs --- # Anthropic Passed OpenAI in Business AI Spend. The Ramp Data Is Decisive — and the Threats Are Serious. URL: https://sdd.sh/2026/05/anthropic-overtakes-openai-ramp-ai-index-may-2026/ Date: 2026-05-19 Updated: 2026-05-19 Tags: anthropic, claude-code, openai, enterprise, market-share, industry, ramp Categories: Industry, AI Tools Summary: The May 2026 Ramp AI Index shows Anthropic at 34.4% of US business AI spend — past OpenAI's 32.3% for the first time. Claude Code is the engine. But the same report flags three structural threats that could erase the lead as fast as it was built. For the first time since the AI race started in earnest, more American businesses are paying for Anthropic than for OpenAI. The May 2026 Ramp AI Index, compiled from transaction data across more than 50,000 US businesses and $100 billion in annual spend, shows Claude at 34.4% adoption — up 3.8% in April — while ChatGPT fell to 32.3%, down 2.9% in the same period. This is a single data point from a single measurement methodology. It is also the most credible third-party spend-tracking dataset in the AI market, and the crossover it records is unambiguous. ## How This Happened The trajectory is not subtle. Anthropic climbed from 0.03% of businesses in June 2023 to 7.94% by April 2025 — then rocketed to 34.4% by April 2026. That is a quadrupling in a single year. OpenAI's business adoption grew 0.3% over the same period. The driver is not Claude's chat product. It is Claude Code. Ramp's analysis identifies Claude Code as the fastest-growing product in Anthropic's history and the primary mechanism behind the adoption surge. That tracks with external signals: a separate analysis of public GitHub data published this month estimated that 4% of all global public commits are now being authored by Claude Code — double the percentage from just one month prior. For context, GitHub processed roughly 4 billion commits in 2025. Four percent of that is 160 million commits per year. One tool, one year, that kind of scale. The growth compounds because Claude Code is a workflow tool, not a chatbot. When a team adopts Claude Code, usage accumulates through API calls rather than seat licenses. One engineer doing serious agentic work can consume between $500 and $2,000 per month in API costs. Multiply that across engineering orgs, and Anthropic's revenue per customer is significantly higher than a traditional SaaS model where everyone pays the same monthly fee regardless of usage. Anthropic's annualized revenue run rate hit approximately $30 billion in early 2026, up from $9 billion at the end of 2025. OpenAI is tracking at $24-25 billion over the same period — a reversal from a lead that had seemed structural just eighteen months ago. ## Three Threats the Data Surfaces Ramp doesn't just track the crossover. It flags three structural risks that could unwind Anthropic's position. Each deserves honest analysis. **Threat 1: The token incentive trap.** Anthropic makes money when customers use more tokens. That creates a structural incentive to push customers toward expensive models and high-context workflows even when cheaper alternatives would suffice. Ramp frames this bluntly: "Anthropic profits from increased token consumption, creating pressure to push customers toward expensive models even when cheaper ones are sufficient." This is the underlying economics behind what Uber's CTO described publicly: the company burned through its entire 2026 AI budget in four months, largely on Claude Code and Cursor. Individual engineers are reporting $500 to $2,000 per month in personal API costs for serious agentic workflows. At those numbers, CFOs start paying attention. And when CFOs start paying attention, they look for alternatives. **Threat 2: Reliability and cost shifts.** Ramp's data captured a period of user frustration — "frequent outages, rate limits, and increasing dissatisfaction with results." Anthropic responded by resetting usage limits in April and securing additional compute capacity through the SpaceX Colossus deal (300MW, 220,000+ NVIDIA GPUs in Memphis). The rate limit reset helped. But the underlying compute constraint that created the reliability problems is a consequence of the 80x growth in Q1 2026 that Anthropic had only planned as 10x. That kind of demand mismatch doesn't resolve cleanly. A separate cost issue: recent model changes tripled token costs for image-inclusive prompts. That's a significant jump in a category where usage is growing. Claude Code's computer use features and the visual analysis capabilities of Opus 4.7 both involve image tokens. Developers building on those capabilities took an unexpected cost hit. **Threat 3: OpenAI Codex as cost-effective alternative.** OpenAI's Codex — the async agentic coding agent, not the legacy model — now covers substantial overlap with Claude Code's core workflow at a lower per-task cost and with minimal switching friction. Ramp identifies inference platforms offering cheap, open-source alternatives as the fastest-growing competing category in their dataset. Codex isn't open-source, but its pricing structure and the ease of migration via standard API patterns means that cost-sensitive teams have a credible exit path. The switching cost from Claude Code to Codex is lower than it looks from the outside. Both tools operate via terminal, both support CLAUDE.md-style configuration, both integrate with GitHub. The moat is model quality, CLAUDE.md ecosystem depth, and the Managed Agents platform. If OpenAI closes the model quality gap on SWE-bench Pro (currently 58.6% GPT-5.5 vs 64.3% Opus 4.7), the Codex cost argument gets harder to dismiss. ## What Actually Changes The Ramp crossover is symbolically significant and operationally real. "Most businesses paying for Anthropic vs OpenAI" means enterprise IT procurement conversations are now tilted differently than they were six months ago. When Anthropic walks into a 10,000-seat enterprise negotiation, it no longer needs to defend itself against "but everyone uses ChatGPT." The data now says the opposite. But the threats Ramp surfaces are also real, not hypothetical. Uber's budget story will be repeated in CFO conversations at every large enterprise that has given engineers open-ended Claude Code access. The response from those CFOs won't necessarily be "switch to a competitor" — it may be "governance and spend controls." That's exactly what Claude Cowork GA addresses (group spend limits, per-user caps, Analytics API for cost attribution). Anthropic has built the enterprise controls. The question is whether adoption of those controls keeps pace with the cost concerns they're meant to address. The deeper question is whether Claude Code's architectural advantages are durable. The terminal-native, agent-owned model — where Claude Code has full environment access and owns the full development lifecycle from spec to deployment — is qualitatively different from IDE-embedded tools. But "qualitatively different" only maintains a price premium if users feel the difference in their outcomes, not just in their benchmarks. The 4% of GitHub commits metric is the most direct signal available. At 160 million commits per year, something about the outcomes is working. The business adoption crossover confirms the enterprise is noticing. The threats are real. The lead is real. The next quarter of Ramp data will be informative. --- **Sources:** - [Ramp AI Index — May 2026](https://ramp.com/leading-indicators/ai-index-may-2026) — Ramp - [Anthropic now has more business customers than OpenAI, according to Ramp data](https://techcrunch.com/2026/05/13/anthropic-now-has-more-business-customers-than-openai-according-to-ramp-data/) — TechCrunch - [Anthropic finally beat OpenAI in business AI adoption — but 3 big threats could erase its lead](https://venturebeat.com/technology/anthropic-finally-beat-openai-in-business-ai-adoption-but-3-big-threats-could-erase-its-lead) — VentureBeat - [Anthropic Passes OpenAI in Business Adoption: Ramp AI Index](https://letsdatascience.com/blog/anthropic-passed-openai-business-adoption-ramp-index) — Let's Data Science - [Anthropic 34.4% Just Passed OpenAI — Ramp Flip May 2026](https://theplanettools.ai/blog/anthropic-overtakes-openai-ramp-business-adoption-may-2026) — ThePlanetTools.ai - [Anthropic vs OpenAI Business Adoption: What the Data Says About Enterprise AI](https://www.mindstudio.ai/blog/anthropic-vs-openai-business-adoption-2026) — MindStudio --- # OpenAI Codex Mobile: Remote Control for Your Agent, Not Code on Your Phone URL: https://sdd.sh/2026/05/openai-codex-mobile-remote-control-agentic-sessions/ Date: 2026-05-18 Updated: 2026-05-18 Tags: openai, codex, mobile, agentic-workflows, remote-access Categories: AI Tools, Agentic Workflows Summary: OpenAI shipped Codex inside ChatGPT for iOS and Android on May 14 — but not as a code execution environment. It's a remote viewport onto a session running on a host machine. Remote SSH also went GA. The architectural choice is correct, and it reveals more about agentic coding than the headline does. The announcement almost always gets written wrong. "Codex comes to your phone" — technically accurate, architecturally misleading. What launched on May 14 is a remote viewing and control surface, not a mobile runtime. Your phone becomes a window into a Codex session running on a host machine — your laptop, a Mac mini, a dev box, or a cloud VM you've configured. This is the right design. And it says more about where agentic coding is headed than the headline suggests. ## Why Mobile Can't Be the Runtime Agentic coding at the frontier requires substantial compute and system access. Anthropic's SpaceX Colossus deal — 300 megawatts, 220,000+ NVIDIA GPUs allocated to Claude — suggests the scale these systems will eventually run at. That isn't coming to the A18 chip. But raw compute is only part of it. Coding agents need persistent filesystem access (your project files, not a sandboxed documents folder), tool execution (git, cargo, pytest, Docker, whatever your stack requires), credential access (API keys, SSH certs, cloud auth), and long-running sessions that can stretch across hours. None of these fit cleanly on a phone. OpenAI made the constraint explicit in the launch notes: files, credentials, and local setup stay on the host machine. The phone only receives what the host chooses to stream — terminal output, screenshots, file diffs, test results. You can review outputs, approve commands, change models, and start new sessions from the phone. The agent itself stays on the machine that can do the work. ## What Actually Shipped Three things launched together on May 14: **Codex on ChatGPT mobile (preview)** — Available across all ChatGPT plans, including Free and Go. The mobile app shows a live view of active Codex threads: what the agent is doing, what it's produced, which commands it's requesting. You can redirect a session, approve a step, or kill a runaway loop from the phone without touching a laptop. **Remote SSH GA** — Codex can now connect to any SSH-accessible machine. This moved from preview to general availability on May 14. The practical implication: Codex doesn't need to run on your local laptop at all. SSH into a powerful dev box, a team server, or a cloud VM — the session runs there. The phone (or any client) streams it. **Programmatic access tokens for Enterprise and Business** — Scoped credentials that let Codex sessions authenticate to services without exposing primary account credentials. Useful for CI/CD pipelines and automated workflows where an agent needs service-level access rather than user-level access. HIPAA-compliant Codex in CLI, IDE, and the Codex app also launched for Enterprise customers — but local environments only. Healthcare teams running agents against protected data keep everything on-premises. The mobile viewer is excluded from this mode by design. ## The Architecture It Reveals Mobile-as-viewport isn't a compromise. It's an acknowledgment that agentic workflows have natural supervision points — moments where a human should glance at what's happening and decide whether to continue, redirect, or stop. You don't need a laptop to do that. This pattern was already live for Claude Code users. Claude Code Channels (launched March 2026) routes agent sessions through Telegram or Discord. The agent runs on your machine or Anthropic's infrastructure; the messaging app is the supervision surface. You get a message when the agent finishes a step, needs approval, or hits an error. You reply with instructions. The key architectural difference is where the agent lives. Codex on mobile still requires a host machine — your device, a dev box, or a cloud VM you've configured. Claude Code Routines run on Anthropic's infrastructure natively; there's no host to provision or maintain. When a Routine finishes overnight, the result shows up wherever you're watching. Remote SSH GA narrows the gap for Codex — if you provision a powerful cloud VM and SSH Codex into it, you get a similar effect — but it requires managing that infrastructure yourself. ## Who This Is Actually For The core use case is supervision during transitions. You kick off a multi-hour Codex session before a meeting. Mid-meeting, you check your phone, verify the agent is making progress, review an intermediate diff. After the meeting, you approve the next step or give the agent new direction. That turns "agentic coding requires sitting at a computer" into "agentic coding requires occasional glances from wherever you are." That's a meaningful quality-of-life change for anyone running long sessions. The enterprise features have a different constituency. Remote SSH GA lets engineering teams provision powerful centralized dev boxes and let developers SSH Codex sessions into them — shared infrastructure rather than per-developer laptops. The HIPAA-compliant mode signals OpenAI's intent in regulated industries. Healthcare teams that needed fully on-premises agent workflows can now use Codex agents within those constraints, supervised via the desktop app. The Free and Go plan access for the mobile viewer is the broader play. OpenAI is normalizing the idea that you supervise agents from your phone rather than owning and operating a dev machine. For individual developers and indie hackers, lower barrier to entry matters. ## The Problem That Remains Codex on mobile is a good execution within real constraints. The constraints are legitimate: mobile is a viewport, not a runtime, and OpenAI designed accordingly. But every Codex session still requires a host machine. If your laptop runs out of battery, the session stops. If you're traveling and didn't provision a cloud VM, there's no fallback. The agent isn't cloud-native by default. It's cloud-accessible-if-you-set-it-up. The natural next step is fully managed cloud execution: OpenAI runs the agent on their infrastructure by default, mobile and desktop clients supervise it. That would make the phone viewer genuinely powerful rather than convenient — the agent outlasts your hardware. Whether OpenAI builds this before Claude Code's Routines become the default expectation for how autonomous agents work is the product trajectory to watch. For now, May 14 is a worthwhile milestone. Remote SSH GA alone is the more significant technical change — it decouples Codex from the developer's local machine, which is the prerequisite for everything else. The mobile viewer is the consumer face of an infrastructure shift that matters. --- **Sources:** [TechCrunch](https://techcrunch.com/2026/05/14/openai-says-codex-is-coming-to-your-phone/), [9to5Mac](https://9to5mac.com/2026/05/14/openai-brings-codex-control-to-chatgpt-for-iphone-and-android/), [SiliconANGLE](https://siliconangle.com/2026/05/14/openai-brings-codex-mobile-devices-adds-customization-features/), [OpenAI Developer Docs — Remote Connections](https://developers.openai.com/codex/remote-connections), [Gadget Bridge](https://www.gadgetbridge.com/news/openai-codex-lands-on-chatgpt-mobile-app-for-ios-and-android-with-remote-ssh-support/) --- # Cursor 3.3 and 3.4: Parallel Build Plans, Cloud Dev Environments, and the Ceiling That Remains URL: https://sdd.sh/2026/05/cursor-33-34-parallel-agents-cloud-dev-environments/ Date: 2026-05-18 Updated: 2026-05-18 Tags: cursor, parallel-agents, cloud-environments, agentic-workflows, code-review Categories: AI Tools, Agentic Workflows, Industry Summary: Cursor shipped two meaningful updates in May: Parallel Build Plans and PR Splitting in 3.3 (May 7), and Cloud Agent Development Environments plus configurable Bugbot effort levels in 3.4 (May 13). Both updates are genuine improvements. Both also clarify what Cursor is and isn't. Cursor shipped two changelogs in quick succession this month. Version 3.3 on May 7 added Parallel Build Plans and built-in PR Splitting. Version 3.4 on May 13 added Cloud Agent Development Environments and configurable Bugbot effort levels. The combined effect is a meaningfully more capable agentic coding environment — and a clearer picture of where Cursor's architecture lands. Both things are true: these are real improvements, and the ceiling is real. Let's look at both. ## 3.3: Parallel Build Plans and PR Splitting The headline feature in 3.3 is "Build in Parallel." When Cursor generates a multi-step implementation plan, a button now identifies which parts are independent and runs them as async subagents concurrently. Steps that require earlier output stay ordered; everything else runs in parallel. This is a genuine improvement over the single-agent sequential loop. A plan to add authentication middleware, update the database schema, and write integration tests has three largely independent branches. Running them sequentially meant the agent worked on one while the others waited. Build in Parallel runs all three concurrently and merges results when they complete. PR Splitting is the complementary feature. Once an agent produces a large diff, a quick-action pill in the PR view proposes how to split it into logically independent pull requests. Cursor shows the proposed split, creates a backup snapshot, and executes if you confirm. The chat context from the session informs how it identifies slices — if you told the agent "add auth and fix the caching bug," it knows those are separate concerns and splits accordingly. Both features address a real friction: AI-generated diffs routinely mix multiple logical changes in a single commit because agents tend to fix adjacent things they notice. Giving the agent a splitting primitive and parallel execution reduces that sprawl. ## 3.4: Cloud Dev Environments The more architecturally significant update is in 3.4. Teams can now configure a Dockerfile-based development environment that Cursor agents use when running in the cloud. The Dockerfile specifies the repository, dependencies, credentials, build system access, and any tooling the agents need. It's reusable across sessions and supports multi-repo configurations. This addresses a real failure mode: cloud agents fail in opaque ways when they can't find a dependency, authenticate to a service, or run a build command. A team-managed Dockerfile that defines the full environment removes that ambiguity. Every agent session starts from a known-good state. The enterprise value is clear. Previously, Cursor cloud agents ran in generic sandboxes that may or may not match your actual development environment. Now a team can define "this is what our stack looks like" and have agents operate reliably within it. Persistent environments across sessions, multi-repo access, pre-baked credentials — these are the table stakes for production agentic workflows. Bugbot gains effort levels in 3.4. Default mode finds 0.7 bugs per run with 79%+ resolved by merge time. High mode climbs to 0.95 bugs per run — slower, more expensive, more thorough. A Custom mode takes natural-language instructions for when to use which: "use High effort for PRs touching the payment flow, Default for everything else." These are actual numbers, which is unusual in agent quality claims. 0.7 bugs per run as the default, reaching 0.95 in high-effort mode, with a documented merge-time resolution rate — that's the kind of measurement that makes it possible to have a real conversation about whether the tool is worth the cost. ## What These Features Don't Change Credit given: parallel build plans, cloud dev environments, and configurable Bugbot are the right direction. Cursor is building infrastructure for agentic workflows, not adding more autocomplete. These are serious product improvements. The architectural ceiling is structural, not cosmetic. **On parallel execution:** "Build in Parallel" runs subagents within Cursor's orchestration layer — concurrent subtask execution within a single agent context. Claude Code Agent Teams (March 2026) ships a 15-agent mailbox architecture where each agent is an independent peer with its own tool access, memory, and task queue. The difference is not degree — it's category. Cursor's parallel execution is concurrency. Agent Teams is coordination. In multi-day, multi-component projects, that gap surfaces. **On cloud dev environments:** Cursor's cloud environments give agents a configured execution context — a Dockerfile that replicates your dev setup. This solves the "agent can't find the dependency" problem. It doesn't change who owns and manages the infrastructure. Your team writes the Dockerfile, provisions the environments, and maintains them as your stack evolves. Claude Code Managed Agents uses Anthropic's infrastructure with Anthropic's reliability SLA. Claude Code on AWS Bedrock (GA since April 18) gives you AWS-managed infrastructure with Mantle zero-operator-access guarantees. Different risk profiles for different organizational requirements. **On Bugbot:** 0.95 bugs per run is a good single-agent review number. Claude Code Review (GA May 6, $15–25 per PR) runs multiple independent review agents in parallel, each with its own context and focus area. Multi-agent review isn't just "more bugs found" — it's different reviewers looking for different things simultaneously. Comparing single-agent and multi-agent review on a per-bug metric misses the architectural difference. ## The Cursor Question The pattern across 3.3 and 3.4 is consistent: Cursor adds depth to features the IDE already had. The additions work, the metrics are real, and teams using Cursor benefit directly from these updates. What Cursor is building is an IDE that orchestrates AI agents. What Claude Code is building is an agent that can use an IDE when it needs to. The distinction matters more with each release because the use cases are diverging. If your team's workflow requires visual diff review, inline suggestions, and IDE-integrated chat — and you're willing to be in the loop during agent execution — Cursor 3.3 and 3.4 are good reasons to stay. Parallel build plans speed up multi-component work. Cloud dev environments reduce agent failures in production workflows. Configurable Bugbot means you can tune the quality/cost tradeoff per PR. If your question is "what's the fastest path to autonomous software development with minimal developer bottlenecks," these updates don't move the answer. They extend a model where a developer is a necessary orchestration participant. That's a legitimate product choice — most engineering teams aren't ready to remove that participant. But it's worth being clear that the choice is being made, not just the tool. Cursor at $50 billion valuation (announced April 2026) has resources to ship more quickly. The direction is right. The question is whether features added to an IDE-centric architecture can close the gap on a terminal-native agentic architecture, or whether those are just different products serving different markets. The next version will tell us something about which it is. --- **Sources:** [cursor.com/changelog (3.3, May 7, 2026)](https://cursor.com/changelog/05-07-26), [cursor.com/changelog (3.4, May 13, 2026)](https://cursor.com/changelog/05-13-26), [Cursor Cloud Agent Development Environments blog](https://cursor.com/blog/cloud-agent-development-environments), [Cursor Bugbot effort levels blog](https://cursor.com/blog/may-2026-bugbot-changes), [The Decoder — Cursor 3 parallel agent coverage](https://the-decoder.com/new-cursor-3-ditches-the-classic-ide-layout-for-an-agent-first-interface-built-around-parallel-ai-fleets/) --- # From Ghost Text to Autonomous Agent: Five Years of AI Coding Tools URL: https://sdd.sh/2026/05/from-copilot-to-autonomous-agents-ai-coding-evolution-2021-2026/ Date: 2026-05-17 Updated: 2026-05-17 Tags: github-copilot, claude-code, cursor, swe-bench, agentic-workflows, history, autonomous-agents, mcp Categories: AI Tools, Industry Summary: Five years ago, GitHub Copilot autocompleted a function and developers argued whether it was cheating. Today, Google says 75%+ of its new code is AI-generated and Claude Opus 4.7 scores 87.6% on SWE-bench Verified. This is the arc — and the rupture nobody predicted. In June 2021, a developer on Twitter posted a screenshot of GitHub Copilot completing a for-loop and wrote: "impressive, but it's just autocomplete." That take was correct and completely missed the point. What Copilot started was not a feature — it was the first turn of a loop that would, five years later, produce autonomous agents writing, testing, and shipping software while the engineer supervises from a terminal. I've been here for the whole ride. And the most important thing I can tell you is that this evolution was not smooth. There was a rupture. And most of the tools you're familiar with are on the wrong side of it. --- ## 2021–2022: The Autocomplete Era GitHub Copilot launched its private beta in June 2021 on top of OpenAI Codex. The experience was genuinely magical in a narrow way: you typed a comment describing what you wanted and ghost text appeared, offering a plausible completion. Functions filled themselves in. Boilerplate evaporated. But the paradigm was firmly tab-to-accept. The model was passive. It waited for you to type, offered a suggestion, and disappeared until you typed again. The human was the engine; the AI was the turbocharger. Amazon CodeWhisperer followed the same template. The competitive question was which model produced more accurate completions, not what the model was capable of doing on its own. The discourse of this era aged badly. "It'll just write buggy code you'll have to fix anyway." "It's a Clippy for developers." "It trains on your private repo." Some of these concerns were legitimate; none of them engaged with the trajectory. Copilot went generally available in June 2022 and immediately became the most widely adopted developer tool in a generation. The tool was limited. The appetite it revealed was not. --- ## 2022–2023: The Chat Era GPT-4 landed in March 2023 and broke the autocomplete paradigm. Not because GPT-4 was better at completing lines — though it was — but because it could sustain a coherent conversation about a codebase across hundreds of turns. Developers stopped asking "complete this function" and started asking "why does this fail, what should I change, how would you design this differently." This was the era of vibe coding, a term that emerged to describe a workflow that was equal parts productive and reckless: paste the error message, accept the fix, run it again, don't read the diff. Engineers started shipping features faster than they could reason about what they were shipping. Technical debt accumulated at AI speed. SWE-bench was created in late 2023 by Princeton NLP researchers, and its arrival mattered more than most people realized at the time. For the first time there was a structured benchmark measuring something close to real software engineering — resolving GitHub issues in real Python repositories. The initial numbers were humbling: state-of-the-art models solved less than 5% of tasks. That number would become a speedometer for the entire field. The chat era was real progress. But it still kept the human firmly in the loop. The model reasoned; you acted. The model suggested; you typed. The computer did not do anything you didn't explicitly ask for. --- ## 2023–2024: The Agent Experiments In March 2024, Cognition AI launched Devin with a claimed 13.86% on SWE-bench — more than double anything that had come before — and a press release that called it "the world's first AI software engineer." The backlash was immediate and partly warranted: independent researchers found the methodology questionable and real-world performance disappointing. But the significance of the moment had nothing to do with Devin's actual capabilities. It had to do with the framing. For the first time, a serious company shipped a product positioned not as a tool for engineers but as a replacement agent. The Overton window shifted. "AI software engineer" stopped being science fiction and started being a product category. Cursor launched around the same time as an AI-first fork of VS Code, and it was genuinely good. Context-aware edits, inline chat, codebase indexing — it pushed the IDE model further than Copilot had. Developers who lived in VS Code found it transformative. The model had also improved dramatically: Claude 3 Sonnet and Opus raised the quality ceiling on what an AI could reason about code. But Cursor's architecture made a bet: that the right interface for AI-assisted development was still the IDE. That developers would stay in their editor, and the AI would work within that frame. It was a defensible bet. It was also, I'd argue, a ceiling. --- ## 2024–2025: The Terminal-Native Rupture Claude Code launched in early 2025 and it was architecturally different from everything that came before. Not marginally different — structurally different. It ran in the terminal. It had no IDE dependency. It could read your entire repository, plan across files, run tests, interpret the output, iterate, and complete a multi-step task without asking for confirmation at every turn. The IDE-vs-terminal debate that followed was widely misread as a UI preference war. It was not. It was a debate about who holds the steering wheel. In Copilot, in Cursor, the human is always in the critical path. You accept or reject suggestions. You trigger actions. The model is a very powerful tool you're operating. In Claude Code — especially after MCP shipped in late 2024 — the model can hold the plan across a long-horizon task. You can describe what you want, walk away, and come back to a pull request. The human is a supervisor, not an operator. MCP (Model Context Protocol) deserves more credit than it gets in this story. Shipping in late 2024, it gave Claude Code — and any conforming agent — a standardized way to plug into external tools: databases, APIs, file systems, CI pipelines. By mid-2025 it had 97 million downloads. MCP turned Claude Code from a capable terminal agent into an extensible platform. SWE-bench Verified hit roughly 60% in this period, up from under 20% two years earlier. The benchmark was moving fast enough that researchers started debating whether it was still measuring the right thing. --- ## 2025–2026: The Agentic Era Is Not Coming — It's Here Claude Opus 4.7 scores 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro. The Stanford AI Index 2026 notes that SWE-bench Verified is approaching the human performance baseline. Google announced at Cloud Next 2026 that over 75% of its new code is AI-generated. Claude Code crossed $2.5B ARR. Managed Agents, Code Review GA, and Agent Teams shipped. Let me sit with those numbers for a moment, because if you'd shown them to the engineer who posted that Copilot screenshot in 2021, they would have assumed you were describing a dystopian film. The workflow that's emerging — not at some companies, but at most serious engineering organizations — looks like this: the engineer writes a spec or describes a task, an agent implements it, runs the tests, opens a PR, and flags edge cases for human review. The engineer's primary interface is no longer the editor. It is increasingly the spec, the review, the judgment call on ambiguity. This is not the elimination of software engineers. It is the elimination of a large fraction of the work software engineers have historically done. The implementation layer is being automated. What remains irreducibly human is the part that was always undervalued: understanding why the system should exist, what it should do in cases the spec didn't anticipate, and whether the thing the agent built is actually what the business needed. --- ## What Is the Software Engineer's Irreducible Role? I don't have a clean answer, and I'm suspicious of people who do. The honest version is that the industry is mid-restructuring and anyone claiming to know the stable endpoint is extrapolating from incomplete evidence. What I can say is that the engineers thriving right now are the ones who have shifted their leverage point. They are writing fewer lines and making more consequential decisions per day. They are treating AI agents as junior engineers who need clear requirements, good test coverage to catch regressions, and explicit feedback loops — not as autocomplete on steroids. The engineers struggling are the ones who experienced Copilot as the destination and Cursor as the upgrade, and don't understand why they feel like they're falling behind despite using good tools. The tools are good. But they were optimized for a paradigm that is being superseded. Five years ago, the question was whether autocomplete was cheating. Today, the question is what judgment, taste, and systems thinking look like when implementation is nearly free. That is a much better question to be asking. The industry took five years to get here. I don't think it'll take five more to find the answer. --- ### Sources - [GitHub Copilot General Availability](https://github.blog/news-insights/product-news/github-copilot-is-generally-available-to-all-developers/) — GitHub Blog, June 2022 - [SWE-bench: Can Language Models Resolve Real-World GitHub Issues?](https://arxiv.org/abs/2310.06770) — Princeton NLP, October 2023 - [Introducing Devin](https://cognition.ai/blog/introducing-devin) — Cognition AI, March 2024 - [Stanford AI Index Report 2026](https://aiindex.stanford.edu/report/) — Stanford HAI - Google Cloud Next 2026 — AI-generated code keynote announcement - Anthropic Claude Code ARR reporting, 2025–2026 --- # 200,000 MCP Servers Have a Command Injection Problem Nobody Told You About URL: https://sdd.sh/2026/05/mcp-stdio-security-200k-servers-exposed/ Date: 2026-05-17 Updated: 2026-05-17 Tags: mcp, security, stdio, command-injection, agentic-workflows, claude-code Categories: AI Tools Summary: An Ox Security audit published in May 2026 found that STDIO transport — used by over 200,000 MCP servers — has no execution boundary and no input sanitization, leaving it wide open to command injection via malicious tool responses. Separately, 7,000+ MCP servers are running on public IPs with zero authentication. This is the third distinct MCP security crisis in 2026, and the most fundamental one yet. Two hundred thousand. That's the floor estimate for how many MCP servers are running the STDIO transport — the original, simplest, fastest way to wire an AI agent to a local tool. Ox Security published an audit in May 2026 that lands an uncomfortable punch: STDIO has no execution boundary, no input sanitization, and no meaningful defense against a malicious server response injecting arbitrary shell commands through the pipe. The fix exists. Most operators don't know they need it. This isn't a panic post. The sky isn't falling on MCP as a protocol. But the ecosystem grew from zero to 97 million downloads in 18 months, and security was the tax deferred. That bill is coming due in quarterly installments. ## What STDIO Transport Is and Why It's Vulnerable STDIO (standard input/output) is how MCP servers were originally designed to work: the host process launches the server as a child process and communicates by writing JSON to stdin and reading responses from stdout. No network stack. No ports. No HTTP overhead. For local tools — file readers, shell executors, database clients — it's elegant and fast. The problem is that STDIO was designed for a trust model that no longer holds. The original use case assumed the server was a trusted local binary. When an agent connects to a third-party MCP server, retrieves tool definitions, and starts executing those tools, the server's responses flow back through the same pipe — and STDIO has no concept of "this response is trying to do something it shouldn't." A malicious tool response can embed shell metacharacters, newline injections, or control sequences that, when processed by the host's STDIO handler, execute arbitrary commands in the context of whatever user launched the agent. There's no sandbox. There's no escaping layer. The pipe is just a pipe. The Ox Security audit identified two compounding failures: the command injection vector itself, and a separate finding that over 7,000 MCP servers are running on public IP addresses with no authentication layer at all. Those aren't local tools. They're internet-exposed services — some of them apparently production deployments — with the attack surface of a 1990s telnet daemon. ## Three MCP Security Crises, One Pattern 2026 has handed the MCP ecosystem three distinct security crises. Laid out in chronological order, the pattern is hard to ignore. **April 2 — OAuth mix-up attacks on HTTP transport.** The MCP Dev Summit NYC surfaced findings that 43% of MCP servers using HTTP transport had OAuth implementation flaws. Mix-up attacks — where a malicious authorization server tricks a client into sending tokens to the wrong endpoint — were demonstrated as practical exploits against real deployments. The summit framed this as the dominant unsolved problem in the ecosystem. **April 7 — CVE-2026-21852: CLAUDE.md supply-chain poisoning.** A critical vulnerability in Claude Code's config parser allowed a malicious `CLAUDE.md` file to bypass deny rules and execute arbitrary commands. Attackers could embed a payload just past the parser's invisible 50-subcommand cap. The bug was patched, but [as covered here at the time](/posts/claude-code-cve-2026-claudemd-supply-chain-attack/), the underlying attack surface — agentic tools trusting project configs from untrusted repos — is structural. **May 2026 — STDIO command injection at 200K-server scale.** This one. Each crisis hit a different layer of the stack: the client config layer, the HTTP authentication layer, and now the core transport layer. If that feels like an escalating audit of the entire protocol surface, that's because it is. Researchers are working their way down from application-level misconfigurations toward protocol fundamentals. STDIO is as fundamental as it gets. ## What the Ox Security Audit Actually Found The audit's core finding is a category of vulnerability, not a single CVE. Any MCP server running STDIO transport that processes tool responses without strict output sanitization is potentially injectable. The specific mechanics involve: - **No execution boundary**: STDIO doesn't distinguish between "this is data" and "this is a command." The host process reads the pipe and acts on it. A response that embeds shell control characters, environment variable expansions, or newline-delimited subcommands can escape the expected data context. - **No input sanitization in the spec**: The MCP specification does not mandate sanitization of tool responses before they reach the execution layer. Individual implementations may add it. Many don't. - **Scale through defaults**: STDIO is the default transport in the majority of MCP tutorials, starter templates, and quickstart guides. Developers reaching for their first MCP integration almost always land on STDIO. That's why the number is 200,000 and not 2,000. The 7,000 internet-facing servers with no authentication are a separate but related failure. These appear to be teams that stood up an MCP server for production use and either didn't implement authentication or actively disabled it because it was friction. An unauthenticated STDIO server exposed on a public IP is exactly as dangerous as it sounds. ## Who's Actually at Risk Let's be precise about threat models, because "200K servers" sounds maximally alarming and the reality is more segmented. **Enterprise teams using third-party MCP servers are the primary risk.** If your engineering team connects Claude Code or another agentic tool to an MCP server you didn't write — a vendor integration, an open-source community server, a marketplace-listed tool — that server's responses flow through STDIO. You're trusting the server operator's sanitization. You probably shouldn't. **Developers running public-facing MCP servers** — especially the 7,000 on public IPs — are directly exploitable without even needing to compromise a tool response. A network-level attacker who can send arbitrary data to those servers doesn't need injection tricks. **Casual local users** running a small set of personally maintained STDIO servers against their own toolchain are at lower risk. If you wrote the server, you know what it returns. The injection vector is still theoretically present, but the practical attack path requires you to have already compromised yourself. Claude Code's own MCP implementation is not the vulnerability here. The attack surface is the third-party MCP servers you connect to. Claude Code is the client; if a connected server is malicious or compromised, STDIO gives that server a lever into your execution environment. ## What to Do Right Now **1. Audit your MCP server inventory.** List every MCP server your agents connect to. For each one: who operates it? Do you trust their sanitization practices? Is it running STDIO or HTTP+SSE? **2. Prefer HTTP+SSE transport for production.** The HTTP transport with Server-Sent Events moves the response channel out of the STDIO pipe and into a structured HTTP response layer where you can apply standard web security controls. It has its own auth problems — see the April summit findings — but command injection via response data is not one of them. **3. Add strict input validation at the client layer.** If you're running STDIO servers you can't immediately migrate, validate and escape tool responses before they touch anything that executes code. Treat every server response like untrusted user input, because that's exactly what it is. **4. Firewall your STDIO servers.** If an MCP server has no business being on the network, block it at the firewall level. STDIO was designed for localhost. Anything listening on a public IP without authentication is a misconfiguration, not a deployment. **5. Watch the MCP roadmap for the spec-level fix.** The [2026 MCP roadmap](https://blog.modelcontextprotocol.io) includes security hardening as a stated priority. Anthropic and the Agentic AI Foundation are aware of these findings. Spec-level sanitization requirements and transport security guidelines are in progress. Stay close to SDK updates over the next 60 days. ## The Cost of Velocity MCP reached 97 million downloads in roughly 18 months. That's not just fast — it's "internet infrastructure-level growth while still in RFC status" fast. The HTTP OAuth vulnerability, the supply-chain config poisoning, and now STDIO command injection are three different facets of the same root cause: the ecosystem moved at product velocity while the security model was still being designed. That's not a reason to stop using MCP. The protocol solves a real problem — standardized tool integration for AI agents — and it solves it well enough that OpenAI adopted it, the Linux Foundation governs it, and enterprises are betting production workloads on it. The architecture is sound. The implementation surface is still being hardened. But if you're a CTO or a tech lead making decisions right now about which MCP servers your agents connect to, treat third-party STDIO servers with the same scrutiny you'd apply to a third-party binary you're running as root. Because at the moment, the trust model is approximately that loose. The fix is not complex. The audit trail is public. The time to act is before your agent does something you didn't ask it to. --- **Sources** - Ox Security MCP STDIO vulnerability audit, May 2026 — [VentureBeat coverage](https://venturebeat.com) - [MCP 2026 roadmap — blog.modelcontextprotocol.io](https://blog.modelcontextprotocol.io) - The New Stack on MCP production readiness - [CVE-2026-21852: The CLAUDE.md Supply-Chain Attack](/posts/claude-code-cve-2026-claudemd-supply-chain-attack/) — prior coverage on this blog - [MCP Dev Summit NYC 2026: Authentication Is the Crisis](/posts/mcp-dev-summit-nyc-2026-auth-scale-openai/) — prior coverage on this blog --- # ServiceNow Build Agent Goes Everywhere: Enterprise MCP Governance for Every AI Coding Tool URL: https://sdd.sh/2026/05/servicenow-build-agent-ga-mcp-governance-enterprise/ Date: 2026-05-16 Updated: 2026-05-16 Tags: servicenow, mcp, enterprise, claude-code, cursor, windsurf, governance, agentic-workflows Categories: AI Tools, Industry Summary: ServiceNow made Build Agent generally available at Knowledge 2026, extending its core skills into Claude Code, Cursor, Windsurf, GitHub Copilot, OpenAI Codex, and Antigravity via MCP — with enterprise governance, OAuth, audit trails, and a real-time AI Gateway baked in by default. It's the model for how enterprise platforms will integrate with the agentic coding ecosystem. At Knowledge 2026, ServiceNow made a decision that matters more than any single product announcement: instead of building a proprietary coding agent and asking developers to switch, they shipped into the tools developers already use. Build Agent is now generally available in ServiceNow Studio, Cursor, Windsurf, Claude Code, GitHub Copilot, OpenAI Codex, and Antigravity — the full roster of mainstream AI coding tools — with the same enterprise governance applied regardless of where the code gets written. It's a governance layer on top of the agentic coding ecosystem, not a replacement for it. --- ## What Build Agent Actually Does Build Agent started as an AI assistant for building ServiceNow applications inside ServiceNow Studio. Its job was to accelerate development of scoped apps that run on the Now Platform — helping developers scaffold data models, workflows, integrations, and UI components without writing every line from scratch. The General Availability announcement at Knowledge 2026 is not a rebrand. It's an expansion: Build Agent's core skills now work as an MCP server that any compatible coding tool can invoke. When a developer working in Claude Code or Cursor needs ServiceNow context — platform APIs, data schema, security roles, workflow models — the Build Agent MCP server provides it directly, without switching to ServiceNow Studio. The workflow looks like this: you write code in your preferred tool, Build Agent provides ServiceNow-aware context and validation, and when you're ready to ship, you export to ServiceNow Studio like any other scoped app. Governance, security roles, and data model enforcement happen at export time — applied by the platform, not the developer's discipline. This is the right design. It accepts that developers will use their preferred tools and puts governance at the platform boundary rather than at the tool boundary. --- ## The MCP Server: Included by Default The ServiceNow MCP Server is **generally available and included in every Now Assist and AI Native SKU** — no separate license required. For organizations already running ServiceNow in production, this is a meaningful shift: the MCP integration arrives in the next contract renewal, not as a separate line item. The MCP Server Console provides enterprise controls that matter to the buyers making these decisions: - **AICT governance**: AI Control Tower integration for centralized agent observability - **Consumption metering**: per-request tracking of what every agent is consuming from the platform - **Managed OAuth**: enterprise-grade authorization without each developer managing their own credentials - **Audit trails**: complete logs of which agent made which platform call, when, and from which tool - **Session management**: agent session lifecycle controls that match how enterprises think about access - **Role-based tool packages**: different tool sets for different developer roles, controlled by platform administrators For comparison: most MCP server deployments in production today are developer-managed, with minimal observability and ad-hoc access control. The ServiceNow MCP Server Console is the first production-grade, enterprise-class control plane for MCP I've seen from a major platform vendor. --- ## Action Fabric and the AI Gateway Beyond Build Agent, ServiceNow announced **Action Fabric** — a governed access layer that lets AI agents invoke ServiceNow's full system of action directly, without a human opening a browser or running a workflow manually. The practical meaning: when Claude Code or a Managed Agent needs to create a ticket, update a CMDB record, trigger an approval workflow, or escalate an incident, Action Fabric provides a headless API surface with ServiceNow's full governance stack applied. Agents get the same access a human ServiceNow administrator would have, with the same audit trail and the same role-based constraints. The **AI Gateway** is the runtime control layer on top of this. It provides real-time controls for agentic workloads — rate limiting, policy enforcement, circuit breakers for runaway agents — along with observability and security for traffic flowing across any third-party AI system. This is how an enterprise IT team monitors what 200 developers' coding agents are doing to the production platform at 2 AM. Build Agent also connects outward as an MCP Client, pulling context from external tools: design specs from Figma, requirements from Miro, code context from GitHub. The same governance that applies to outbound ServiceNow calls applies to these inbound integrations — everything flows through the AI Gateway. --- ## Why This Architecture Wins Enterprise The conventional enterprise software playbook for AI is to build a first-party AI assistant and ask developers to use it exclusively. ServiceNow could have done that. They chose not to, and the reasons are instructive. Developer tool preferences are high-stakes and sticky. Telling a team of engineers who have spent months building Claude Code workflows, CLAUDE.md configs, and MCP integrations that they need to switch to a ServiceNow-specific coding interface is a losing argument. It doesn't matter how good the ServiceNow interface is — the switching cost is real and the resentment is reliable. The alternative — embed your governance into the tools developers already use — solves the adoption problem by eliminating it. Build Agent doesn't compete with Claude Code. It extends it. This is also a governance story, not a capability story. The ServiceNow MCP Server doesn't make Claude Code smarter. It makes Claude Code's interactions with the ServiceNow platform auditable, compliant, and centrally observable. That's what enterprise IT buyers actually need to approve a deployment. Capability is table stakes; compliance is the procurement blocker. The MCP standard is what makes all of this possible. By building against a standardized protocol rather than tool-specific integrations, ServiceNow's governance layer works across Claude Code, Cursor, Windsurf, Copilot, and Codex simultaneously. New tools that implement MCP inherit the integration automatically. This is the MCP ecosystem flywheel working as intended: tool vendors invest in MCP compliance, platform vendors invest in MCP servers, and developers get governed access to enterprise systems from their preferred environment. Nobody wins by building a private integration ecosystem anymore. --- ## Anthropic Models Inside ServiceNow One detail from the announcement worth noting: Build Agent on the ServiceNow AI Platform is now powered by Anthropic models. The specific benefit cited is longer context sessions — developers can work through entire application builds without losing continuity. This is the enterprise distribution story Anthropic has been building toward. Claude doesn't have to be the interface that developers see; it can be the reasoning engine inside platforms they already use. ServiceNow joining that list (alongside Amazon Bedrock, Google Vertex AI, Azure AI Foundry) reinforces the pattern: Anthropic sells capability, partners sell workflow integration. For Claude Code users, the practical implication is coherence. When you use Build Agent skills from within Claude Code, the underlying model driving ServiceNow's guidance and Claude Code's autonomous execution is coming from the same lab. That's alignment in the literal sense — the models share the same training lineage and capability profile, which reduces the kind of instruction drift that happens when heterogeneous AI systems try to collaborate. --- ## The Governance Gap in Today's MCP Deployments Most teams using MCP in production today are running it without any of the controls the ServiceNow MCP Server Console provides. MCP servers are typically developer-deployed, with access granted by API key, no consumption metering, no centralized audit trail, and no role-based access control. This works fine for small teams. It does not work fine for a 50,000-person enterprise where AI agents are making calls to production CRM, ITSM, and financial systems on behalf of 2,000 developers. The audit question — *which agent called this endpoint, when, and with whose authorization?* — is currently unanswerable in most MCP deployments. ServiceNow has answered it. The AI Gateway and MCP Server Console are the first enterprise-grade answer I've seen to the MCP governance gap. If this architecture gets replicated by Salesforce, SAP, and Workday — which it should — it will become the standard pattern for how enterprise platforms integrate with the agentic coding ecosystem. --- ## What to Watch The MCP Server is GA and in production. The AI Gateway additional features are planned for H2 2026. Devin integration for Windsurf is also planned for H2 2026, which would mean ServiceNow Build Agent running through a Windsurf + Devin autonomous session — governed by the AI Gateway, audited by the MCP Server Console. That's a plausible production architecture by end of year. The market question is whether other enterprise platforms move on this timeline or wait for the pattern to mature. Given that ServiceNow is the first major platform vendor to ship enterprise-grade MCP governance, they have a meaningful window to define what enterprise MCP integration looks like before the standard gets set by committee. --- **Sources:** - [ServiceNow Build Agent now works inside every major AI coding tool, governed by default — Business Wire](https://www.businesswire.com/news/home/20260506008934/en/ServiceNow-Build-Agent-now-works-inside-every-major-AI-coding-tool-governed-by-default) - [ServiceNow opens its full system of action to every AI Agent in the enterprise — ServiceNow Newsroom](https://newsroom.servicenow.com/press-releases/details/2026/ServiceNow-opens-its-full-system-of-action-to-every-AI-Agent-in-the-enterprise/default.aspx) - [ServiceNow Knowledge 2026 — AI Control Tower expands, Autonomous Workforce reaches every function — Diginomica](https://diginomica.com/servicenow-knowledge-2026-ai-control-tower-expands-autonomous-workforce-reaches-every-function-and) - [ServiceNow AI Governance Push: Knowledge 2026 — CX Today](https://www.cxtoday.com/security-privacy-compliance/servicenow-ai-agent-governance-knowledge-2026/) - [ServiceNow Wants to Be the Control Layer for Every AI Agent in the Enterprise — Reworked](https://www.reworked.co/digital-workplace/servicenow-launches-action-fabric-major-overhaul-of-ai-control-tower/) - [Building ServiceNow apps via Claude Code and the ServiceNow SDK — ServiceNow Community](https://www.servicenow.com/community/developer-advocate-blog/building-servicenow-apps-via-claude-code-and-the-servicenow-sdk/ba-p/3525677) - [ServiceNow MCP Integration with Claude Code — Composio](https://composio.dev/toolkits/servicenow/framework/claude-code) --- # Grok Build: xAI's First Coding Agent Has Eight Parallel Agents, a Privacy-First Architecture, and One Major Problem URL: https://sdd.sh/2026/05/grok-build-xai-coding-agent-arena-mode/ Date: 2026-05-16 Updated: 2026-05-16 Tags: xai, grok, coding-agents, terminal-native, benchmarks, claude-code Categories: AI Tools, Industry Summary: xAI launched Grok Build on May 14 — a terminal-based coding agent with 8 parallel sub-agents, Arena Mode automated evaluation, and a local-first privacy model that sends zero codebase data to xAI servers. It scores 70.8% on SWE-bench Verified at $0.20/M tokens. Here's what it gets right, what's missing, and how it stacks up against Claude Code. xAI launched Grok Build on May 14 — Elon Musk's first serious move into the AI coding agent market that Anthropic, OpenAI, and Google have been fighting over for the past year. It's an early beta, exclusive to SuperGrok Heavy subscribers, and it has a genuinely interesting architecture. It also has a benchmark score that's 17 points behind the current leader, and its most-hyped feature isn't live yet. Here's what Grok Build actually is, what it gets right, and why Claude Code users shouldn't be canceling their subscriptions this week. --- ## What Grok Build Is Grok Build is a CLI-based coding agent — not an IDE plugin, not a VS Code fork. You invoke it from your terminal, describe a task, and it runs. In that sense, xAI is making the same architectural bet Anthropic made: the terminal, not the editor, is the right home for a serious coding agent. The underlying model is **grok-code-fast-1**, with a 256,000-token context window and API pricing of $0.20 per million input tokens and $1.50 per million output tokens. The model scores **70.8% on SWE-bench Verified** — meaningful, but not frontier-level. Claude Code, running on Opus 4.7, sits at 87.6% on SWE-bench Verified and 64.3% on the harder, contamination-resistant SWE-bench Pro. GPT-5.5 Spud is at 82.7% on Terminal-Bench 2.0. Grok Build enters a competitive benchmark field where it's not yet the leader on any metric. **Pricing:** SuperGrok Heavy at $300/month, with an introductory deal at $99/month for the first six months. API access at $0.20/$1.50 per million tokens is genuinely competitive — Anthropic's Opus 4.7 runs at $5/$25. --- ## The Architecture: Eight Agents, One Arena The core selling point of Grok Build is multi-agent parallelism. When you hand it a task, it doesn't run a single chain-of-thought loop. It spawns up to **eight concurrent sub-agents**, each specialized across a three-stage workflow: plan, search, and build. This is closer to Windsurf's parallel-agent model than to Claude Code's single-agent-goes-deep approach. Complex tasks get subdivided and attacked simultaneously. The theoretical benefit is wall-clock time: eight agents working a problem in parallel can compress multi-step tasks that would otherwise be sequential. **Arena Mode** is the feature that's generated the most discussion. The concept: every agent response appears side by side, automatically scored and ranked before a developer ever reviews it. You'd get an automated evaluation layer that selects the best output from the parallel runs before it ever reaches your screen. It was confirmed in code traces as far back as February 2026. It's not live in the early beta. Arena Mode is a genuinely clever idea — it shifts output selection from human judgment to algorithmic scoring, which is the right direction. But until it ships, Grok Build's parallel architecture is producing eight outputs you still evaluate manually. That's a different workflow than Windsurf's side-by-side visual comparison, and it's more cognitive load than Claude Code's single-agent model, not less. --- ## Local-First Privacy: The Differentiator That Actually Works The architecture detail that deserves the most attention is privacy. **Grok Build is local-first: no code is transmitted to xAI's servers during a session.** Computation happens on your machine. The tool is air-gap compatible once initial setup is complete. This is a serious differentiator for regulated industries. Anthropic has answered this problem with Claude Code on Bedrock's Mantle backend (zero operator access, NitroTPM attestation), but that's an enterprise SKU that requires AWS infrastructure and setup. Grok Build's local-first model is a single-machine answer to the same problem — no cloud configuration required. For individual developers working on proprietary codebases who don't have an enterprise Anthropic contract, local-first matters. xAI has correctly identified that "where does my code go?" is a real blocker for a meaningful portion of the addressable market. **Plan Mode** reinforces this philosophy. Before Grok Build modifies a single file, it presents the complete execution plan — including which files it intends to change, what it will do to each, and why. You can review it, comment on individual steps, rewrite parts of it, or kill it entirely. The plan is editable; the execution doesn't start until you approve. Claude Code users in Auto mode will recognize the tradeoff: Plan Mode adds a review gate that slows autonomous execution in exchange for control. It's the right choice for an early beta where trust hasn't been established. The question is whether that gate is still mandatory six months from now. --- ## What's Missing Honest accounting of what Grok Build doesn't have yet: **No MCP ecosystem.** Claude Code's 6,400+ MCP servers represent two years of community and enterprise tool integration — ServiceNow, Figma, Jira, Salesforce, GitHub. Grok Build has no equivalent ecosystem. A coding agent without tool integration is a code-writing loop, not a full development workflow. **No CLAUDE.md equivalent.** Anthropic's project instruction system lets teams encode invariants, style guides, architecture rules, and agent behavior constraints into a file that every Claude Code session reads. It's how organizations scale consistent AI behavior across hundreds of engineers. Grok Build has no documented equivalent mechanism. **No scheduling or cloud execution.** Claude Code Routines run on Anthropic's infrastructure — cron triggers, API webhooks, GitHub event triggers — without your machine being online. Grok Build requires an active session. **No enterprise governance layer.** No Analytics API, no per-user spend controls, no SCIM integration, no OpenTelemetry export. For teams buying agentic coding tools at the enterprise level, these aren't nice-to-haves. **Arena Mode isn't live.** The headline feature is coming. Features in early beta often arrive on timeline; they also sometimes don't. --- ## The Real Competitive Picture Grok Build is positioned as a direct Claude Code competitor, and xAI has made the right architectural call by going terminal-native rather than IDE-embedded. But the current beta reveals a tool that's compelling on privacy and interesting on parallelism, while trailing significantly on benchmark performance and ecosystem depth. 70.8% SWE-bench Verified is not a bad score. Three months ago it would have been near the top of the leaderboard. Today, the frontier has moved. Claude Opus 4.7 at 87.6%, GPT-5.5 at 82.7% Terminal-Bench 2.0, and Kimi K2.6 at 58.6% SWE-bench Pro are the relevant comparisons. Grok Build's model enters a field where it needs significant improvement to be the benchmark leader, and benchmark leadership is how developers justify switching costs. The pricing math is also complicated. $99/month introductory rate is reasonable for a SuperGrok Heavy bundle. $300/month steady-state is at the ceiling of what individual developers pay. Claude Code Max 20x costs $200/month with a substantially larger ecosystem, higher benchmark scores, and years of production hardening. Grok Build needs to close the capability gap before the introductory pricing window closes. --- ## What to Watch xAI is not a research lab dabbling in developer tools. They have real infrastructure, real compute, and a clear commercial incentive to make Grok Build competitive. Arena Mode could be a genuine workflow innovation when it ships. The local-first privacy model is architecturally sound and serves a real market segment. But early betas are tested by characteristics, not promises. Right now, Grok Build is a compelling idea with benchmark numbers that need to improve and a flagship feature that's coming soon. That's not unusual for a first release. It's a reason to put it on the watchlist, not the primary workflow. Check back when Arena Mode ships and grok-code-fast-2 benchmarks. Those two data points will tell you whether xAI is serious about catching the frontier. --- **Sources:** - [xAI Enters the Coding Agent Race With Grok Build — DevOps.com](https://devops.com/xai-enters-the-coding-agent-race-with-grok-build/) - [Grok Build Early Beta: 6 Ways xAI's New AI Coding Agent Plans to Take on Claude Code — Techloy](https://www.techloy.com/grok-build-early-beta-6-ways-xais-new-ai-coding-agent-plans-to-take-on-claude-code/) - [xAI Grok Build: Multi-Agent Arena Mode Redefines AI Coding — AI2Work](https://ai2.work/blog/xai-grok-build-multi-agent-arena-mode-redefines-ai-coding/) - [xAI Unveils Grok Build: An Agentic AI Coding Tool to Take on OpenAI, Google & Anthropic — AndroidHeadlines](https://www.androidheadlines.com/2026/05/xai-grok-build-agentic-ai-coding-tool-launch-beta.html) - [Grok Build: xAI's Agentic Coding CLI Takes On Claude Code — Pasquale Pillitteri](https://pasqualepillitteri.it/en/news/2584/grok-build-xai-cli-2026) - [Grok Build CLI: xAI's Answer to Claude Code — Beginners in AI](https://beginnersinai.org/grok-build-cli/) --- # AI is Finding 20-Year-Old Bugs Everywhere. Your Stack Is Next. URL: https://sdd.sh/2026/05/ai-cve-surge-open-source-2026/ Date: 2026-05-16 Updated: 2026-05-16 Tags: security, CVE, open-source, postgresql, AI, vulnerability-discovery, linux-kernel, spring Categories: Industry, AI Tools Summary: PostgreSQL fixed 11 CVEs in its May 2026 release — unusually high for a project that typically ships 1–4 per quarter. Spring went from 17 CVEs in all of 2025 to 30 in two months. Chrome is up 563% year-to-date. This isn't a code quality crisis. It's AI-assisted vulnerability discovery, and it's systematically sweeping every major open-source project. Last week, PostgreSQL shipped versions 18.2, 17.8, 16.12, 15.16, and 14.21. Eleven security vulnerabilities fixed in a single quarterly release. For context: PostgreSQL typically ships one to four CVEs per release. The project has a 30-year track record of quiet, disciplined engineering. Eleven is not normal. But it's not an anomaly either. It's the new baseline. ## The Numbers Across the Stack PostgreSQL is one data point in what NIST now confirms is a structural shift. CVE submissions in Q1 2026 were **33% higher** than Q1 2025 — and 2025 was already a record year. NIST enriched nearly 42,000 CVEs in 2025, more than any prior year, and still could not keep pace with submissions. The per-project numbers are harder to ignore: | Project | Change (YTD 2026) | |---|---| | Chrome | +563% | | GitHub | +476% | | Apache | +170% | | Mozilla | +157% | | Spring Framework | 17 CVEs in all of 2025 → 30 in 2 months of 2026 | | Linux kernel | 3 local-root privilege escalation CVEs in the same code area, weeks apart | Spring Security released emergency patches on April 21, 2026 fixing multiple CVEs, including an infinite recursion OOM in Spring Cloud Function and a filter-expression injection in Spring AI. The Linux kernel disclosed *Copy Fail* (CVE-2026-31431), then *Dirty Frag* (CVE-2026-43284 / CVE-2026-43500), then *Fragnesia* (CVE-2026-46300) — three separate local-privilege-escalation vulnerabilities in related kernel code, each allowing any unprivileged user to reach root via a public proof-of-concept, each disclosed within weeks of the last. This is not the fingerprint of a sudden regression in code quality. These projects haven't gotten worse. The tooling for finding what was already broken has gotten dramatically better. ## AI Found the Bugs. AI Is Also Looking for Them on the Other Side. The proximate cause is well-documented at this point. CSO Online reported in early 2026 that AI tooling had uncovered 20-year-old bugs in PostgreSQL and MariaDB — latent vulnerabilities that had been sitting in plain sight through dozens of human security audits. In April 2026, Anthropic disclosed that Claude Mythos Preview had identified thousands of zero-day vulnerabilities across major operating systems and browsers. The economics have inverted. A skilled security researcher running manual analysis might audit one component of one project in a week. An AI model can sweep an entire codebase in minutes, flag plausible vulnerability patterns across every execution path, and do it again tomorrow after the next commit. Every major open-source project is now subject to continuous, automated re-examination at a scale that would have required a large, dedicated red team a year ago. The bugs being found are real. These aren't false positives — the PostgreSQL CVEs carry CVSS scores of 8.2 to 8.8. The pgcrypto heap buffer overflow (CVE-2026-2005), the intarray arbitrary code execution (CVE-2026-2004), the pg_trgm heap overflow (CVE-2026-2007) — all high-severity, all in extensions that have been shipped and trusted for years. The uncomfortable flip side: the same AI capability that finds these bugs can be used to weaponize them. Barracuda Networks' May 2026 threat report documents a measurable collapse in the time between CVE disclosure and functional exploit availability. The exploit window — historically measured in weeks — is now measured in hours for well-documented vulnerabilities. AI doesn't just find the bug; it can write the PoC faster than the patch reaches most production systems. ## The Triage Crisis Nobody Planned For Here is the operational problem that doesn't make headlines: the humans responsible for validating and fixing these vulnerabilities were not resourced for this volume. Most major open-source projects are maintained by small teams — often partially or entirely volunteers. PostgreSQL, Spring, and the Linux kernel are better-resourced than most, but even they are absorbing a materially higher triage load with the same team sizes. For the thousands of smaller open-source projects that underpin the modern stack, the math is worse. A CVE report is not a fix. It's a claim that requires validation: Is this actually exploitable? Under what conditions? Does the proposed patch address root cause or just the reported surface? The cost of generating a vulnerability report with AI has dropped to near-zero. The cost of verifying one has not changed. Security teams downstream are experiencing this as an advisory flood. Two-thirds of security teams in ProjectDiscovery's 2026 AI Coding Impact Report are already spending more than half their time manually triaging AI-generated findings rather than remediating them. That was before the current CVE surge hit its current rate. ## What This Means If You Run Production Software The practical implications are not subtle. **Your patching cadence is now wrong.** If you're on quarterly patch cycles, you are structurally behind. PostgreSQL shipped 11 CVEs with CVSS scores up to 8.8. Linux had a local-root exploit with a public PoC. Both in May 2026. If you patched in March and your next window is June, you have a gap. **Extensions and embedded dependencies are the attack surface.** The PostgreSQL CVEs weren't in the core engine — they were in pgcrypto, intarray, and pg_trgm. The Spring CVEs included Spring AI and Spring Cloud Function. AI vulnerability discovery is thorough: it doesn't skip the extension ecosystem the way human auditors sometimes do. Your threat surface is larger than your primary dependency list. **AI-generated code is being scanned by the same tools.** If 51% of GitHub commits in 2026 are AI-assisted, and AI models generate code that contains OWASP top-10 vulnerabilities at a high base rate, then the CVE surge isn't only about old bugs in legacy code. It's also about new bugs in recently shipped AI-generated features. Both populations are being scanned simultaneously. **The time between disclosure and exploit is now too short for slow response.** When a public PoC for a local-root Linux vulnerability is available within hours of CVE publication, the margin for "we'll patch it in the next maintenance window" is gone. Automated patching infrastructure — KernelCare, live patch pipelines, dependency bots — stops being a nice-to-have and becomes a baseline requirement. ## The Correct Response Is Not Panic None of this argues for slowing down your stack or auditing it into paralysis. The bugs being found are real, but most of them are also patchable. The CVE surge is, in a meaningful sense, good news: these vulnerabilities existed before AI started finding them. The only thing that changed is that we now know about them. The practical response is architectural: **Treat your dependency update pipeline as infrastructure, not maintenance.** Renovate, Dependabot, automated patch PRs — these should be running continuously and merging on green CI. A project with a working automated update pipeline will absorb the CVE surge without additional human load. A project that patches manually on a quarterly schedule will not. **Scope your exposure by extension and plugin.** The PostgreSQL and Spring CVEs were concentrated in optional extensions that not everyone uses. Before a patch is available, the fastest risk reduction is confirming whether the vulnerable component is actually deployed in your environment. pgcrypto, intarray, pg_trgm — if you don't use them, disable or remove them. **Build agentic security review into the generation loop.** If AI is generating a meaningful fraction of your code, the same AI capability that finds old vulnerabilities can review new ones. A Claude Code pre-commit hook running security-focused static analysis isn't a future aspiration — it's a deployable pattern today. AI-generated code with an AI security reviewer in the loop produces fewer vulnerabilities than human-reviewed AI code, because the reviewer doesn't fatigue. **Monitor disclosure feeds, not just release notes.** The time between CVE publication and patch availability can be hours for some projects. If your threat intelligence is "wait for the vendor release email," you're reading about exploits after the fact. NIST NVD, VulnCheck, and OpenCVE all offer real-time feeds that can be piped into automated triage workflows. ## The Broader Shift The CVE surge is the security industry's version of the broader AI acceleration pattern: AI is increasing the rate at which consequential things happen, in both directions. Code gets written faster. Bugs get found faster. Exploits get developed faster. Patches need to ship faster. The organizations that will absorb this well are the ones that have already automated the low-value, high-frequency work: dependency updates, basic security scanning, patch deployment. The ones that will struggle are the ones whose security posture still depends on human reviewers moving at human speed against a threat surface that is now being probed at machine speed. Your stack is being scanned right now. Whether the results show up in a responsible disclosure report or in an attacker's toolbox first depends partly on luck and partly on how fast your patching infrastructure runs. Probably a good time to find out which one you have. --- **Sources:** - [PostgreSQL 18.2, 17.8, 16.12, 15.16, and 14.21 Released — postgresql.org](https://www.postgresql.org/about/news/postgresql-182-178-1612-1516-and-1421-released-3235/) - [AI finds 20-year-old bugs in PostgreSQL and MariaDB — CSO Online](https://www.csoonline.com/article/4167137/ai-finds-20-year-old-bugs-in-postgresql-and-mariadb.html) - [AI Vulnerability Discovery and the Open Source CVE Surge — Security Boulevard](https://securityboulevard.com/2026/05/ai-vulnerability-discovery-and-the-open-source-cve-surge/) - [The First CVE Wave: AI-Assisted Vulnerability Discovery — VulnCheck](https://www.vulncheck.com/blog/ai-assisted-vulnerability-discovery) - [30 CVEs in Two Months: What the Spring Numbers Tell Us — HeroDevs](https://www.herodevs.com/blog-posts/30-cves-in-two-months-what-the-spring-numbers-tell-us-about-the-future-of-open-source-security) - [Dirty Frag Linux Kernel CVEs — TuxCare](https://tuxcare.com/blog/dirty-frag-cve-2026-43284-cve-2026-43500-kernelcare-live-patches-released/) - [Fragnesia CVE-2026-46300 — AlmaLinux](https://almalinux.org/blog/2026-05-13-fragnesia-cve-2026-46300/) - [AI-Driven Vulnerability Discovery and Exploit Trends — Barracuda Networks](https://blog.barracuda.com/2026/05/15/CVE-surge-patch-diff-exploitation-vendor-targeting-trends) - [NIST CVE Prioritization as AI Speeds Up Discovery — Penligent](https://www.penligent.ai/hackinglabs/nist-cve-prioritization-as-ai-speeds-up-vulnerability-discovery/) --- # Why Engineers Are Writing Specs in HTML (And When You Should Too) URL: https://sdd.sh/2026/05/html-specs-structured-machine-readable/ Date: 2026-05-15 Updated: 2026-05-15 Tags: spec-driven-development, html, ai-agents, specs, claude-code, structured-data Categories: Spec-Driven Development, Guides Summary: A growing number of engineering teams are ditching Markdown for HTML when writing specs — not because they enjoy writing more verbose documents, but because HTML's semantic structure gives AI agents significantly richer context when implementing from a spec. Here is where the tradeoff makes sense and how to do it well. Markdown is the default format for everything in software engineering: README files, wikis, ADRs, specs. It is frictionless, readable in any text editor, and renders beautifully in GitHub. For most purposes it is perfectly fine. But "perfectly fine" is not the same as "optimal for machine consumption." When you are practicing Spec-Driven Development — writing a spec and handing it to an AI agent to implement — the format of that spec is not a cosmetic detail. It is load-bearing infrastructure. A growing number of teams are discovering that HTML, specifically semantic HTML, is a better substrate for complex specs. Not because HTML is fun to write, but because the semantic signal it carries meaningfully changes what an AI agent can infer from the document. ## The Semantic Gap Between Markdown and HTML Consider two ways of marking up the same content. In Markdown: ```markdown # Auth ## Token format ... ## Refresh logic ... ``` In semantic HTML: ```html

Auth

Token format

...

Refresh logic

...
``` The Markdown version is a flat list of headings. The HTML version is a graph. The agent reading the HTML version knows that "Refresh logic" is a child of "Auth," that a team named `team-identity` owns this section, and that the section has been approved — not just drafted. It can also link directly to `#auth-token-format` from anywhere else in the document without ambiguity. In a 20-page spec this distinction is academic. In a 100-page spec covering authentication, payments, notifications, compliance, and internal APIs, it becomes the difference between an agent that navigates the document purposefully and one that drifts. ## Where HTML Genuinely Wins **Large, multi-team specs.** When more than one team owns different sections of a spec, `data-owner` attributes give you machine-readable provenance without cluttering the human-readable content. An agent generating code for the payments flow can filter the spec to only sections where `data-owner="team-payments"` and avoid pulling in noise from adjacent sections. **Stable internal cross-references.** Markdown's internal link syntax (`[see auth](#auth)`) works, but the anchor targets are derived from heading text, which changes. In HTML, `id` attributes are explicit and stable. A spec that references `refresh behavior` will not silently break if someone rewords the heading. **Tabular data and API definitions.** HTML tables explicitly separate `` from ``. A `
` (definition list) is semantically perfect for API field definitions — each `
` is a field name, each `
` is its description and type. AI agents reading these elements know they are processing structured data, not prose. **Status tracking.** `data-status="draft"` versus `data-status="approved"` versus `data-status="deprecated"` gives an agent immediate signal about which sections to implement against and which to flag for review. This is metadata that Markdown forces you to embed inline as text — where it is invisible to automated parsing. **Long-lived living documentation.** If a spec will outlive the initial implementation and serve as the canonical reference for a system, HTML's explicit structure makes it easier to maintain, diff, and query over time. ## Practical Patterns Worth Adopting Use `
` with explicit `id` attributes for every major and minor section. Use `
` for self-contained components (a single API endpoint, a single data model). Use `