From Specs to Shipping: Five Production Teams That Went Spec-First

Table of Contents

Spec-Driven Development is easy to describe in theory: write specs instead of code, let AI handle the implementation. It is harder to evaluate in practice because the case studies have historically been either anecdotal (“I tried it on a side project and it worked”) or untraceable (“a large enterprise reduced time-to-deploy by X%”).

That is starting to change. Five teams have published enough detail — through conference talks, engineering blogs, investor materials, and analyst surveys — to build a coherent picture of what SDD looks like when it actually ships software at scale.

Mercado Libre: 90% Autonomous, 23,000 Engineers
#

Mercado Libre is the largest e-commerce platform in Latin America, with 23,000 engineers spread across Argentina, Brazil, Mexico, Uruguay, and five other countries. In early 2026, they announced a target that most engineering organizations would consider reckless: 90% autonomous code generation by Q3 2026, driven entirely by AI agents operating against internally-authored specs.

The methodology disclosed at Code with Claude SF (May 6, 2026) follows the SDD pattern precisely. Engineers write requirements documents that describe the problem, the constraints, the expected behavior, and the acceptance criteria. Claude Code runs in agentic mode against those specs, producing implementations that are gated by automated test suites before human review is even triggered.

What makes the Mercado Libre case instructive is scale. At 23,000 engineers, you cannot run an agentic workflow that requires constant human supervision — the economics collapse. The org’s decision to target 90% autonomous rather than “AI-assisted” is an implicit acknowledgment that the value is not in AI making individual engineers faster; it is in AI eliminating the bottleneck between specification and working software entirely.

The constraint: Mercado Libre’s engineering leadership has been explicit that the 90% target applies to the subset of tasks that are well-specified. They are not trying to automate product strategy or architecture decisions. The target is for implementation of clearly-scoped features and bug fixes — the category where spec quality determines outcome quality.

Harvey: 6x Task Completion via Memory-Augmented Agents
#

Harvey is a legal AI company with roughly 50 engineers supporting a platform used by some of the largest law firms in the world. In May 2026, Anthropic disclosed that Harvey’s adoption of Managed Agents Dreaming produced a 6x increase in task completion rates for their engineering team’s AI-assisted workflows.

The mechanism matters. Harvey was not simply running AI against code tasks. They implemented a spec-first workflow where agents were given structured task descriptions, then Dreaming — Anthropic’s overnight memory consolidation feature — curated what the agents had learned about Harvey’s codebase, patterns, and constraints. Each new session started with accumulated context; agents didn’t re-discover the same patterns session after session.

This is SDD expressed through memory. The “spec” is not just the task description; it is the accumulated organizational knowledge that the agent brings to every session. Harvey’s 6x improvement is not purely from writing better individual prompts. It is from the system knowing, persistently, what good output looks like for Harvey’s specific codebase.

The constraint: This approach requires investing in the memory hygiene layer — deciding what context to preserve, what to purge, and what structure makes accumulated knowledge actually useful. Teams that treat AI as a stateless assistant miss this compounding effect.

TELUS: 13,000 Agents, 500,000 Hours Saved
#

TELUS is a Canadian telecommunications company with one of the most publicized AI agent deployments in enterprise engineering. Their internal deployment includes over 13,000 AI agents running across software development, customer operations, and network management. On the software side, the specific numbers that Anthropic has cited: 30% faster code review cycles, 500,000 engineer-hours saved annually.

TELUS’s deployment is not fully SDD in the narrow sense — not every task runs agent-to-implementation via a spec. But the software development portion of their AI rollout follows the key pattern: agents are given structured task descriptions that act as specifications, and the output is evaluated against predefined criteria before human review.

The 30% code review improvement is notable because code review is typically where AI-generated code creates additional work, not less. Teams that generate AI code carelessly produce PRs with quality issues that overwhelm reviewers. TELUS’s deployment suggests their specs are precise enough that generated code arrives review-ready — the spec is doing the quality work before code generation begins.

The constraint: TELUS’s deployment required significant upfront investment in what their engineering org calls “agent task design” — a practice functionally equivalent to writing good specs. The 500K hours saved did not come from deploying AI on loosely-described tasks; it came from rigorous attention to how tasks were structured before any agent touched them.

Pinterest: MCP as the Spec-Execution Layer
#

Pinterest’s case study is different in kind. Rather than SDD at the task level, Pinterest built a platform for SDD — an infrastructure that lets specifications travel through MCP (Model Context Protocol) to domain-specific execution layers.

In their April 2026 engineering blog post, Pinterest described a fleet of domain-specific MCP servers: one for ads infrastructure, one for recommendations, one for content systems. Each server exposes the tools, APIs, and context relevant to its domain. When an engineer writes a spec for an ads feature, Claude Code connects to the ads MCP server and has immediate access to the tools, schemas, and constraints relevant to that domain without the engineer having to include them in the prompt.

The result: 66,000 tool invocations per month, 844 monthly active users among engineering staff, 7,000 hours saved per month. The MCP infrastructure effectively encodes organizational knowledge into the execution environment — reducing the spec-writing burden because the context the agent needs is already ambient.

The constraint: Building this infrastructure required significant upfront investment from Pinterest’s platform engineering team. They had to design and maintain a central MCP registry, implement JWT-based authentication per server, and create the tooling for new server onboarding. This is a team that already had strong platform engineering culture; organizations without it will find the Pinterest model difficult to replicate directly.

Zapier: 800+ Internal Agents, 89% Adoption Without Mandate
#

Zapier’s internal AI deployment is notable primarily because of how it happened: bottom-up, not mandated by leadership. By early 2026, 89% of Zapier’s engineering team was using AI tools regularly, and the org had deployed 800+ internal AI-powered agents across development workflows.

Zapier does not describe its workflow explicitly as SDD, but the structure matches: engineers who adopted AI most deeply shifted their work toward writing detailed task descriptions and letting agents implement. The adoption curve tracks closely with spec quality improvement — engineers who produced the best AI-assisted results were the ones who invested in clearer problem descriptions, not the ones who used the most sophisticated prompts.

The constraint: The 89% adoption rate is impressive but also reveals the floor: 11% of engineers in an automation-native company still do not find AI tools worth integrating into daily work. The most common reason cited internally was that AI output quality did not reliably exceed the cost of the correction cycle — which, on investigation, traced to specification quality rather than model capability.

Patterns Across All Five Teams
#

These five cases share structural characteristics that are more predictive of success than any specific tool or model choice.

Specs as organizational assets, not throwaway prompts. In every case, the teams that produced the best outcomes treated specifications as documents worth iterating on and preserving. Mercado Libre maintains spec libraries. Pinterest encodes domain knowledge into MCP servers. Harvey’s agents accumulate knowledge across sessions. The spec is the asset; the code is a byproduct.

Evaluation criteria are written before implementation begins. The most common failure mode in AI-assisted coding is discovering that the implementation is wrong after it’s built. All five teams invested heavily in specifying acceptance criteria as part of the spec — not just what to build, but what “done” looks like, in terms the agent can evaluate.

Human review shifts from checking code to auditing specs. The engineering team’s cognitive work does not disappear; it relocates. Engineers at these organizations spend more time designing and reviewing specifications than they did before. Senior engineers in particular report that their work has become more strategic — closer to architecture and requirement design, further from line-by-line implementation.

Volume of AI-generated code correlates with investment in spec infrastructure. The teams achieving the highest autonomous coding percentages are the ones that have invested the most in making specifications machine-readable, contextually rich, and evaluable. There are no shortcuts through this.

The Single Failure Mode
#

Every team — without exception — traces their AI-assisted development failures to specification quality rather than model capability.

Mercado Libre’s autonomous coding target has a stated precondition: the task must be well-specified. TELUS’s 30% code review improvement disappeared on tasks that arrived without structured specs. Pinterest’s MCP infrastructure only provides value for domains where the MCP server has been properly instrumented. Harvey’s memory compounding only works when the underlying task descriptions are precise enough for the agent to learn from.

The failure mode is consistent: underspecified inputs produce underspecified outputs, regardless of the model’s capability. A 69% SWE-bench Pro score does not compensate for a vague task description. The model will try to resolve ambiguity — and will usually resolve it incorrectly, because the ambiguity is in the domain knowledge you have not provided.

The 2026 SDD stack (Claude Code, AWS Kiro, Managed Agents, MCP) has matured to the point where model capability is rarely the bottleneck. Spec quality almost always is.

Starting With SDD
#

If you are moving a team toward spec-driven workflows, the question to answer first is not which tool to use. It is: what needs to be true about the spec so that the correct implementation is the only plausible implementation?

That question surfaces the right work before any agent runs:

What are the constraints? Data formats, performance requirements, security invariants, third-party API limits — anything that makes some implementations wrong.
What does “done” look like? Specific, testable acceptance criteria, not “it should work.”
What context does the agent need that is not in the codebase? Business rules, domain semantics, organizational conventions — the knowledge that lives in engineers’ heads.

Specs that answer all three consistently produce good AI-assisted output. Specs that skip any one of them produce the failure mode that every team in this article has experienced at some point.

The tools have caught up. The specs are where the work is now.

Sources:

Mercado Libre: 90% Autonomous, 23,000 Engineers#

Harvey: 6x Task Completion via Memory-Augmented Agents#

TELUS: 13,000 Agents, 500,000 Hours Saved#

Pinterest: MCP as the Spec-Execution Layer#

Zapier: 800+ Internal Agents, 89% Adoption Without Mandate#

Patterns Across All Five Teams#

The Single Failure Mode#

Starting With SDD#

Related