Anthropic’s May 6 Claude Managed Agents update shipped three features: Dreaming (already covered), Outcomes, and Multiagent orchestration. Dreaming is the async self-improvement loop that runs overnight and reshapes memory stores. Outcomes and Multiagent are the synchronous production primitives — the ones you wire into your agent code to make it reliable enough to run without supervision.
This is the hands-on guide. If you want the conceptual overview, the Anthropic blog post has it. This article covers how to actually use Outcomes and Multiagent in production, what the failure modes look like, and how to compose the two features together.
What Outcomes Actually Does#
The core idea is simple: you write a rubric describing what success looks like, and a separate Claude instance grades the agent’s output against that rubric in its own context window. If the output fails, the grader tells the agent exactly what needs to change, and the agent takes another pass. This repeats until the output passes or a retry limit is reached.
“Separate context window” is the key design choice. The grader cannot see the agent’s reasoning trace, the tools it called, or the steps it took to produce the output. It evaluates the artifact itself — the document, the diff, the report — against your stated criteria. This eliminates a failure mode common in self-evaluation: models that generated a flawed output tend to rationalize it as correct when asked to review their own work in the same context.
In Anthropic’s internal benchmarks, Outcomes improved task success rates by up to 10 percentage points over a standard prompting loop, with the largest gains on the hardest tasks. MindStudio’s production data shows a 10.1% improvement in PowerPoint quality when Outcomes was applied to a slide generation agent — a concrete, customer-facing quality improvement with a single rubric change.
Writing Rubrics That Work#
The most common way to waste Outcomes is to write a vague rubric. “Make this excellent, polished, and accurate” is not a rubric. It is a wishful thought. A rubric the grader cannot operationalize produces a false sense of governance: the agent runs through cycles, the grader approves, and you have a worse outcome than a single well-prompted pass would have produced.
A rubric that works has three properties:
Observable criteria. Each criterion must be checkable from the document alone, without external knowledge. “The executive summary is three paragraphs or fewer” is observable. “The tone is appropriate” is not.
Explicit constraints. State what the output must not do, not just what it should do. Negative criteria are easier for a grader to evaluate. “No passive voice in the executive summary” is a better criterion than “use active voice.” “Does not reference internal ticket numbers” is cleaner than “appropriate for external distribution.”
Tiered requirements. Separate must-haves from nice-to-haves. If the output fails a must-have criterion, it should retry. If it misses a nice-to-have, the grader can flag it but still approve. The Anthropic documentation recommends 5 to 10 criteria per rubric for most tasks — enough to be specific, not so many that every output fails on minor stylistic grounds.
Before running Outcomes in production, test your rubric on known-good and known-bad examples. Generate five outputs manually, grade them yourself, then run the grading agent and compare. If the grader disagrees with your judgment more than twice out of five, rewrite the rubric.
The Retry Budget#
Every Outcomes configuration has a retry limit. Anthropic’s documentation recommends three retries as a starting point. If your agent is failing three times before passing, the problem is almost always the prompt or the rubric, not the retry budget. Increasing retries to compensate for an underspecified rubric burns tokens and time without improving outcomes.
Watch for “rubric drift”: if you add criteria incrementally as new failures surface, you will eventually have a rubric so strict that no output passes on the first pass even when the output is genuinely good. This inflates costs and reduces throughput. Audit your rubric quarterly. Remove criteria that are never triggered.
Multiagent Orchestration: The Architecture#
Multiagent orchestration addresses a different problem than Outcomes. Outcomes improves quality for a single agent on a well-defined task. Multiagent addresses scope: when the job is too large, too varied, or too parallel for a single agent to do well.
The structure is a coordinator plus specialists. The coordinator receives the top-level task, decomposes it, and delegates to specialized agents — each with its own model, system prompt, tool access, and context window. Specialists work in parallel on a shared filesystem. When they complete, results flow back to the coordinator, which synthesizes the outputs.
The hard limits: a maximum of 20 unique agents can be listed in multiagent.agents, but the coordinator can call multiple copies of each agent. So you can have 20 specialist types with multiple parallel instances of each, which in practice means the ceiling on concurrency is higher than it initially appears. All agents share the same filesystem, which is the coordination primitive — agents write intermediate artifacts that other agents read.
All Managed Agents endpoints require the managed-agents-2026-04-01 beta header. The API is otherwise standard Claude API syntax.
Netflix’s Production Pattern#
Netflix’s platform team built a log analysis agent using Multiagent orchestration. The problem: their platform generates logs from hundreds of concurrent builds, across multiple infrastructure layers, and the signal-to-noise ratio in raw log output is too low for a single-agent pass to surface the patterns that matter.
Their architecture: a coordinator agent receives a time window and a log scope. It fans out to sub-agents that each process a batch of logs from a specific service or build stage. Sub-agents write structured summaries to the shared filesystem. The coordinator reads those summaries, identifies cross-service patterns, and produces a ranked list of issues worth investigating.
What made Multiagent the right choice here: the work was embarrassingly parallel (each log batch can be processed independently), the total context exceeded a single agent’s practical window, and the per-batch tasks were homogeneous enough to use the same specialist agent type in multiple parallel copies.
What would not have worked: a single long-context agent reading all logs sequentially. At the token counts involved, quality degrades in the middle of the context window (the well-documented “lost in the middle” effect), and the sequential processing time would have made the results stale before they could be acted on.
Composing Outcomes and Multiagent#
The two features compose naturally. Apply Outcomes at the coordinator level for end-to-end quality verification. Apply Outcomes at the specialist level for tasks where individual specialists need to meet a quality bar before their output is passed to the coordinator.
For the Netflix pattern: add Outcomes to the coordinator agent with a rubric that checks the final report’s coverage, actionability, and format. Do not add Outcomes to each specialist — per-specialist verification at 300+ specialist calls would cost more in retries than the quality gain is worth. Verify at the level where quality actually matters: the final artifact.
A practical composition pattern:
coordinator (with Outcomes rubric)
├── specialist_a × N parallel instances
├── specialist_b × M parallel instances
└── specialist_c × P parallel instances
(specialists write to shared filesystem)
coordinator reads results → produces output → Outcomes grader evaluates → retry if neededWhen Not to Use These Features#
Outcomes adds latency (grader call + potential retry cycles) and cost (grader tokens). For tasks that already produce high-quality output reliably, it adds overhead without benefit. The right candidates are tasks where failure has a real cost: customer-facing documents, security reports, code diffs that will be merged without further human review.
Multiagent adds coordination complexity. For tasks that fit in a single agent’s context window and do not have natural decomposition points, a single well-prompted agent is faster, cheaper, and more debuggable. Premature multiagent decomposition is one of the more common mistakes in early Managed Agents deployments.
The decision heuristic: if you are adding Outcomes to a task where outputs are already passing human review 90%+ of the time, stop and tune the prompt instead. If you are decomposing a task into agents that do not produce intermediate artifacts the coordinator actually reads, you do not have a multiagent workflow — you have parallel single-agent calls, which is cheaper and simpler to build.
Getting Started#
The Managed Agents quickstart covers authentication and the beta header. The Define Outcomes documentation has the rubric schema. The Multiagent sessions documentation has the coordinator/specialist agent config.
Start with one document type you generate repeatedly. Write a rubric with five specific, checkable criteria. Run Outcomes for a week. Measure the retry rate and whether the grader’s rejections match your own quality judgment. That feedback loop will teach you more about effective rubric design than any documentation can.
Sources#
- New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration — Anthropic
- Define outcomes — Claude API Docs
- Multiagent sessions — Claude API Docs
- Claude Managed Agents overview — Claude API Docs
- Claude Outcomes Feature Improved PowerPoint Quality 10.1% — MindStudio
- Anthropic updates Claude Managed Agents with three new features — 9to5Mac
- Codex /goal and Claude Managed Outcomes: The New Control Loops — Developers Digest
- Claude Managed Agents: complete guide to building production AI agents — The AI Corner