Claude Agents Can Now Dream. Harvey Saw 6× More Tasks Completed.

Table of Contents

Legal AI company Harvey had a problem that every team running autonomous agents eventually hits: the agent kept making the same mistakes. A workaround that a human had corrected in session 47 was gone by session 48. A tool usage pattern that the team had optimized was invisible to the next run. Institutional knowledge evaporated every time the context window closed.

After deploying Anthropic’s new Dreaming feature for Claude Managed Agents, Harvey’s task completion rate increased roughly 6×. That is not a modest improvement to a success metric — it is a structural shift in what the agent is capable of, compounding across every subsequent run.

Dreaming shipped on May 6 at Anthropic’s Code with Claude SF event, alongside two related features: Outcomes (rubric-based success evaluation) and Multiagent Orchestration (a coordinator/specialist model for decomposing large tasks). Together they move Claude Managed Agents from a capable agent loop to something closer to a continuously improving system.

What Dreaming Actually Does
#

Memory — launched earlier this year in public beta — stores facts across sessions: preferences, project state, file structures, user-specific context. It is a notepad the agent can write to and read from.

Dreaming operates at a higher level of abstraction. It is a scheduled background process that reviews the agent’s past sessions and memory stores, extracts patterns across them, and curates memories so the agent improves over time. It surfaces things a single session cannot see on its own: recurring mistakes across dozens of runs, workflows the agent consistently converges on, preferences that show up across a team’s sessions.

When a dream runs, it reads the existing memory store alongside past session transcripts, then produces a new, reorganized memory store:

Duplicate entries merged into single canonical facts
Stale or contradicted knowledge replaced with the latest validated value
New patterns surfaced as explicit learnings the next session will act on

The output is written as plain-text notes and structured playbooks — not embedded weights, not opaque vectors. Every insight is readable, auditable, and correctable by a human. You can see exactly what the agent learned and why.

For Harvey’s legal agents — handling long-form drafting, multi-document review, and complex legal research — the patterns Dreaming surfaced included filetype workarounds, tool-specific usage sequences, and recurring failure modes that no single session had enough context to detect. The agents now arrive at each session pre-loaded with the institutional knowledge the team had built up, rather than starting from scratch.

Control, Not Autopilot
#

One important design decision: Dreaming does not have to be automatic.

You decide how much control you want. The two modes:

Auto-update: Dreaming runs on a schedule, updates memory, and the next session benefits immediately. Low friction, suited for established agents with validated memory stores.
Manual review: Dreaming produces a candidate memory update. A human reviews the proposed changes before they land. Higher friction, appropriate during the early stages when you are still calibrating what the agent should and should not retain.

This is a meaningful distinction. Automatic memory updates in a production agentic system carry real risk — a bad dream could propagate a systematic mistake at scale. The manual review mode gives teams an audit layer, which matters especially in regulated industries (where Harvey operates) where the agent’s reasoning chain needs to be defensible.

The schedule configuration is via the Managed Agents API; the Dreams API docs cover the YAML structure and available trigger windows.

Outcomes: Defining Done Before the Agent Starts
#

The second feature — Outcomes — solves a different problem. Most agentic workflows define a task, run the agent, and then evaluate the result by eye. That works fine for a demo. It does not scale to production agents running thousands of tasks per week.

Outcomes lets you write a rubric describing what success looks like before the agent starts. The agent then works toward that rubric. When it finishes, a separate grader evaluates the output against your criteria — in its own context window, so it is not influenced by the agent’s reasoning or any in-session confirmation bias.

When the grader finds the output does not meet the rubric, it pinpoints what needs to change and sends the agent back for another pass. You define the success criteria once; the agent iterates until it meets them.

You can also add a webhook: POST /sessions/{id}/outcome notifies your system when the agent completes (or exhausts its retry budget). No polling. The agent runs, the webhook fires, you process the result.

This is a production pattern, not a convenience feature. Teams that are building SLA-bound agentic workflows — “this task must meet these criteria before it ships” — have had to implement this evaluation loop themselves until now. Outcomes brings it into the managed layer.

Multiagent Orchestration: Coordinator + Specialists
#

The third feature is the most architecturally interesting. When a task is too large or too heterogeneous for a single agent to handle well, Multiagent Orchestration lets a lead agent break the job into pieces and delegate each to a specialist with its own model, prompt, and tools.

The canonical example from Anthropic’s documentation is an incident investigation: a lead agent runs the overall investigation while subagents fan out in parallel through deploy history, error logs, metrics dashboards, and support tickets. The specialists work simultaneously on a shared filesystem and contribute findings to the lead agent’s overall context.

Key constraints:

Up to 20 unique agents per multiagent session (the coordinator plus up to 19 specialists)
Each specialist can have its own model, system prompt, and tool set — you are not restricted to a homogeneous fleet
Specialists write to a shared filesystem; the coordinator aggregates
No separate access request required; available via the Claude Platform API with the managed-agents-2026-04-01 beta header

The 20-agent limit is not a technical ceiling — it is a guardrail while the feature is in public beta. Anthropic will almost certainly adjust it based on what production deployments actually need.

Why This Matters Beyond the Headline
#

The Harvey 6× number is striking, but the structural implication is bigger than any single metric.

Until now, Claude Managed Agents was a capable, stateful loop. You got persistence across sessions (memory), sandboxed execution, checkpointing, and solid infrastructure. What you did not get was an agent that could learn from its own history in a systematic, auditable way — or a mechanism to define and enforce success criteria automatically.

Dreaming + Outcomes + Multiagent Orchestration closes three of the main gaps between “capable prototype” and “production-grade autonomous system”:

Gap	Feature that closes it
Agent forgets what it learned	Dreaming: scheduled memory curation
Success is defined by eyeballing	Outcomes: explicit rubric + grader loop
Single agent can’t handle complex parallel work	Multiagent: coordinator + specialist fleet

Together they describe a system that can improve over time (Dreaming), enforce its own quality bar (Outcomes), and scale its capacity horizontally (Multiagent) — without requiring a human in the loop for each iteration.

How to Think About It in Practice
#

If you are running Claude Managed Agents today, the sequencing that makes sense:

Start with Outcomes if you have a production workflow with a definable success criterion. This is the lowest risk addition — you are just specifying what “done” means and letting the agent iterate toward it, with webhook notification when it gets there.
Add Dreaming once you have enough session history for it to be useful — typically after the agent has run 20-50 sessions with meaningful memory writes. Enable manual review mode first. Review a few dream outputs by hand before switching to auto-update.
Add Multiagent Orchestration when a single agent is hitting token or time limits on a given class of tasks, or when you have genuinely parallel workstreams that benefit from specialization (legal research + document drafting + citation verification can run simultaneously rather than sequentially).
Combine all three for the full flywheel: multiagent sessions generate more session data for Dreaming to work with; Dreaming curates learnings that improve the specialists; Outcomes catches regressions before they compound.

The Bigger Picture
#

Dreaming is Anthropic’s answer to a question the industry has been asking for a year: how do you make an agentic system that actually gets better over time, without retraining the model?

The answer is not fine-tuning and not RL on production traffic — both are expensive, opaque, and hard to govern. It is a scheduled background process that reads past behavior, synthesizes it into plain-text learnings, and writes those learnings back to the memory store that the next session reads. Observable, auditable, reversible.

That is the design philosophy Claude Code has always taken: terminal-native, filesystem-transparent, inspectable at every step. Dreaming extends the same philosophy up the stack to the agent memory layer.

Harvey’s 6× completion rate will not be universal — it reflects a specific agent architecture (long-form legal drafting, complex tool chains, high session volume) where memory curation pays off quickly. Your numbers will depend on your task structure and session volume. But the direction is clear: agents that can learn from their own history are not just more convenient — they are fundamentally more capable in ways that compound.

The question is not whether to use Dreaming. The question is how quickly you can accumulate enough session history to make it valuable.

Sources:

What Dreaming Actually Does#

Control, Not Autopilot#

Outcomes: Defining Done Before the Agent Starts#

Multiagent Orchestration: Coordinator + Specialists#

Why This Matters Beyond the Headline#

How to Think About It in Practice#

The Bigger Picture#

Related