DeepSeek dropped V4 on April 24, 2026, and the headline numbers are hard to ignore: 80.6% on SWE-bench Verified, 93.5% on LiveCodeBench, a Codeforces rating of 3206, and pricing that is roughly one-sixth of Claude Opus 4.7 and GPT-5.5. Open-weight, MIT license, 1 million token context.
If you only read the benchmark sheet, this looks like the moment DeepSeek cracked the frontier — at Chinese-lab economics.
The reality is more nuanced, and worth reading carefully before you migrate anything.
What DeepSeek V4 Actually Is#
Two variants shipped in preview:
V4-Pro — 1.6 trillion total parameters, approximately 49 billion active per inference pass via a 384-expert Mixture-of-Experts architecture. 1M token context. This is the frontier contender.
V4-Flash — 284 billion parameters, 13 billion active. Same context window. Built for throughput where V4-Pro would be overkill.
The architecture headline is a new hybrid attention mechanism: Compressed Sparse Attention (CSA) combined with Heavily Compressed Attention (HCA). In practice, what this means for long-context use cases is significant: at 1M tokens, V4-Pro requires only 27% of the inference FLOPs and 10% of the KV cache compared to V3.2. That is not a marginal improvement — it changes the economics of running long-context agent loops at scale.
It is the largest open-weight model ever released. Under MIT license.
The Benchmark Picture#
| Benchmark | DeepSeek V4-Pro | Claude Opus 4.7 | GPT-5.5 | Gemini 3.1 |
|---|---|---|---|---|
| SWE-bench Verified | 80.6% | 80.8% | — | — |
| LiveCodeBench | 93.5% | 88.8% | — | 91.7% |
| Codeforces Rating | 3206 | — | 3168 | 3052 |
| GPQA Diamond | 90.1% | — | 93.6% | 94.3% |
The coding numbers are legitimately impressive. LiveCodeBench is one of the cleanest benchmarks for raw coding ability — it uses live competitive programming problems that post-date training cutoffs, so models can’t pattern-match against training data. V4-Pro at 93.5% is the best published score on that benchmark as of this writing.
SWE-bench Verified tells a tighter story: 80.6% for V4-Pro versus 80.8% for Claude Opus 4.7. That is statistical noise. For real GitHub issue resolution on real repositories, they are at parity today.
What V4-Pro does not lead on: SWE-bench Pro, MCP-Atlas, Terminal-Bench 2.0, and the multi-agent coordination benchmarks where Opus 4.7 was specifically tuned. DeepSeek has not published SWE-bench Pro numbers. That absence is notable given how prominently other labs have published on it — it is the benchmark most resistant to data contamination, and the one that best predicts production agentic performance.
The Cost Argument, Honestly Stated#
This is where the case for V4 is strongest, and it is a real case.
API pricing:
- V4-Flash: $0.14 / $0.28 per million input/output tokens
- V4-Pro: $0.145 / $3.48 per million input/output tokens
- Claude Opus 4.7: $5 / $25 per million input/output tokens
- GPT-5.5: $5 / $30 per million input/output tokens
For output-heavy workloads — code generation, long agentic loops with extensive tool call responses — V4-Pro is approximately 7× cheaper than Opus 4.7. V4-Flash is cheaper still, by nearly two orders of magnitude.
If you are running agents at scale and your tasks are mostly code synthesis and retrieval-augmented generation rather than complex multi-agent coordination, the economics are significant. A workflow costing $5 on V4-Pro runs $35 on GPT-5.5. At volume, that is the difference between a viable product margin and a cost problem.
The MIT license amplifies this: you can self-host, fine-tune for your proprietary codebase, and run inference on your own infrastructure. No API dependency, no data egress to a third-party provider.
What the Benchmark Sheet Does Not Tell You#
Three things are missing from the V4 launch narrative that matter for production agentic workflows.
SWE-bench Pro scores. Every frontier lab published SWE-bench Pro results over the past few months — it became the discriminating benchmark precisely because it is contamination-resistant. DeepSeek did not. Claude Opus 4.7 sits at 64.3% on SWE-bench Pro; MiniMax M2.7 (an open-source competitor) published 56.22%. Without a V4-Pro SWE-bench Pro number, the “matches the frontier” claim is incomplete.
Agentic harness. V4-Pro is a model. Claude Code is a model plus a purpose-built agentic scaffold: persistent bash sessions, worktree isolation, multi-agent orchestration, Routines with event triggers, CLAUDE.md project context, and a terminal-native operating model. The benchmark measures the model in isolation; production agents are model + harness. A V4-Pro model in a generic OpenAI-compatible server is a different product than Claude Opus 4.7 inside Claude Code.
Preview status. This is a preview release. SWE-bench Verified scores frequently revise between preview and GA. V4-Flash in particular received mixed reactions from developers who found it was not a significant jump over V3.2 for their specific use cases. Wait for independent developer benchmarking on production codebases before treating the launch numbers as settled.
The Open-Source Dynamic#
The strategic picture here is larger than a single model release.
DeepSeek V4-Pro at MIT license, running at frontier-competitive coding performance, is the clearest signal yet that the closed-model tax is becoming optional for coding workloads. GLM-5.1 landed at 58.4% on SWE-bench Pro under MIT in April. MiniMax M2.7 reached 56.22%. DeepSeek V4-Pro matches the top closed models on SWE-bench Verified.
This is not a fluke trajectory. Open-weight models are closing the capability gap with each generation, and they are doing it with substantially better economics. For teams with the infrastructure to self-host, fine-tuned open-weight models at V4-Pro performance levels are increasingly a viable alternative to paying frontier API rates.
The question for engineering organizations is whether the capability you are actually getting from closed-model APIs justifies the cost premium. For long-context code synthesis and standard agentic workflows, it is getting harder to justify.
Where Claude Still Has an Edge#
Honest accounting: Claude Opus 4.7 leads on SWE-bench Pro (64.3%, with no equivalent published from DeepSeek), MCP-Atlas (79.1%), and Terminal-Bench 2.0 (69.4% — though GPT-5.5 has it on that one). More importantly, it comes packaged with Claude Code’s agentic infrastructure, which is purpose-built for autonomous terminal-native work in a way that no model-only release from DeepSeek can replicate.
The multi-agent coordination features in Opus 4.7 — one-third the tool errors in agentic loops, 14% improvement on complex multi-step workflows, native Agent Teams support — are architectural bets that Anthropic has been building toward since 2024. A model that scores similarly on SWE-bench Verified is not the same as a model that performs similarly when orchestrating 10 parallel sub-agents across a real deployment pipeline.
If your workload is: “generate code for well-specified tasks with bounded context” — V4-Pro is a serious alternative to evaluate. If your workload is: “run autonomous agents across a complex codebase, coordinate parallel workstreams, and handle the failure modes of long-running multi-step tasks” — Opus 4.7 inside Claude Code is still the stack to beat.
The Practical Recommendation#
V4-Flash at $0.14/$0.28 per million tokens is an obvious candidate for any cost-sensitive, bounded-context coding workload. The price-to-capability ratio is the best in the market for that tier.
V4-Pro is worth evaluating against Opus 4.7 for code synthesis tasks where you have the infrastructure to run comparisons. Wait for SWE-bench Pro numbers and post-preview stability before migrating production agentic pipelines.
The MIT license and self-hosting option are genuinely valuable, particularly for organizations with data-residency requirements or who want to fine-tune on proprietary codebases. That option did not exist at this capability level six months ago.
DeepSeek V4 is the best evidence so far that open-source has reached coding-frontier parity on the benchmarks that are easiest to measure. The benchmarks that are hardest to game — and the agentic scaffolding that turns a model into a production tool — are still a Claude story.
Sources: DeepSeek V4 Preview Release Notes, TechCrunch: DeepSeek previews new AI model, VentureBeat: DeepSeek-V4 cost comparison, CNBC: DeepSeek V4 release, MIT Technology Review: Why DeepSeek’s V4 matters