Skip to main content
  1. Articles/

DeepSeek V4: Near-Frontier Performance, Open Weights, and the First Major Model Built for Huawei Chips

·1136 words·6 mins·

A year after DeepSeek R1 rattled every major AI lab’s stock price, the Chinese AI startup is back with V4. Released on April 24, 2026, DeepSeek V4 is the company’s new flagship series: a 1.6 trillion-parameter Pro model and a 284 billion-parameter Flash variant, both Mixture-of-Experts (MoE) architectures, both MIT-licensed, and both available via the DeepSeek API on day one.

The headline number is the price. At $1.74 per million input tokens for V4-Pro and $0.14/M for V4-Flash, DeepSeek is undercutting every closed-model competitor at a performance level that makes the comparison credible — not just a race-to-the-bottom cheap model. But the story that will matter more over the next few years isn’t the pricing. It’s the chips.

The Models
#

DeepSeek-V4-Pro packs 1.6 trillion total parameters with 49 billion active at inference time — a hallmark of the MoE efficiency architecture DeepSeek has mastered. Context window: 1 million tokens. Available on Hugging Face at 865GB. MIT license means you can deploy and modify it freely.

DeepSeek-V4-Flash is 284 billion parameters total with 13 billion active, and 160GB on Hugging Face. Same 1M context, same license, priced at a fraction of Pro.

Performance: V4-Pro beats all rival open-weight models on math and coding benchmarks and trails only Google’s Gemini 3.1 Pro — a closed model — on world knowledge tasks. MIT Technology Review puts the gap at roughly “3 to 6 months behind state-of-the-art frontier models.” That’s not parity with Claude Opus 4.7 or GPT-5.5 Spud, but it’s remarkably close for an open-weight model at V4-Pro’s price point.

The Efficiency Architecture
#

The technical paper is worth reading for the architectural innovations. For 1 million-token contexts, V4-Pro requires only 27% of the single-token FLOPs and 10% of the KV cache size relative to DeepSeek-V3.2. V4-Flash is even more efficient at 10% of FLOPs and 7% of KV cache.

That efficiency gain matters beyond cost. Reduced KV cache means longer contexts don’t degrade as sharply under memory pressure. The attention mechanism selectively compresses older context while preserving nearby token fidelity — a tradeoff that happens to align well with practical coding workloads, where recent context (the function you’re editing, the error you just saw) matters more than distant context (the imports you wrote an hour ago).

The Real Story: Huawei Ascend
#

Here’s what most coverage is underselling: V4 is the first major frontier-class model optimized for Huawei Ascend chips rather than Nvidia GPUs.

During training, DeepSeek still used American hardware — that’s not a secret. But inference — the part that actually runs when you call the API or self-host the model — runs on domestic Chinese hardware. MIT Technology Review calls this “China’s first model optimized for domestic Chinese chips, such as Huawei’s Ascend.”

This matters for several reasons:

Geopolitical. U.S. export controls on Nvidia A100s and H100s have put significant pressure on Chinese AI labs’ ability to scale training runs. DeepSeek’s response has been to become brutally efficient: the MoE architecture, the KV cache compression, the per-token FLOP reduction — these aren’t just cost optimizations, they’re adaptations to a hardware-constrained environment. V4’s Ascend inference path is the next step: demonstrating that the full stack, including deployment, can run on non-Nvidia silicon.

Infrastructure. If high-quality model inference can run efficiently on Huawei Ascend processors, it decouples Chinese AI deployment from American chip supply chains in a way that training-side restrictions cannot address. The implication for enterprise AI buyers in China (and potentially other markets where Ascend hardware is more accessible than Nvidia) is significant.

For Western developers. V4 runs on Nvidia hardware too — the Hugging Face weights are standard. But the existence of a capable Ascend inference path means the model’s continued development and availability isn’t contingent on sustained access to export-controlled chips. That’s a resilience story for a model you might want to rely on.

The Pricing Comparison That Actually Matters
#

Let’s put the numbers on paper:

ModelInputOutputOpen Weight?
DeepSeek V4-Flash$0.14/M$0.28/MYes (MIT)
DeepSeek V4-Pro$1.74/M$3.48/MYes (MIT)
GPT-5.4$2.50/M$15.00/MNo
Claude Opus 4.7$5.00/M$25.00/MNo

V4-Flash is the cheapest capable small model in this class. V4-Pro is the cheapest frontier-adjacent open-weight model by a significant margin. The output token asymmetry is telling: closed Western models charge 5-6x more for output than input; DeepSeek charges 2x. For agentic workloads that generate substantial output per query, this gap compounds fast.

Is It Good for Coding?
#

The benchmark claim is that V4-Pro beats all open-weight rivals on coding. What does that mean in practice?

For self-hosted or API-integrated coding workflows — running code generation in CI pipelines, powering custom coding agents, building developer tooling — V4-Pro’s cost-to-capability ratio is compelling. If you’re spending $25/M output tokens on Opus 4.7 for automated code tasks and V4-Pro achieves 90%+ of the quality at $3.48/M output, the economics are hard to ignore.

For interactive coding with a product like Claude Code, the comparison is less direct. Claude Code’s Opus 4.7 integration is tightly optimized for the agentic loop — CLAUDE.md invariants, tool call efficiency, the multi-agent architecture. Swapping the backbone model requires re-evaluating the whole stack, not just comparing raw benchmark numbers.

The more interesting use case is multi-agent orchestration. If you’re running Agent Teams or parallel subagent swarms, the per-token economics of individual agent calls matter enormously. V4-Flash at $0.14/M input becomes genuinely attractive as a workhorse model for high-frequency, lower-stakes subagent tasks.

What This Means for the AI Coding Landscape
#

DeepSeek V4 doesn’t change the competitive picture at the top of the market today. Claude Opus 4.7’s 64.3% SWE-bench Pro score, its tool error rate improvements, and its integration with Claude Code’s agentic infrastructure put it in a different category for serious agentic development workflows.

But V4 does three things the closed-model incumbents can’t:

  1. Provides a credible open-weight option near the frontier. You can run V4-Pro yourself, modify it, and build on it without an API key or vendor dependency.
  2. Sets a new price floor for frontier-adjacent performance. $1.74/M input for a model that competes with 6-month-old closed frontier models is going to pressure API pricing industry-wide.
  3. Demonstrates that non-Nvidia inference infrastructure works at this capability level. That’s a decade-long strategic implication for how AI infrastructure gets built and where.

A year ago, DeepSeek R1 made a point about training efficiency. V4 makes a different point: about deployment independence, open-weight availability, and what it costs to run near-frontier models in 2026.


Sources:

Related