Skip to main content
  1. Articles/

Gemma 4: Google Just Made the Case for Running Your Coding Agent Locally

·1431 words·7 mins·

For the past two years, the case for running a capable coding agent entirely on your own hardware has been theoretical. The open-weight models were good — genuinely impressive for their size — but when it came to the things that matter most for agentic coding (tool use, sustained reasoning, large context), they were still a tier below the frontier API models. You ran local models for privacy or cost reasons, accepting a performance penalty to do so.

Google’s Gemma 4, released April 2, 2026, changes that calculus. Not completely, and not without caveats. But meaningfully.

The Numbers That Matter
#

Gemma 4 ships in four variants. The two that matter most for coding agents are the 26B Mixture-of-Experts and the 31B Dense model.

The 31B Dense scores 80.0% on LiveCodeBench v6 with a Codeforces ELO of 2,150. Those are not open-weight numbers — that is competitive with frontier API models from six months ago. The 26B MoE follows closely at 77.1% LiveCodeBench and 1,718 Codeforces ELO.

For context: Gemma 3 27B, the previous generation, scored 29.1% on LiveCodeBench and a Codeforces ELO of 110. That is not a modest improvement. That is a different class of capability.

The agentic tool-use numbers are equally striking. On τ2-bench (a retail task-completion benchmark that measures how well a model actually uses tools to accomplish goals), Gemma 4 31B scores 86.4%. Gemma 3 27B scored 6.6%. The model that came before Gemma 4 was not a coding agent in any meaningful sense. Gemma 4 is.

The License Change That Matters More Than Benchmarks
#

Every previous Gemma release shipped under a restrictive custom license that limited commercial use and prohibited various deployment patterns. Gemma 4 ships under Apache 2.0.

VentureBeat’s coverage put it directly: “the license change may matter more than the benchmarks.” That is not hyperbole. Apache 2.0 means Gemma 4 can be embedded in commercial products, fine-tuned and redistributed, deployed in any cloud or on-premise environment, and integrated into enterprise toolchains without legal review. The prior license required that review. Most companies never got through it.

The practical effect: Gemma 4 is now the strongest fully-commercial open-weight coding model available. DeepSeek V3.2 ships under MIT. Qwen 3.5 ships under Apache 2.0. Gemma 4 is now in the same tier on licensing while competing on benchmarks.

Hardware: What You Actually Need
#

The four model variants have meaningfully different hardware requirements:

ModelMinimum VRAMPractical GPU
E2B (2.3B)~3–4 GBAny modern GPU; Raspberry Pi 5 works
E4B (4.5B)~6 GBRTX 3060 (8GB)
26B MoE~8 GBRTX 3080 (10GB) or M2 Pro
31B Dense~20 GBRTX 3090/4090 (24GB) or M3 Max

The 26B MoE is the practical sweet spot for coding agent use. It fits on a single consumer GPU, delivers 77.1% LiveCodeBench, supports the full 256K context window, and runs at approximately 150 tokens/second on an RTX 4090. That throughput is fast enough for agentic loops where the model is calling tools and processing results iteratively.

For Apple Silicon: the MLX backend handles all four variants. A MacBook Pro M3 Max (128GB unified memory) runs the 26B MoE comfortably at good throughput. The 31B Dense fits on M2 Ultra or M3 Max configurations.

The E2B and E4B variants deserve a mention for a different reason: they run on edge hardware. Gemma 4 E2B has been demonstrated at 7.6 decode tokens/second on a Raspberry Pi 5, and 31 tokens/second on a Qualcomm Dragonwing IQ8 NPU. These are not coding agent numbers — they are edge deployment numbers. But the capability progression from E2B to 31B Dense under a single unified model family is notable.

The 256K Context Window
#

Both the 26B MoE and 31B Dense support 256K token context. For a coding agent, that means you can feed an entire mid-sized codebase — API layer, frontend, database schema, test suite, documentation — into a single prompt. No chunking, no summarization, no retrieval-augmented lookup to find the relevant file. The model sees everything.

This is enabled by Proportional RoPE on global attention layers, combined with an alternating attention architecture that mixes local sliding-window attention (1,024 tokens) with full global attention. The design keeps computation tractable at long context lengths without the quality degradation that typically appears when naive RoPE scaling is used.

Day-One Tooling: Mostly Ready
#

Gemma 4 launched with support across the full stack: Ollama, LM Studio, MLX, llama.cpp, Hugging Face Transformers, vLLM, SGLang, Unsloth, NVIDIA NIM/NeMo, Keras, and JAX. The model integrates with any OpenAI-compatible server — feed it to aider, continue.dev, or any other tool that speaks the OpenAI API via llama-server.

One caveat that matters if you are building a coding agent: tool calling is currently broken in Ollama v0.20.0. The streaming parser drops tool calls into the reasoning field rather than parsing them as structured output. A workaround exists (a community gist for OpenCode users that patches the streaming response), but this is a real bug that will affect any agent that relies on Ollama-served Gemma 4 for tool use. Track the Ollama release notes; this will likely be patched quickly, but check before you build on it.

Beyond Ollama, tool calling works correctly through llama.cpp’s server, through the Transformers pipeline interface, and through fine-tuning frameworks like Unsloth. If you are building a production coding agent on Gemma 4, skip Ollama for now and use llama.cpp’s OpenAI-compatible server instead.

Enable Thinking: The On-Demand Reasoning Mode
#

Gemma 4 ships with a chain-of-thought reasoning mode activatable at inference time via apply_chat_template(..., enable_thinking=True). When enabled, the model works through a problem step-by-step before producing output — similar to Claude’s extended thinking or the reasoning modes in newer GPT models.

For coding tasks, this matters most for algorithm design, complex debugging, and refactoring decisions where a direct answer is less reliable than a reasoned one. The AIME 2026 score — 89.2% for 31B Dense — is the benchmark proxy for this reasoning quality.

You can turn it off for fast, simple completions and on for tasks that benefit from deeper analysis. That toggle-at-inference-time design is more practical than models that either always reason (slow, expensive) or never do (fast, sometimes wrong).

Where Gemma 4 Does Not Lead
#

Honest assessment: DeepSeek V3.2 still leads on raw coding benchmark performance for large-scale code generation. For SWE-bench Verified-style task completion, GLM-5 has posted 77.8%, and DeepSeek V3.2’s numbers on general code generation continue to impress. If raw coding throughput is your only metric and you can use a cloud-hosted open model, those alternatives are worth evaluating.

Qwen 3.5 27B is the nearest competitor to Gemma 4 26B MoE on the specs that matter: similar coding benchmark performance, Apache 2.0, comparable context length. For pure coding tasks, Qwen3 Coder Next (80B MoE, 3B active parameters) is specifically tuned for coding agents and delivers strong results — though at infrastructure complexity that offsets the “local” advantage.

The distinction worth making: Gemma 4 26B MoE is the strongest option for private, local, single-GPU deployment of a coding agent. If your threat model requires that no code leaves your machine, or your cost model requires zero API spend, Gemma 4 is now the answer to reach for.

What This Means
#

The practical conclusion is straightforward. Teams that have been deferring the “run AI locally” decision because local models were not good enough for real agentic work now have a concrete option. A developer with an RTX 4090 or a MacBook Pro M3 Max can run a coding agent that scores within striking distance of the frontier cloud models from a year ago — with Apache 2.0 licensing, 256K context, and production-grade tool use — at zero marginal cost per token.

That changes the economics for privacy-sensitive codebases, air-gapped environments, teams in jurisdictions with strict data residency requirements, and individual developers who want capable AI assistance without the API bill.

The tool-calling Ollama bug is temporary. The Apache 2.0 license and the benchmark numbers are not.


Sources

Related