What makes North Mini Code different from general-purpose coding models?

North Mini Code uses multiple training scaffolds (agent harnesses) and reinforcement learning with verifiable rewards (RLVR) targeting software engineering and terminal tasks, enabling robustness across different agentic workflows rather than optimization for a single task.

How does North Mini Code's efficiency compare to dense models of similar capability?

With 3B active parameters out of 30B total, North Mini Code activates only 10% of its weights per token, reducing computational overhead while matching or exceeding the performance of denser models up to 120B parameters on coding benchmarks.

Can developers use North Mini Code commercially?

Yes, North Mini Code is released under the Apache 2.0 license, permitting commercial use, modification, and distribution with minimal restrictions.

Cohere Releases North Mini Code, a 30B-Parameter MoE Model for Agentic Software Engineering

Cohere has introduced North Mini Code, the first model in its new family of coding-specialized models, with a 30B-parameter sparse Mixture-of-Experts (MoE) architecture that activates only 3B parameters per token. Available on Hugging Face under the Apache 2.0 license, the model targets agentic software engineering workflows and terminal-based coding tasks.

North Mini Code’s Architecture and Efficiency

According to the Hugging Face Blog, North Mini Code uses a decoder-only Transformer with 128 experts, of which 8 activate per token. The model interleaves sliding-window self-attention (three-quarters of layers) with full global attention (one-quarter), paired with a SwiGLU feed-forward block. This sparse design enables efficient inference: by activating only 10% of total parameters, the model reduces compute requirements compared to dense alternatives while maintaining expressiveness.

The architecture also employs a sigmoid-gated router applied before top-k expert selection, distinguishing it from standard MoE routing schemes. A single dense layer precedes the sparse layers, providing a bottleneck that stabilizes training and routing decisions.

Coding Benchmark Performance

North Mini Code achieves a score of 33.4 on Artificial Analysis’ Coding Index, the Hugging Face Blog reports. This score places it ahead of Qwen 3.5 (35B-A3B), Gemma 4 (26B-A4B), and Devstral Small 2 (24B Dense)—models of comparable or larger dense parameter counts. Notably, it also outperforms substantially larger models including Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B), demonstrating that sparse, agentic-focused training can exceed dense models several times its nominal size.

Training Strategy for Agent Robustness

Rather than optimizing for a single agent harness, Cohere trained North Mini Code across multiple scaffolds. According to the source, the model uses a three-stage post-training pipeline: two phases of supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR). The RLVR phase specifically targets software engineering and terminal-based agentic tasks, enabling the model to serve as a foundation for agent frameworks like OpenCode.

This multi-scaffold approach prioritizes robustness—critical for agents that must adapt across different execution environments and task structures rather than static benchmarks.

Why This Matters

The release of North Mini Code signals Cohere’s commitment to the agentic coding segment, where models must balance real-time performance (sparse MoE reduces latency) with reliability across heterogeneous agent architectures. Teams deploying coding agents will likely benefit from a model explicitly trained on verifiable rewards and multiple harnesses, reducing the gap between benchmark performance and production reliability. The 10% active-parameter efficiency makes North Mini Code a compelling option for cost-constrained deployments, while the Apache 2.0 license removes licensing friction for enterprises evaluating open-weights alternatives to proprietary coding models.