Ted Hisokawa
Apr 12, 2026 01:37
MiniMax releases M2.7, a 230B-parameter mixture-of-experts model optimized for NVIDIA GPUs with up to 2.7x throughput gains on Blackwell hardware.
MiniMax has released M2.7, a 230-billion parameter open-weights AI model designed specifically for autonomous agent workflows, now available across NVIDIA’s inference ecosystem including the company’s latest Blackwell Ultra GPUs.
The model represents a significant efficiency play in enterprise AI. Despite its massive 230B total parameters, M2.7 activates only 10B parameters per token—a 4.3% activation rate achieved through mixture-of-experts (MoE) architecture with 256 local experts. This keeps inference costs manageable while maintaining the reasoning capacity of a much larger model.
Performance Numbers on Blackwell
NVIDIA collaborated with open source communities to optimize M2.7 for production workloads. Two key optimizations—a fused QK RMS Norm kernel and FP8 MoE integration from TensorRT-LLM—delivered substantial throughput improvements on Blackwell Ultra GPUs.
Testing with a 1K/1K input/output sequence length dataset showed vLLM achieved up to 2.5x throughput improvement, while SGLang hit 2.7x gains. Both optimizations were implemented within a single month, suggesting further performance headroom exists.
Technical Architecture
M2.7 supports 200K input context length across 62 layers, using multi-head causal self-attention with Rotary Position Embeddings (RoPE). A top-k expert routing mechanism activates only 8 of the 256 experts for any given input, which is how the model maintains low inference costs despite its scale.
The architecture targets coding challenges and complex agentic tasks—workflows where AI systems need to plan, execute, and iterate autonomously rather than respond to single prompts.
Deployment Options
Developers can access M2.7 through multiple channels. NVIDIA’s NemoClaw reference stack provides a one-click deployment for running autonomous agents with OpenShell runtime. The model is also available through NVIDIA NIM containerized microservices for on-premise, cloud, or hybrid deployments.
For teams wanting to customize the model, NVIDIA’s NeMo AutoModel library supports fine-tuning with published recipes. Reinforcement learning workflows are available through NeMo RL with sample configurations for 8K and 16K sequence lengths.
Free GPU-accelerated endpoints on build.nvidia.com allow testing before committing to infrastructure. The open weights are also available on Hugging Face for self-hosted deployments.
The release positions MiniMax as a credible alternative to closed models from OpenAI and Anthropic for enterprises building autonomous AI systems, particularly those already invested in NVIDIA infrastructure.
Image source: Shutterstock
Credit: Source link






