The launch of Granite 4.0 initiates a new era for IBM’s family of enterprise-ready large language models, leveraging novel architectural advancements to double down on small, efficient language models that provide competitive performance at reduced costs and latency. The Granite 4.0 models were developed with a particular emphasis on essential tasks for agentic workflows, both in standalone deployments and as cost-efficient building blocks in complex systems alongside larger reasoning models.
The Granite 4.0 collection comprises multiple model sizes and architecture styles to provide optimal production across a wide array of hardware constraints, including:
Granite-4.0-H-Small, a hybrid mixture of experts (MoE) model with 32B total parameters (9B active)Granite-4.0-H-Tiny, a hybrid MoE with 7B total parameters (1B active)Granite-4.0-H-Micro, a dense hybrid model with 3B parameters.This release also includes Granite-4.0-Micro, a 3B dense model with a conventional attention-driven transformer architecture, to accommodate platforms and communities that do not yet support hybrid architectures.
Granite 4.0-H Small is a workhorse model for strong, cost-effective performance on enterprise workflows like multi-tool agents and customer support automation. The Tiny and Micro models are designed for low latency, edge and local applications, and can also serve as a building block within larger agentic workflows for fast execution of key tasks such as function calling.
Granite 4.0 benchmark performance shows substantial improvements over prior generations—even the smallest Granite 4.0 models significantly outperform Granite 3.3 8B, despite being less than half its size—but their most notable strength is a remarkable increase in inference efficiency. Relative to conventional LLMs, our hybrid Granite 4.0 models require significantly less RAM to run, especially for tasks involving long context lengths (like ingesting a large codebase or extensive documentation) and multiple sessions at the same time (like a customer service agent handling many detailed user inquiries simultaneously).
Most importantly, this dramatic reduction in Granite 4.0’s memory requirements entails a similarly dramatic reduction in the cost of hardware needed to run heavy workloads at high inference speeds. Our aim is to lower barriers to entry by providing enterprises and open-source developers alike with cost-effective access to highly competitive LLMs.