
(lafoto/Shutterstock)
The AI revolution has created huge demand for processing power to train frontier models, which Nvidia is filling with its high-end GPUs. But the sudden shift to AI inference and agentic AI in 2025 is exposing gaps in the memory pipeline, which d-Matrix hopes to address with its innovative 3D stacked digital in-memory compute (3DIMC) architecture, which it showed off at Hot Chips this week.
Even before the launch of ChatGPT ignited the AI revolution in late 2022, the folks at d-Matrix had already identified an unfilled need for bigger and faster memory in response to large language models (LLMs). d-Matrix CEO and co-founder Sid Sheth was already predicting a surge in AI inference workloads to result from the promising LLMs from OpenAI and Google that already were turning heads in the AI world and beyond.
“We think this is going to be around for a long time,” Sheth told BigDATAwire in April 2022 about the transformative potential of LLMs. “We think people will essentially kind of gravitate around transformers for the next five to 10 years, and that is going to be the workhorse workload for AI compute for the next five to 10 years.”
Not only did Sheth correctly predict the transformative impact of the transformer model, but he also foresaw it would eventually result in a surge in AI inference workloads. That presented a business opportunity for Sheth and d-Matrix. The problem was that the GPU-based high performance computing architectures that worked well for training ever-bigger LLMs and frontier models were not ideal for running AI inference workloads. In fact, d-Matrix had identified that the problem extended all the way down into DRAM, which could not efficiently move data at the high speeds needed to support the looming AI inference workloads.

Memory growth lags compute growth (Source: d-Matrix)
d-Matrix’s solution to this was to focus on innovation at the memory layer. While DRAM could not keep up with AI inference demands, a faster and more expensive form of memory called SRAM, or static random access memory, was up for the task.
d-Matrix utilized digital in-memory compute (DMIC) technology that fused a processor directly into SRAM modules. Its Nighthawk architecture utilized DMIC chiplets embedded directly on SRAM cards that plug right into the PCI bus while its Jayhawk architecture provided die-to-die offerings for scale-out processing. Both of these architectures were incorporated into the company’s flagship offering, dubbed Corsair, which today utilizes the latest PCIe Gen5 form factor and features ultra-high memory bandwidth of 150 TB/s.
Fast forward to 2025, and many of Sheth’s predictions have come to pass. We are firmly in the midst of a big shift from AI training to AI inference, with agentic AI poised to drive huge investments in the years to come. d-Matrix has kept pace with the needs of emerging AI workloads, and this week announced that its next-generation Pavehawk architecture, which uses three-dimensional stacked DMIC technology (or 3DMIC), is now working in the lab.
Sheth is confident that 3DMIC will provide the performance boost to help AI inference get past the memory wall.
“AI inference is bottlenecked by memory, not just FLOPs. Models are growing fast and traditional HBM memory systems are getting very costly, power hungry and bandwidth limited,” Sheth wrote in a LinkedIn blog post. “3DIMC changes the game. By stacking memory in three dimensions and bringing it into tighter integration with compute, we dramatically reduce latency, improve bandwidth, and unlock new efficiency gains.”

d-Matrix’s new Pavehawk architecture supports 3DMIC technology (Image source d-Matrix)
The memory wall has been looming for years, and is due to a mismatch in the advances of memory and processor technologies. “Industry benchmarks show that compute performance has grown roughly 3x every two years, while memory bandwidth has lagged at just 1.6x,” d-Matrix Founder and CTO Sudeep Bhoja shared in a blog post this week. “The result is a widening gap where pricey processors sit idle, waiting for data to arrive.”
While it won’t completely close the gap with the latest GPUs, 3DMIC technology promises to close the gap, Bhoja wrote. As Pavehawk comes to market, the company is currently developing the next generation of in-memory processing architecture that utilizes 3DMIC, dubbed Raptor.
“Raptor…will incorporate 3DIMC into its design–benefiting from what we and our customers learn from testing on Pavehawk,” Bhoja wrote. “By stacking memory vertically and integrating tightly with compute chiplets, Raptor promises to break through the memory wall and unlock entirely new levels of performance and TCO.”
How much better? According Bhoja, d-Matrix is hoping for 10x better memory bandwidth and 10x better energy efficiency when running AI inference workloads with 3DIMC compared to HBM4.
“These are not incremental gains–they are step-function improvements that redefine what’s possible for inference at scale,” Bhoja wrote. By putting memory requirements at the center of our design–from Corsair to Raptor and beyond–we are ensuring that inference is faster, more affordable, and sustainable at scale.
This article first appeared on our sister publication, BigDATAwire.