Breaking The Networking Wall In AI Infrastructure - Microsoft Research

Two white line icons on a gradient background transitioning from blue to pink. From left to right: icon representing a set of gears; an icon representing three connected nodes each containing a user icon.

Memory and network bottlenecks are increasingly limiting AI system performance by reducing GPU utilization and overall efficiency, ultimately preventing infrastructure from reaching its full potential despite enormous investments. At the core of this challenge is a fundamental trade-off in the communication technologies used for memory and network interconnects.

Datacenters typically deploy two types of physical cables for communication between GPUs. Traditional copper links are power-efficient and reliable, but limited to very short distances (< 2 meters) that restrict their use to within a single GPU rack. Optical fiber links can reach tens of meters, but they consume far more power and fail up to 100 times as often as copper. A team working across Microsoft aims to resolve this trade-off by developing MOSAIC, a novel optical link technology that can provide low power and cost, high reliability, and long reach (up to 50 meters) simultaneously. This approach leverages a hardware-system co-design and adopts a wide-and-slow design with hundreds of parallel low-speed channels using microLEDs.

The fundamental trade-off among power, reliability, and reach stems from the narrow-and-fast architecture deployed in today’s copper and optical links, comprising a few channels operating at very high data rates. For example, an 800 Gbps link consists of eight 100 Gbps channels. With copper links, higher channel speeds lead to greater signal integrity challenges, which limits their reach. With optical links, high-speed transmission is inherently inefficient, requiring power-hungry laser drivers and complex electronics to compensate for transmission impairments. These challenges grow as speeds increase with every generation of networks. Transmitting at high speeds also pushes the limits of optical components, reducing systems margins and increasing failure rates.

These limitations force systems designers to make unpleasant choices, limiting the scalability of AI infrastructure. For example, scale-up networks connecting AI accelerators at multi-Tbps bandwidth typically must rely on copper links to meet the power budget, requiring ultra-dense racks that consume hundreds of kilowatts per rack. This creates significant challenges in cooling and mechanical design, which constrain the practical scale of these networks and end-to-end performance. This imbalance ultimately erects a networking wall akin to the memory wall, in which CPU speeds have outstripped memory speeds, creating performance bottlenecks.

A technology offering copper-like power efficiency and reliability over long distances can overcome this networking wall, enabling multi-rack scale-up domains and unlocking new architectures. This is a highly active R&D area, with many candidate technologies currently being developed across the industry. In our recent paper, “MOSAIC: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs”, which received a Best Paper award at ACM SIGCOMM, we present one such promising approach that is the result of a multi-year collaboration between Microsoft Research, Azure, and M365. This work is centered around an optical wide-and-slow architecture, shifting from a small number of high-speed serial channels towards hundreds of parallel low-speed channels. This would be impractical to realize with today’s copper and optical technologies because of i) electromagnetic interference challenges in high-density copper cables and ii) the high cost and power consumption of lasers in optical links, as well as the increase in packaging complexity. MOSAIC overcomes these issues by leveraging directly modulated microLEDs, a technology originally developed for screen displays.

MicroLEDs are significantly smaller than traditional LEDs (ranging from a few to tens of microns) and, due to their small size, they can be modulated at several Gbps. They are manufactured in large arrays, with over half a million in a small physical footprint for high-resolution displays like head-mounted devices or smartwatches. For example, assuming 2 Gbps per microLED channel, an 800 Gbps MOSAIC link can be realized by using a 20×20 microLED array, which can fit in less than 1 mm×1 mm silicon die.

MOSAIC’s wide-and-slow design provides four core benefits.

Operating at low speed improves power efficiency by eliminating the need for complex electronics and reducing optical power requirements.

By leveraging optical transmission (via microLEDs), MOSAIC sidesteps copper’s reach issues, supporting distances up to 50 meters, or > 10x further than copper.

MicroLEDs’ simpler structure and temperature insensitivity make them more reliable than lasers. The parallel nature of wide-and-slow also makes it easy to add redundant channels, further increasing reliability, up to two orders of magnitude higher than optical links.

The approach is also scalable, as higher aggregate speeds (e.g., 1.6 Tbps or 3.2 Tbps) can be achieved by increasing the number of channels and/or raising per-channel speed (e.g., to 4-8 Gbps).

Further, MOSAIC is fully compatible with today’s pluggable transceivers’ form factor and it provides a drop-in replacement for today’s copper and optical cables, without requiring any changes to existing server and network infrastructure. MOSAIC is protocol-agnostic, as it simply relays bits from one endpoint to another without terminating or inspecting the connection and, hence, it’s fully compatible with today’s protocols (e.g., Ethernet, PCIe, CXL). We are currently working with our suppliers to productize this technology and scale to mass production.

While conceptually simple, realizing this architecture posed a few key challenges across the stack, which required a multi-disciplinary team with expertise spanning across integrated photonics, lens design, optical transmission, and analog and digital design. For example, using individual fibers per channel would be prohibitively complex and costly due to the large number of channels. We addressed this by employing imaging fibers, which are typically used for medical applications (e.g., endoscopy). They can support thousands of cores per fiber, enabling multiplexing of many channels within a single fiber. Also, microLEDs are a less pure light source than lasers, with a larger beam shape (which complicates fiber coupling) and a broader spectrum (which degrades fiber transmission due to chromatic dispersion). We tackled these issues through a novel microLED and optical lens design, and a power-efficient analog-only electronic back end, which does not require any expensive digital signal processing.

Based on our current estimates, this approach can save up to 68% of power, i.e., more than 10W per cable while reducing failure rates by up to 100x. With global annual shipments of optical cables reaching into the tens of millions, this translates to over 100MW of power savings per year, enough to power more than 300,000 homes. While these immediate gains are already significant, the unique combination of low power consumption, reduced cost, high reliability, and long reach opens up exciting new opportunities to rethink AI infrastructure from network and cluster architectures to compute and memory designs.

For example, by supporting low-power, high-bandwidth connectivity at long reach, MOSAIC removes the need for ultra-dense racks and enables novel network topologies, which would be impractical today. The resulting redesign could reduce resource fragmentation and simplify collective optimization. Similarly, on the compute front, the ability to connect silicon dies at low power over long distances could enable resource disaggregation, shifting from today’s large, multi-die packages to smaller, more cost-effective, ones. Bypassing packaging area constraints would also make it possible to drastically increase GPU memory capacity and bandwidth, while facilitating adoption of novel memory technologies.

Historically, step changes in network technology have unlocked entirely new classes of applications and workloads. While our SIGCOMM paper provides possible future directions, we hope this work sparks broader discussion and collaboration across the research and industry communities.

Source link

What's Hot

Claude’s new AI file creation feature ships with deep security risks built in

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers – Takara TLDR

Mistral AI raises $2B led by semiconductor equipment maker ASML at $14B valuation

Breaking the networking wall in AI infrastructure – Microsoft Research

Crescent library brings privacy to digital identity systems

Coauthor roundtable: Reflecting on healthcare economics, biomedical research, and medical education

Applicability vs. job displacement: further notes on our recent research on AI and occupations

Anne Imhof Reimagines Football Jerseys with Nike

Jason Wu, Robert Rauschenberg Collaboration for New York Fashion Week

Storied Collector and MoMA Trustee Dies at 92

Congress Obtains Drawing Trump Apparently Made for Jeffrey Epstein

Claude’s new AI file creation feature ships with deep security risks built in

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers – Takara TLDR

Mistral AI raises $2B led by semiconductor equipment maker ASML at $14B valuation

What's Hot

Breaking the networking wall in AI infrastructure – Microsoft Research

The AI Revolution in Medicine, Revisited

Related Posts

Subscribe to Updates