A Virtual Machine For Arbitrary Low-Precision GPGPU Computation In LLM Serving

arXiv:2504.12984v1 Announce Type: cross
Abstract: Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance due to high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, which are essential for efficient low-precision computations. In this paper, we introduce a virtual machine (VM) designed for General-Purpose GPU (GPGPU) computing, enabling support for low-precision data types with arbitrary bit widths while maintaining GPU programmability. The proposed VM features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. VM programs are compiled into highly efficient GPU programs with automatic vectorization and instruction selection. Extensive experiments demonstrate that our VM efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels on their supported types. Compared to existing compilers like Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, our VM achieves performance improvements of 1.75x, 2.61x, 1.29x and 1.03x, respectively.

Source link

What's Hot

Exploring Next-Generation Large Model Development and Open Source Collaboration_The_model_open

Alibaba Cloud’s Triple Release! Omni Leads the Launch of Three Major Models_input_Qwen_model

OpenAI Plans to Deploy 10 Gigawatts of NVIDIA AI Systems in Exchange for NVIDIA’s $100 Billion Investment_the_and

A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

St. Patrick’s Cathedral Unveils Monumental Mural by Adam Cvijanovic

Three Loaned Banksy Works Incite Dispute Between England and Italy

Major Collection of Old Masters Paintings Could Be Fractionalized

100 Must-See Artworks at the Metropolitan Museum of Art

Exploring Next-Generation Large Model Development and Open Source Collaboration_The_model_open

Alibaba Cloud’s Triple Release! Omni Leads the Launch of Three Major Models_input_Qwen_model

OpenAI Plans to Deploy 10 Gigawatts of NVIDIA AI Systems in Exchange for NVIDIA’s $100 Billion Investment_the_and

What's Hot

A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

Related Posts

Subscribe to Updates