Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation – Takara TLDR

Technologist Rahul Patil Named CTO of Anthropic, Maker of Claude AI

OpenAI Doubles Down on Chip Diversity With AMD, Nvidia Deals

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
VentureBeat AI

Huawei's new open source technique shrinks LLMs to make them run on less powerful, less expensive hardware

By Advanced AI EditorOctober 6, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email



Huawei’s Computing Systems Lab in Zurich has introduced a new open-source quantization method for large language models (LLMs) aimed at reducing memory demands without sacrificing output quality.

The technique, called SINQ (Sinkhorn-Normalized Quantization), is designed to be fast, calibration-free, and easy to integrate into existing model workflows. The code for performing it has been made available by the Huawei research team on Github and Hugging Face under a permissive, enterprise-friendly Apache 2.0 license, allowing organizations to take and use it, modify it, and deploy it commercially — all for free.

Across models of different sizes, SINQ cuts memory usage by 60–70%, depending on architecture and bit-width.

This enables models that would previously require >60 GB of memory to run on ~20 GB setups—a critical enabler for running large models on a single high-end GPU or even multi-GPU consumer-grade setups.

This makes it possible to run models that previously needed high-end enterprise GPUs—like NVIDIA’s A100 or H100—on significantly more affordable hardware, such as a single Nvidia GeForce RTX 4090 (around $1600), instead of enterprise hardware like the A100 80GB ($19,000) or even H100 units that exceed $30,000.

For teams using cloud infrastructure, the savings are similarly tangible. A100-based instances often cost $3–4.50 per hour, while 24 GB GPUs like the RTX 4090 are available on many platforms for $1–1.50 per hour.

Over time, especially for extended inference workloads, this difference can add up to thousands of dollars in cost reductions, while also unlocking LLM deployment on smaller clusters, local workstations, or consumer-grade setups previously constrained by memory.

Tackling the Memory Challenge of LLMs

Running large models often requires compromises between performance and size.

In practice, neural networks use floating-point numbers to represent both weights and activations. A floating-point number can express a wide range of values (very small, very large, with fractional parts).

This flexibility is helpful because during training and inference, weights and activations can vary in scale dramatically. Using floating-point lets the model adjust precisely. (For example, a weight could be 0.0023 or 123.45, and floating-point can capture both with decent precision.)

Quantization — a method that reduces the precision of model weights — offers a practical path to lower memory usage, but typically comes with trade-offs in model quality, especially at 4-bit precision and below.

When you convert those floating-point values into lower-precision formats (like 8-bit integers), you’re approximating them.

That means you store and compute with fewer bits, which is faster and more memory-efficient — but you risk losing fidelity (i.e. introducing small errors).

The trick is to do the conversion carefully so the model’s behavior stays nearly the same, even though internally it’s working with rougher approximations of those weights and activations.

SINQ addresses these pain points by introducing a plug-and-play solution that delivers strong performance even in low-precision settings—without requiring calibration data or inter-layer dependencies.

How SINQ Works

The SINQ approach introduces two main innovations:

Dual-Axis Scaling: Instead of using a single scale factor for quantizing a matrix, SINQ uses separate scaling vectors for rows and columns. This helps mitigate the effects of outliers and allows the quantization error to be distributed more flexibly across the matrix.

Sinkhorn-Knopp-Style Normalization: A fast algorithm inspired by Sinkhorn iterations is used to normalize the standard deviations of rows and columns in a matrix. This helps minimize what the authors call “matrix imbalance,” a new proxy metric shown to be more effective than alternatives like kurtosis for improving quantization performance.

The combination of these two features allows SINQ to outperform other calibration-free techniques such as Round-To-Nearest (RTN), HQQ, and Hadamard-based quantization across multiple benchmarks.

Performance and Compatibility

SINQ has been evaluated across a wide range of architectures and models, including the Qwen3 series, LLaMA, and DeepSeek.

On benchmarks like WikiText2 and C4, SINQ consistently reduces perplexity and flip rates compared to baseline methods, often approaching or matching the performance of calibrated solutions.

It also supports non-uniform quantization schemes such as NF4 and can be combined with calibration methods like AWQ, leading to the variant A-SINQ. In calibrated settings, A-SINQ further narrows the gap with full-precision models.

In terms of runtime efficiency, SINQ quantizes models roughly twice as fast as HQQ and over 30 times faster than AWQ. This makes it well-suited for both research and production environments where quantization time is a practical constraint.

Open Source and Easy to Use

Huawei has released SINQ as an open-source project under a permissive, enterprise-friendly Apache 2.0 license, with implementation instructions and reproducibility tools available on GitHub:

The repository includes support for quantizing Hugging Face models with just a few lines of code, as well as tools for saving and reloading quantized weights. Default settings offer a balance between memory savings and accuracy, and users can customize parameters like bit-width, tiling strategy, and group size based on their needs.

The authors also provide evaluation integration via the lm-eval library and plan to release pre-quantized models on the Hugging Face Hub in the near future.

Looking Ahead

With growing demand for running large models on consumer-grade hardware, quantization is becoming an essential tool. SINQ aims to lower the entry barrier for LLM deployment, enabling developers and researchers to efficiently shrink models without major trade-offs in quality or compatibility.

Further updates—including integration with Hugging Face Transformers and pre-quantized model releases—are planned, making this a project to watch in the quantization space.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleTaylor Swift fans accuse singer of using AI in her Google scavenger hunt videos
Next Article When Your Primary Customer Folds Overnight
Advanced AI Editor
  • Website

Related Posts

OpenAI announces Apps SDK allowing ChatGPT to launch and run third party apps like Zillow, Canva, Spotify

October 6, 2025

Beyond Von Neumann: Toward a unified deterministic architecture

October 6, 2025

Replacing coders with AI? Why Bill Gates, Sam Altman and experience say you shouldn’t.

October 6, 2025

Comments are closed.

Latest Posts

Tomb of Amenhotep III Reopens After Two-Decade Renovation    

Limited Edition Print of Ozzy Osbourne Art Sold To Benefit Charities

Morning Links for October 6, 2025

Sotheby’s to Sell René Magritte Held in Same Collection for 100 years

Latest Posts

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation – Takara TLDR

October 6, 2025

Technologist Rahul Patil Named CTO of Anthropic, Maker of Claude AI

October 6, 2025

OpenAI Doubles Down on Chip Diversity With AMD, Nvidia Deals

October 6, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • Efficient Multi-modal Large Language Models via Progressive Consistency Distillation – Takara TLDR
  • Technologist Rahul Patil Named CTO of Anthropic, Maker of Claude AI
  • OpenAI Doubles Down on Chip Diversity With AMD, Nvidia Deals
  • When Your Primary Customer Folds Overnight
  • Huawei's new open source technique shrinks LLMs to make them run on less powerful, less expensive hardware

Recent Comments

  1. Nickywer on Gemma 3N: Google’s Latest On Device Mobile AI Model
  2. Howardunicy on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. Best Online Roulette Simulator on ‘Titanic’ and ‘Avatar’ VFX Innovator Robert Legato Joins Stability AI
  4. Otha Hoffpauir on VAST Data Powers Smarter, Evolving AI Agents with NVIDIA Data Flywheel
  5. Nickywer on Google DeepMind Taught Itself to Play Minecraft

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.