Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

From Silicon Valley to Nairobi: What the Global South’s AI leapfrogging teaches tech leaders

MrBeast says AI could threaten creators’ livelihoods, calling it ‘scary times’ for the industry

Optimizing What Matters: AUC-Driven Learning for Robust Neural Retrieval – Takara TLDR

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
VentureBeat AI

Huawei's new open source technique shrinks LLMs to make them run on less powerful, less expensive hardware

By Advanced AI EditorOctober 6, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email



Huawei’s Computing Systems Lab in Zurich has introduced a new open-source quantization method for large language models (LLMs) aimed at reducing memory demands without sacrificing output quality.

The technique, called SINQ (Sinkhorn-Normalized Quantization), is designed to be fast, calibration-free, and easy to integrate into existing model workflows. The code for performing it has been made available by the Huawei research team on Github and Hugging Face under a permissive, enterprise-friendly Apache 2.0 license, allowing organizations to take and use it, modify it, and deploy it commercially — all for free.

Across models of different sizes, SINQ cuts memory usage by 60–70%, depending on architecture and bit-width.

This enables models that would previously require >60 GB of memory to run on ~20 GB setups—a critical enabler for running large models on a single high-end GPU or even multi-GPU consumer-grade setups.

This makes it possible to run models that previously needed high-end enterprise GPUs—like NVIDIA’s A100 or H100—on significantly more affordable hardware, such as a single Nvidia GeForce RTX 4090 (around $1600), instead of enterprise hardware like the A100 80GB ($19,000) or even H100 units that exceed $30,000.

For teams using cloud infrastructure, the savings are similarly tangible. A100-based instances often cost $3–4.50 per hour, while 24 GB GPUs like the RTX 4090 are available on many platforms for $1–1.50 per hour.

Over time, especially for extended inference workloads, this difference can add up to thousands of dollars in cost reductions, while also unlocking LLM deployment on smaller clusters, local workstations, or consumer-grade setups previously constrained by memory.

Tackling the Memory Challenge of LLMs

Running large models often requires compromises between performance and size.

In practice, neural networks use floating-point numbers to represent both weights and activations. A floating-point number can express a wide range of values (very small, very large, with fractional parts).

This flexibility is helpful because during training and inference, weights and activations can vary in scale dramatically. Using floating-point lets the model adjust precisely. (For example, a weight could be 0.0023 or 123.45, and floating-point can capture both with decent precision.)

Quantization — a method that reduces the precision of model weights — offers a practical path to lower memory usage, but typically comes with trade-offs in model quality, especially at 4-bit precision and below.

When you convert those floating-point values into lower-precision formats (like 8-bit integers), you’re approximating them.

That means you store and compute with fewer bits, which is faster and more memory-efficient — but you risk losing fidelity (i.e. introducing small errors).

The trick is to do the conversion carefully so the model’s behavior stays nearly the same, even though internally it’s working with rougher approximations of those weights and activations.

SINQ addresses these pain points by introducing a plug-and-play solution that delivers strong performance even in low-precision settings—without requiring calibration data or inter-layer dependencies.

How SINQ Works

The SINQ approach introduces two main innovations:

Dual-Axis Scaling: Instead of using a single scale factor for quantizing a matrix, SINQ uses separate scaling vectors for rows and columns. This helps mitigate the effects of outliers and allows the quantization error to be distributed more flexibly across the matrix.

Sinkhorn-Knopp-Style Normalization: A fast algorithm inspired by Sinkhorn iterations is used to normalize the standard deviations of rows and columns in a matrix. This helps minimize what the authors call “matrix imbalance,” a new proxy metric shown to be more effective than alternatives like kurtosis for improving quantization performance.

The combination of these two features allows SINQ to outperform other calibration-free techniques such as Round-To-Nearest (RTN), HQQ, and Hadamard-based quantization across multiple benchmarks.

Performance and Compatibility

SINQ has been evaluated across a wide range of architectures and models, including the Qwen3 series, LLaMA, and DeepSeek.

On benchmarks like WikiText2 and C4, SINQ consistently reduces perplexity and flip rates compared to baseline methods, often approaching or matching the performance of calibrated solutions.

It also supports non-uniform quantization schemes such as NF4 and can be combined with calibration methods like AWQ, leading to the variant A-SINQ. In calibrated settings, A-SINQ further narrows the gap with full-precision models.

In terms of runtime efficiency, SINQ quantizes models roughly twice as fast as HQQ and over 30 times faster than AWQ. This makes it well-suited for both research and production environments where quantization time is a practical constraint.

Open Source and Easy to Use

Huawei has released SINQ as an open-source project under a permissive, enterprise-friendly Apache 2.0 license, with implementation instructions and reproducibility tools available on GitHub:

The repository includes support for quantizing Hugging Face models with just a few lines of code, as well as tools for saving and reloading quantized weights. Default settings offer a balance between memory savings and accuracy, and users can customize parameters like bit-width, tiling strategy, and group size based on their needs.

The authors also provide evaluation integration via the lm-eval library and plan to release pre-quantized models on the Hugging Face Hub in the near future.

Looking Ahead

With growing demand for running large models on consumer-grade hardware, quantization is becoming an essential tool. SINQ aims to lower the entry barrier for LLM deployment, enabling developers and researchers to efficiently shrink models without major trade-offs in quality or compatibility.

Further updates—including integration with Hugging Face Transformers and pre-quantized model releases—are planned, making this a project to watch in the quantization space.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleTaylor Swift fans accuse singer of using AI in her Google scavenger hunt videos
Next Article When Your Primary Customer Folds Overnight
Advanced AI Editor
  • Website

Related Posts

From Silicon Valley to Nairobi: What the Global South’s AI leapfrogging teaches tech leaders

October 7, 2025

OpenAI unveils AgentKit that lets developers drag and drop to build AI agents

October 6, 2025

OpenAI announces Apps SDK allowing ChatGPT to launch and run third party apps like Zillow, Canva, Spotify

October 6, 2025

Comments are closed.

Latest Posts

Tomb of Amenhotep III Reopens After Two-Decade Renovation    

Limited Edition Print of Ozzy Osbourne Art Sold To Benefit Charities

Odili Donald Odita Sues Jack Shainman Gallery over ‘Withheld’ Artworks

Morning Links for October 6, 2025

Latest Posts

From Silicon Valley to Nairobi: What the Global South’s AI leapfrogging teaches tech leaders

October 7, 2025

MrBeast says AI could threaten creators’ livelihoods, calling it ‘scary times’ for the industry

October 7, 2025

Optimizing What Matters: AUC-Driven Learning for Robust Neural Retrieval – Takara TLDR

October 7, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • From Silicon Valley to Nairobi: What the Global South’s AI leapfrogging teaches tech leaders
  • MrBeast says AI could threaten creators’ livelihoods, calling it ‘scary times’ for the industry
  • Optimizing What Matters: AUC-Driven Learning for Robust Neural Retrieval – Takara TLDR
  • WIRED Roundup: The New Fake World of OpenAI’s Social Video App
  • IBM Adds Agentic AI to Network Intelligence

Recent Comments

  1. Staceysew on Bitcoin Security: Here’s What Makes The OG Blockchain Safer Than Fort Knox
  2. Minimexer4Nalay on Anthropic’s popular Claude Code AI tool now included in its $20/month Pro plan
  3. Minimexer4Nalay on The first Google TPU for the age of inference
  4. Minimexer4Nalay on Nvidia takes $4.5bn hit due to export restrictions
  5. Minimexer4Nalay on New MIT CSAIL study suggests that AI won’t steal as many jobs as expected

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.