
Huawei’s Computing Systems Lab in Zurich has introduced a New open-source quantization method For large language models (LLM) the aim is to reduce memory demands without compromising output quality.
technique, called SINQ (Sinkhorn-Normalized Quantization)Designed to be fast, calibration-free, and easy to integrate into existing model workflows. The code to execute this has been provided by the Huawei research team Github And hugging face Under a permissive, enterprise-friendly Apache 2.0 license, organizations are allowed to take it and use it, modify it, and deploy it commercially – all for free.
Across models of different sizes, SINQ cuts memory usage 60-70%Depends on architecture and bit-width.
This enables models that previously required more than 60 GB of memory to run ~20 GB setup– A key enabler for running large models on a single high-end GPU or even multi-GPU consumer-grade setups.
This makes it possible to run models that previously required high-end enterprise GPUs – such as NVIDIA’s A100 or H100 – on significantly more affordable hardware, such as a single nvidia geforce rtx 4090 ,around $1600), rather than enterprise hardware such as a100 80gb ,$19,000) or even h100 units which over $30,000,
For teams using cloud infrastructure, the savings are equally tangible. A100-based instances often cost $3-4.50 per hour, while 24GB GPUs like the RTX 4090 are available for $1-1.50 per hour on many platforms.
Over time, especially for extended estimation workloads, this difference may increase. Thousands of dollars in cost savingsas well as unlocking LLM deployment on small groups, local workspaces, or consumer-grade setups previously constrained by memory.
Tackling the memory challenge of LLM
Running larger models often requires a compromise between performance and size.
In practice, neural networks are used decimal numbers To represent both weight and activity. A floating-point number can express a wide range of values (very small, very large, with fractional parts).
This flexibility is helpful because during training and estimation, the weights and activation scales can vary dramatically. Using floating-point allows the model to be precisely adjusted. (For example, a weight could be 0.0023 or 123.45, and floating-point could capture both with good accuracy.)
Quantization – a method that reduces the accuracy of model weights – offers a practical route to reducing memory usage, but typically comes with a trade-off in model quality, especially at 4-bit precision and below.
When you convert those floating-point values to lower-precision formats (like 8-bit integers), you’re approximating them.
This means you store and compute with fewer bits, which is faster and more memory-efficient – but you risk losing fidelity (ie introducing smaller errors).
The trick is to do the transformation carefully so that the behavior of the model remains approximately the same, even though internally it is working with a rough estimate of those weights and activations.
SINQ addresses these problems by offering a plug-and-play solution that delivers robust performance even in low-precision settings without the need for calibration data or inter-layer dependencies.
How does SINQ work?
The SINQ approach introduces two main innovations:
-
Dual-axis scaling: Instead of using a single scale factor to quantize the matrix, SINQ uses separate scaling vectors for rows and columns. This helps reduce the impact of outliers and allows the quantization error to be distributed more flexibly across the matrix.
-
Sinkhorn–Knopf-style generalization: A fast algorithm inspired by sinkhorn iterations is used to normalize the standard deviations of rows and columns in a matrix. This helps reduce what the authors call “matrix imbalance”, a new proxy metric that has been shown to be more effective than alternatives such as kurtosis for improving quantization performance.
The combination of these two features allows SINQ to outperform other calibration-free techniques such as round-to-nearest (RTN), HQQ, and Hadamard-based quantization in many benchmarks.
Performance and compatibility
SINQ has been evaluated across a wide range of architectures and models, including the Qwen3 series, LLaMA, and DeepSeek.
On benchmarks such as WikiText2 and C4, SINQ consistently reduces error and flip rates compared to baseline methods, often approaching or matching the performance of calibrated solutions.
It also supports non-uniform quantization schemes such as NF4 and can be combined with calibration methods such as AWQ, yielding the variant A-SINQ. In calibrated settings, A-SINQ further reduces the gap with full-precision models.
In terms of runtime efficiency, SINQ models scale approximately twice as fast as HQQ and up to 30 times faster than AWQ. This makes it suitable for both research and production environments where quantification time is a practical constraint.
Open source and easy to use
Huawei has released SINQ as an open-source project under a permissive, enterprise-friendly Apache 2.0 license, with implementation instructions and reproduction tools available on GitHub:
The repository includes support for quantizing the hugging face model with just a few lines of code, as well as tools for saving and reloading quantified weights. The default settings provide a balance between memory savings and accuracy, and users can customize parameters such as bit-width, tiling strategy, and group size based on their needs.
The authors also provide assessment integration through lm-eval There are plans to release pre-quantized models on the library and Hugging Face Hub in the near future.
looking ahead
With the increasing demand to run large models on consumer-grade hardware, quantization is becoming an essential tool. SINQ aims to lower the barrier to entry for LLM deployment, enabling developers and researchers to efficiently miniaturize models without major changes in quality or compatibility.
Further updates – including integration with Hugging Face Transformers and pre-quantized model releases – are planned, making this a project worth watching in the quantization space.

