Breakthroughs in Large Model Quantization Algorithms
Introduction: The 'Last Mile' Challenge of Large Model Deployment
Large language models (LLMs) are reshaping the artificial intelligence landscape at an unprecedented pace. From GPT-4 to Llama 3, from Qwen2.5 to DeepSeek-V3, model parameter counts have routinely reached tens of billions or even trillions. However, these massive parameter volumes bring enormous memory footprints and computational overhead — a 70-billion-parameter model requires approximately 140GB of VRAM just to load its weights in FP16 precision, far exceeding the capacity of a single consumer-grade GPU.
How can we dramatically compress model size and reduce inference costs while preserving model performance as much as possible? Quantization is one of the core techniques for addressing this bottleneck. Recently, breakthroughs in several advanced quantization algorithms are redefining the technical boundaries of efficient LLM deployment.
Quantization Fundamentals: The Precision Compression Journey from FP16 to INT4
The core idea behind quantization is converting model weights and activation values from high-precision floating-point numbers (such as FP32, FP16) to low-precision representations (such as INT8, INT4, or even lower), thereby reducing storage requirements and computational workload.
Based on when quantization is applied, mainstream approaches fall into two categories:
- Post-Training Quantization (PTQ): Weights are quantized directly after model training is complete, requiring no retraining. It is low-cost and fast, making it the dominant approach for LLM quantization today.
- Quantization-Aware Training (QAT): Quantization errors are simulated during the training process, enabling the model to actively adapt to low-precision representations. While precision loss is smaller, training costs increase significantly.
For LLMs with tens of billions of parameters, the training costs of QAT are often prohibitive, making PTQ the focal point of research in both academia and industry.
A Comprehensive Overview of Advanced Quantization Algorithms
GPTQ: Layer-wise Quantization Based on Second-Order Information
GPTQ is one of the most widely adopted LLM quantization algorithms today. Its core idea originates from the classic OBQ (Optimal Brain Quantization) method, using the inverse of the Hessian matrix to measure each weight's impact on the output after quantization, and distributing the quantization error of individual weights to other unquantized weights in the same layer through a compensation mechanism.
GPTQ's key innovation lies in combining row-wise quantization strategies with efficient matrix decomposition, enabling INT4 quantization of models with tens of billions of parameters within just a few hours, with minimal precision loss. Experiments on the Llama series of models show that GPTQ 4-bit quantized models only drop by 1-2 percentage points on most benchmarks.
AWQ: Activation-Aware Weight Quantization
AWQ (Activation-Aware Weight Quantization), proposed by Song Han's team at MIT, is built on a core insight: not all weights are equally important. AWQ analyzes the distribution of activation values to identify the "salient weight channels" that have the greatest impact on model output, and applies scaling factors to protect these channels, significantly reducing their quantization error.
Compared to GPTQ, AWQ does not rely on backpropagation or large-scale calibration data reconstruction, resulting in faster quantization speed and superior robustness at INT4 precision — particularly in long-text generation and instruction-following tasks. AWQ has been widely integrated into mainstream inference frameworks such as vLLM and TensorRT-LLM.
QuIP# and AQLM: Pushing Toward the 2-bit Precision Frontier
If INT4 quantization is approaching maturity, then 2-bit quantization represents the current "deep end" of quantization research.
QuIP# (Quantization with Incoherence Processing) introduces random orthogonal matrices to apply "incoherence processing" to weights, making weight distributions more uniform and reducing the interference of extreme values on low-bit quantization. Combined with vector quantization and lattice coding techniques, QuIP# achieves remarkable precision preservation at 2-bit precision.
AQLM (Additive Quantization for Language Models) employs a multi-codebook additive quantization strategy, representing each group of weights as the sum of multiple codebook vectors. This approach retains more information than traditional scalar quantization at 2-bit or even lower precision, significantly outperforming other methods at the same precision level on Perplexity metrics.
SqueezeLLM and SpQR: Non-uniform and Mixed-Precision Strategies
SqueezeLLM addresses the outlier problem in weight distributions. Research has found that LLM weights contain a small number of outliers with disproportionately large impact, and traditional uniform quantization handles them poorly. SqueezeLLM adopts a non-uniform quantization scheme combined with dense-sparse decomposition, storing outliers separately at high precision while quantizing the remaining weights at low precision.
SpQR (Sparse-Quantized Representation) further systematizes the mixed-precision concept by automatically identifying "sensitive" weights and retaining higher precision for them, while compressing the remaining weights to 3-4 bits, achieving near-lossless compression.
Latest Developments: GPTQ-V2, QuaRot, and FP Quantization
Since 2024, quantization algorithms have continued to iterate. QuaRot proposes a computational invariance quantization method based on rotation matrices, simultaneously quantizing both weights and activations to 4-bit (W4A4), achieving a qualitative leap in inference speed. Unlike weight-only quantization schemes, W4A4 can fully leverage INT4 Tensor Core compute power, improving actual inference throughput by 2-3x.
Additionally, floating-point quantization formats such as FP8 and FP4 are gaining increasing hardware support. The NVIDIA Blackwell architecture natively supports FP4 operations, enabling floating-point quantization to strike a new balance between precision and efficiency. Compared to integer quantization, floating-point quantization is naturally better suited to the non-uniform weight distributions commonly found in LLMs.
Technical Analysis: Core Challenges and Trade-offs in Quantization
The Precision vs. Compression Ratio Trade-off
Quantization is fundamentally information compression, and precision loss is inevitable. The current technical consensus is: INT8 quantization is virtually lossless, INT4 quantization incurs manageable losses, while INT3 and below require careful design to maintain usability. Different tasks exhibit significantly varying sensitivity to quantization error — mathematical reasoning and code generation tasks are more precision-sensitive, while general conversation and summarization tasks have greater tolerance.
Choosing Quantization Granularity
Quantization granularity ranges from coarse to fine: per-tensor, per-channel, and per-group quantization. Finer granularity means better precision preservation but also introduces more scaling factor storage overhead and computational complexity. Current mainstream approaches generally adopt per-group quantization with a group size of 128 as a compromise between precision and efficiency.
The Criticality of Hardware Compatibility
The actual speedup achieved by quantization algorithms is highly dependent on hardware support. INT4 matrix multiplication on NVIDIA GPUs requires implementation through specific CUDA kernels, while non-uniform quantization and vector quantization — though theoretically more precise — often fail to translate into actual inference speed improvements due to lack of hardware acceleration support. This is a key reason why "hardware-friendly" approaches like AWQ and GPTQ have achieved widespread industrial adoption.
Industry Applications
Quantization technology has been deeply integrated into every stage of LLM deployment:
- Cloud Inference: Frameworks such as TensorRT-LLM and vLLM fully support INT8/INT4 quantization, significantly reducing serving costs. Estimates suggest that INT4 quantization can reduce per-inference GPU memory requirements by approximately 75%, enabling a 70B model to run on a single A100 80GB GPU.
- On-device Deployment: Projects like llama.cpp and MLC-LLM have successfully deployed quantized models to smartphones, laptops, and other edge devices. On Apple M-series chips and Qualcomm Snapdragon platforms, 4-bit quantized 7B models can already achieve real-time conversation.
- Model Distribution: A large number of models on HuggingFace are published in GPTQ, AWQ, and other quantized formats, fostering an active community ecosystem. Community contributors like TheBloke continuously provide multi-precision quantized versions for new models.
Future Outlook: The Next Frontier in Quantization
Looking ahead, LLM quantization technology will continue to evolve along several directions:
First, 1-bit quantization and beyond will push the boundaries of extreme compression, challenging fundamental limits of information representation in neural networks.
📌 Source: GogoAI News (www.gogoai.xin)
🔗 Original: https://www.gogoai.xin/article/large-model-quantization-algorithm-breakthroughs
⚠️ Please credit GogoAI when republishing.