AI - Machine Learning Blog

11 MIN READ

VPTQ Quantized 2-Bit Models: Principles, Steps, and Practical Implementation

xinyuwei

Microsoft

Feb 01, 2025

Please refer to my repo to get more AI resources, wellcome to star it:

https://github.com/xinyuwei-david/david-share.git

This article if from one of my repo:

https://github.com/xinyuwei-david/david-share/tree/master/Deep-Learning/Quantization-2-bit-VPTQ

Welcome to this comprehensive guide where we delve into the application of VPTQ (Vector Post-Training Quantization) in quantizing models to 2 bits. This article aims to help you understand the core concepts of VPTQ, the key steps involved in the quantization process, and how to achieve efficient model compression and performance optimization using VPTQ.

Introduction

As large language models (LLMs) continue to grow in scale, the demand for storage and computational resources increases accordingly. To run these large models on hardware with limited resources, model compression techniques become crucial. Among them, VPTQ (Vector Post-Training Quantization) stands out as an ultra-low-bit quantization method that can quantize model parameters to 1-2 bits without the need for retraining, all while maintaining high accuracy. Significant advancements in quantization for LLMs have been made recently. Algorithms like AQLM and AutoRound have demonstrated that 4-bit quantization can maintain the accuracy of the original models across most tasks. However, pushing quantization to even lower precision, such as 2-bit, often introduces noticeable accuracy loss. VPTQ addresses this challenge by leveraging advanced techniques to achieve low-bit quantization with minimal degradation in performance.

VPTQ Quantization Effect Demonstration

Memory saved Evaluation

In this section, I will showcase the performance of Llama-3.1-8B-Instruct running on the A100 GPU in two scenarios: before quantization using 16-bit precision and after applying VPTQ quantization. My prompt is a math problem: calculating the sum of numbers from 1 to 100. Both methods produce accurate results. After quantization, the model's memory consumption is only 17% of that of the non-quantized model.

Code:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load VPTQ-quantized model directly from HuggingFace Hub
model = AutoModelForCausalLM.from_pretrained("VPTQ-community/Meta-Llama-3.1-8B-Instruct", device_map="auto")
#model = AutoModelForCausalLM.from_pretrained("VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-256-woft", device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("VPTQ-community/Meta-Llama-3.1-8B-Instruct")
#tokenizer = AutoTokenizer.from_pretrained("VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-256-woft")

# Simple inference
prompt = "Explain: Do not go gentle into that good night."
output = model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device))
print(tokenizer.decode(output[0], skip_special_tokens=True))

Base Model:

Quantization Model:

Accuracy Evaluation

Quantized models' performance depends on the quantization method and parameter choices. In practice, VPTQ-quantized models often maintain accuracy levels comparable to their original 16-bit counterparts in specific tasks.

Code:

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness && cd lm-evaluation-harness && pip install -e .

models = ["VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-256-woft", # 2 bits bits
          "meta-llama/Llama-3.1-8B-Instruct", # 16 bits
          ]

for m in models:
    !HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download {m} --exclude *.pth

for m in models:
      !lm_eval --model hf --model_args pretrained={m},dtype=float16 --tasks mmlu --device cuda:0 --num_fewshot 0 --batch_size auto --output_path ./eval/

Output Analyze:

Category	Subcategory	Quantized Model	Non-Quantized Model	Difference/Improvement
Model Information	Model Name	VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-256-woft	meta-llama/Llama-3.1-8B-Instruct	-
Overall Accuracy	MMLU Overall Accuracy	63.88%	68.09%	+4.21%
By Domain	Humanities	58.34%	64.51%	+6.17%
	STEM	56.52%	58.71%	+2.19%
	Social Sciences	73.19%	76.93%	+3.74%
	Other	70.52%	74.28%	+3.76%
Detailed Tasks	Humanities (Average)	58.34%	64.51%	+6.17%
	STEM (Average)	56.52%	58.71%	+2.19%
	Social Sciences (Average)	73.19%	76.93%	+3.74%
	Other (Average)	70.52%	74.28%	+3.76%
Summary	Overall Accuracy	Non-quantized model has 4.21% higher accuracy overall.
	Best Improvement	Humanities domain shows the largest improvement.		+6.17%
	Smallest Improvement	STEM domain shows the smallest improvement.		+2.19%

Understanding Key Concepts: Centroids, Codebooks, and Centroid Quantity

Before diving into the VPTQ quantization process, it's essential to understand several key concepts: Centroids, Codebooks, and Centroid Quantity (k). To illustrate these concepts more intuitively, let's use the analogy of a fruit merchant.

Centroids

Analogy:

Imagine you're a fruit merchant dealing with various fruits: apples, oranges, bananas, grapes, etc. To manage and sell them more efficiently, you decide to categorize these fruits based on their characteristics (color, size, shape, taste). For each category, you select one fruit that best represents the group—this representative fruit is called the centroid.

Mathematical Understanding:

In data processing, a centroid is the center point of a cluster of similar data, representing the common features of that group. In machine learning, centroids are often obtained through clustering algorithms such as k-means clustering.

Codebooks

Analogy:

To efficiently manage your fruit categories, you record all the representative fruits and their corresponding category numbers in a booklet. This booklet is the codebook.

Role in the Model:

In model quantization, the codebook stores all the centroids, each with a unique index (code). During model inference, these indices can be used to quickly retrieve the corresponding centroids to reconstruct approximate model parameters.

Centroid Quantity (k)

Meaning:

The centroid quantity k represents the number of categories into which you have divided the fruits. A larger k means more categories with finer distinctions (each category has similar fruits), while a smaller k means fewer categories with broader groupings (each category contains fruits with more differences).

Role in the Model:

In model quantization, the choice of centroid quantity k affects both the compression ratio and the model's accuracy:

Larger k: Provides a better representation of weight distributions, resulting in higher accuracy but requires more memory to store the centroids.
Smaller k: Improves memory efficiency and compression ratio but may introduce more quantization errors, potentially reducing accuracy.

Detailed Steps of VPTQ Quantization

The VPTQ quantization process can be broken down into the following primary steps:

1. Reshape and Group

Operation:

Reshape the model's weight parameter matrix into a series of small vectors based on a fixed vector length v (e.g., v = 8, so every 8 weights form a vector).

Purpose:

This reshaping converts high-dimensional weight matrices into smaller vector groups suitable for Vector Quantization (VQ). By processing vectors instead of individual scalars, VQ can capture correlations between weights, leading to better quantization performance.

2. Clustering

Operation:

Cluster the small vectors obtained from the previous step using a clustering algorithm (e.g., k-means clustering), grouping similar vectors together. Each cluster's central vector is called the centroid.

Core Step:

Clustering is the core step of VPTQ, determining the effectiveness of model quantization. The goal is to minimize the Euclidean distance between the vectors and their assigned centroids, reducing quantization error. During clustering, parameter importance information (e.g., from the Hessian matrix) can be used to perform Hessian-weighted k-means clustering, ensuring that important parameters receive more precise quantization.

3. Constructing the Codebook

Operation:

Store all the centroids and their corresponding indices in a codebook. During model inference, you can quickly retrieve the corresponding centroids using indices to reconstruct approximate weight values.

Role during Inference:

The model reconstructs the weights by looking up centroids from the codebook based on indices. This process involves simple lookups and additions, making inference efficient despite the low-bit representation.

4. Residual Vector Quantization (RVQ)

Purpose:

Residual Vector Quantization (RVQ) is used to further refine the quantization process. RVQ quantizes the residual errors that remain after the initial quantization, enabling high accuracy with minimal bit overhead.

Operation:

Calculate Residuals: Compute the difference between the original vectors and their corresponding centroids:
Residual r = Original Vector v - Centroid c
Second-stage Clustering: Apply vector quantization to the residuals using a secondary codebook.
Repeat if Necessary: Multiple stages of RVQ can be applied to iteratively minimize residual errors.
Advantages of RVQ:
Improved Accuracy: By capturing residual errors, RVQ enhances the model's ability to represent weights accurately.
Minimal Bit Overhead: Although additional codebooks are used, the overall increase in bitwidth is minimal, maintaining a good compression ratio.

Understanding the Bit Calculation in VPTQ

In VPTQ quantization, it's important to understand how many bits each weight occupies after quantization. This depends on the centroid quantity K and vector length v.

1. Basic Calculation Method

Index Bitwidth: To represent the centroid indices, the number of bits required is:
Number of Bits per Index = log2(K)
Bits per Weight: Each vector contains v weights, so the number of bits per weight is:
Bits per Weight = (Number of Bits per Index) / v

2. Example Calculations

Example 1:

Vector Length (v): 8
Centroid Quantity (K): 256
Index Bitwidth: log2(256) = 8 bits
Bits per Weight: 8 bits / 8 = 1 bit per weight
Example 2:
Vector Length (v): 8
Centroid Quantity (K): 65,536
Index Bitwidth: log2(65,536) = 16 bits
Bits per Weight: 16 bits / 8 = 2 bits per weight
Example 3:
Vector Length (v): 16
Centroid Quantity (K): 256
Index Bitwidth: log2(256) = 8 bits
Bits per Weight: 8 bits / 16 = 0.5 bits per weightIf Residual Vector Quantization (RVQ) is used, additional bits are required to store the residual indices. For example:
Including RVQ:
Residual Centroid Quantity (K_res): 256
Residual Index Bitwidth: log2(256) = 8 bits
Residual Bits per Weight: 8 bits / v
Total Bits per Weight: Initial bits per weight + Residual bits per weight

3. Summary

By adjusting the centroid quantity K, residual centroid quantity K_res, and vector length v, we can balance between compression ratio and model accuracy:

Larger K and K_res: Improves model accuracy but increases bits per weight and memory consumption.
Smaller K and K_res: Enhances compression ratio but may reduce model accuracy due to higher quantization errors.

Memory Savings and Performance Evaluation of VPTQ Models

1. Memory Savings

Using the ultra-low-bit quantization of VPTQ, we can significantly reduce a model's memory footprint. For example, compressing a 70-billion-parameter model from the original 140 GB (FP16) to approximately 26 GB (using 3-bit quantization), achieving over 80% memory savings.

Model Size Estimation Example:

Total Model Parameters: 70 billion
Bits per Weight:
- Initial Quantization: 2 bits per weight
- Residual Quantization: 1 bit per weight
- Total: 2 + 1 = 3 bits per weight
Model Size Calculation:
Model Size = (70,000,000,000 parameters × 3 bits) / 8 bits per byte = 26.25 GB

Note: This estimation excludes the size of the codebooks and potential overheads from storing indices.

3. Inference Speed and Memory Consumption

Inference Speed:

VPTQ models may experience a slight decrease in inference speed compared to other quantization methods due to additional computations for weight reconstruction.
The overhead is minimal since it primarily involves simple lookups and additions during inference.
Memory Consumption:
The memory usage of quantized models is significantly reduced, allowing larger models to run on hardware with limited resources.
For instance, a VPTQ quantized model may consume only 17% of the memory required by its unquantized counterpart.

Hessian and Inverse Hessian Matrices in VPTQ

In VPTQ, we introduce the Hessian and Inverse Hessian matrices to assess parameter importance and correct quantization errors. The quantization process is guided by a second-order optimization framework, where the impact of quantization is minimized based on the model's sensitivity to changes in weights.

1. Role of the Hessian Matrix

Analogy:

The Hessian matrix is like a map indicating which parameters in the model have the most significant impact on performance—similar to knowing which fruits are most valuable in our fruit merchant analogy.

Technical Details:

Definition: The Hessian matrix represents second-order derivatives of the loss function with respect to the model parameters, capturing the curvature of the loss landscape.where L is the loss function and θ_i is the i-th parameter.
H_ii = ∂²L / ∂θ_i²
Interpretation: A larger H_ii value indicates that changes in θ_i have a significant impact on the loss, making θ_i an important parameter.
Use in Quantization: By identifying important parameters using the Hessian diagonal, we can prioritize them during quantization to minimize performance degradation.

2. Role of the Inverse Hessian Matrix

Analogy:

The inverse Hessian matrix acts as a tool to precisely adjust parameters when they are perturbed due to quantization errors, similar to how a fruit merchant might carefully handle valuable fruits to prevent damage.

Technical Details:

Definition: The inverse of the Hessian matrix's diagonal elements provides a measure of how to correct quantization errors.
H_inv_ii = 1 / H_ii
Error Correction: Quantization errors δ_i for each parameter can be corrected using:
Δθ_i = - H_inv_ii × δ_i
Result: By applying this correction, especially to important parameters, we can reduce the negative impact of quantization on the model's performance.

3. Detailed Procedure

Step 1: Compute the Hessian Matrix

Operation:
- Perform forward and backward passes to collect gradient information.
- Compute the second derivatives of the loss function with respect to each parameter.
Outcome:
- A Hessian matrix (or its diagonal approximation) indicating the importance of each parameter.
- Step 2: Use the Hessian to Guide Clustering and Quantization
Weighted Clustering:
- Apply Hessian-weighted k-means clustering during the vector quantization process.
- Important parameters (those with higher H_ii values) are given more weight, ensuring they are represented more accurately in the codebook.
- Step 3: Compute the Inverse Hessian Matrix
Operation:
- Calculate the inverse of the Hessian diagonal elements.
Outcome:
- An inverse Hessian matrix used for error correction after quantization.
- Step 4: Quantize Parameters and Correct Errors
Quantization:
- Quantize the parameters using the codebook indices, resulting in quantized weights θ_quantized_i.
Compute Quantization Errors:
- Calculate the error for each parameter:
- δ_i = θ_quantized_i - θ_original_i
Error Correction:
- Apply corrections using the inverse Hessian:
- Δθ_i = - H_inv_ii × δ_i
Update Parameters:
- Obtain corrected parameters:
- θ_corrected_i = θ_quantized_i + Δθ_i
Result:
- The corrected parameters minimize the impact of quantization errors, especially for the most important weights.

Quantization Steps and Considerations in Practice

1. Example Quantization Command

Below is an example command for performing VPTQ quantization, following the guidelines from the VPTQ GitHub repository:

CUDA_VISIBLE_DEVICES=0 python run_vptq.py \  
    --model_name meta-llama/Meta-Llama-3.1-8B-Instruct \  
    --output_dir outputs/Meta-Llama-3.1-8B-Instruct/ \  
    --vector_lens -1 8 \  
    --group_num 1 \  
    --num_centroids -1 65536 \  
    --num_res_centroids -1 256 \  
    --npercent 0 \  
    --blocksize 128 \  
    --new_eval \  
    --seq_len 8192 \  
    --kmeans_mode hessian \  
    --num_gpus 1 \  
    --enable_perm \  
    --enable_norm \  
    --save_model \  
    --save_packed_model \  
    --hessian_path Hessians-Llama-31-8B-Instruct-6144-8k \  
    --inv_hessian_path InvHessians-Llama-31-8B-Instruct-6144-8k \  
    --ktol 1e-5 \  
    --kiter 100

Parameter Explanations:

--model_name: Specifies the model to be quantized.
--vector_lens -1 8: Sets the vector length v = 8.
--num_centroids -1 65536: Sets the number of centroids K = 65,536.
--num_res_centroids -1 256: Sets the number of residual centroids K_res = 256.
--kmeans_mode hessian: Uses Hessian-weighted k-means clustering.
--hessian_path and --inv_hessian_path: Specify paths to precomputed Hessian and inverse Hessian matrices.
Other parameters control aspects like sequence length, block size, and whether to enable normalization or permutation.

2. Considerations

Centroid Quantity Limitations:
- Due to CUDA kernel limitations, using more than 4096 centroids can cause illegal memory access errors.
- It's recommended to set --num_centroids and --num_res_centroids to 4096 or fewer unless the code supports higher values.
Hardware Resources:
- The quantization process can be computationally intensive.
- Utilizing multiple GPUs or high-memory GPUs can speed up the process.
Parameter Adjustments:
- Adjust vector length v, centroid quantities K and K_res, and other hyperparameters based on the desired balance between accuracy and compression.
Hessian and Inverse Hessian Computation:
- Computing these matrices can be resource-intensive.
- Precomputed matrices can be used, or tools like quip-sharp may assist in their computation.
RVQ Usage:
- Although Residual Vector Quantization (RVQ) is optional, it significantly improves accuracy, especially in ultra-low-bit settings.
- Including RVQ adds complexity but is often worthwhile.
Inference Considerations:
- Quantized models may have slower inference speeds due to overhead in reconstructing weights.
- Optimizations in the implementation can mitigate this issue.

The quantization process does not consume much GPU memory but is intensive on CUDA/Tensor cores. Quantization is performed layer by layer on the model. If multiple GPUs are available, it's best to use them; otherwise, the process will be very slow.

Conclusion

VPTQ is an advanced, ultra-low-bit quantization method that allows for compressing large models to 1-2 bits without retraining, while maintaining high performance. By understanding key concepts like centroids, codebooks, centroid quantity, and leveraging techniques like Hessian-weighted clustering and residual vector quantization, we can effectively apply VPTQ in practice to achieve efficient model compression and deployment. As research into quantization methods continues and hardware advances, ultra-low-bit quantization like VPTQ will unlock more possibilities for deploying large models on resource-constrained devices. It offers a promising avenue for keeping up with the rapid growth of model sizes in natural language processing and other fields.

Published Feb 01, 2025

Version 1.0

artificial intelligence

xinyuwei

Microsoft

Joined January 12, 2023

View Profile

AI - Machine Learning Blog

Follow this blog board to get notified when there's new activity