Skip to content

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Notifications You must be signed in to change notification settings

OpenGVLab/EfficientQAT

Repository files navigation

EfficientQAT

Official PyTorch implement of paper EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

News

  • [2024/10] 🔥 We release a new weight-activation quantization algorithm, PrefixQuant, which is the first work to let the performance of static activation quantization surpasses dynamic ones.
  • [2024/08] The new inference backend T-MAC from Microsoft has supported EffcientQAT models.
  • [2024/08] We support for the quantization of Mistral-Large-Instruct. W2g64 Mistral-Large-Instruct with our EfficientQAT can compress the 123B models to 35 GB with only 4 points accuracy degeneration.
  • [2024/07] New featurs! We support to transfer EfficientQAT quantized models into GPTQ v2 format and BitBLAS format, which can be directly loaded through GPTQModel.
  • [2024/07] We release EfficientQAT, which pushes the limitation of uniform (INT) quantization in an efficient manner.

Contents

Installation

  1. Clone this repository and navigate to EfficientQAT folder
git clone https://github.com/OpenGVLab/EfficientQAT.git
cd EfficientQAT
  1. Install package
conda create -n efficientqat python==3.11

conda activate efficientqat

pip install -r requirements.txt

Model Zoo

We provide a number of prequantized EfficientQAT models as follows:

  • WikiText2 PPL is measured in 2048 context length.
  • Avg. Accuracy indicate the average accuracy in 5 zero-shot reasoning tasks (WinoGrande,PIQA,HellaSwag,Arc-Easy, Arc-Challenge) with lm-eval v0.4.2.
  • 1GB = $10^9$ Bit
  • Hub Link: EQAT indicates the original checkpoints. We also transfer the checkpoints into GPTQ and BitBLAS formats, which can be loaded directly through GPTQModel. (PS: GPTQModel is a official bug-fixed repo of AutoGPTQ, which would be merged into AutoGPTQ in future.)
Model Quantization WikiText2 PPL Avg. Accuracy Model Size (GB) Hub link
Llama-2-7B fp16 5.47 64.86 13.2 -
Llama-2-7B w4g128 5.53 64.27 3.7 EQAT|GPTQ|BitBLAS
Llama-2-7B w3g128 5.81 64.02 3.1 EQAT
Llama-2-7B w2g64 6.86 60.14 2.3 EQAT|GPTQ|BitBLAS
Llama-2-7B w2g128 7.17 59.50 2.2 EQAT|GPTQ|BitBLAS
Llama-2-13B fp16 4.88 67.81 25.4 -
Llama-2-13B w4g128 4.93 67.52 6.8 EQAT|GPTQ|BitBLAS
Llama-2-13B w3g128 5.12 67.28 5.6 EQAT
Llama-2-13B w2g64 5.96 64.88 4.0 EQAT|GPTQ|BitBLAS
Llama-2-13B w2g128 6.08 63.88 3.8 EQAT|GPTQ|BitBLAS
Llama-2-70B fp16 3.32 72.41 131.6 -
Llama-2-70B w4g128 3.39 72.62 35.8 EQAT|GPTQ|BitBLAS
Llama-2-70B w3g128 3.61 71.76 29.1 EQAT
Llama-2-70B w2g64 4.52 69.48 20.1 EQAT|GPTQ|BitBLAS
Llama-2-70B w2g128 4.61 68.93 18.9 EQAT|GPTQ|BitBLAS
Llama-3-8B fp16 6.14 68.58 13.0 -
Llama-3-8B w4g128 6.47 68.43 5.4 EQAT|GPTQ|BitBLAS
Llama-3-8B w3g128 7.09 67.35 4.7 EQAT
Llama-3-8B w2g64 9.41 60.76 3.9 EQAT|GPTQ|BitBLAS
Llama-3-8B w2g128 9.80 59.36 3.8 EQAT|GPTQ|BitBLAS
Llama-3-70B fp16 2.85 75.33 137.8 -
Llama-3-70B w4g128 3.17 74.57 38.9 EQAT|GPTQ|BitBLAS
Llama-3-70B w3g128 4.19 72.42 32.2 EQAT
Llama-3-70B w2g64 6.08 67.89 23.2 EQAT|GPTQ
Llama-3-70B w2g128 6.38 67.57 22.0 EQAT|GPTQ|BitBLAS
Llama-3-8B-Instruct fp16 8.29 68.43 13.0 -
Llama-3-8B-Instruct w4g128 7.93 68.39 5.4 EQAT|GPTQ|BitBLAS
Llama-3-8B-Instruct w3g128 8.55 67.24 4.7 EQAT
Llama-3-8B-Instruct w2g64 11.19 60.66 3.9 EQAT|GPTQ|BitBLAS
Llama-3-8B-Instruct w2g128 11.73 60.16 3.8 EQAT|GPTQ|BitBLAS
Llama-3-70B-Instruct fp16 5.33 73.78 137.8 -
Llama-3-70B-Instruct w4g128 5.35 73.47 38.9 EQAT|GPTQ|BitBLAS
Llama-3-70B-Instruct w3g128 5.65 72.87 32.2 EQAT
Llama-3-70B-Instruct w2g64 7.86 67.64 23.2 EQAT|GPTQ|BitBLAS
Llama-3-70B-Instruct w2g128 8.14 67.54 22.0 EQAT|GPTQ|BitBLAS
Mistral-Large-Instruct-2407 fp16 2.74 77.76 228.5 -
Mistral-Large-Instruct-2407 w2g64 5.58 73.54 35.5 GPTQ

Training

EfficientQAT involves two consecutive training phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). The detailed training script can be found in ./examples. We give the training script examples on Llama-2-7B with w2g64 quantization in the following.

  1. Block-AP

You should modify --model to the folder of full-precision model in the script before you running the following command.

bash examples/block_ap/Llama-2-7b/w2g64.sh

Specifically, the --weight_lr is 2e-5 for 2-bit and 1e-5 for 3-/4-bits in our experiments.

Some other important arguments:

  • --train_size: number of training data samples, 4096 as default
  • --val_size: number of validation data samples, 64 as default
  • --off_load_to_disk: save training dataset to disk, saving CPU memory but may reduce training speed
  1. E2E-QP

Then, you can load the quantized model of Block-AP for further E2E-QP. Specifically, E2E-QP can adapt to different scenarios by changing the training datasets. You should modify --quant_model_path to the folder of quantized model in the script before you running the following command.

1) Train on RedPajama

bash examples/e2e_qp/Llama-2-7b/w2g64-redpajama.sh

2) Train on Alpaca

bash examples/e2e_qp/Llama-2-7b/w2g128-redpajama.sh

Specifically, the --learning_rate is 2e-5 for 2-bit and 1e-5 for 3-/4-bits in our experiments. You can decrease the --per_device_train_batch_size to reduce the memory footprint during training, and making sure that --gradient_accumulation_steps increases by the same multiple to maintain the same batch size.

Inference

  1. Download the pre-quantized EfficientQAT models from Huggingface
pip install huggingface_hub

huggingface-cli download ChenMnZ/Llama-2-7b-EfficientQAT-w2g64 --local-dir ./output/pre_quantized_models/Llama-2-7b-EfficientQAT-w2g64
  1. Evaluate the pre-quantized EfficientQAT model
CUDA_VISIBLE_DEVICES=0 python main_block_ap.py \
--resume_quant ./output/pre_quantized_models/Llama-2-7b-EfficientQAT-w2g64 \
--net Llama-2 \
--wbits 2 \
--group_size 64 \
--output_dir ./output/inference_results/ \
--eval_ppl \
--eval_tasks  piqa,arc_easy,arc_challenge,hellaswag,winogrande

Model Transferring

Firstly, you should install gptqmodel package to support GPTQ and BitBLAS quantization format:

git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
bash install.sh
  • In our experiences, we test with gptqmodel v0.9.8.

Then, we offer three types of transferring as follows:

  1. Transfer EfficientQAT checkpoints to GPTQ format
bash examples/model_transfer/efficientqat_to_gptq/llama-2-7b.sh
  • Note: Currently AutoGPTQ has overflow bugs for asymmetric quantization. Therefore, we choose the official bug-fixed version GPTQModel to transfer our asymmetric quantized models. Therefore, the GPTQ models provide by this repo can be only successfully loaded through GPTQModel otherwise AutoGPTQ.
  1. Transfer EfficientQAT checkpoints to BitBLAS format
bash examples/model_transfer/efficientqat_to_bitblas/llama-2-7b.sh
  • Speedup has some problem, refer this issue for details.
  1. Transfer fp32 datas in EfficientQAT checkpoints to half-precision counterparts. Some of parameters are saved as fp32 for training, you can transfer them into half-precision to further reducing model size after training.
bash examples/model_transfer/fp32_to_16/llama-2-7b.sh

Inference of Other Formats

Below is an example to inference with GPTQ or BitBLAS quantized formats.

from transformers import AutoTokenizer
from gptqmodel import GPTQModel

quant_dir = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-GPTQ"
# quant_dir = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-BitBLAS"
# or local path

tokenizer = AutoTokenizer.from_pretrained(quant_dir, use_fast=True)


# load quantized model to the first GPU
model = GPTQModel.from_quantized(quant_dir)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("Model quantization is", return_tensors="pt").to(model.device))[0]))

Citation

If you found this work useful, please consider citing:

@article{efficientqat,
  title={EfficientQAT: Efficient Quantization-Aware Training for Large Language Models},
  author={Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Qiao, Yu and Luo, Ping},
  journal={arXiv preprint arXiv:2407.11062},
  year={2024}
}

About

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages