GPTQ-for-PULSE

4 bits quantization of LLaMA and Bloom using GPTQ

This repo is modified from GPTQ-for-LLaMa, the basic usage is the same as that repo, FYI.

GPTQ is SOTA one-shot weight quantization method

Supports the fastest speed, but uses both triton and cuda. Triton only supports Linux, so if you are a Windows user, please use WSL2.

News or Update

Support pulse model with lora finetuning 4-bit quantization.

Result

PULSE results are evaluated on the medical dataset.

Quantization requires a large amount of CPU memory. However, the memory required can be reduced by using swap memory.

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

Installation

If you don't have conda, install it first.

conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# Or, if you're having trouble with conda, use pip with python3.9:
# pip3 install torch torchvision torchaudio

git clone -b pulse https://github.com/hanrui1sensetime/GPTQ-for-PULSE.git
cd GPTQ-for-PULSE
pip install -r requirements.txt
python setup_cuda.py install

Dependencies

torch: tested on v2.0.0+cu117
transformers: tested on v4.34.0
datasets: tested on v2.13.1
safetensors: tested on v0.3.1
peft: tested on v0.7.0

All experiments were run on a single NVIDIA RTX3090.

Language Generation

PULSE

PULSE-7B model is implemented by bloomz.

# Generate 4-bit PULSE-7B model
CUDA_VISIBLE_DEVICES=0 python bloom.py ${MODEL_DIR} custom --wbits 4 --act-order --groupsize 128 --save pulse7b-4bit-128g.bin --calib_data ${CALIB_DATA_PATH}

# Generate 4-bit PULSE-7B with lora model
CUDA_VISIBLE_DEVICES=0 python bloom_lora.py ${MODEL_DIR} custom --wbits 4 --act-order --groupsize 128 --save pulse7b-4bit-128g.bin --calib_data ${CALIB_DATA_PATH} --peft_path ${PEFT_PATH}

Acknowledgements

This code is based on GPTQ-for-LLaMa

Thanks to Meta AI for releasing LLaMA, a powerful LLM.

Triton GPTQ kernel code is based on GPTQ-triton

Name		Name	Last commit message	Last commit date
Latest commit History 486 Commits
quant		quant
utils		utils
.gitignore		.gitignore
.style.yapf		.style.yapf
LICENSE.txt		LICENSE.txt
README.md		README.md
bloom.py		bloom.py
bloom_lora.py		bloom_lora.py
convert_llama_weights_to_hf.py		convert_llama_weights_to_hf.py
gen_index.py		gen_index.py
gptq.py		gptq.py
llama.py		llama.py
neox.py		neox.py
opt.py		opt.py
predict_13bv8_gptq.py		predict_13bv8_gptq.py
quant_cuda.cpp		quant_cuda.cpp
quant_cuda_kernel.cu		quant_cuda_kernel.cu
quant_params_0824.json		quant_params_0824.json
requirements.txt		requirements.txt
setup_cuda.py		setup_cuda.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPTQ-for-PULSE

News or Update

Result

Installation

Dependencies

Language Generation

PULSE

Acknowledgements

About

Releases

Packages

Languages

License

hanrui1sensetime/GPTQ-for-PULSE

Folders and files

Latest commit

History

Repository files navigation

GPTQ-for-PULSE

News or Update

Result

Installation

Dependencies

Language Generation

PULSE

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages