⭐ Support ⭐
- LLMs: BLOOM-(e.g., BLOOM-1b7, BLOOMZ-7b1-mt), LLaMA-(e.g., LLaMA-7b,LLaMA-13b), LLaMA2-(e.g., LLaMA2-7b,LLaMA2-13b), ChatGLM-(e.g., ChatGLM2-6b)
- our Proposed TIM [run_clm.py] and Vanilla Instruct-tuning[run_clm_sft.py], and Set RATE as -1.
- LoRA, Tuning with Embedding Fixed, Full Parameters Tuning
- Data-streaming
- Distributed training with deepspeed ZeRO stage 1/2/3
- Please refer our paper for more detail.
⭐ Tips ⭐
- [20231215] We added the flash-attention for faster training. we can set --use_flash_attention to active flash-attention.
- [20230914] We update the preference loss function of TIM, which makes the training more stable.
- [20230914] We fix the bug when using Data Cache (i.e., --streaming=False) for training.
- When datastreaming is turned on, it is recommended to shuffle the training data first.
- When training with Deepspeed ZeRO stage 1/2, we can set --use_low_cpu_mem=True to save memory usage
- After training a model using Deepspeed ZeRO stage3, we need to use sft_reward_training/change_param_name.py to perform a transformation of the model's parameter names before inference.
We develop TIM with HuggingFaces's transformers and Deepspeed-chat.
Requirements:
- Python 3.7.9
- Pytorch 1.10.0+cu111
- Transformers 4.28
- accelerate==0.19.0
- numpy==1.22.4
- deepspeed==0.9.0
- scikit-learn
- flash-attn==2.0.1
-
Training data: train_data/alpaca_reward.json, train.wmt_hint_dict_revall_alpaca_lm1b.json
An essential ingredient of our method is the construction of samples used to provide comparison signals for model learning. In addition to regular translation data, we construct data used for comparison by introducing dictionary information or translation errors
-
test data: test_data/wmt22, test_data/flores200
We modify add_noisy.py in noisy-text.
We use the following setting in our paper:
python add_noise.py data/example --delete_probability 0.15 --replace_probability 0.15 --filler_token '' --permutation_range 1
Then, you can run [run_reward.sh] to get the final training data for TIM.
We modify run_clm.py
and Trainer
in transformers, and utils
for LoRA in Deepspeed-Chat.
In addition to vanilla fine-tuning all model parameters, parameter-efficient fine-tuning methods are specially proposed for large language models such as prefix tuning and LoRA.
We adopt three different strategies for tuning the models, listed in descending order from the number of fine-tuned parameters.
(1) LoRA: Tuning with Low-rank Matrices
LORA_MODULE_NAME="query_key_value" # for BLOOM
LORA_MODULE_NAME="q_proj,k_proj,v_proj,o_proj" # for Llama
--only_optimize_lora # if True, only optimizing the parameters of LoRA
--lora_dim 8
--lora_alpha 16
--lora_droppout 0.05
--lora_module_name ${LORA_MODULE_NAME}
(2) FixEmb: Tuning with Embedding Fixed
--only_optimize_layers "9" "8" "7" "6" "5" "4" "3" "2" "1" "0"
(2) Full: Tuning with Full Parameters
- deepspeed_config/ds_config.json, deepspeed_config/ds_config_stage2.json, deepspeed_config/ds_config_stage3.json
-
inference/infer_bloom.py, inference/infer_llama.py
-l # using LoRA
--rootmodel # if LoRA, the path of the foundation model
--ifhint # add note indicates no mistakes in the hypothesize
--ifsample # if true, use sample else beam search for inference
--ifreranking # use the preference score to select a preferred hypothesize in candidates
--vocab # the dictionary for dict-guided inference
--reverse # whether reverse the src language and tgt language when loading the dictionary
We evaluate TIM's performance on the WMT and FLORES-200 dev-test tasks, comprising four language pairs.
### Citation Please kindly cite our paper if you find it helpful:@inproceedings{zeng2023tim,
title={TIM: Teaching LM to Translate with Comparison},
author={Jiali Zeng and Fandong Meng and Yongjing Yin and Jie Zhou},
booktitle = {ArXiv},
year = {2023},
url = {https://arxiv.org/pdf/2307.04408.pdf}
}