Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs when fine-tune tiny-llama with instructions using tiny-llama's conversation template #2992

Closed
hychaochao opened this issue Jan 31, 2024 · 7 comments

Comments

@hychaochao
Copy link

Thanks for your great work! I met some problems when using train_with_template.py to fine-tune the tinyllama by using tinyllama's conversation template.
This is my script:

torchrun --nproc_per_node=1 --master_port=20001 train_with_template.py \
    --model_name_or_path .../tinyllama  \
    --data_path /data/dummy_conversation.json \
    --bf16 True \
    --output_dir/tinyllama-test \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "steps" \
    --eval_steps 1500 \
    --save_strategy "steps" \
    --save_total_limit 8 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.04 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True 

However, the tokenization mismatch warning was reported, and the training loss was always 0.

WARNING: tokenization mismatch: 132 vs. 124. (ignored)
some of the output:
{'loss': 0.0, 'learning_rate': 1.8477344278896708e-05, 'epoch': 0.21}
 21%|██        | 26/123 [00:25<01:27,  1.11it/s]
 22%|██▏       | 27/123 [00:26<01:26,  1.12it/s]
{'loss': 0.0, 'learning_rate': 1.833313919082515e-05, 'epoch': 0.22}
 22%|██▏       | 27/123 [00:26<01:26,  1.12it/s]
 23%|██▎       | 28/123 [00:27<01:24,  1.12it/s]
{'loss': 0.0, 'learning_rate': 1.818302775908169e-05, 'epoch': 0.23}

just like #2871
I've confirmed that I'm using tinyllama's template when training, but it still doesn't work.

@congchan
Copy link
Contributor

congchan commented Feb 1, 2024

Hi, hychaochao could you help to confirm if this pr #2996 fix the issue?

@hychaochao
Copy link
Author

Hi, hychaochao could you help to confirm if this pr #2996 fix the issue?

Yes, it works!!!Thanks for your great work again!
this is my training arguments:

torchrun --nproc_per_node=1 --master_port=20001 train.py \
    --model_name_or_path .../tinyllama  \
    --data_path FastChat-main/data/dummy_conversation.json \
    --bf16 True \
    --output_dir .../tinyllama-test \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "steps" \
    --eval_steps 1500 \
    --save_strategy "steps" \
    --save_total_limit 8 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.04 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True

and this is part of the output:

0%|          | 1/500 [00:01<11:28,  1.38s/it]
                                               
{'loss': 3.9269, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}

  0%|          | 1/500 [00:01<11:28,  1.38s/it]
  0%|          | 2/500 [00:01<07:11,  1.15it/s]
                                               
{'loss': 3.0636, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.0}

  0%|          | 2/500 [00:01<07:11,  1.15it/s]
  1%|          | 3/500 [00:02<05:55,  1.40it/s]

@hychaochao hychaochao reopened this Feb 2, 2024
@hychaochao
Copy link
Author

The script I used for testing was wrong. I was testing the train.py file instead of the train_with_template.py file. Now I retested and found that it still doesn't work. This is my training arguments:

torchrun --nproc_per_node=4 --master_port=20001 train_with_template.py \
    --model_name_or_path /home/bingxing2/home/scx6203/luckychao/tinyllama  \
    --data_path /data/dummy_conversation.json \
    --bf16 True \
    --output_dir /weight/tinyllama-test \
    --num_train_epochs 4 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "steps" \
    --eval_steps 1500 \
    --save_strategy "steps" \
    --save_total_limit 8 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.04 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --resume_from_checkpoint /tinyllama-chat

and this is part of the output:

WARNING: tokenization mismatch: 80 vs. 74. (ignored)
WARNING: tokenization mismatch: 136 vs. 128. (ignored)
WARNING: tokenization mismatch: 61 vs. 57. (ignored)
.........
1%|          | 1/124 [00:03<07:00,  3.42s/it]
                                               
{'loss': 0.0, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.03}

  1%|          | 1/124 [00:03<07:00,  3.42s/it]
  2%|▏         | 2/124 [00:05<05:35,  2.75s/it]
                                               
{'loss': 0.0, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.06}

  2%|▏         | 2/124 [00:05<05:35,  2.75s/it]
  2%|▏         | 3/124 [00:07<05:07,  2.54s/it]

@hychaochao
Copy link
Author

@congchan and I met the same error when I use train_with_template.py to fine-tune the llama-2 by using llama-2's conversation template.
this is my script:

torchrun --nproc_per_node=4 --master_port=20001 train_with_template.py \
    --model_name_or_path .../llama-2  \
    --data_path  /data/dummy_conversation.json \
    --bf16 True \
    --output_dir /weight/llama2-test \
    --num_train_epochs 4 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "steps" \
    --eval_steps 1500 \
    --save_strategy "steps" \
    --save_total_limit 8 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.04 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --resume_from_checkpoint /llama2-chat \
    --deepspeed '/home/bingxing2/home/scx6203/luckychao/stanford_alpaca/configs/default_offload_opt_param.json' \

this is part of the output:

WARNING: tokenization mismatch: 55 vs. 52. (ignored)
WARNING: tokenization mismatch: 103 vs. 99. (ignored)
WARNING: tokenization mismatch: 44 vs. 42. (ignored)
WARNING: tokenization mismatch: 42 vs. 40. (ignored)
WARNING: tokenization mismatch: 98 vs. 94. (ignored)
......
8%|▊         | 1/12 [01:02<11:23, 62.12s/it]
                                              
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.26}

  8%|▊         | 1/12 [01:02<11:23, 62.12s/it]
 17%|█▋        | 2/12 [02:02<10:11, 61.16s/it]
                                              
{'loss': 0.0, 'learning_rate': 2e-05, 'epoch': 0.52}

@congchan
Copy link
Contributor

congchan commented Feb 3, 2024

Hi, @hychaochao I just testing with llama 2 and tinyllama and they both works with this fix:
#3006

Fee free to confirm the results on your data, and let me know if it works. Thank you.

@hychaochao
Copy link
Author

@congchan Very happy to see that you have fixed the bug! I have tried it on my data and it works. Thank you again for such a great job and such efficiency!
btw I noticed that the conv template is selected by the name of the model path, which is a bit inconvenient for local models. Maybe you can add a "model_id" parameter to determine the template, just like training vicuna, but just a little suggestion.

@congchan
Copy link
Contributor

congchan commented Feb 3, 2024

@congchan Very happy to see that you have fixed the bug! I have tried it on my data and it works. Thank you again for such a great job and such efficiency! btw I noticed that the conv template is selected by the name of the model path, which is a bit inconvenient for local models. Maybe you can add a "model_id" parameter to determine the template, just like training vicuna, but just a little suggestion.

Thanks for your suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants