Customed scripts for training llama

News置顶

🔥 指令微调样本长度精细化处理。
🔥 数据处理代码支持按比例采样后整合，支持统计数据tokens等信息。arrow文件大小优化
🔥 支持预训练及指令微调，支持全量参数及lora
🔥 修改vllm部署脚本，整合generate接口(供langchain使用)和openai接口。提供判别式获得下一token概率的调用方式

背景

个人项目中修改整理的llama模型的预训练以及微调代码，代码参考来源是Chinese-LLaMA-Alpaca-2，主要是将：tokenizing training data and save as arrow的这部分逻辑独立出来，以解绑数据准备和训练两个阶段。优化和部分代码，增加注释。

使用说明如下

首先根据requirements下载需要的包

lora微调说明
主要步骤为：
1、读取微调数据生成tokenzier后的数据，保存为arrow文件格式（运行python pre_tokenizer_inst.py）
2、修改01_run_sft.sh中运行脚本中的dataset_dir参数，指定为上一步输出的arrow文件目录，设置其余参数运行进行模型训练训练（运行01_run_sft.sh）
3、将训练得到的pt_lora_model跟原模型进行参数合并，得到完整模型（运行python merge_llama2_with_chinese_lora_low_mem.py）
4、可将完整模型路径写入到vllm启动脚本进行部署。

详细步骤：
1、详情查看pre_tokenizer_inst.py中的__main__方法，
    1.files中指定微调数据，这里给出的示例数据是instr_data/cj_instr.json。主要包括instruction、input和output字段。
    2.gen_arrow(files, "arrow_data1219", 'tokenizer_chinese_llama', cache_dir=f'{root_path}instr_data/1219/cache')  # 使用原chinese_llama词表
        tokenizer主要函数，参数由files,处理完的数据保存目录名arrow_data1219,所采用的分词器tokenizer_chinese_llama,处理中间过程数据的保存路径instr_data/1219/cache，
        最终会生成instr_data/1219/arrow_data1219文件目录

2、运行01_run_sft脚本
    1、01_run_sft.sh中指定dataset_dir参数为上一步的/instr_data/1219/arrow_data1219目录。并保持chinese_tokenizer_path参数与上一步所采用的的分词器相同。
    2、其余常见修改包括使用单机几卡的GPU。在CUDA_VISIBLE_DEVICES=0,1,2中设置，这个例子使用3卡，故nproc-per-node也要一并修改为3
    3、设置模型保存输出路径output_dir


3、训练完后运行python merge_llama2_with_chinese_lora_low_mem.py
    python merge_llama2_with_chinese_lora_low_mem.py \
    --base_model path/to/llama2-hf-model \  # 基座模型路径
    --lora_model path/to/chinese-llama2-or-alpaca2-lora \  # 上一步模型模型保存路径中的pt_lora_model目录
    --output_type [huggingface|pth|] \  # 一般设置为huggingface格式
    --output_dir path/to/output-dir # 合并后的新模型保存路径

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
deploy		deploy
instr_data		instr_data
peft		peft
pretrained_models		pretrained_models
pt_data		pt_data
saved_models		saved_models
tokenizer_chinese_llama		tokenizer_chinese_llama
.gitignore		.gitignore
00_readme.txt		00_readme.txt
01_run_sft.sh		01_run_sft.sh
02_run_pt.sh		02_run_pt.sh
LICENSE		LICENSE
README.md		README.md
build_dataset.py		build_dataset.py
deepspeed设置说明.txt		deepspeed设置说明.txt
ds_zero2_no_offload.json		ds_zero2_no_offload.json
ds_zero2_opt_offload.json		ds_zero2_opt_offload.json
ds_zero3_no_offload.json		ds_zero3_no_offload.json
ds_zero3_offload.json		ds_zero3_offload.json
inspect_arrow.py		inspect_arrow.py
merge_llama2_with_chinese_lora_low_mem.py		merge_llama2_with_chinese_lora_low_mem.py
pre_tokenizer.py		pre_tokenizer.py
pre_tokenizer_inst.py		pre_tokenizer_inst.py
proc_ckpt_ari.py		proc_ckpt_ari.py
requirements.txt		requirements.txt
run_clm_pt_with_peft2.py		run_clm_pt_with_peft2.py
run_clm_sft_with_peft2.py		run_clm_sft_with_peft2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customed scripts for training llama

News置顶

背景

使用说明如下

About

Releases

Packages

Languages

License

Qznan/train_llama

Folders and files

Latest commit

History

Repository files navigation

Customed scripts for training llama

News置顶

背景

使用说明如下

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages