Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

您好,模型运行,然后立马退出显示运行成功,请问是什么原因 #88

Open
2879982985 opened this issue Oct 16, 2024 · 1 comment

Comments

@2879982985
Copy link

#! /bin/bash
NUM_WORKERS=1
NUM_GPUS_PER_WORKER=1
MP_SIZE=1

script_path=$(realpath $0)
script_dir=$(dirname $script_path)
main_dir=$(dirname $script_dir)
MODEL_TYPE="XrayGLM"
MODEL_ARGS="--max_source_length 64
--max_target_length 256
--lora_rank 10
--pre_seq_len 4"

#OPTIONS_SAT="SAT_HOME=$1" #"SAT_HOME=/raid/dm/sat_models"
OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2"
HOST_FILE_PATH="hostfile"
HOST_FILE_PATH="hostfile_single"

train_data="./data/Xray/openi-zh.json"
eval_data="./data/Xray/openi-zh.json"

gpt_options="
--experiment-name finetune-$MODEL_TYPE
--model-parallel-size ${MP_SIZE}
--mode finetune
--train-iters 300
--resume-dataloader
$MODEL_ARGS
--train-data ${train_data}
--valid-data ${eval_data}
--distributed-backend nccl
--lr-decay-style cosine
--warmup .02
--checkpoint-activations
--save-interval 3000
--eval-interval 10000
--save "./checkpoints"
--split 1
--eval-iters 10
--eval-batch-size 8
--zero-stage 1
--lr 0.0001
--batch-size 8
--skip-init
--fp16
--use_lora
"

run_cmd="${OPTIONS_NCCL} ${OPTIONS_SAT} deepspeed --master_port 16666 --hostfile ${HOST_FILE_PATH} finetune_XrayGLM.py ${gpt_options}"
echo ${run_cmd}
eval ${run_cmd}

set +x

@2879982985
Copy link
Author

这是输出的内容:(Xray) root@qzedu-NF5280M6:/data/ymj/XrayGLM-main/XrayGLM-main# bash finetune_XrayGLM.sh
NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --hostfile hostfile_single finetune_XrayGLM.py --experiment-name finetune-XrayGLM --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --pre_seq_len 4 --train-data ./data/Xray/openi-zh.json --valid-data ./data/Xray/openi-zh.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 3000 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 8 --skip-init --fp16 --use_lora
[2024-10-16 11:01:15,299] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-16 11:01:17,425] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected VISIBLE_DEVICES=0: setting --include=localhost:0
[2024-10-16 11:01:17,425] [INFO] [runner.py:607:main] cmd = /home/zhuqirui/.conda/envs/Xray/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_XrayGLM.py --experiment-name finetune-XrayGLM --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --pre_seq_len 4 --train-data ./data/Xray/openi-zh.json --valid-data ./data/Xray/openi-zh.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 3000 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 8 --skip-init --fp16 --use_lora
[2024-10-16 11:01:18,931] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-16 11:01:21,018] [INFO] [launch.py:139:main] 0 NCCL_DEBUG=info
[2024-10-16 11:01:21,018] [INFO] [launch.py:139:main] 0 NCCL_NET_GDR_LEVEL=2
[2024-10-16 11:01:21,018] [INFO] [launch.py:139:main] 0 NCCL_IB_DISABLE=0
[2024-10-16 11:01:21,018] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0]}
[2024-10-16 11:01:21,018] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-10-16 11:01:21,018] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-10-16 11:01:21,018] [INFO] [launch.py:164:main] dist_world_size=1
[2024-10-16 11:01:21,018] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-10-16 11:01:21,030] [INFO] [launch.py:256:main] process 38741 spawned with command: ['/home/zhuqirui/.conda/envs/Xray/bin/python', '-u', 'finetune_XrayGLM.py', '--local_rank=0', '--experiment-name', 'finetune-XrayGLM', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './data/Xray/openi-zh.json', '--valid-data', './data/Xray/openi-zh.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '3000', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '8', '--skip-init', '--fp16', '--use_lora']
[2024-10-16 11:01:22,499] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-16 11:01:27,036] [INFO] [launch.py:351:main] Process 38741 exits successfully.
(Xray) root@qzedu-NF5280M6:/data/ymj/XrayGLM-main/XrayGLM-main#

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant