slurm submission log: 2024-11-18 08:01:28.643962 created following sbatch script: ############################### #!/bin/bash #SBATCH --account=nlp #SBATCH --cpus-per-task=2 #SBATCH --gres=gpu:4 #SBATCH --job-name=ram1998-job-4376567 #SBATCH --mem=16G #SBATCH --nodelist=jagupard33 #SBATCH --open-mode=append #SBATCH --output=ram1998-job-4376567.out #SBATCH --partition=jag-standard #SBATCH --time=14-0 # activate your desired anaconda environment . /nlp/scr/ram1998/miniconda3/etc/profile.d/conda.sh ; conda activate pyreft_dev # cd to working directory cd . # launch commands srun --unbuffered run_as_child_processes 'torchrun --nproc_per_node 4 train_multigpu.py --model_name_or_path yahma/llama-7b-hf --data_path ./alpaca_data.json --output_dir ./test_multi_gpu_v2/ --layers "8;19" --rank 4 --position "f1+l1" --num_train_epochs 10 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "no" --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --max_n_train_example 10000' ############################### submission to slurm complete! ############################### slurm submission output Submitted batch job 9113346 ############################### ############################### start time: 2024-11-18 08:01:29.771619 machine: jagupard33.stanford.edu conda env: pyreft_dev ############################### running following processes torchrun --nproc_per_node 4 train_multigpu.py --model_name_or_path yahma/llama-7b-hf --data_path ./alpaca_data.json --output_dir ./test_multi_gpu_v2/ --layers "8;19" --rank 4 --position "f1+l1" --num_train_epochs 10 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "no" --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --max_n_train_example 10000 ############################### command outputs: W1118 08:01:31.539000 139959025029504 torch/distributed/run.py:779] W1118 08:01:31.539000 139959025029504 torch/distributed/run.py:779] ***************************************** W1118 08:01:31.539000 139959025029504 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1118 08:01:31.539000 139959025029504 torch/distributed/run.py:779] ***************************************** nnsight is not detected. Please install via 'pip install nnsight' for nnsight backend. nnsight is not detected. Please install via 'pip install nnsight' for nnsight backend. nnsight is not detected. Please install via 'pip install nnsight' for nnsight backend. nnsight is not detected. Please install via 'pip install nnsight' for nnsight backend. Starting on rank 0 /nlp/scr/ram1998/miniconda3/envs/pyreft_dev/lib/python3.12/site-packages/transformers/training_args.py:1545: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( Starting on rank 3 Starting on rank 2 /nlp/scr/ram1998/miniconda3/envs/pyreft_dev/lib/python3.12/site-packages/transformers/training_args.py:1545: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( /nlp/scr/ram1998/miniconda3/envs/pyreft_dev/lib/python3.12/site-packages/transformers/training_args.py:1545: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( Starting on rank 1 /nlp/scr/ram1998/miniconda3/envs/pyreft_dev/lib/python3.12/site-packages/transformers/training_args.py:1545: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message Loading checkpoint shards: 0%| | 0/2 [00:00