使用colab T4GPU预训练Chinese-LLaMA-Alpaca-2，执行run_pt.sh，报错（着急解决！） #528

KagerJin · 2024-02-23T09:00:31Z

KagerJin
Feb 23, 2024

2024-02-23 03:10:18.814529: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-23 03:10:18.814595: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-23 03:10:18.816594: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-23 03:10:20.392356: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
Traceback (most recent call last):
File "/content/Chinese-LLaMA-Alpaca-2-4.1/scripts/training/run_clm_pt_with_peft.py", line 720, in
main()
File "/content/Chinese-LLaMA-Alpaca-2-4.1/scripts/training/run_clm_pt_with_peft.py", line 375, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
obj = dtype(inputs)
File "", line 129, in init
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1442, in post_init
and (self.device.type != "cuda")
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1887, in device
return self._setup_devices
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 54, in get
cached = self.fget(obj)
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1813, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 171, in init
assert (
AssertionError: DeepSpeed is not available => install it using `pip3 install deepspeed` or build it from source
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 60462) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init**.py", line 346, in wrapper
return f(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_clm_pt_with_peft.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-02-23_03:10:28
host : cc2c86357935
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 60462)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

JZLshen · 2024-08-01T08:08:17Z

JZLshen
Aug 1, 2024

Have you solved the problem? I have the same problem.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用colab T4GPU预训练Chinese-LLaMA-Alpaca-2，执行run_pt.sh，报错（着急解决！） #528

{{title}}

Replies: 1 comment

{{title}}

Select a reply

使用colab T4GPU预训练Chinese-LLaMA-Alpaca-2，执行run_pt.sh，报错（着急解决！） #528

KagerJin Feb 23, 2024

run_clm_pt_with_peft.py FAILED

Failures: <NO_OTHER_FAILURES>

Replies: 1 comment

JZLshen Aug 1, 2024

KagerJin
Feb 23, 2024

Failures:
<NO_OTHER_FAILURES>

JZLshen
Aug 1, 2024