Question of running errors on multiple GPUs #2

Tourist-Zhang · 2024-03-15T13:18:05Z

When I run the code using one GPU, it runs successfully, while when I run the code on two GPUs, it starts reporting errors.
My parameter and the error are as follow:

Args in experiment:
Namespace(is_training=1, model='PathFormer', model_id='ETT.sh', data='ETTh1', root_path='./data/ETT/', data_path='ETTh1.csv', features='M', target='OT', freq='h', checkpoints='./checkpoints/', seq_len=96, pred_len=96, individual=False, d_model=16, d_ff=64, num_nodes=7, layer_nums=3, k=2, num_experts_list=[4, 4, 4], patch_size_list=[[16, 12, 8, 32], [12, 8, 6, 4], [8, 6, 4, 2]], do_predict=False, revin=1, drop=0.1, embed='timeF', residual_connection=0, metric='mae', num_workers=10, itr=1, train_epochs=20, batch_size=2, patience=5, learning_rate=0.001, lradj='TST', use_amp=False, pct_start=0.4, use_gpu=True, gpu=0, use_multi_gpu=True, devices='0,1', test_flop=False, dvices='0,1', device_ids=[0, 1])
Use GPU: cuda:0

start training : ETT.sh_PathFormer_ftETTh1_slM_pl96_96>>>>>>>>>>>>>>>>>>>>>>>>>>
train 8449
val 2785
test 2785
Traceback (most recent call last):
File "/remote-home/projects/004-pathformer/run.py", line 114, in
exp.train(setting)
File "/remote-home/projects/004-pathformer/exp/exp_main.py", line 146, in train
outputs, balance_loss = self.model(batch_x)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/_utils.py", line 434, in reraise
raise exception
IndexError: Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/projects/004-pathformer/models/PathFormer.py", line 56, in forward
out, aux_loss = layer(out)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/projects/004-pathformer/layers/AMS.py", line 111, in forward
expert_outputs = [self.expertsi[0] for i in range(self.num_experts)]
File "/remote-home/projects/004-pathformer/layers/AMS.py", line 111, in
expert_outputs = [self.expertsi[0] for i in range(self.num_experts)]
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/projects/004-pathformer/layers/Layer.py", line 73, in forward
weights_distinct, biases_distinct = self.weights_generator_distinct()
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/projects/004-pathformer/layers/Layer.py", line 319, in forward
bias = [torch.matmul(memory, self.B[i]).squeeze(1) for i in range(self.number_of_weights)]
File "/remote-home/projects/004-pathformer/layers/Layer.py", line 319, in
bias = [torch.matmul(memory, self.B[i]).squeeze(1) for i in range(self.number_of_weights)]
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/container.py", line 462, in getitem
idx = self._get_abs_string_index(idx)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/container.py", line 445, in _get_abs_string_index
raise IndexError('index {} is out of range'.format(idx))
IndexError: index 0 is out of range

hzqn1234 · 2024-12-18T18:28:01Z

Hi I'm facing the same issue, is there any resolution for this?

Tourist-Zhang · 2024-12-19T01:19:39Z

I guess updating the version of pytorch may help.

hzqn1234 · 2024-12-19T08:49:09Z

Thanks for the advice, let me try it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question of running errors on multiple GPUs #2

Question of running errors on multiple GPUs #2

Tourist-Zhang commented Mar 15, 2024

hzqn1234 commented Dec 18, 2024

Tourist-Zhang commented Dec 19, 2024

hzqn1234 commented Dec 19, 2024

Question of running errors on multiple GPUs #2

Question of running errors on multiple GPUs #2

Comments

Tourist-Zhang commented Mar 15, 2024

hzqn1234 commented Dec 18, 2024

Tourist-Zhang commented Dec 19, 2024

hzqn1234 commented Dec 19, 2024