Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question of running errors on multiple GPUs #2

Open
Tourist-Zhang opened this issue Mar 15, 2024 · 3 comments
Open

Question of running errors on multiple GPUs #2

Tourist-Zhang opened this issue Mar 15, 2024 · 3 comments

Comments

@Tourist-Zhang
Copy link

When I run the code using one GPU, it runs successfully, while when I run the code on two GPUs, it starts reporting errors.
My parameter and the error are as follow:

Args in experiment:
Namespace(is_training=1, model='PathFormer', model_id='ETT.sh', data='ETTh1', root_path='./data/ETT/', data_path='ETTh1.csv', features='M', target='OT', freq='h', checkpoints='./checkpoints/', seq_len=96, pred_len=96, individual=False, d_model=16, d_ff=64, num_nodes=7, layer_nums=3, k=2, num_experts_list=[4, 4, 4], patch_size_list=[[16, 12, 8, 32], [12, 8, 6, 4], [8, 6, 4, 2]], do_predict=False, revin=1, drop=0.1, embed='timeF', residual_connection=0, metric='mae', num_workers=10, itr=1, train_epochs=20, batch_size=2, patience=5, learning_rate=0.001, lradj='TST', use_amp=False, pct_start=0.4, use_gpu=True, gpu=0, use_multi_gpu=True, devices='0,1', test_flop=False, dvices='0,1', device_ids=[0, 1])
Use GPU: cuda:0

start training : ETT.sh_PathFormer_ftETTh1_slM_pl96_96>>>>>>>>>>>>>>>>>>>>>>>>>>
train 8449
val 2785
test 2785
Traceback (most recent call last):
File "/remote-home/projects/004-pathformer/run.py", line 114, in
exp.train(setting)
File "/remote-home/projects/004-pathformer/exp/exp_main.py", line 146, in train
outputs, balance_loss = self.model(batch_x)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/_utils.py", line 434, in reraise
raise exception
IndexError: Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/projects/004-pathformer/models/PathFormer.py", line 56, in forward
out, aux_loss = layer(out)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/projects/004-pathformer/layers/AMS.py", line 111, in forward
expert_outputs = [self.expertsi[0] for i in range(self.num_experts)]
File "/remote-home/projects/004-pathformer/layers/AMS.py", line 111, in
expert_outputs = [self.expertsi[0] for i in range(self.num_experts)]
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/projects/004-pathformer/layers/Layer.py", line 73, in forward
weights_distinct, biases_distinct = self.weights_generator_distinct()
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/projects/004-pathformer/layers/Layer.py", line 319, in forward
bias = [torch.matmul(memory, self.B[i]).squeeze(1) for i in range(self.number_of_weights)]
File "/remote-home/projects/004-pathformer/layers/Layer.py", line 319, in
bias = [torch.matmul(memory, self.B[i]).squeeze(1) for i in range(self.number_of_weights)]
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/container.py", line 462, in getitem
idx = self._get_abs_string_index(idx)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/container.py", line 445, in _get_abs_string_index
raise IndexError('index {} is out of range'.format(idx))
IndexError: index 0 is out of range

@hzqn1234
Copy link

Hi I'm facing the same issue, is there any resolution for this?

@Tourist-Zhang
Copy link
Author

I guess updating the version of pytorch may help.

@hzqn1234
Copy link

Thanks for the advice, let me try it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants