You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run the code using one GPU, it runs successfully, while when I run the code on two GPUs, it starts reporting errors.
My parameter and the error are as follow:
start training : ETT.sh_PathFormer_ftETTh1_slM_pl96_96>>>>>>>>>>>>>>>>>>>>>>>>>>
train 8449
val 2785
test 2785
Traceback (most recent call last):
File "/remote-home/projects/004-pathformer/run.py", line 114, in
exp.train(setting)
File "/remote-home/projects/004-pathformer/exp/exp_main.py", line 146, in train
outputs, balance_loss = self.model(batch_x)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/_utils.py", line 434, in reraise
raise exception
IndexError: Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/projects/004-pathformer/models/PathFormer.py", line 56, in forward
out, aux_loss = layer(out)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/projects/004-pathformer/layers/AMS.py", line 111, in forward
expert_outputs = [self.expertsi[0] for i in range(self.num_experts)]
File "/remote-home/projects/004-pathformer/layers/AMS.py", line 111, in
expert_outputs = [self.expertsi[0] for i in range(self.num_experts)]
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/projects/004-pathformer/layers/Layer.py", line 73, in forward
weights_distinct, biases_distinct = self.weights_generator_distinct()
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/remote-home/projects/004-pathformer/layers/Layer.py", line 319, in forward
bias = [torch.matmul(memory, self.B[i]).squeeze(1) for i in range(self.number_of_weights)]
File "/remote-home/projects/004-pathformer/layers/Layer.py", line 319, in
bias = [torch.matmul(memory, self.B[i]).squeeze(1) for i in range(self.number_of_weights)]
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/container.py", line 462, in getitem
idx = self._get_abs_string_index(idx)
File "/remote-home/anaconda/envs/python39/lib/python3.9/site-packages/torch/nn/modules/container.py", line 445, in _get_abs_string_index
raise IndexError('index {} is out of range'.format(idx))
IndexError: index 0 is out of range
The text was updated successfully, but these errors were encountered:
When I run the code using one GPU, it runs successfully, while when I run the code on two GPUs, it starts reporting errors.
My parameter and the error are as follow:
Args in experiment:
Namespace(is_training=1, model='PathFormer', model_id='ETT.sh', data='ETTh1', root_path='./data/ETT/', data_path='ETTh1.csv', features='M', target='OT', freq='h', checkpoints='./checkpoints/', seq_len=96, pred_len=96, individual=False, d_model=16, d_ff=64, num_nodes=7, layer_nums=3, k=2, num_experts_list=[4, 4, 4], patch_size_list=[[16, 12, 8, 32], [12, 8, 6, 4], [8, 6, 4, 2]], do_predict=False, revin=1, drop=0.1, embed='timeF', residual_connection=0, metric='mae', num_workers=10, itr=1, train_epochs=20, batch_size=2, patience=5, learning_rate=0.001, lradj='TST', use_amp=False, pct_start=0.4, use_gpu=True, gpu=0, use_multi_gpu=True, devices='0,1', test_flop=False, dvices='0,1', device_ids=[0, 1])
Use GPU: cuda:0
The text was updated successfully, but these errors were encountered: