Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training problem #40

Open
vision-heng opened this issue Mar 18, 2022 · 9 comments
Open

training problem #40

vision-heng opened this issue Mar 18, 2022 · 9 comments

Comments

@vision-heng
Copy link

Hello, Professor! I have the following problem when running the code on win11. Can you explain what they mean and how to solve the problems? (my graph memory is 8GB) Thank you very much!

python main.py --nb_cl_fg=50 --nb_cl=10 --gpu=0 --random_seed=1993 --baseline=lucir --branch_mode=dual --branch_1=ss --branch_2=free --dataset=cifar100
Namespace(K=2, base_lr1=0.1, base_lr2=0.1, baseline='lucir', branch_1='ss', branch_2='free', branch_mode='dual', ckpt_dir_fg='-', ckpt_label='exp01', custom_momentum=0.9, custom_weight_decay=0.0005, data_dir=
'data/seed_1993_subset_100_imagenet/data', dataset='cifar100', disable_gpu_occupancy=True, dist=0.5, dynamic_budget=False, epochs=160, eval_batch_size=128, fusion_lr=1e-08, gpu='0', icarl_T=2, icarl_beta=0.25
, imgnet_backbone='resnet18', lr_factor=0.1, lw_mr=1, nb_cl=10, nb_cl_fg=50, nb_protos=20, num_classes=100, num_workers=1, random_seed=1993, resume=False, resume_fg=False, test_batch_size=100, the_lambda=5, train_batch_size=128)
Using gpu: 0
Total memory: 8192, used memory: 829
Occupy GPU memory in advance.
Files already downloaded and verified
Files already downloaded and verified
Order name:./logs/cifar100_nfg50_ncls10_nproto20_lucir_dual_b1ss_b2free_fixed_exp01\seed_1993_cifar100_order.pkl
Loading the saved class order
[68, 56, 78, 8, 23, 84, 90, 65, 74, 76, 40, 89, 3, 92, 55, 9, 26, 80, 43, 38, 58, 70, 77, 1, 85, 19, 17, 50, 28, 53, 13, 81, 45, 82, 6, 59, 83, 16, 15, 44, 91, 41, 72, 60, 79, 52, 20, 10, 31, 54, 37, 95, 14, 71, 96, 98, 97, 2, 64, 66, 42, 22, 35, 86, 24, 34, 87, 21, 99, 0, 88, 27, 18, 94, 11, 12, 47, 25, 30, 46, 62, 69, 36, 61, 7, 63, 75, 5, 32, 4, 51, 48, 73, 93, 39, 67, 29, 49, 57, 33]
Feature: 64 Class: 50
Setting the dataloaders ...
Check point name: ./logs/cifar100_nfg50_ncls10_nproto20_lucir_dual_b1ss_b2free_fixed_exp01\iter_4_b1.pth

Epoch: 0, learning rate: 0.1
Traceback (most recent call last):
File "main.py", line 88, in
trainer.train()
File "E:\AlgSpace\pycharm\AANets\trainer\trainer.py", line 171, in train
cur_lambda, self.args.dist, self.args.K, self.args.lw_mr)
File "E:\AlgSpace\pycharm\AANets\trainer\zeroth_phase.py", line 63, in incremental_train_and_eval_zeroth_phase
outputs = b1_model(inputs)
File "E:\Anaconda\envs\aanets\lib\site-packages\torch\nn\modules\module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "E:\AlgSpace\pycharm\AANets\models\modified_resnet_cifar.py", line 109, in forward
x = self.fc(x)
File "E:\Anaconda\envs\aanets\lib\site-packages\torch\nn\modules\module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "E:\AlgSpace\pycharm\AANets\models\modified_linear.py", line 37, in forward
F.normalize(self.weight, p=2, dim=1))
File "E:\Anaconda\envs\aanets\lib\site-packages\torch\nn\functional.py", line 1371, in linear
output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

@yaoyao-liu
Copy link
Owner

Thanks for your interest in our work!

I'm not entirely sure what's causing this problem. I think it might be due to the mismatch of PyTorch and CUDA versions.
My NVIDIA driver version is 460.84, and my CUDA version is 11.2. I hope this information might help you.

If you have any further questions, please feel free to add comments to this issue.

@vision-heng
Copy link
Author

vision-heng commented Mar 18, 2022 via email

@cocogt96
Copy link

Hi, I got the same problem here. Do you find out how to fix it? Very appreciated for your help.

@yaoyao-liu
Copy link
Owner

Hi, I got the same problem here. Do you find out how to fix it? Very appreciated for your help.

I don't have this issue when I running the code. So, could you please provide your GPU info, PyTorch version, and CUDA version? Thanks a lot!

@cocogt96
Copy link

Hi, thank you for your reply. My ubuntu version is 20.04, driver version 470.103.01, cuda version: 11.4. Very appreciated that.

@yaoyao-liu
Copy link
Owner

Hi, thank you for your reply. My ubuntu version is 20.04, driver version 470.103.01, cuda version: 11.4. Very appreciated that.

Are you using PyTorch 1.2.0 and Python 3.6?

@cocogt96
Copy link

Hi, thank you for your reply. My ubuntu version is 20.04, driver version 470.103.01, cuda version: 11.4. Very appreciated that.

Are you using PyTorch 1.2.0 and Python 3.6?

Yes, I follow the exact version of them.

@yaoyao-liu
Copy link
Owner

Hi, thank you for your reply. My ubuntu version is 20.04, driver version 470.103.01, cuda version: 11.4. Very appreciated that.

Are you using PyTorch 1.2.0 and Python 3.6?

Yes, I follow the exact version of them.

Thanks for providing this information. Currently, I cannot reproduce this issue on my system. Thus, I don't have a solution to this issue. I am very sorry about it. I will keep you posted if I find some new solution.

If you find some solutions, you may also post them here. I am truly grateful for it.

@cocogt96
Copy link

Sure, Thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants