[Cream] Errors when runinng examples/nas/cream on a single GPU. #3794

yiyang129 · 2021-03-19T03:26:49Z

yiyang129
Mar 19, 2021

Environment:

NNI version: 2.0
NNI mode (local|remote|pai): local
Client OS: Win10
Server OS (for remote mode only):
Python version: 3.6.8
PyTorch/TensorFlow version: PyTorch 1.2.0
Is conda/virtualenv/venv used?: conda
Is running in Docker?: no

I try to run the example on a single GPU. The following modifications are made on codes:

examples/nas/cream/train.py torch.distributed.init_process_group(backend='nccl', init_method='env://') is commented out.
examples/nas/cream/train.py torch.nn.DataParallel(model, device_ids=[0]) is added after model = model.cuda().
examples/nas/cream/train.py if USE_APEX: model = DDP(model, delay_allreduce=True) ...... model = DDP(model, device_ids=[args.local_rank]) is commented out.
nas/pytorch/cream/trainer.py metrics = reduce_metrics(metrics) is commented out.
nas/pytorch/cream/trainer.py metrics = self.reduce_metrics(metrics, self.distributed) is commented out.
nas/pytorch/cream/trainer.py prec1, prec5 = self.accuracy(logits, y, topk=(1, 5)) is modified to prec1, prec5 = accuracy(logits, y, topk=(1, 5)).

The error raised at the 22nd epoch.

[2021-03-19 10:07:55] INFO (nni.algorithms.nas.pytorch.cream.trainer/MainThread) Epoch [22/120] Step [1/542] prec1 0.000000 (0.000000) prec5 0.015625 (0.015625) loss 6.925241 (6.925241)
Traceback (most recent call last):
File "train.py", line 215, in
main()
File "train.py", line 210, in main
trainer.train()
File "C:\Users\Administrator\AppData\Roaming\Python\Python36\site-packages\nni\nas\pytorch\trainer.py", line 142, in train
self.train_one_epoch(epoch)
File "C:\Users\Administrator\AppData\Roaming\Python\Python36\site-packages\nni\algorithms\nas\pytorch\cream\trainer.py", line 344, in train_one_epoch
meta_value, teacher_cand = self._select_teacher()
File "C:\Users\Administrator\AppData\Roaming\Python\Python36\site-packages\nni\algorithms\nas\pytorch\cream\trainer.py", line 126, in _select_teacher
weight = self.model.module.forward_meta(output - item[5])
File "E:\Anaconda\anaconda3\envs\nni_env\lib\site-packages\torch\nn\modules\module.py", line 591, in getattr
type(self).name, name))
AttributeError: 'SuperNet' object has no attribute 'module'

I'm new to this field, so the issue I found might be due to my ignorance. Thank you in advance.

Answered by Z7zuqer

Mar 26, 2021

Hi,

Thanks for your interest in cream!

Not only meta_LR should vary with batch size, LR should also vary with batch size. From your statement, the learning rate of supernet remains 0.1 in your different tries of setting meta_LR. You might try reducing LR, e.g., LR=5e-4, meta_LR=4e-4.

Best,
Hao.

View full answer

Z7zuqer · 2021-03-22T05:42:20Z

Z7zuqer
Mar 22, 2021

Hi,

In our source doe, we use distributed running mode, the model is wrapped to run on the different GPU card. So we need to add module(model.module.xxxxxx) to call functions or access parameters. Because you use dataparallel, you don’t need to add module, just quote it directly.(model.xxxxxxx)

Best,
Hao.

0 replies

yiyang129 · 2021-03-23T09:15:42Z

yiyang129
Mar 23, 2021
Author

Hi, Thanks for your reply. I don't use module and try again. However, another error raised at the same epoch: Traceback (most recent call last):   File "./train.py", line 215, in <module>     main()   File "./train.py", line 210, in main     trainer.train()   File "C:\Users\Administrator\AppData\Roaming\Python\Python36\site-packages\nni\nas\pytorch\trainer.py", line 142, in train     self.train_one_epoch(epoch)   File "C:\Users\Administrator\AppData\Roaming\Python\Python36\site-packages\nni\algorithms\nas\pytorch\cream\trainer.py", line 345, in train_one_epoch     meta_value, teacher_cand = self._select_teacher()   File "C:\Users\Administrator\AppData\Roaming\Python\Python36\site-packages\nni\algorithms\nas\pytorch\cream\trainer.py", line 132, in _select_teacher     assert teacher_cand is not None AssertionError The teacher_cand is assigned in if weight > meta_value("nni\algorithms\nas\pytorch\cream\trainer.py", line 128). I debug the code like this: for now_idx, item in enumerate(self.prioritized_board): print("进入prioritized_board循环", now_idx) inputx = item[4] output = torch.nn.functional.softmax(self.model(inputx), dim=1) # weight = self.model.module.forward_meta(output - item[5]) #run on the different GPU card weight = self.model.forward_meta(output - item[5]) if weight > meta_value: print("进入if判断：", self.prioritized_board[cand_idx][3]) meta_value = weight cand_idx = now_idx teacher_cand = self.prioritized_board[cand_idx][3] In output, now_idx is 0 and self.prioritized_board[cand_idx][3] is None.  Best, Yi. ------------------ 原始邮件 ------------------ 发件人: "microsoft/nni" ***@***.***>; 发送时间: 2021年3月22日(星期一) 中午1:42 ***@***.***>; ***@***.******@***.***>; 主题: Re: [microsoft/nni] [Cream] Errors when runinng examples/nas/cream on a single GPU. (#3459) Hi, In our source doe, we use distributed running mode, the model is wrapped to run on the different GPU card. So we need to add module(model.module.xxxxxx) to call functions or access parameters. Because you use dataparallel, you don’t need to add module, just quote it directly.(model.xxxxxxx) Best, Hao. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

0 replies

Z7zuqer · 2021-03-23T13:46:41Z

Z7zuqer
Mar 23, 2021

Hi,

This error is caused by failure training of meta matching network. Our original learning rate of meta matching networks is delicately tuned by experiments. It's very easy that the training of meta matching network crash. The learning rate should vary with the magnitude of batch size(8gpu VS 1gpu).

Best,
Hao.

0 replies

yiyang129 · 2021-03-25T11:41:29Z

yiyang129
Mar 25, 2021
Author

Hi, I'm sorry to disturb you again. I have tried many different groups of training params, however, the training of meta matching network still crashed. The train.yaml used in the last training are as follows： AUTO_RESUME: False DATA_DIR: './data/imagenet' MODEL: 'Supernet_Training' RESUME_PATH: './experiments/workspace/train/resume.pth.tar' SAVE_PATH: './' SEED: 42 LOG_INTERVAL: 10 RECOVERY_INTERVAL: 0 WORKERS: 1 NUM_GPU: 1 SAVE_IMAGES: False AMP: False OUTPUT: 'None' EVAL_METRICS: 'prec1' TTA: 0 LOCAL_RANK: 0 DATASET:   NUM_CLASSES: 1000   IMAGE_SIZE: 224 # image patch size   INTERPOLATION: 'bilinear' # Image resize interpolation type   BATCH_SIZE: 16 # batch size NET:   GP: 'avg'   DROPOUT_RATE: 0.0   EMA:     USE: True     FORCE_CPU: False # force model ema to be tracked on CPU     DECAY: 0.9998 OPT: 'sgd' LR: 0.1 EPOCHS: 50 META_LR: 1e-8 BATCHNORM:   SYNC_BN: False SUPERNET:   UPDATE_ITER: 200   SLICE: 4   POOL_SIZE: 10   RESUNIT: False   DIL_CONV: False   UPDATE_2ND: True   FLOPS_MINIMUM: 0  # Minimum Flops of Architecture   FLOPS_MAXIMUM: 200  # Maximum Flops of Architecture   PICK_METHOD: 'meta'   META_STA_EPOCH: 10   HOW_TO_PROB: 'pre_prob'   PRE_PROB: (0.05,0.2,0.05,0.5,0.05,0.15) In addition, LR=1.0,0.5, META_LR=1e-5, BATCH_SIZE=32, META_STA_EPOCH=20 were also used. Would you mind tell me how can I set these params?

…

------------------ 原始邮件 ------------------ 发件人: "microsoft/nni" ***@***.***>; 发送时间: 2021年3月23日(星期二) 晚上9:46 ***@***.***>; ***@***.******@***.***>; 主题: Re: [microsoft/nni] [Cream] Errors when runinng examples/nas/cream on a single GPU. (#3459) Hi, This error is caused by failure training of meta matching network. Our original learning rate of meta matching networks is delicately tuned by experiments. It's very easy that the training of meta matching network crash. The learning rate should vary with the magnitude of batch size(8gpu VS 1gpu). Best, Hao. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

0 replies

Z7zuqer · 2021-03-26T05:30:52Z

Z7zuqer
Mar 26, 2021

Hi,

Thanks for your interest in cream!

Not only meta_LR should vary with batch size, LR should also vary with batch size. From your statement, the learning rate of supernet remains 0.1 in your different tries of setting meta_LR. You might try reducing LR, e.g., LR=5e-4, meta_LR=4e-4.

Best,
Hao.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cream] Errors when runinng examples/nas/cream on a single GPU. #3794

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Cream] Errors when runinng examples/nas/cream on a single GPU. #3794

yiyang129 Mar 19, 2021

Replies: 5 comments

Z7zuqer Mar 22, 2021

yiyang129 Mar 23, 2021 Author

Z7zuqer Mar 23, 2021

yiyang129 Mar 25, 2021 Author

Z7zuqer Mar 26, 2021

yiyang129
Mar 19, 2021

Z7zuqer
Mar 22, 2021

yiyang129
Mar 23, 2021
Author

Z7zuqer
Mar 23, 2021

yiyang129
Mar 25, 2021
Author

Z7zuqer
Mar 26, 2021