-
Environment:
I try to run the example on a single GPU. The following modifications are made on codes:
The error raised at the 22nd epoch. [2021-03-19 10:07:55] INFO (nni.algorithms.nas.pytorch.cream.trainer/MainThread) Epoch [22/120] Step [1/542] prec1 0.000000 (0.000000) prec5 0.015625 (0.015625) loss 6.925241 (6.925241) I'm new to this field, so the issue I found might be due to my ignorance. Thank you in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments
-
Hi, In our source doe, we use distributed running mode, the model is wrapped to run on the different GPU card. So we need to add module(model.module.xxxxxx) to call functions or access parameters. Because you use dataparallel, you don’t need to add module, just quote it directly.(model.xxxxxxx) Best, |
Beta Was this translation helpful? Give feedback.
-
Hi,
Thanks for your reply. I don't use module and try again. However, another error raised at the same epoch:
Traceback (most recent call last):
File "./train.py", line 215, in <module>
main()
File "./train.py", line 210, in main
trainer.train()
File "C:\Users\Administrator\AppData\Roaming\Python\Python36\site-packages\nni\nas\pytorch\trainer.py", line 142, in train
self.train_one_epoch(epoch)
File "C:\Users\Administrator\AppData\Roaming\Python\Python36\site-packages\nni\algorithms\nas\pytorch\cream\trainer.py", line 345, in train_one_epoch
meta_value, teacher_cand = self._select_teacher()
File "C:\Users\Administrator\AppData\Roaming\Python\Python36\site-packages\nni\algorithms\nas\pytorch\cream\trainer.py", line 132, in _select_teacher
assert teacher_cand is not None
AssertionError
The teacher_cand is assigned in if weight > meta_value("nni\algorithms\nas\pytorch\cream\trainer.py", line 128). I debug the code like this:
for now_idx, item in enumerate(self.prioritized_board):
print("进入prioritized_board循环", now_idx)
inputx = item[4]
output = torch.nn.functional.softmax(self.model(inputx), dim=1)
# weight = self.model.module.forward_meta(output - item[5]) #run on the different GPU card
weight = self.model.forward_meta(output - item[5])
if weight > meta_value:
print("进入if判断:", self.prioritized_board[cand_idx][3])
meta_value = weight
cand_idx = now_idx
teacher_cand = self.prioritized_board[cand_idx][3]
In output, now_idx is 0 and self.prioritized_board[cand_idx][3] is None.
Best,
Yi.
------------------ 原始邮件 ------------------
发件人: "microsoft/nni" ***@***.***>;
发送时间: 2021年3月22日(星期一) 中午1:42
***@***.***>;
***@***.******@***.***>;
主题: Re: [microsoft/nni] [Cream] Errors when runinng examples/nas/cream on a single GPU. (#3459)
Hi,
In our source doe, we use distributed running mode, the model is wrapped to run on the different GPU card. So we need to add module(model.module.xxxxxx) to call functions or access parameters. Because you use dataparallel, you don’t need to add module, just quote it directly.(model.xxxxxxx)
Best,
Hao.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Beta Was this translation helpful? Give feedback.
-
Hi, This error is caused by failure training of meta matching network. Our original learning rate of meta matching networks is delicately tuned by experiments. It's very easy that the training of meta matching network crash. The learning rate should vary with the magnitude of batch size(8gpu VS 1gpu). Best, |
Beta Was this translation helpful? Give feedback.
-
Hi,
I'm sorry to disturb you again. I have tried many different groups of training params, however, the training of meta matching network still crashed.
The train.yaml used in the last training are as follows:
AUTO_RESUME: False
DATA_DIR: './data/imagenet'
MODEL: 'Supernet_Training'
RESUME_PATH: './experiments/workspace/train/resume.pth.tar'
SAVE_PATH: './'
SEED: 42
LOG_INTERVAL: 10
RECOVERY_INTERVAL: 0
WORKERS: 1
NUM_GPU: 1
SAVE_IMAGES: False
AMP: False
OUTPUT: 'None'
EVAL_METRICS: 'prec1'
TTA: 0
LOCAL_RANK: 0
DATASET:
NUM_CLASSES: 1000
IMAGE_SIZE: 224 # image patch size
INTERPOLATION: 'bilinear' # Image resize interpolation type
BATCH_SIZE: 16 # batch size
NET:
GP: 'avg'
DROPOUT_RATE: 0.0
EMA:
USE: True
FORCE_CPU: False # force model ema to be tracked on CPU
DECAY: 0.9998
OPT: 'sgd'
LR: 0.1
EPOCHS: 50
META_LR: 1e-8
BATCHNORM:
SYNC_BN: False
SUPERNET:
UPDATE_ITER: 200
SLICE: 4
POOL_SIZE: 10
RESUNIT: False
DIL_CONV: False
UPDATE_2ND: True
FLOPS_MINIMUM: 0 # Minimum Flops of Architecture
FLOPS_MAXIMUM: 200 # Maximum Flops of Architecture
PICK_METHOD: 'meta'
META_STA_EPOCH: 10
HOW_TO_PROB: 'pre_prob'
PRE_PROB: (0.05,0.2,0.05,0.5,0.05,0.15)
In addition, LR=1.0,0.5, META_LR=1e-5, BATCH_SIZE=32, META_STA_EPOCH=20 were also used. Would you mind tell me how can I set these params?
…------------------ 原始邮件 ------------------
发件人: "microsoft/nni" ***@***.***>;
发送时间: 2021年3月23日(星期二) 晚上9:46
***@***.***>;
***@***.******@***.***>;
主题: Re: [microsoft/nni] [Cream] Errors when runinng examples/nas/cream on a single GPU. (#3459)
Hi,
This error is caused by failure training of meta matching network. Our original learning rate of meta matching networks is delicately tuned by experiments. It's very easy that the training of meta matching network crash. The learning rate should vary with the magnitude of batch size(8gpu VS 1gpu).
Best,
Hao.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Beta Was this translation helpful? Give feedback.
-
Hi, Thanks for your interest in cream! Not only meta_LR should vary with batch size, LR should also vary with batch size. From your statement, the learning rate of supernet remains 0.1 in your different tries of setting meta_LR. You might try reducing LR, e.g., LR=5e-4, meta_LR=4e-4. Best, |
Beta Was this translation helpful? Give feedback.
Hi,
Thanks for your interest in cream!
Not only meta_LR should vary with batch size, LR should also vary with batch size. From your statement, the learning rate of supernet remains 0.1 in your different tries of setting meta_LR. You might try reducing LR, e.g., LR=5e-4, meta_LR=4e-4.
Best,
Hao.