-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Validation datasets support during training #785
base: main
Are you sure you want to change the base?
Conversation
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
maskrcnn_benchmark/engine/trainer.py
Outdated
).format( | ||
eta=eta_string, | ||
iter=iteration, | ||
meters=str(meters), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str(meters)
needs to be str(meters_val)
here, otherwise the training metrics are displayed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... Yes. Fixed.
@fmassa , |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I think that having a separate VAL set might not be necessary.
Also, you only log the losses during validation. While more complicated, in the new release of torchvision I added a functionality to progressively compute the mAP during evaluation, and as an example I compute evaluation at the end of every epoch, see https://github.com/pytorch/vision/blob/master/references/detection/coco_eval.py
Maybe something like that could be used here instead?
@@ -30,7 +30,8 @@ MODEL: | |||
SHARE_BOX_FEATURE_EXTRACTOR: False | |||
MASK_ON: True | |||
DATASETS: | |||
TRAIN: ("coco_2014_train", "coco_2014_valminusminival") | |||
TRAIN: ("coco_2014_train",) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert this change? All the models have been trained using the new coco_2017train
dataset, which corresponds to coco_2014_train
+ coco_2014_valminusminival
. If you want to evaluate at every N iterations, you could do it on the coco_2014_minival
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted it and created another config file where the number of iterations for validation specified:
https://github.com/facebookresearch/maskrcnn-benchmark/pull/828/files#diff-4dd26a63ac00a49aeb10985800d7f21c
args["transforms"] = transforms | ||
# make dataset from factory | ||
dataset = factory(**args) | ||
datasets.append(dataset) | ||
|
||
# for testing, return a list of datasets | ||
if not is_train: | ||
if mode != DatasetMode.TEST: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though not really the best thing to do, I believe in most cases we simply evaluate on the test dataset after N iterations, so I think that we can remove the VAL
part altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added another boolean flag instead for controlling the way of data loader creating:
https://github.com/facebookresearch/maskrcnn-benchmark/pull/828/files#diff-48c338613bdbf422235cdb2ef17201f7R77
losses = sum(loss for loss in loss_dict.values()) | ||
loss_dict_reduced = reduce_loss_dict(loss_dict) | ||
losses_reduced = sum(loss for loss in loss_dict_reduced.values()) | ||
meters_val.update(loss=losses_reduced, **loss_dict_reduced) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, you only evaluate the loss here, while a metric which is generally more useful is to report the mAP as we do for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added inference metrics calculating in addition to loss calculation:
https://github.com/facebookresearch/maskrcnn-benchmark/pull/828/files#diff-29486803add8a1cde2a6e5b741434c7cR128
Hi, I'm working on the same thing write now, and I wonder why is it possible to avoid model.eval() command before the validation starts? |
@fmassa, @YradenRavid, |
@osanwe I saw now you created FrozenBatchNorm2d() that BatchNorm2d where the batch statistics and the affine parameters are fixed. Is that the reason we don't need to use model.eval() before computing validation loss? |
@YradenRavid, |
losses = sum(loss for loss in loss_dict.values()) | ||
loss_dict_reduced = reduce_loss_dict(loss_dict) | ||
losses_reduced = sum(loss for loss in loss_dict_reduced.values()) | ||
meters_val.update(loss=losses_reduced, **loss_dict_reduced) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line records batch's loss of val-set using current train iteration model, right?
So, if the purpose is to check our model is over fitting or not, we need to calculate the average loss of val-set using current train iteration model. And use this average loss to decide early stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, global average loss is needed and MetricLogger
class calculates it:
https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/maskrcnn_benchmark/utils/metric_logger.py#L35-L37
By inspecting the implementation of the FrozenBatchNorm, it seems all parameters are indeed frozen and the behavior of the layer does not change regardless of |
The behavior is changed in |
It seems this inference operation will increase the cost of memory? I encountered out of memory when gathering results from the gpus. I am so confused that the inference can be finished but the memory will be costed a lot when gathering results? |
Yes, the validation step requires additional memory because of additional data loader etc. Please, refer to #828. |
Preparing pr description... |
It seems that the framework does not have capabilities to use validation datasets during training for checking convergence (it attaches them into a training dataset), and it is a requested feature (#171 and #348).
In both mentioned issues not clean solutions are proposed because of neediness of additional boilerplate code. So the PR contains integration of the validation into training process with configs support.
Also this feature will be useful for visualization after adding TensorBoard support (#163).
The usage example presented in this config file.
Upd.: The latest version of PR is #828