Validation datasets support during training #785

osanwe · 2019-05-15T08:36:06Z

It seems that the framework does not have capabilities to use validation datasets during training for checking convergence (it attaches them into a training dataset), and it is a requested feature (#171 and #348).

In both mentioned issues not clean solutions are proposed because of neediness of additional boilerplate code. So the PR contains integration of the validation into training process with configs support.

Also this feature will be useful for visualization after adding TensorBoard support (#163).

The usage example presented in this config file.

Upd.: The latest version of PR is #828

facebook-github-bot · 2019-05-15T08:43:18Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2019-05-15T08:59:13Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

droseger · 2019-05-22T01:16:53Z

maskrcnn_benchmark/engine/trainer.py

+                    ).format(
+                        eta=eta_string,
+                        iter=iteration,
+                        meters=str(meters),


str(meters) needs to be str(meters_val) here, otherwise the training metrics are displayed

Oh... Yes. Fixed.

osanwe · 2019-05-23T15:52:31Z

@fmassa ,
Sorry for ping.
As I understood from previous activity, you are a maintainer of the repo. Is this pull request useful or it can be closed?
Maybe some additional information or tests should be provided from my side. The code was tested on 1 and 8 GPUs.

fmassa

Thanks for the PR!

I think that having a separate VAL set might not be necessary.

Also, you only log the losses during validation. While more complicated, in the new release of torchvision I added a functionality to progressively compute the mAP during evaluation, and as an example I compute evaluation at the end of every epoch, see https://github.com/pytorch/vision/blob/master/references/detection/coco_eval.py

Maybe something like that could be used here instead?

fmassa · 2019-05-24T09:32:53Z

configs/e2e_mask_rcnn_R_50_FPN_1x.yaml

@@ -30,7 +30,8 @@ MODEL:
    SHARE_BOX_FEATURE_EXTRACTOR: False
  MASK_ON: True
 DATASETS:
-  TRAIN: ("coco_2014_train", "coco_2014_valminusminival")
+  TRAIN: ("coco_2014_train",)


Can you revert this change? All the models have been trained using the new coco_2017train dataset, which corresponds to coco_2014_train + coco_2014_valminusminival. If you want to evaluate at every N iterations, you could do it on the coco_2014_minival?

I've reverted it and created another config file where the number of iterations for validation specified:
https://github.com/facebookresearch/maskrcnn-benchmark/pull/828/files#diff-4dd26a63ac00a49aeb10985800d7f21c

fmassa · 2019-05-24T09:34:26Z

maskrcnn_benchmark/data/build.py

        args["transforms"] = transforms
        # make dataset from factory
        dataset = factory(**args)
        datasets.append(dataset)

    # for testing, return a list of datasets
-    if not is_train:
+    if mode != DatasetMode.TEST:


Even though not really the best thing to do, I believe in most cases we simply evaluate on the test dataset after N iterations, so I think that we can remove the VAL part altogether.

I added another boolean flag instead for controlling the way of data loader creating:
https://github.com/facebookresearch/maskrcnn-benchmark/pull/828/files#diff-48c338613bdbf422235cdb2ef17201f7R77

fmassa · 2019-05-24T09:36:28Z

maskrcnn_benchmark/engine/trainer.py

+                        losses = sum(loss for loss in loss_dict.values())
+                        loss_dict_reduced = reduce_loss_dict(loss_dict)
+                        losses_reduced = sum(loss for loss in loss_dict_reduced.values())
+                        meters_val.update(loss=losses_reduced, **loss_dict_reduced)


If I understand it correctly, you only evaluate the loss here, while a metric which is generally more useful is to report the mAP as we do for testing.

I added inference metrics calculating in addition to loss calculation:
https://github.com/facebookresearch/maskrcnn-benchmark/pull/828/files#diff-29486803add8a1cde2a6e5b741434c7cR128

YradenRavid · 2019-05-27T09:37:52Z

Hi, I'm working on the same thing write now, and I wonder why is it possible to avoid model.eval() command before the validation starts?

osanwe · 2019-05-27T10:18:23Z

@fmassa,
Yes, it seems the additional datasets field is redundant. So I created another PR (#828) where the same datasets are used for intermediate and final evaluations with suggested AP calculations.

@YradenRavid,
The train and eval methods change the inner state of the module (https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L987-L1011) and can impact to the model behavior.
For validation in this PR loss values are needed, and they can be gotten only in training mode (https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py#L59).

qihao-huang · 2019-05-27T11:28:08Z

@osanwe I'm working on validation set average loss curve for Tensor Board at every eval step, and almost done.
Will support your #828 as well.

Also this feature will be useful for visualization after adding TensorBoard support (#163).

YradenRavid · 2019-05-28T07:04:43Z

@osanwe I saw now you created FrozenBatchNorm2d() that BatchNorm2d where the batch statistics and the affine parameters are fixed. Is that the reason we don't need to use model.eval() before computing validation loss?

osanwe · 2019-05-28T09:16:03Z

@YradenRavid,
It seems FrozenBatchNorm2d class is created by @fmassa, not by me.
As I wrote previously, eval() and train() methods only change the training boolean flag in nn.Module. And yes, this flag can impact on the batch_norm behavior and some other layers. Also it impacts on the RCNN behavior here (as mentioned previously), so you cannot get loss values if training flag is equal to False (i.e. you use eval() mehod).

qihao-huang · 2019-05-28T09:29:58Z

maskrcnn_benchmark/engine/trainer.py

+                        losses = sum(loss for loss in loss_dict.values())
+                        loss_dict_reduced = reduce_loss_dict(loss_dict)
+                        losses_reduced = sum(loss for loss in loss_dict_reduced.values())
+                        meters_val.update(loss=losses_reduced, **loss_dict_reduced)


This line records batch's loss of val-set using current train iteration model, right?
So, if the purpose is to check our model is over fitting or not, we need to calculate the average loss of val-set using current train iteration model. And use this average loss to decide early stop.

Yes, global average loss is needed and MetricLogger class calculates it:
https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/maskrcnn_benchmark/utils/metric_logger.py#L35-L37

alono88 · 2019-05-28T13:00:21Z

@YradenRavid,
It seems FrozenBatchNorm2d class is created by @fmassa, not by me.
As I wrote previously, eval() and train() methods only change the training boolean flag in nn.Module. And yes, this flag can impact on the batch_norm behavior and some other layers. Also it impacts on the RCNN behavior here (as mentioned previously), so you cannot get loss values if training flag is equal to False (i.e. you use eval() mehod).

By inspecting the implementation of the FrozenBatchNorm, it seems all parameters are indeed frozen and the behavior of the layer does not change regardless of .eval() or .train().

osanwe · 2019-05-29T11:34:45Z

behavior of the layer does not change regardless of .eval() or .train()

The behavior is changed in GeneralizedRCNN forward method:
https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py#L59-L65

JoyHuYY1412 · 2019-09-29T14:32:33Z

It seems that the framework does not have capabilities to use validation datasets during training for checking convergence (it attaches them into a training dataset), and it is a requested feature (#171 and #348).

In both mentioned issues not clean solutions are proposed because of neediness of additional boilerplate code. So the PR contains integration of the validation into training process with configs support.

Also this feature will be useful for visualization after adding TensorBoard support (#163).

The usage example presented in this config file.

Upd.: The latest version of PR is #828

It seems this inference operation will increase the cost of memory? I encountered out of memory when gathering results from the gpus. I am so confused that the inference can be finished but the memory will be costed a lot when gathering results?

osanwe · 2019-09-30T12:41:16Z

It seems this inference operation will increase the cost of memory?

Yes, the validation step requires additional memory because of additional data loader etc. Please, refer to #828.

lufficc · 2023-08-10T12:06:09Z

Preparing pr description...

Petr Vytovtov added 9 commits April 24, 2019 11:45

Basic validation dataset support.

4b0ef2d

Added TRAIN, VALID, and TEST modes.

7016daa

Fixed DatasetMode import.

9908d79

Fixed DatasetMode import in started scripts.

b422778

Fixed DatasetMode usage.

ef4669c

Fixed validation set loading.

3e2edba

Fixed validation set loading.

63b550d

Changed the batch examples number during validation.

e2c50a6

Code cleaning.

989816d

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label May 15, 2019

droseger reviewed May 22, 2019

View reviewed changes

Fixed validation metrics logging.

3c469a0

fmassa suggested changes May 24, 2019

View reviewed changes

osanwe mentioned this pull request May 27, 2019

Validation during training (version 2) #828

Merged

qihao-huang reviewed May 28, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation datasets support during training #785

Validation datasets support during training #785

osanwe commented May 15, 2019 •

edited

Loading

facebook-github-bot commented May 15, 2019

facebook-github-bot commented May 15, 2019

droseger May 22, 2019 •

edited

Loading

osanwe May 22, 2019

osanwe commented May 23, 2019

fmassa left a comment

fmassa May 24, 2019

osanwe May 29, 2019

fmassa May 24, 2019

osanwe May 29, 2019

fmassa May 24, 2019

osanwe May 29, 2019

YradenRavid commented May 27, 2019

osanwe commented May 27, 2019

qihao-huang commented May 27, 2019 •

edited

Loading

YradenRavid commented May 28, 2019 •

edited

Loading

osanwe commented May 28, 2019 •

edited

Loading

qihao-huang May 28, 2019 •

edited

Loading

osanwe May 29, 2019

alono88 commented May 28, 2019

osanwe commented May 29, 2019 •

edited

Loading

JoyHuYY1412 commented Sep 29, 2019

osanwe commented Sep 30, 2019

lufficc commented Aug 10, 2023

Validation datasets support during training #785

Are you sure you want to change the base?

Validation datasets support during training #785

Conversation

osanwe commented May 15, 2019 • edited Loading

facebook-github-bot commented May 15, 2019

facebook-github-bot commented May 15, 2019

droseger May 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

osanwe commented May 23, 2019

fmassa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YradenRavid commented May 27, 2019

osanwe commented May 27, 2019

qihao-huang commented May 27, 2019 • edited Loading

YradenRavid commented May 28, 2019 • edited Loading

osanwe commented May 28, 2019 • edited Loading

qihao-huang May 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alono88 commented May 28, 2019

osanwe commented May 29, 2019 • edited Loading

JoyHuYY1412 commented Sep 29, 2019

osanwe commented Sep 30, 2019

lufficc commented Aug 10, 2023

osanwe commented May 15, 2019 •

edited

Loading

droseger May 22, 2019 •

edited

Loading

qihao-huang commented May 27, 2019 •

edited

Loading

YradenRavid commented May 28, 2019 •

edited

Loading

osanwe commented May 28, 2019 •

edited

Loading

qihao-huang May 28, 2019 •

edited

Loading

osanwe commented May 29, 2019 •

edited

Loading