Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Validation datasets support during training #785

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

osanwe
Copy link

@osanwe osanwe commented May 15, 2019

It seems that the framework does not have capabilities to use validation datasets during training for checking convergence (it attaches them into a training dataset), and it is a requested feature (#171 and #348).

In both mentioned issues not clean solutions are proposed because of neediness of additional boilerplate code. So the PR contains integration of the validation into training process with configs support.

Also this feature will be useful for visualization after adding TensorBoard support (#163).

The usage example presented in this config file.

Upd.: The latest version of PR is #828

@facebook-github-bot
Copy link

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label May 15, 2019
@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

).format(
eta=eta_string,
iter=iteration,
meters=str(meters),
Copy link

@droseger droseger May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str(meters) needs to be str(meters_val) here, otherwise the training metrics are displayed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... Yes. Fixed.

@osanwe
Copy link
Author

osanwe commented May 23, 2019

@fmassa ,
Sorry for ping.
As I understood from previous activity, you are a maintainer of the repo. Is this pull request useful or it can be closed?
Maybe some additional information or tests should be provided from my side. The code was tested on 1 and 8 GPUs.

Copy link
Contributor

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I think that having a separate VAL set might not be necessary.

Also, you only log the losses during validation. While more complicated, in the new release of torchvision I added a functionality to progressively compute the mAP during evaluation, and as an example I compute evaluation at the end of every epoch, see https://github.com/pytorch/vision/blob/master/references/detection/coco_eval.py

Maybe something like that could be used here instead?

@@ -30,7 +30,8 @@ MODEL:
SHARE_BOX_FEATURE_EXTRACTOR: False
MASK_ON: True
DATASETS:
TRAIN: ("coco_2014_train", "coco_2014_valminusminival")
TRAIN: ("coco_2014_train",)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this change? All the models have been trained using the new coco_2017train dataset, which corresponds to coco_2014_train + coco_2014_valminusminival. If you want to evaluate at every N iterations, you could do it on the coco_2014_minival?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted it and created another config file where the number of iterations for validation specified:
https://github.com/facebookresearch/maskrcnn-benchmark/pull/828/files#diff-4dd26a63ac00a49aeb10985800d7f21c

args["transforms"] = transforms
# make dataset from factory
dataset = factory(**args)
datasets.append(dataset)

# for testing, return a list of datasets
if not is_train:
if mode != DatasetMode.TEST:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though not really the best thing to do, I believe in most cases we simply evaluate on the test dataset after N iterations, so I think that we can remove the VAL part altogether.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added another boolean flag instead for controlling the way of data loader creating:
https://github.com/facebookresearch/maskrcnn-benchmark/pull/828/files#diff-48c338613bdbf422235cdb2ef17201f7R77

losses = sum(loss for loss in loss_dict.values())
loss_dict_reduced = reduce_loss_dict(loss_dict)
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
meters_val.update(loss=losses_reduced, **loss_dict_reduced)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, you only evaluate the loss here, while a metric which is generally more useful is to report the mAP as we do for testing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YradenRavid
Copy link

Hi, I'm working on the same thing write now, and I wonder why is it possible to avoid model.eval() command before the validation starts?

@osanwe
Copy link
Author

osanwe commented May 27, 2019

@fmassa,
Yes, it seems the additional datasets field is redundant. So I created another PR (#828) where the same datasets are used for intermediate and final evaluations with suggested AP calculations.

@YradenRavid,
The train and eval methods change the inner state of the module (https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L987-L1011) and can impact to the model behavior.
For validation in this PR loss values are needed, and they can be gotten only in training mode (https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py#L59).

@qihao-huang
Copy link

qihao-huang commented May 27, 2019

@osanwe I'm working on validation set average loss curve for Tensor Board at every eval step, and almost done.
Will support your #828 as well.

Also this feature will be useful for visualization after adding TensorBoard support (#163).

@YradenRavid
Copy link

YradenRavid commented May 28, 2019

@osanwe I saw now you created FrozenBatchNorm2d() that BatchNorm2d where the batch statistics and the affine parameters are fixed. Is that the reason we don't need to use model.eval() before computing validation loss?

@osanwe
Copy link
Author

osanwe commented May 28, 2019

@YradenRavid,
It seems FrozenBatchNorm2d class is created by @fmassa, not by me.
As I wrote previously, eval() and train() methods only change the training boolean flag in nn.Module. And yes, this flag can impact on the batch_norm behavior and some other layers. Also it impacts on the RCNN behavior here (as mentioned previously), so you cannot get loss values if training flag is equal to False (i.e. you use eval() mehod).

losses = sum(loss for loss in loss_dict.values())
loss_dict_reduced = reduce_loss_dict(loss_dict)
losses_reduced = sum(loss for loss in loss_dict_reduced.values())
meters_val.update(loss=losses_reduced, **loss_dict_reduced)
Copy link

@qihao-huang qihao-huang May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line records batch's loss of val-set using current train iteration model, right?
So, if the purpose is to check our model is over fitting or not, we need to calculate the average loss of val-set using current train iteration model. And use this average loss to decide early stop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alono88
Copy link

alono88 commented May 28, 2019

@YradenRavid,
It seems FrozenBatchNorm2d class is created by @fmassa, not by me.
As I wrote previously, eval() and train() methods only change the training boolean flag in nn.Module. And yes, this flag can impact on the batch_norm behavior and some other layers. Also it impacts on the RCNN behavior here (as mentioned previously), so you cannot get loss values if training flag is equal to False (i.e. you use eval() mehod).

By inspecting the implementation of the FrozenBatchNorm, it seems all parameters are indeed frozen and the behavior of the layer does not change regardless of .eval() or .train().

@osanwe
Copy link
Author

osanwe commented May 29, 2019

behavior of the layer does not change regardless of .eval() or .train()

The behavior is changed in GeneralizedRCNN forward method:
https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py#L59-L65

@JoyHuYY1412
Copy link

It seems that the framework does not have capabilities to use validation datasets during training for checking convergence (it attaches them into a training dataset), and it is a requested feature (#171 and #348).

In both mentioned issues not clean solutions are proposed because of neediness of additional boilerplate code. So the PR contains integration of the validation into training process with configs support.

Also this feature will be useful for visualization after adding TensorBoard support (#163).

The usage example presented in this config file.

Upd.: The latest version of PR is #828

It seems this inference operation will increase the cost of memory? I encountered out of memory when gathering results from the gpus. I am so confused that the inference can be finished but the memory will be costed a lot when gathering results?

@osanwe
Copy link
Author

osanwe commented Sep 30, 2019

It seems this inference operation will increase the cost of memory?

Yes, the validation step requires additional memory because of additional data loader etc. Please, refer to #828.

@lufficc
Copy link
Contributor

lufficc commented Aug 10, 2023

Preparing pr description...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants