Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Extract features from bounding boxes #665

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

TheShadow29
Copy link

Hi. First thanks for the amazing repository.

Features extracted from a detection network are often used in other tasks (like vqa). The code shows how to extract out the features given the bounding boxes. Currently, I have just added some utility functions to demo/predictor.py. This possibly solves #164 with minor changes.

Currently, I am not sure how to test if everything is correct. A sanity check I have done is to re-classify the extracted boxes, and the results seem to be consistent.

Thanks

@facebook-github-bot
Copy link

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Apr 12, 2019
@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@botcs botcs added enhancement New feature or request good first issue Good for newcomers labels Sep 16, 2019
@botcs
Copy link
Contributor

botcs commented Sep 16, 2019

Hi @TheShadow29

Thanks for the PR.
I would like to see #164 resolved first, and then we can come back to discussing this PR as well

@TheShadow29
Copy link
Author

TheShadow29 commented Sep 16, 2019

@botcs sounds good. Actually #164 does two things at once (if I have understood correctly). During the forward pass, it retains the image proposals as well as the image features.

This pr instead requires ground-truth box to be given first and then uses the box to retrieve the image features. There are two advantages to this:
(i) In some cases where you have the ground-truth boxes of the objects, you can directly use them (like in ms-coco derived datasets like refcoco)
(ii) Even if you don't have ground-truth boxes, if you first do a full forward pass of the test image, then get the final bounding box prediction and then re-use it to get the image feature via roi-align/pool, the new features would be better than the image features obtained directly during the forward pass.

The only down-side is that (ii) would take a bit more time to get the features (I don't have timing comparisons but I guess it would be aroudn 1.5 times slower). However, this is usually a one-time process so getting better features might be valuable at the cost of more processing time.

Let me know what you think. Thank you for your patience.

@kangkang59812
Copy link

kangkang59812 commented Oct 21, 2019

@botcs sounds good. Actually #164 does two things at once (if I have understood correctly). During the forward pass, it retains the image proposals as well as the image features.

This pr instead requires ground-truth box to be given first and then uses the box to retrieve the image features. There are two advantages to this:
(i) In some cases where you have the ground-truth boxes of the objects, you can directly use them (like in ms-coco derived datasets like refcoco)
(ii) Even if you don't have ground-truth boxes, if you first do a full forward pass of the test image, then get the final bounding box prediction and then re-use it to get the image feature via roi-align/pool, the new features would be better than the image features obtained directly during the forward pass.

The only down-side is that (ii) would take a bit more time to get the features (I don't have timing comparisons but I guess it would be aroudn 1.5 times slower). However, this is usually a one-time process so getting better features might be valuable at the cost of more processing time.

Let me know what you think. Thank you for your patience.

Thanks for your effort. But when I extracted the features using bbox(such as [13,4]), I got features of [15,1024]. So assert len(features) == len(gt_box_list[0].bbox) happended. How to fix it?
In box_head.py, result = self.post_processor((class_logits, box_regression), proposals), the bbox in 'result' is changed from [13,4] to [15,4] after processing.

@TheShadow29
Copy link
Author

@kangkang59812 Thanks for checking it out. What network are you using? I think I tested with res50 fpn maskrcnn architecture. My guess is there were some changes made to the repo (the pr was made quite some time back) and this pr would have to updated.

@kangkang59812
Copy link

@TheShadow29 faster_rcnn_R_101_FPN. In inference.py line 108, I think it unwraps the boxlist to avoid additional overhead according to the comment. So 13 is changed to 15. By the way, instead of your sanity check , I test the extracted features using the roi_heads.box.predictor.cls_score() and I think the results are correct.

@TheShadow29
Copy link
Author

@kangkang59812 Awesome. Thanks for confirming

@simaiden
Copy link

simaiden commented Jan 5, 2020

Hi @TheShadow29 , I used your implementation to do retrieval but it gets worse performance that using a pre trained ResNet50 (in imagenet, without further training) to extract features after the detection. Maybe the bbox has to be resized according to image padding?

edit: it seems that is not necessary to pad the boxes, as said in https://github.com/facebookresearch/maskrcnn-benchmark/issues/965#issuecomment-510926086 , so I don't know why the worse performance

@TheShadow29
Copy link
Author

@simaiden Could you briefly explain your setup of retrieval? It doesn't seem to be same as object detection

@simaiden
Copy link

simaiden commented Jan 6, 2020

@simaiden Could you briefly explain your setup of retrieval? It doesn't seem to be same as object detection

I use Mask R-CNN to detect clothes in an image, after that get a feature vector with the cropped region. This region became an input to ResNet50, then I do global average pooling in some layer to get the feature vector. With this approach I get good results but not with the roi feature using your implementation.

@TheShadow29
Copy link
Author

@simaiden Could you verify it for object detection on coco? There might a few more things taking place under the hood in your use-case.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants