-
Notifications
You must be signed in to change notification settings - Fork 72
How to perform captioning on action proposals? #20
Comments
@sgarbanti You should directly run |
Thank you @LuoweiZhou for your help. Once that I dowloaded video and extracted the 10 sampled frames for each action proposals, I need to use Detectron using extract_features.py script, right? In this way I would obtain only the region features or also the file in h5 format with the region proposals? |
@sgarbanti There are three parts: proposals (coordinates), features, and class probabilities. You need to modify the script to take in video input (this part will be updated later this month) and revise |
@LuoweiZhou thank you very much for your patience and for your work too. I downloaded the video with https://github.com/activitynet/ActivityNet/tree/master/Crawler, then i extracted the 10 uniformly sampled frames, using segment's timestamps, as you shown here: facebookresearch/ActivityNet-Entities#1 (comment) Finally, using the configuration and checkpoint files for GVD, I extracted region-features with your code https://github.com/LuoweiZhou/detectron-vlp ; using exract_features.py on my own segment frames. Is there anything wrong with what i did? |
@sgarbanti Sorry for the delay. For some reason, I missed your follow-up. We double-checked the code and it turned out there are two differences might account for the discrepancy. In GVD, we are using the box coordinates from RPN rather than the final coordinates after class-wise regression (you can compare them here). I have added the corresponding script |
@LuoweiZhou Thank you for the new script. But unfortunately, the results that I obtained don't coincide with yours data: neither with the information of the regions nor with the features of the regions. Is it possible that the weights in the detectron-vlp repository for the gvd model are not exactly the same that you used? |
@sgarbanti I actually have double-checked recently and can reproduce the features (despite slight discrepancies in value at magnitude 1e-5 or less, possible due to device-related differences in processing floating-point). We will work together to debug this. Just to confirm, you have read the updated README and are using the correct scripts, right? Also, I'd suggest you delete the yaml/pkl file from your local and re-download them using the links we provided just to be sure. |
@LuoweiZhou thank you for your time, I had read the README file and I had already tried to download the checkpoint and configuration files again, but this didn't change anything. I used the "extract_features_gvd_anet.py" script, I just had to modify some small things like the use of the --list_of_ids parameter, which was not used in the code, where I passed the dic_anet.json file to it since the segments in your file .h5 are sorted by this file (I saw that looking at the dataloader). You can see the modified script, that i used, here: Thank you very much. |
@sgarbanti What I can do is to see if your feature files are correct since the changes you made will have no impact on the *.npy files. Could you share with me your following files: |
@LuoweiZhou I was able to extract the features only for segments v_QsfIM28uvHM_segment_02.npy and v_G8gTBLLf8Bo_segment_00.npy, because the video with ID kmWf36zfL7o is no longer available. the "command.txt" file contains bash commands used to extract frames, following your instructions: facebookresearch/ActivityNet-Entities#1 (comment). Region features are in the "fc6_feat_100rois" folder. The output of Thanks for your help. |
@sgarbanti The md5sum ids look good. Could you also place the sampled frames in the Gdrive folder? Thanks |
@LuoweiZhou Ok, I added a folder "Frames" with the sampled frames. EDIT: I performed the evaluation only on the four segments of the v_yACg55C3IlM video, initially with your region features (file h5 included) and then with region features extracted by me (file h5 included) and the captions generated seem good and the scores have even improved. |
@sgarbanti I just checked the frames and features and it turned out the frames look the same as mine but the features are way off. I cannot really imagine why at this point but will keep diagnosing later today. In the meantime, you may want to go through some of the commits to see if you have missed anything. BTW, are you needing the features for new videos or? |
@LuoweiZhou I checked and the detectron-vlp repository that I have should be consistent, I also tried to replace the "convert_cityscapes_to_coco.py" script in the detectron repo with the one before the last commit of January 15th, but nothing changes. I'm using the same videos but I need to extract features for new segments, I'm using a temporal action proposal generator to get the event segments and I need to extract the features to caption them with gvd. |
@sgarbanti To eliminate the possibility that I made any unintended changes to the code, I made a copy of mine here: https://drive.google.com/file/d/1Bt7GXTV6P0pC33bEGPpHMq-Y77ZJDzh1/view?usp=sharing |
@LuoweiZhou I tried to re-extract the features several times, I also tried to reinstall and reconfigure everything, including conda environment and detectron, but I always obtain same results. as you suggested here: because otherwise I'd get the error: Do you think this could be the problem? |
@sgarbanti Thanks for the feedback. I highly suspect this results from the discrepancy in our caffe2 package. In your case, you're using the caffe2 from torch while I compile a stand-alone conda env for caffe2 (before caffe2 was merged into pytorch, let's name the env I tried to reproduce your output but encountered some problems when trying to convert |
@LuoweiZhou I followed the Detectron instructions starting from the gvd_pytorch1.1 environment, my md5 checksum for libcaffe2_detectron_ops_gpu.so is: I get your checksum if I compute it on libcaffe2_detectron_ops_gpu.so from the detectron-vlp repository, but I have to replace it to run detectron-vlp. |
@sgarbanti I meant a separate conda env |
@LuoweiZhou I tried to initialize an environment with python 2.7 and then follow the Detectron istructions but when I run Trying to create an environment with your yml file, conda doesn't find your version of Caffe2:
If I use your yml file, postponing the installation of Caffe2, I get this version:
But then, when I try to run detectron-vlp, I get: I can only get everything working starting from gvd_pytorch1.1, I don't know why I get all these errors. |
You need to google around to fix the bugs on installation (e.g., cococapi). A stand-alone |
@LuoweiZhou I managed to create a stand-alone conda environment but libcaffe2_detectron_ops_gpu.so in the detectron-vlp repository, the one with your checksum, still produces this error: Traceback: https://pastebin.com/raw/1BDZ17DZ So I have to replace it with the one in the pytorch installation, thus getting my region features. |
@sgarbanti I've reproduced the feature by directly using the caffe2 from |
@LuoweiZhou I don't know, I downloaded the videos and extracted frames in the way I said here: #20 (comment) However it seems that my extracted features are also good for the model, so no problem. I'm closing the issue, best regards. |
Hi, thank you for sharing this repository.
My goal is to use your model to generate captions on Activity-Net action proposals.
The dataset is the same, so i don't think to need to retrain the model, however i should need to generates the region features and detections using Detectron. Right?
Is there an easy way, a script, to do it?
I'd saw that you kindly provide the code here:
https://github.com/LuoweiZhou/detectron-vlp
Should I download "RGB frames extracted at 5FPS" provided by ActivityNet, segment them by mine action proposals timestamp, uniformly sample 10 frames for each segment and than use the extract_features.py script, that is into your detectron-vlp repository, to extract region features?
Thanks in advance.
The text was updated successfully, but these errors were encountered: