Mismatch between loaded features and snippet indexes. #4

Phoenix1327 · 2020-04-27T13:55:53Z

Thanks for releasing code, but I found there may exist some bugs when loading features from the h5 file.
In line 208 and 209 of dataset.py, we can see that the features are loaded every 5 frames (self.video_skipframes=5 for thumos)
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L208
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L209
The start frame of the loaded sequence should be 0 (idx=0).

But in line 221, the snippet index starts from #start_snippet=3#.
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L221

Then, after calculating, we can find the anchor region related the first timestamp in the sequence will be [0.5, 5.5].
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L241
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L242
But when you calculate the start region and the end region related to the ground truth box, these seems no such shift along the temporal dimension:

https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L137
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L138

Phoenix1327 · 2020-04-27T14:22:44Z

Maybe line 104 and 105 give the correct anchor_xmin and anchor_xmax (the measurements are seconds here), but they are not utilized to calculate training labels.
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L104
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L105
In BSN's codes, there exists a #start_idx#, I guess the reason is the extracted features used by BSN are already sampled at interval of 5 frames and the selected frame starts at the 3-th frame.

frostinassiky · 2020-04-27T15:01:11Z

Hey @Phoenix1327 Thanks for pointing this out!

When I load the features, I only used the first frame to represent the 5-frame-segment.
It is not the optimal solution because, as you mentioned, the 3rd frame should be more representative, or the average.

From my personal experience, the feature are very similar as their temporal neighbours. The improvement might be marginal but still worth discussing!

I would like to apply your suggestion to load the third frames. Let's keep this issue open and update the new experiment here.

Phoenix1327 · 2020-04-27T15:40:51Z

Hey @Phoenix1327 Thanks for pointing this out!

When I load the features, I only used the first frame to represent the 5-frame-segment.
It is not the optimal solution because, as you mentioned, the 3rd frame should be more representative, or the average.

From my personal experience, the feature are very similar as their temporal neighbours. The improvement might be marginal but still worth discussing!

I would like to apply your suggestion to load the third frames. Let's keep this issue open and update the new experiment here.

Sorry, I may not put across my idea properly.
In fact, I think using the first frame is good. The sampled frames will be like: [0, 5, 10, ...,].
However, the problem is the indexes of the sampled frames are uncorrected in line 221. The indexes in line 221 are [0+3, 5+3, 10+3, ..., ].
I suggest to change line 221 to:
df_snippet = [skip_videoframes * i for i in range(num_snippet)]

If not, when you calculate match scores for the timestamp t (t in [0, 5, 10, ...]) as in line 144:
https://github.com/Frostinassiky/gtad/blob/f4677a2fd8fda0f990e0c05687b07eed24de5688/gtad_lib/dataset.py#L144
If I don't make miscalculations, the anchor region for timestamp t in line 144 is [t+3-2.5, t+3+2.5]. You match this region to the ground truth start/end region. It will lead to the mismatch. I think the correct region for timestamp t is [t-2.5, t+2.5].

Or, you can load the third frames. And simply change line 208 and 209 to self.flow_val[video_name][start_snippet:-1:self.skip_videoframes,...],
self.rgb_val[video_name][start_snippet:-1:self.skip_videoframes,...]

frostinassiky · 2020-04-29T13:40:42Z

Hey @Phoenix1327 ,
Thanks again for your constructive feedback!
I already update the code based on your suggestion, and the model performance, mAP at tIoU 0.5, increases from 0.427 to 0.430.
Please feel free to re-open the issue if there are mismatching problems.

frostinassiky added the bug Something isn't working label Apr 27, 2020

frostinassiky self-assigned this Apr 27, 2020

frostinassiky added a commit that referenced this issue Apr 29, 2020

Update experiemntal result based on issue #4

180cdf0

frostinassiky linked a pull request Apr 29, 2020 that will close this issue

Match feature indices #5

Merged

frostinassiky closed this as completed Apr 29, 2020

frostinassiky mentioned this issue May 29, 2021

THUMOS features frame counts #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch between loaded features and snippet indexes. #4

Mismatch between loaded features and snippet indexes. #4

Phoenix1327 commented Apr 27, 2020

Phoenix1327 commented Apr 27, 2020 •

edited

Loading

frostinassiky commented Apr 27, 2020

Phoenix1327 commented Apr 27, 2020 •

edited

Loading

frostinassiky commented Apr 29, 2020 •

edited

Loading

Mismatch between loaded features and snippet indexes. #4

Mismatch between loaded features and snippet indexes. #4

Comments

Phoenix1327 commented Apr 27, 2020

Phoenix1327 commented Apr 27, 2020 • edited Loading

frostinassiky commented Apr 27, 2020

Phoenix1327 commented Apr 27, 2020 • edited Loading

frostinassiky commented Apr 29, 2020 • edited Loading

Phoenix1327 commented Apr 27, 2020 •

edited

Loading

Phoenix1327 commented Apr 27, 2020 •

edited

Loading

frostinassiky commented Apr 29, 2020 •

edited

Loading