Charades Dataset Loading #17

NemioLipi · 2020-05-29T07:19:40Z

Hi,
Thanks for sharing your code. Have you sampled all the videos of the charades dataset to have 1024 frames before loading? This procedure may take a lot of memory. Is'nt it possible to upsample the resulted feature maps of the original 25fps sampling videos on the provided pretrained I3D to have 128,7,7,1024 instead of e.g. 45,7,7,1024? Would it affect the performance of timeception afterwards?

noureldien · 2020-06-04T00:51:00Z

Hello Nemio Lipi,

Sorry for late reply. Didn't pay attention to the notification of github. Yes, 45 segments instead of 128 would affect the performance. To reproduce the results of the paper, you have to randomly sample new frames each epoch. Please note that you need to sample features before training each epoch
https://github.com/noureldien/videograph/blob/master/experiments/exp_epic_kitchens.py#L97

And look here to see how to sample the frames. Uniform (equdistant) sampling is done for test videos only
https://github.com/noureldien/videograph/blob/master/datasets/ds_breakfast.py#L601

But random sampling is done for training videos, and you have to uniformly sample segments before each epoch. Sample only segments, but don't sample frames in each segment. Each segment should contain 8 successive frames.

And here how to extract the features
https://github.com/noureldien/videograph/blob/master/datasets/ds_breakfast.py#L765

So, to answer your question directly. Yes, if you train on pre-defined features the performance drops significantly. Because Timeception layers need to see new features of new segments each training epoch.

However, there is a trick that might alleviate this overhead. Do the following:

Pre-train the backbone CNN on the dataset.
Extract feature of 1024 segments.
When training Timeception layer, then you can sample from features, rather than having to sample from segments and feedforward through the backbone.
By doing so, you only do the feedforward for the backbone only once. But the downside is that you have to extract and save a lot of features. 1024 features (segments) for each video.

NemioLipi · 2020-06-23T08:38:18Z

Thanks a lot for the response. As the number of frames may be very large, wouldn't the last trick you mentioned cause OOM problems?

noureldien · 2020-06-23T19:22:03Z

What do you mean by OOM problem?

basavaraj-hampiholi · 2020-10-06T16:27:20Z

Hi @noureldien,

Its really nice work and a good presentation at CVPR-19 by Efstratios Gavves. And thanks for sharing the code.

I have a couple of queries regarding the timeception paper and dataloading:

The input to I3D for feature extraction is 3xTx224x224, where T is 1024. I3D yields feature with dim-1024x128x7x7.
So you sample the entire video of any length (let's say 5268 frames) to a fixed 1024 frame clip?
And all these 1024 frames in the input video clip belong to the same class? (as timeception is not producing framewise probabilities.)
If all of the frames belong to the same class how are you learning complex actions (consists of several one-actions) with different temporal extents using multi-scale temporal kernels? (mentioned in section 4.2 of the paper).

Thanks,
Raj

md-mohaiminul mentioned this issue Sep 19, 2022

n_segement and segment_length while training and evaluation md-mohaiminul/ViS4mer#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Charades Dataset Loading #17

Charades Dataset Loading #17

NemioLipi commented May 29, 2020 •

edited

Loading

noureldien commented Jun 4, 2020

NemioLipi commented Jun 23, 2020

noureldien commented Jun 23, 2020

basavaraj-hampiholi commented Oct 6, 2020

Charades Dataset Loading #17

Charades Dataset Loading #17

Comments

NemioLipi commented May 29, 2020 • edited Loading

noureldien commented Jun 4, 2020

NemioLipi commented Jun 23, 2020

noureldien commented Jun 23, 2020

basavaraj-hampiholi commented Oct 6, 2020

NemioLipi commented May 29, 2020 •

edited

Loading