Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charades Dataset Loading #17

Open
NemioLipi opened this issue May 29, 2020 · 4 comments
Open

Charades Dataset Loading #17

NemioLipi opened this issue May 29, 2020 · 4 comments

Comments

@NemioLipi
Copy link

NemioLipi commented May 29, 2020

Hi,
Thanks for sharing your code. Have you sampled all the videos of the charades dataset to have 1024 frames before loading? This procedure may take a lot of memory. Is'nt it possible to upsample the resulted feature maps of the original 25fps sampling videos on the provided pretrained I3D to have 128,7,7,1024 instead of e.g. 45,7,7,1024? Would it affect the performance of timeception afterwards?

@noureldien
Copy link
Owner

Hello Nemio Lipi,

Sorry for late reply. Didn't pay attention to the notification of github. Yes, 45 segments instead of 128 would affect the performance. To reproduce the results of the paper, you have to randomly sample new frames each epoch. Please note that you need to sample features before training each epoch
https://github.com/noureldien/videograph/blob/master/experiments/exp_epic_kitchens.py#L97

And look here to see how to sample the frames. Uniform (equdistant) sampling is done for test videos only
https://github.com/noureldien/videograph/blob/master/datasets/ds_breakfast.py#L601

But random sampling is done for training videos, and you have to uniformly sample segments before each epoch. Sample only segments, but don't sample frames in each segment. Each segment should contain 8 successive frames.

And here how to extract the features
https://github.com/noureldien/videograph/blob/master/datasets/ds_breakfast.py#L765

So, to answer your question directly. Yes, if you train on pre-defined features the performance drops significantly. Because Timeception layers need to see new features of new segments each training epoch.

However, there is a trick that might alleviate this overhead. Do the following:

  1. Pre-train the backbone CNN on the dataset.
  2. Extract feature of 1024 segments.
  3. When training Timeception layer, then you can sample from features, rather than having to sample from segments and feedforward through the backbone.
  4. By doing so, you only do the feedforward for the backbone only once. But the downside is that you have to extract and save a lot of features. 1024 features (segments) for each video.

@NemioLipi
Copy link
Author

Thanks a lot for the response. As the number of frames may be very large, wouldn't the last trick you mentioned cause OOM problems?

@noureldien
Copy link
Owner

What do you mean by OOM problem?

@basavaraj-hampiholi
Copy link

Hi @noureldien,

Its really nice work and a good presentation at CVPR-19 by Efstratios Gavves. And thanks for sharing the code.

I have a couple of queries regarding the timeception paper and dataloading:

  1. The input to I3D for feature extraction is 3xTx224x224, where T is 1024. I3D yields feature with dim-1024x128x7x7.
    So you sample the entire video of any length (let's say 5268 frames) to a fixed 1024 frame clip?
    And all these 1024 frames in the input video clip belong to the same class? (as timeception is not producing framewise probabilities.)

  2. If all of the frames belong to the same class how are you learning complex actions (consists of several one-actions) with different temporal extents using multi-scale temporal kernels? (mentioned in section 4.2 of the paper).

Thanks,
Raj

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants