Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SeqTransf & meanP #1

Open
celestialxevermore opened this issue Oct 5, 2022 · 5 comments
Open

SeqTransf & meanP #1

celestialxevermore opened this issue Oct 5, 2022 · 5 comments

Comments

@celestialxevermore
Copy link

celestialxevermore commented Oct 5, 2022

Dear Author,

I really am appreciated and fascinated by your work, and feel thankful of releasing your code.

I know that CLIP4clip + meanP have all the best performance among CLIP4Clip + seqTranf, seqLSTM, and tightTransf,

But I found that in your script, always seqTransf are recommended in sh files.

Is that any special reason that why "sim_header == seqTransf" is default setting?

I had looked your Table 2 on MSVD, your model recorded X-CLIP(ViT-B/32) R@1 scores 47.1 .
Is it mean that when X-Clip with seqTransf is the best than any other mode -meanP, tightTransf- ?
I cannot find that what kind of sim_header retrieved that scores in that table.

If X-CLIP + seqtrasnf is recommended anyway,
any special reason why seqTrasnf outperforms than meanP, unlike Clip4Clip did?

Sincerely,

@xuguohai
Copy link
Owner

xuguohai commented Oct 5, 2022

We propose a temporal encoder to model the temporal relationship by setting "sim_header == seqTransf" (as shown in Figure 2.)
The ablation study of temporal encoder is shown in Table 8.

@celestialxevermore
Copy link
Author

celestialxevermore commented Oct 5, 2022

Thank you for replying.
As I know, The temporal encoder, Transformer is randomly initialized, which causes some sub-optimal phenomenon as the randomly initialized weights of the seqTransf do harm on CLIP pretrained weights. Am I wrong? or Any ideas about this?

Thx.

@xuguohai
Copy link
Owner

xuguohai commented Oct 5, 2022

I agree with you. If the seqTranf is randomly initialized (actually initialized from clip as shown in line 116 of modules/modeling.py), it may cause some sub-optimal phenomenon. That is why CLIP4Clip + meanP is better than CLIP4Clip + seqTranf in most datasets.

Therefore, in our paper, we recommend using original clip to obtain frame-level visual features as shown in line 298 of modules/modeling_xclip.py. The temporal encoder helps to obtain the global video-level visual representation.

@celestialxevermore
Copy link
Author

celestialxevermore commented Oct 6, 2022

Oh, Really Thank you for your very kind and fast reply.

I didn't notice that line 116 of modules/modeling.py means that code makes initialized the seqTransf from clip.

Q1.
Then, What about Cross model? in tightTransf?

Q2.
Plus, As I novice for Deep Learning, I cannot understand exactly that why line 298 of modules/modeling_xclip.py the seemly simply just only 'copying' action from visual_output can be interpreted as using original clip to earn the frame-level visual features. I guess that the visual_output give all the objects earned from clip parameters to visual_output_original.

Q3.
Then, what if I do modelling newly on using some other Layers like seqTransf or seqLSTM or TightTransf,
is there no need to freeze some layers but do what you did in line 298 of modules/modeling_xclip.py is enough to help making better in performance?
Can you teach me about this comment?

Thx. you're very kind.

@willyfh
Copy link

willyfh commented May 20, 2023

I did an experiment on another language (Indonesian) on MSVD using XCLIP. And I found that X-CLIP+meanP performs the best compared to the other. I haven't tried on the English one tho. But my experiment indicates that X-CLIP+seqTransf, i.e., the proposed temporal encoder, don't always perform the best on a dataset with different characteristics as in MSVD-Indonesian. I will share my experiment results later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants