Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training speed #18

Open
HarukiYqM opened this issue Mar 4, 2021 · 3 comments
Open

Training speed #18

HarukiYqM opened this issue Mar 4, 2021 · 3 comments

Comments

@HarukiYqM
Copy link

HarukiYqM commented Mar 4, 2021

Thanks for this nice work. Could you provide a rough estimation of the running time for this implementation?

Currently, it takes around 2.5 hours to train one epoch and seems much slower than the normal case. (Total batch size 2048, 4 x 8V100, 32G)

Thank you!

@HarukiYqM
Copy link
Author

In addition, it seems that there is one node always idle during forward pass.

@antoine77340
Copy link
Owner

HI,

If I remember correctly, it was a bit faster to run this code, i.e. around one hour per epoch.
One key thing to make the training fast is to ensure a very fast IO because IO is a bottleneck.
Do you have the videos stored on SSD disk? We did have all videos stored in a distributed SSD disk and this was already input bound so I guess it could be even worse with HDD disk?

@HarukiYqM
Copy link
Author

HI,

If I remember correctly, it was a bit faster to run this code, i.e. around one hour per epoch.
One key thing to make the training fast is to ensure a very fast IO because IO is a bottleneck.
Do you have the videos stored on SSD disk? We did have all videos stored in a distributed SSD disk and this was already input bound so I guess it could be even worse with HDD disk?

Thanks for your reply. It is indeed the IO slow down the training. Currently, I have to store the data on the cloud and cannot use local SSD. Do you have any suggestions to make the training possible with a cloud computing system like azure? Also I wonder that will the center crop option reduces the performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants