Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abnormal training time when run multi NeuRad jobs #32

Open
szhang963 opened this issue Jun 11, 2024 · 4 comments
Open

abnormal training time when run multi NeuRad jobs #32

szhang963 opened this issue Jun 11, 2024 · 4 comments

Comments

@szhang963
Copy link

The training time becomes longer when I run the second job in a multi-GPU cluster.
image

And then, the second job's training time is also slower as below.
image

Could you give me some suggestions?
Thank you in advance.

@georghess
Copy link
Owner

Have you checked that the jobs do not use the same resources (GPU, CPU)?

@szhang963
Copy link
Author

szhang963 commented Jun 11, 2024

I can ensure I do not use the same GPU, but for the CPU I am not sure.
The training time is not stable when running multiple jobs in a multi-GPU cluster.
image

I did not meet the problem in the nerfstudio project.

Can you reproduce the issue?

@georghess
Copy link
Owner

I see. We often train multiple jobs in parallel on our cluster as well and have never had any issues where they affect each other. I know that the multiprocess data loading has been given some people issues, not sure if that is the case here as well?
You could try running the training with --pipeline.datamanager.num_processes=0 and see if that helps. Do you see the GPU-utilization dropping when running multiple jobs?

@szhang963
Copy link
Author

Hi, the issue was solved by setting --pipeline.datamanager.num_processes=0. At the same time, the parameter does not affect the training time in a single job for me. So, that is why? How much can it speed up for you?
Thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants