Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The problem of getting deadlock when training with zero3. #99

Open
kimhyeonwokk opened this issue Jul 14, 2024 · 1 comment
Open

The problem of getting deadlock when training with zero3. #99

kimhyeonwokk opened this issue Jul 14, 2024 · 1 comment

Comments

@kimhyeonwokk
Copy link

Hi. I've checked your good results, and I just want to say thank you for developing such an amazing model.
My analysis of getting deadlock on deep speed zero3 seems to be a for statement on the code.
I got a deadlock similar to you when I put a for statement on the code that hurts gpu parallelism while working on another code.
However, it was rewritten to maintain gpu parallelism within the code and confirmed that it was learned normally.
I might not be accurate, but I would appreciate it if you could recognize it.
As a studying student, I enjoyed reading your paper. I look forward to better results!

@HAWLYQ
Copy link
Collaborator

HAWLYQ commented Jul 14, 2024

Hi, @kimhyeonwokk, thanks for acknowledging our work and giving kind advice about the deadlock with zero3. I will check the for statement in our code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants