The problem of getting deadlock when training with zero3. #99

kimhyeonwokk · 2024-07-14T07:15:20Z

Hi. I've checked your good results, and I just want to say thank you for developing such an amazing model.
My analysis of getting deadlock on deep speed zero3 seems to be a for statement on the code.
I got a deadlock similar to you when I put a for statement on the code that hurts gpu parallelism while working on another code.
However, it was rewritten to maintain gpu parallelism within the code and confirmed that it was learned normally.
I might not be accurate, but I would appreciate it if you could recognize it.
As a studying student, I enjoyed reading your paper. I look forward to better results!

HAWLYQ · 2024-07-14T07:47:18Z

Hi, @kimhyeonwokk, thanks for acknowledging our work and giving kind advice about the deadlock with zero3. I will check the for statement in our code!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The problem of getting deadlock when training with zero3. #99

The problem of getting deadlock when training with zero3. #99

kimhyeonwokk commented Jul 14, 2024

HAWLYQ commented Jul 14, 2024

The problem of getting deadlock when training with zero3. #99

The problem of getting deadlock when training with zero3. #99

Comments

kimhyeonwokk commented Jul 14, 2024

HAWLYQ commented Jul 14, 2024