Does zero3/zero++ not support large scale pretraining? #3999

cccc0der · 2023-07-20T02:35:05Z

cccc0der
Jul 20, 2023

Hi,

Zero3 split model parameters across all ranks and do forward/backward computing by gather parameters from other ranks.
Also, due to the problem of gradient accumulation, zero3 is not compatable with pipeline parallel.

There are often thounds of GPUs in situation of large scale pretraining, zero3 must gather params from all gpus, it will take huge network cost although zero++ reduces communication.

From above(if I'm not wrong), I think [3D parallel + Zero 1] is still a sota solution for pretrain work, and Zero3/Zero++ is much better on SFT\RHLF.

So, is my understanding correct?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does zero3/zero++ not support large scale pretraining? #3999

{{title}}

Replies: 0 comments

Select a reply

Does zero3/zero++ not support large scale pretraining? #3999

cccc0der Jul 20, 2023

Replies: 0 comments

cccc0der
Jul 20, 2023