Keep Attention Softmax FP32 during FP16/ZeRO Training #1474

conceptofmind · 2022-08-20T15:58:11Z

conceptofmind
Aug 20, 2022

Hi all,

Recent discoveries from GLM-130 and researchers at Tsinghua have shown that keeping the attention softmax fp32 during training with fp16 and ZeRO leads to much greater stability at scale.

Attention Computation: FP32 Softmax
Gradient shrink is a post-hoc technique to avoid training collapse. Essentially, the collapse is formed by an abnormal loss' gradient, either because of noisy data or overflow and underflow in the forward computing. We observe that the attention computation operation is the most likely to overflow or underflow in large language models. CogView shows that different attention heads have very different value scales for their attention scores, and some value scales can reach +1e4 or -1e-3. Such varied value scales can lead to frequent overflows or underflows under FP16 in the softmax computation. CogView proposes the Precision-Bottleneck Relaxation (PB-Relax) to mitigate the issue, which deducts the maximum absolute value in each head's attention score matrix before doing softmax. However, it turns out that PB-Relax is slow in GLM-130B's training, probably because finding the maximum and manipulating scalars in 96 attention score matrices sized 2048 * 2048 can be unfriendly to CUDA kernels. Finally, after a few weeks of arduous exploration, we find the fastest and easiest way to avoid the problem is to use FP32 in the softmax computation. Compared to the full FP16 computing, it hardly brings any speed loss but significantly improves the training stability.

Since ColossalAI handles the floating point precision during training, is there a specific recommended way to ensure that the softmax remains fp32 without being overridden automatically by the engine with fp16/ZeRO initialized? That way you can use fp16 and ZeRO enabled in the configuration while maintaining numerical stability.

Thank you,

Enrico

Answered by ver217

Aug 24, 2022

Could you create an issue?

View full answer

ver217 · 2022-08-24T03:01:48Z

ver217
Aug 24, 2022
Maintainer

Thanks for your advice! We will do more experiments about this tech.

0 replies

ver217 · 2022-08-24T03:04:16Z

ver217
Aug 24, 2022
Maintainer

Could you create an issue?

1 reply

conceptofmind Aug 24, 2022
Author

Hi @ver217 ,

I will create an issue with the same comments/questions above.

Thank you,

Enrico

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep Attention Softmax FP32 during FP16/ZeRO Training #1474

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Keep Attention Softmax FP32 during FP16/ZeRO Training #1474

conceptofmind Aug 20, 2022

Replies: 2 comments · 1 reply

ver217 Aug 24, 2022 Maintainer

ver217 Aug 24, 2022 Maintainer

conceptofmind Aug 24, 2022 Author

conceptofmind
Aug 20, 2022

Replies: 2 comments 1 reply

ver217
Aug 24, 2022
Maintainer

ver217
Aug 24, 2022
Maintainer

conceptofmind Aug 24, 2022
Author