Is the delayed scaling computation overlapped with another computation? #1231

avizon-aws · 2024-10-08T16:23:45Z

avizon-aws
Oct 8, 2024

In order for the delayed scaling recipe to be efficient, the scales for the next iteration are computed before hand. I think the part for the computation of the scales is here:

TransformerEngine/transformer_engine/pytorch/fp8.py

Line 314 in f8eb799

    
           def reduce_tensor_across_group_op_max(tensor: torch.Tensor, group: dist_group_type) -> None:

But I dont understand if this computation is done in parallel with another computation to save time and for efficiency. Could anyone throw some light on this?Also where is the amax calculation for the input and weights done?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the delayed scaling computation overlapped with another computation? #1231

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Is the delayed scaling computation overlapped with another computation? #1231

avizon-aws Oct 8, 2024

Replies: 0 comments

avizon-aws
Oct 8, 2024