Update FP8 scale-inverse in kernels with FP8 output #1083

timmoon10 · 2024-08-07T02:42:38Z

Description

We currently treat the FP8 scale-inverse (the dequantization scaling factor) as part of the FP8 recipe, along with the FP8 scale (the quantization scaling factor) and the absmax history. However, this is uncomfortable because any change to the FP8 recipe will invalidate the corresponding FP8 data. We work around this by creating copies of the scale-invs whenever there might be a recipe update, e.g. in between the forward and backward passes of the linear layer:

TransformerEngine/transformer_engine/pytorch/module/linear.py

Line 318 in 6717554

fp8_meta["scaling_fwd"].scale_inv.clone() if fp8 else None,

This adds non-trivial CPU overhead (I estimate ~20% for the PyTorch linear layer forward pass on an L40).

A better approach is to treat the scale-inv as part of the FP8 data, something that should be output along with the FP8 bits and should never change independently of the FP8 bits. The FP8 recipe tells us how we want to cast into FP8, while the scale-inv tells us how to convert back to higher precision. Note that this generalizes nicely to block-scaling schemes, where the scale-inv tensor may be large and must be packaged with the data during communication.

This PR makes initial work toward this scheme by including scale-inv updates in most of the kernels with FP8 output: casting, activations, LayerNorm, RMSNorm. It doesn't seem that cuBLAS supports this, so I've added a small kernel that is launched after FP8 GEMMs. I have not attempted to propagate this change into Userbuffers or attention. I've also updated the PyTorch Linear and LayerNormLinear modules to avoid maintaining extra copies of the scale-inv and I see a 1.12x speedup in the Linear forward pass.

I'm a little apprehensive since this is technically a breaking change. Every time we generate FP8 values we will overwrite the FP8 recipe scale-inv. I have a hard time imagining why we would ever use a stale FP8 scale-inv though if the FP8 data has already been overwritten.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Update FP8 scale-inverse in cast-transpose kernels
Update FP8 scale-inverse in cast and activation kernels
Update FP8 scale-inverse in LayerNorm and RMSNorm kernels
Update FP8 scale-inverse after FP8 GEMMs
Avoid unnecessary FP8 scale-inverse copies in PyTorch Linear module
Avoid unnecessary FP8 scale-inverse copies in PyTorch LayerNormLinear module

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2024-08-07T02:47:39Z

/te-ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-08-08T00:40:07Z

/te-ci

Use quantization scaling factor in ONNX quantize op. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-08-09T00:25:53Z

/te-ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-08-14T01:20:16Z

/te-ci

transformer_engine/pytorch/cpp_extensions/cast.py

ptrendx · 2024-08-15T23:56:43Z

transformer_engine/pytorch/module/linear.py

-                fp8_meta["scaling_fwd"].scale_inv,
-                tex.FP8FwdTensors.GEMM1_INPUT,
+                inputmat_scale_inv,
+                0,


Why not just remove this item?

Mostly to keep the API backward-compatible. LayerNormMLP is still storing scale-invs in the fp8_meta.

Not sure I follow - this particular call is from internal autograd function, so we should be able to change its API.

fp8_gemm is used differently in Linear and LayerNormMLP: Linear constructs a new scale-inv tensor, LayerNormMLP still uses the fp8_meta's scale-inv and requires an offset. I avoided touching the more complicated logic in LayerNormMLP and attention to keep this PR simple.

transformer_engine/pytorch/cpp_extensions/cast.py

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 · 2024-08-17T01:59:36Z

/te-ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-08-19T21:01:47Z

/te-ci

@ptrendx

* Perform scale-inv update in cast-transpose kernels Signed-off-by: Tim Moon <tmoon@nvidia.com> * Perform scale-inv update in cast and activation kernels Signed-off-by: Tim Moon <tmoon@nvidia.com> * Perform sclae-inv update in LayerNorm and RMSNorm kernels Signed-off-by: Tim Moon <tmoon@nvidia.com> * Perform scale-inv update after FP8 GEMMs Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fuse casts and scale-inv updates in linear module Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fuse casts and scale-inv updates in layernorm-linear module Signed-off-by: Tim Moon <tmoon@nvidia.com> * Simplify kernel to update FP8 scale-inv Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix typos Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug amax update in layernorm kernels Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug test failures Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug ONNX export Use quantization scaling factor in ONNX quantize op. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Review suggestion from @ptrendx Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * Debug mismatched dtypes Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: beinggod <zhangruibin@01.ai>

timmoon10 added 9 commits August 1, 2024 23:15

Perform scale-inv update in cast-transpose kernels

711b77e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Perform scale-inv update in cast and activation kernels

c53bf0d

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Perform sclae-inv update in LayerNorm and RMSNorm kernels

0a12bff

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Perform scale-inv update after FP8 GEMMs

20bed0b

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fuse casts and scale-inv updates in linear module

f65b3d1

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fuse casts and scale-inv updates in layernorm-linear module

fa65672

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Simplify kernel to update FP8 scale-inv

a3c00ec

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fix typos

1182e3e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug amax update in layernorm kernels

c548e8a

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added the performance label Aug 7, 2024

timmoon10 requested review from ptrendx and ksivaman August 7, 2024 02:42

pre-commit-ci bot and others added 2 commits August 7, 2024 02:43

[pre-commit.ci] auto fixes from pre-commit.com hooks

a265eae

for more information, see https://pre-commit.ci

Merge branch 'main' into fuse-cast-and-scale-inv-update

a47500d

timmoon10 requested a review from denera August 7, 2024 03:03

Debug test failures

35b5a1b

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 mentioned this pull request Aug 8, 2024

[PyTorch] Custom kernel to compute reciprocal of a single float #1016

Closed

13 tasks

timmoon10 and others added 2 commits August 8, 2024 17:14

Debug ONNX export

ad7b1f6

Use quantization scaling factor in ONNX quantize op. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into fuse-cast-and-scale-inv-update

9798cf8

Merge branch 'main' into fuse-cast-and-scale-inv-update

e550929

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 mentioned this pull request Aug 13, 2024

[PyTorch] FP8 MHA with RoPE and Miscellaneous Improvements #1100

Merged

13 tasks

Merge branch 'main' into fuse-cast-and-scale-inv-update

dca23c5

ptrendx reviewed Aug 15, 2024

View reviewed changes

transformer_engine/pytorch/cpp_extensions/cast.py Outdated Show resolved Hide resolved

ptrendx reviewed Aug 15, 2024

View reviewed changes

timmoon10 commented Aug 17, 2024

View reviewed changes

transformer_engine/pytorch/cpp_extensions/cast.py Outdated Show resolved Hide resolved

Review suggestion from @ptrendx

4909a50

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Merge branch 'main' into fuse-cast-and-scale-inv-update

0d45a2e

timmoon10 added 2 commits August 19, 2024 14:00

Debug mismatched dtypes

66ff01d

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into fuse-cast-and-scale-inv-update

87a4309

ptrendx approved these changes Aug 20, 2024

View reviewed changes

timmoon10 merged commit 8e3561b into NVIDIA:main Aug 21, 2024
31 checks passed

kshitij12345 mentioned this pull request Aug 26, 2024

TE - support v1.11 (current main) Lightning-AI/lightning-thunder#1052

Merged

timmoon10 mentioned this pull request Aug 27, 2024

[PyTorch] Remove some direct calls to PyTorch extensions in Float8Tensor #1137

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update FP8 scale-inverse in kernels with FP8 output #1083

Update FP8 scale-inverse in kernels with FP8 output #1083

timmoon10 commented Aug 7, 2024 •

edited

Loading

timmoon10 commented Aug 7, 2024

timmoon10 commented Aug 8, 2024

timmoon10 commented Aug 9, 2024

timmoon10 commented Aug 14, 2024

ptrendx Aug 15, 2024

timmoon10 Aug 16, 2024

ptrendx Aug 16, 2024

timmoon10 Aug 17, 2024 •

edited

Loading

timmoon10 commented Aug 17, 2024

timmoon10 commented Aug 19, 2024

Update FP8 scale-inverse in kernels with FP8 output #1083

Update FP8 scale-inverse in kernels with FP8 output #1083

Conversation

timmoon10 commented Aug 7, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

timmoon10 commented Aug 7, 2024

timmoon10 commented Aug 8, 2024

timmoon10 commented Aug 9, 2024

timmoon10 commented Aug 14, 2024

ptrendx Aug 15, 2024

Choose a reason for hiding this comment

timmoon10 Aug 16, 2024

Choose a reason for hiding this comment

ptrendx Aug 16, 2024

Choose a reason for hiding this comment

timmoon10 Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

timmoon10 commented Aug 17, 2024

timmoon10 commented Aug 19, 2024

timmoon10 commented Aug 7, 2024 •

edited

Loading

timmoon10 Aug 17, 2024 •

edited

Loading