[PyTorch] Custom kernel to compute reciprocal of a single float #1016

timmoon10 · 2024-07-15T18:28:35Z

Description

FP8 training is frequently bottlenecked by CPU overheads and a non-trivial fraction of CPU overhead comes from small PyTorch operations. For example, when I benchmark the forward pass of small Linear modules on an L40, I estimate ~20% of runtime is spent in handling the FP8 scaling factors (mainly in reciprocal and clone operations). This PR attempts to mitigate these overheads by adding a scalar_reciprocal kernel that operates on a single float, bringing the kernel launch cost down from ~20 us to ~10 us. In my benchmark of Linear forwards, I see a 8% reduction in runtime.

Alternative approaches:

Modify the cast and cast-transpose kernels to perform the scale-inv update, similar to how they perform the amax update. Logically, the scale is part of the FP8 recipe and the scale-inv is part of the data.
Use torch.compile to fuse FP8 scale operations. We would require significant refactoring to avoid incurring extra overhead from graph breaks, especially in how we deal with the FP8TensorMeta class.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Custom kernel to compute reciprocal of a single float

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2024-07-15T18:59:13Z

/te-ci pytorch

ksivaman · 2024-07-16T14:55:30Z

transformer_engine/pytorch/module/linear.py

-                            fwd_scale_inverses,
-                            tex.FP8FwdTensors.GEMM1_INPUT,
+                            inputmat_fp8_scale_inv,
+                            0,


fwd_scale_inverses is created with a clone operation (~20 us), while input_fp8_scale_inv is created with the scalar reciprocal kernel (~10 us).

ksivaman · 2024-07-16T14:56:51Z

transformer_engine/pytorch/module/linear.py

@@ -335,7 +343,7 @@ def forward(
                weight,
                weight_fp8,
                weight.main_grad if cpu_offloading and fuse_wgrad_accumulation else None,
-                fp8_meta["scaling_fwd"].scale_inv.clone() if fp8 else None,
+                inputmat_fp8_scale_inv,


This is only set for float8tensor case

We handle three cases:

Non-FP8: scale-inv is None

TransformerEngine/transformer_engine/pytorch/module/linear.py

Line 120 in 23258bb

inputmat_fp8_scale_inv = None

FP8, Float8Tensor input: scale-inv is taken from Float8Tensor

TransformerEngine/transformer_engine/pytorch/module/linear.py

Line 125 in 23258bb

inputmat_fp8_scale_inv = inputmat._scale_inv

FP8, non-Float8Tensor input: scale-inv is computed with fast kernel

TransformerEngine/transformer_engine/pytorch/module/linear.py

Line 158 in 23258bb

inputmat_fp8_scale_inv = tex.scalar_reciprocal(

timmoon10 · 2024-08-08T18:24:46Z

Closed by #1083

timmoon10 added 4 commits July 13, 2024 01:13

Add PyTorch extension func to compute reciprocal of single float

b4ea12b

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Use reciprocal func to update input scale-invs in linear modules

865775f

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Use correct type for Pybind extension

a5e53e7

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Document scalar_reciprocal function

1a589e2

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added the enhancement New feature or request label Jul 15, 2024

timmoon10 requested a review from ksivaman July 15, 2024 18:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

23258bb

for more information, see https://pre-commit.ci

ksivaman reviewed Jul 16, 2024

View reviewed changes

timmoon10 marked this pull request as draft July 16, 2024 18:50

timmoon10 closed this Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Custom kernel to compute reciprocal of a single float #1016

[PyTorch] Custom kernel to compute reciprocal of a single float #1016

timmoon10 commented Jul 15, 2024

timmoon10 commented Jul 15, 2024

ksivaman Jul 16, 2024

timmoon10 Jul 16, 2024

ksivaman Jul 16, 2024

timmoon10 Jul 16, 2024

timmoon10 commented Aug 8, 2024

[PyTorch] Custom kernel to compute reciprocal of a single float #1016

[PyTorch] Custom kernel to compute reciprocal of a single float #1016

Conversation

timmoon10 commented Jul 15, 2024

Description

Type of change

Changes

Checklist:

timmoon10 commented Jul 15, 2024

ksivaman Jul 16, 2024

Choose a reason for hiding this comment

timmoon10 Jul 16, 2024

Choose a reason for hiding this comment

ksivaman Jul 16, 2024

Choose a reason for hiding this comment

timmoon10 Jul 16, 2024

Choose a reason for hiding this comment

timmoon10 commented Aug 8, 2024