[PyTorch] Fix FP8 activation recompute #1254

ksivaman · 2024-10-15T13:41:05Z

Description

The amax reduction of all backward tensors happens in the first module (one of the base modules) in a given fp8_autocast. The ctx.reduce_and_update_bwd_fp8_tensors flag is saved by querying the FP8GlobalStateManager.is_first_fp8_module() which only returns True for the first module in the fp8_autocast. However, this introduces a bug during activation recompute since the recompute phase runs outside the fp8 context, and the first module flags are never set. This results in the amaxes for gradients not getting reduced.

Fixes #1190

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

The activation_recompute_forward maintains a queue structure to pass values of the IS_FIRST_FP8_MODULE flag from the forward phase to the recompute phase. During the recompute phase, it is reset back to not disturb any nested autocasts.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman · 2024-10-15T13:41:29Z

/te-ci pytorch

denera

LGTM!

ptrendx · 2024-10-15T23:28:58Z

@ksivaman Could you put some more information about the bug and the fix in the description?

ksivaman · 2024-10-16T01:19:38Z

/te-ci pytorch

Fix FP8 activation recompute Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman added 2 commits October 14, 2024 21:31

Fix FP8 activation recompute

73916a4

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Merge branch 'NVIDIA:main' into fix_fp8_activation_recomp

a6cd47f

ksivaman added the bug Something isn't working label Oct 15, 2024

ksivaman requested a review from denera October 15, 2024 13:41

ksivaman self-assigned this Oct 15, 2024

ksivaman marked this pull request as draft October 15, 2024 13:41

ksivaman mentioned this pull request Oct 15, 2024

[PyTorch] FP8 and activation checkpointing causes training instabilities #1190

Closed

ksivaman marked this pull request as ready for review October 15, 2024 17:53

denera approved these changes Oct 15, 2024

View reviewed changes

Merge branch 'main' into fix_fp8_activation_recomp

1e119e0

ksivaman merged commit a518151 into NVIDIA:main Oct 16, 2024
14 of 15 checks passed

timmoon10 pushed a commit to timmoon10/TransformerEngine that referenced this pull request Nov 7, 2024

[PyTorch] Fix FP8 activation recompute (NVIDIA#1254)

1fed926

Fix FP8 activation recompute Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Fix FP8 activation recompute #1254

[PyTorch] Fix FP8 activation recompute #1254

ksivaman commented Oct 15, 2024 •

edited

Loading

ksivaman commented Oct 15, 2024

denera left a comment

ptrendx commented Oct 15, 2024

ksivaman commented Oct 16, 2024

[PyTorch] Fix FP8 activation recompute #1254

[PyTorch] Fix FP8 activation recompute #1254

Conversation

ksivaman commented Oct 15, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

ksivaman commented Oct 15, 2024

denera left a comment

Choose a reason for hiding this comment

ptrendx commented Oct 15, 2024

ksivaman commented Oct 16, 2024

ksivaman commented Oct 15, 2024 •

edited

Loading