[PyTorch] Branching operations #1027

timmoon10 · 2024-07-19T00:25:24Z

Description

This PR modifies the operation-based API (#707) to support some simple branching behavior: operations can now accept extra tensor inputs and generate extra tensor outputs. This enables fusions like GEMMs with beta=1:

model = te.Sequential(
    MakeExtraOutput(),
    Linear(...),
    AddInPlace(),
)
y, linear_in = model(x, linear_out)  # GEMM with beta=1 into linear_out
...
loss.backward()  # dgrad GEMM with beta=1 into linear_in.grad

Support for multiple inputs will also be necessary for cross-attention (and SSMs?). Note that we are not planning to support more complicated structures since that will take us down the road of general graph compilers.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Support extra tensor inputs and outputs in operation-based API
Operation for in-place add
Operation for making extra tensor output
Fused operations for GEMM with beta=1

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-07-19T01:03:53Z

/te-ci pytorch

timmoon10 · 2024-07-22T17:56:22Z

/te-ci pytorch

transformer_engine/pytorch/ops/fused/forward_linear_bias_add.py

transformer_engine/pytorch/ops/basic/basic_linear.py

ptrendx · 2024-08-01T22:32:41Z

transformer_engine/pytorch/ops/basic/basic_linear.py

@@ -389,6 +406,32 @@ def _functional_forward(
                "are not compatible"
            )

+        # Check output tensor dims


I wonder if we need to do this here (same for input) or maybe we could rely on the error checking on the C++ side to minimize CPU overhead?

I think that would be a good optimization in the future, especially since the linear functional API is used in multiple operations.

transformer_engine/pytorch/ops/basic/basic_linear.py

transformer_engine/pytorch/ops/fuser.py

Output tensor dtype and device take precedence over weight tensor in linear functional API. Move some index calculation to fuser constructor. Avoid some unnecessary dereferences. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-08-03T02:08:07Z

/te-ci pytorch

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-08-05T20:12:59Z

/te-ci pytorch

ptrendx · 2024-08-09T00:07:48Z

Could you comment on how the change from your last commit helped with the unittest failures? The change from list comprehension to the for loop should not change the behavior, right?

transformer_engine/pytorch/ops/fuser.py

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 · 2024-08-09T00:24:56Z

/te-ci pytorch

timmoon10 added 7 commits July 17, 2024 01:07

Add op for in-place add

9f61bca

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add op for in-place add

2cda094

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add op that adds extra output to fuser

6bf2869

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add fused op for GEMM+bias+add

6d78177

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add fused op for dgrad+add

da7a981

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add documentation

872f863

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into branching-ops

93c4040

timmoon10 added the enhancement New feature or request label Jul 19, 2024

timmoon10 requested review from sudhakarsingh27 and ksivaman July 19, 2024 00:25

pre-commit-ci bot and others added 2 commits July 19, 2024 00:25

[pre-commit.ci] auto fixes from pre-commit.com hooks

151df0b

for more information, see https://pre-commit.ci

Fix linter warnings

4e618cd

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into branching-ops

fadbb8a