[fix auto-microbatch] FSDP reshard and cleanup after OOM to fix the cuda memory leak #3030

bigning · 2024-02-18T07:53:43Z

What does this PR do?

There are two issues with the current auto-microbatch.

nccl timeout or hang
it finds a lower device train micro-batch size or still OOM with device_train_microbatch_size = 1

This PR targets at 2nd issue. PR 3016 (WIP) targets the 1st issue.

For the 2nd memory issue, it's because if OOM happens during fsdp fwd/bwd after unshard, we need to manually reshard and cleanup. Otherwise, there is memory leaks like this:

There is an existing FSDP backward callback _post_backward_final_callback which tries to reshard and cleanup at the end of fsdp backward. We just call this API after OOM happens.

test

unit test

python -m composer.cli.launcher -n 2 -m pytest -m gpu tests/trainer/test_fsdp.py -k test_fsdp_reshard_after_oom
python -m composer.cli.launcher -n 2 -m pytest -m gpu tests/trainer/test_fsdp.py -k test_fsdp_same_state_after_oom_reshard

end-to-end test

llama2 13B finetuning on 16 A100-40G GPUs with global train batch size = 1024

baseline: llama2-13b-baseline-new-Ln2FYB OOMed even with device-train-microbatch-size = 1
test: llama2-13b-test-16-40g-r7z22-IBHkBe run well with device-train-microbatch-size = 2

llama2 70B finetuning on 32 H100-80G GPUs with global train batch size = 1024

baseline: llama2-70b-8192-32-h100-80g-baseline-npn6Qj OOMed even with device-train-microbatch-size = 1
test: llaama2-70b-8192-32-h100-80g-test-vbLmLL chose device-train-microbatch-size=2 and run well before evaluation. After evaluation, it has nccl timeout error, which is a different issue. We'll debug and fix the timeout in a separate PR.
fixed device-train-microbatch-size=4: llama2-70b-32-h100-80g-fixed-4-KKDQ7I OOMed.

mvpatel2000

This is great! A few minor nits.

Would like @cli99 's thoughts too

composer/trainer/trainer.py

tests/trainer/test_fsdp.py

cli99

Nice test. Looks good to me.

awgu · 2024-02-20T18:26:20Z

Thanks for calling this out and great catch! I think that calling the _post_backward_final_callback sounds good to me.

bigning · 2024-02-20T18:30:59Z

Thanks for calling this out and great catch! I think that calling the _post_backward_final_callback sounds good to me.

Thank you @awgu for the quick response!

composer/trainer/trainer.py

tests/trainer/test_fsdp.py

composer/trainer/trainer.py

tests/trainer/test_fsdp.py

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

mvpatel2000

This is a really nice PR. Super great catch

into reshard-after-oom

mvpatel2000

nit

tests/trainer/test_fsdp.py

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

into reshard-after-oom

bigning added 4 commits February 18, 2024 06:16

reshard and cleanup

69253e5

format

cafa488

fix

a84a22c

cleanup unit test

8ad8d30

bigning marked this pull request as ready for review February 20, 2024 17:30

bigning requested review from cli99, mvpatel2000 and dakinggg February 20, 2024 17:30

mvpatel2000 reviewed Feb 20, 2024

View reviewed changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/trainer/trainer.py Outdated Show resolved Hide resolved

tests/trainer/test_fsdp.py Outdated Show resolved Hide resolved

tests/trainer/test_fsdp.py Outdated Show resolved Hide resolved

cli99 approved these changes Feb 20, 2024

View reviewed changes

dakinggg reviewed Feb 20, 2024

View reviewed changes

composer/trainer/trainer.py Show resolved Hide resolved

dakinggg reviewed Feb 20, 2024

View reviewed changes

tests/trainer/test_fsdp.py Show resolved Hide resolved

bigning added 5 commits February 20, 2024 21:12

comments

defce2e

more test

e863adb

fix the warning

2387179

add numerical correctness test

15e9eee

Merge branch 'dev' into reshard-after-oom

131c8e6

mvpatel2000 reviewed Feb 21, 2024

View reviewed changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/trainer/trainer.py Outdated Show resolved Hide resolved

tests/trainer/test_fsdp.py Outdated Show resolved Hide resolved

mvpatel2000 reviewed Feb 21, 2024

View reviewed changes

tests/trainer/test_fsdp.py Outdated Show resolved Hide resolved

tests/trainer/test_fsdp.py Outdated Show resolved Hide resolved

bigning and others added 2 commits February 21, 2024 13:09

Apply suggestions from code review

8db614b

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

lint

d72fe19

bigning requested a review from mvpatel2000 February 21, 2024 21:20

mvpatel2000 approved these changes Feb 21, 2024

View reviewed changes

bigning added 2 commits February 21, 2024 22:42

fix test warnning

25f3d99

revert irrelevant change

921f92d

bigning enabled auto-merge (squash) February 21, 2024 22:46

Merge branch 'dev' into reshard-after-oom

bb15873

dakinggg approved these changes Feb 21, 2024

View reviewed changes

bigning added 3 commits February 21, 2024 23:37

lint

174dc83

fix test

3c254cd

Merge branch 'reshard-after-oom' of https://github.com/mosaicml/composer

c61c8bc

into reshard-after-oom

mvpatel2000 reviewed Feb 22, 2024

View reviewed changes

tests/trainer/test_fsdp.py Outdated Show resolved Hide resolved

bigning and others added 5 commits February 21, 2024 16:48

Update tests/trainer/test_fsdp.py

76b92c0

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

fix lint

a7aa3dc

Merge branch 'reshard-after-oom' of https://github.com/mosaicml/composer

b3b2dbc

into reshard-after-oom

lint

0590052

Merge branch 'dev' into reshard-after-oom

dfd10e1

bigning merged commit 2133c17 into dev Feb 22, 2024
14 checks passed

bigning deleted the reshard-after-oom branch February 22, 2024 02:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix auto-microbatch] FSDP reshard and cleanup after OOM to fix the cuda memory leak #3030

[fix auto-microbatch] FSDP reshard and cleanup after OOM to fix the cuda memory leak #3030

bigning commented Feb 18, 2024 •

edited

Loading

mvpatel2000 left a comment

cli99 left a comment

awgu commented Feb 20, 2024

bigning commented Feb 20, 2024

mvpatel2000 left a comment

mvpatel2000 left a comment

[fix auto-microbatch] FSDP reshard and cleanup after OOM to fix the cuda memory leak #3030

[fix auto-microbatch] FSDP reshard and cleanup after OOM to fix the cuda memory leak #3030

Conversation

bigning commented Feb 18, 2024 • edited Loading

What does this PR do?

test

unit test

end-to-end test

llama2 13B finetuning on 16 A100-40G GPUs with global train batch size = 1024

llama2 70B finetuning on 32 H100-80G GPUs with global train batch size = 1024

mvpatel2000 left a comment

Choose a reason for hiding this comment

cli99 left a comment

Choose a reason for hiding this comment

awgu commented Feb 20, 2024

bigning commented Feb 20, 2024

mvpatel2000 left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

bigning commented Feb 18, 2024 •

edited

Loading