[NaN] Fix nan print issue when running Megatron-Deepspeed with DeepSpeed #434

ys950902 · 2024-08-05T04:40:36Z

This pr is to fix this issue, whether is skipped iter we should do the nan check.

abhilash1910

LGTM!

tjruwase · 2024-08-07T12:47:41Z

@ys950902, can you please share a bit more details about why skipped_iter is False in this case?

ys950902 · 2024-08-07T13:05:12Z

@ys950902, can you please share a bit more details about why skipped_iter is False in this case?

Hi @tjruwase, thanks for your reply, when you running Megatron-DeepSpeed with DeepSpeed for 3D parallelism:
https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/training.py#L674
or running for zero2/3
https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/training.py#L762
the skipped_iter is set to 0 by default, and DeepSpeed won't update this flag, so is false here.

tjruwase · 2024-08-07T13:41:19Z

@ys950902, thanks for the explanation. I think the correct solution is to use the was_step_applied() API of DeepSpeed. And I noticed that for the non-3D parallelism case, it is already used to set update_successful.

Megatron-DeepSpeed/megatron/training.py

Line 746 in 53b241f

update_successful = model[0].was_step_applied()

The problem is that update_successful is not used to appropriately set skipped_iter unlike the non-deepspeed code path.

Megatron-DeepSpeed/megatron/training.py

Lines 773 to 778 in 53b241f

    
           if update_successful: 
        
               increment = get_num_microbatches() * \ 
        
                           args.micro_batch_size * \ 
        
                           args.data_parallel_size 
        
               opt_param_scheduler.step(increment=increment) 
        
               skipped_iter = 0

Can you try setting update_successful and skipped_iter for both deepspeed cases in a consistent fashion to the megatron case? Thanks

ys950902 · 2024-08-07T13:53:28Z

@ys950902, thanks for the explanation. I think the correct solution is to use the was_step_applied() API of DeepSpeed. And I noticed that for the non-3D parallelism case, it is already used to set update_successful.

Megatron-DeepSpeed/megatron/training.py

Line 746 in 53b241f

update_successful = model[0].was_step_applied()

The problem is that update_successful is not used to appropriately set skipped_iter unlike the non-deepspeed code path.

Megatron-DeepSpeed/megatron/training.py

Lines 773 to 778 in 53b241f

if update_successful:

increment = get_num_microbatches() * \

args.micro_batch_size * \

args.data_parallel_size

opt_param_scheduler.step(increment=increment)

skipped_iter = 0

Can you try setting update_successful and skipped_iter for both deepspeed cases in a consistent fashion to the megatron case? Thanks

Got it, I will fix it as you suggested!

ys950902 · 2024-08-08T05:36:48Z

Hi @tjruwase, could you please take a look on this pr and with the modify in deepspeed to support bfloat16 microsoft/DeepSpeed#5879.

ys950902 · 2024-08-24T05:17:23Z

Hi @tjruwase, will you merge this pr?

ys950902 requested review from tjruwase, awan-10, eltonzheng, duli2012, arashb and GuanhuaWang as code owners August 5, 2024 04:40

abhilash1910 approved these changes Aug 7, 2024

View reviewed changes

fix nan issue when running megatron-deepspeed

2985392

ys950902 force-pushed the nan_issue branch from 836a9f3 to 2985392 Compare August 8, 2024 05:35

tjruwase approved these changes Aug 10, 2024

View reviewed changes

iamdeepakgit approved these changes Aug 13, 2024

View reviewed changes

tjruwase merged commit 4f9f1f6 into microsoft:main Aug 24, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NaN] Fix nan print issue when running Megatron-Deepspeed with DeepSpeed #434

[NaN] Fix nan print issue when running Megatron-Deepspeed with DeepSpeed #434

ys950902 commented Aug 5, 2024

abhilash1910 left a comment

tjruwase commented Aug 7, 2024

ys950902 commented Aug 7, 2024

tjruwase commented Aug 7, 2024

ys950902 commented Aug 7, 2024

ys950902 commented Aug 8, 2024 •

edited

Loading

ys950902 commented Aug 24, 2024

[NaN] Fix nan print issue when running Megatron-Deepspeed with DeepSpeed #434

[NaN] Fix nan print issue when running Megatron-Deepspeed with DeepSpeed #434

Conversation

ys950902 commented Aug 5, 2024

abhilash1910 left a comment

Choose a reason for hiding this comment

tjruwase commented Aug 7, 2024

ys950902 commented Aug 7, 2024

tjruwase commented Aug 7, 2024

ys950902 commented Aug 7, 2024

ys950902 commented Aug 8, 2024 • edited Loading

ys950902 commented Aug 24, 2024

ys950902 commented Aug 8, 2024 •

edited

Loading