Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor(hidden states)missing across GPU in Pipeline Parallelism Training[BUG] #5696

Closed
Youngluc opened this issue Jun 25, 2024 · 3 comments
Assignees
Labels
bug Something isn't working training

Comments

@Youngluc
Copy link

Youngluc commented Jun 25, 2024

Describe the bug
I am training the LLM with DeepSpeed Pipeline Parallel (ZeRO0 or ZeRO1 used). But I have a tricky issue:

Assuming global_batch_size=4, single machine with 8GPUS, and PP=8, so DP=1, and micro_batch_size=4.

Further assuming the first batch contains the input sequence with shape (4, 2262), and its corresponding hidden_states has a shape (4, 2262, C); the second batch contains the input sequence with shape (4, 2361), and corresponding hidden_states has a shape (4, 2361, C).

we also have the following stage partition:

stage=0 layers=14
     0: TokenizerPipeLayer
     1: InternLMBlockPipeLayer
     2: InternLMBlockPipeLayer
     3: InternLMBlockPipeLayer
     4: InternLMBlockPipeLayer
     5: InternLMBlockPipeLayer
     6: InternLMBlockPipeLayer
     7: InternLMBlockPipeLayer
     8: InternLMBlockPipeLayer
     9: InternLMBlockPipeLayer
    10: InternLMBlockPipeLayer
    11: InternLMBlockPipeLayer
    12: InternLMBlockPipeLayer
    13: InternLMBlockPipeLayer
stage=1 layers=7
    14: InternLMBlockPipeLayer
    15: InternLMBlockPipeLayer
    16: InternLMBlockPipeLayer
    17: InternLMBlockPipeLayer
    18: InternLMBlockPipeLayer
    19: InternLMBlockPipeLayer
    20: InternLMBlockPipeLayer
......

But the following RuntimeError occurs:
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2262, 6144]) and output[0] has a shape of torch.Size([4, 2361, 6144]).

Actually, if I print the tensor shape in my costumized InternLMBlockPipeLayer(nn.Module):

class InternLMBlockPipeLayer(nn.Module):
    def __init__(self, model: InternVLChatModel, layer_idx: int, gradient_checkpointing: bool = False):
        super().__init__()
        self.idx = layer_idx
        self.layer = model.language_model.model.layers[layer_idx]
        self.gradient_checkpointing = gradient_checkpointing

    def forward(self, ipt):
        hidden_states, attention_mask, position_ids, labels, random_id = ipt

        print('WARNING: ', hidden_states.shape, f'{self.__class__.__name__}.{self.idx}', f'cuda:{dist.get_rank()}', random_id, '\n')

        if self.gradient_checkpointing and self.training:
            output_attentions = False

            def create_custom_forward(module):
                def custom_forward(*inputs):
                    # None for past_key_value
                    return module(*inputs, output_attentions, None)

                return custom_forward

            deepspeed checkpoint auto use outputs[0] if len(outputs) == 1
            outputs = deepspeed.checkpointing.checkpoint(
                create_custom_forward(self.layer),
                hidden_states,
                attention_mask,
                position_ids,
                None,
            )
            layer_outputs = [outputs]
        else:
            layer_outputs = self.layer(
                hidden_states,
                attention_mask=attention_mask,
                position_ids=position_ids,
            )

        hidden_states = layer_outputs[0]

        return hidden_states, attention_mask, position_ids, labels, random_id
       # random_id is defined in collate_fn, and equal to the len(input_ids), just a tag.

The first step (batch (4, 2262) is normal, but during the second step, the log is like this:

Epoch 1:   0%|                                                                                                                                                         | 0/31397 [00:00<?, ?it/s]�[A0 begin, current rank: 0
dynamic ViT batch size: 29, images per sample: 7.25, dynamic token length: 2262 tensor(2262, device='cuda:0') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.0 cuda:0 tensor(2262, device='cuda:0') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.1 cuda:0 tensor(2262, device='cuda:0') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.2 cuda:0 tensor(2262, device='cuda:0') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.3 cuda:0 tensor(2262, device='cuda:0') 

[2024-06-25 12:43:23,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.4 cuda:0 tensor(2262, device='cuda:0') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.5 cuda:0 tensor(2262, device='cuda:0') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.6 cuda:0 tensor(2262, device='cuda:0') 

[2024-06-25 12:43:23,163] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.7 cuda:0 tensor(2262, device='cuda:0') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.8 cuda:0 tensor(2262, device='cuda:0') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.9 cuda:0 tensor(2262, device='cuda:0') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.10 cuda:0 tensor(2262, device='cuda:0') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.11 cuda:0 tensor(2262, device='cuda:0') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.12 cuda:0 tensor(2262, device='cuda:0') 

[2024-06-25 12:43:24,122] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-25 12:43:24,648] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-25 12:43:24,657] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)

[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.13 cuda:1 tensor(2262, device='cuda:1') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.14 cuda:1 tensor(2262, device='cuda:1') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.15 cuda:1 tensor(2262, device='cuda:1') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.16 cuda:1 tensor(2262, device='cuda:1') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.17 cuda:1 tensor(2262, device='cuda:1') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.18 cuda:1 tensor(2262, device='cuda:1') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.19 cuda:1 tensor(2262, device='cuda:1') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.20 cuda:2 tensor(2262, device='cuda:2') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.21 cuda:2 tensor(2262, device='cuda:2') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.22 cuda:2 tensor(2262, device='cuda:2') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.23 cuda:2 tensor(2262, device='cuda:2') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.24 cuda:2 tensor(2262, device='cuda:2') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.25 cuda:2 tensor(2262, device='cuda:2') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.26 cuda:2 tensor(2262, device='cuda:2') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.27 cuda:3 tensor(2262, device='cuda:3') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.28 cuda:3 tensor(2262, device='cuda:3') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.29 cuda:3 tensor(2262, device='cuda:3') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.30 cuda:3 tensor(2262, device='cuda:3') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.31 cuda:3 tensor(2262, device='cuda:3') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.32 cuda:3 tensor(2262, device='cuda:3') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.33 cuda:3 tensor(2262, device='cuda:3') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.34 cuda:4 tensor(2262, device='cuda:4') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.35 cuda:4 tensor(2262, device='cuda:4') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.36 cuda:4 tensor(2262, device='cuda:4') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.37 cuda:4 tensor(2262, device='cuda:4') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.38 cuda:4 tensor(2262, device='cuda:4') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.39 cuda:4 tensor(2262, device='cuda:4') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.40 cuda:4 tensor(2262, device='cuda:4') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.41 cuda:5 tensor(2262, device='cuda:5') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.42 cuda:5 tensor(2262, device='cuda:5') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.43 cuda:5 tensor(2262, device='cuda:5') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.44 cuda:5 tensor(2262, device='cuda:5') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.45 cuda:5 tensor(2262, device='cuda:5') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.46 cuda:5 tensor(2262, device='cuda:5') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.47 cuda:5 tensor(2262, device='cuda:5') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.48 cuda:6 tensor(2262, device='cuda:6') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.49 cuda:6 tensor(2262, device='cuda:6') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.50 cuda:6 tensor(2262, device='cuda:6') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.51 cuda:6 tensor(2262, device='cuda:6') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.52 cuda:6 tensor(2262, device='cuda:6') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.53 cuda:6 tensor(2262, device='cuda:6') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.54 cuda:6 tensor(2262, device='cuda:6') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.55 cuda:7 tensor(2262, device='cuda:7') 

[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[2024-06-25 12:43:39,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 80.73 | optimizer_step: 401.20

06/25/2024 12:43:43 - INFO - __main__ - {'loss': 2.084827423095703, 'learning_rate': 0.0, 'epoch': 0.0}


Epoch 1:   0%|                                                                                                                    | 0/31397 [00:24<?, ?it/s, loss=2.08, learning_rate=0, epoch=0]�[A

Epoch 1:   0%|                                                                                                        | 1/31397 [00:24<210:13:33, 24.11s/it, loss=2.08, learning_rate=0, epoch=0]�[A

dynamic ViT batch size: 31, images per sample: 7.75, dynamic token length: 2361 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.0 cuda:0 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.1 cuda:0 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.2 cuda:0 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.3 cuda:0 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.4 cuda:0 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.5 cuda:0 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.6 cuda:0 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.7 cuda:0 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.8 cuda:0 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.9 cuda:0 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.10 cuda:0 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.11 cuda:0 tensor(2361, device='cuda:0') 

WARNING:  torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.12 cuda:0 tensor(2361, device='cuda:0') 


Epoch 1:   0%|                                                                                                        | 1/31397 [00:26<234:07:37, 26.85s/it, loss=2.08, learning_rate=0, epoch=0]
tensor(2361, device='cuda:1') 

Traceback (most recent call last):
  File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 863, in <module>
    main()
  File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 842, in main
    loss = model.train_batch(data_iter=train_iter)
  File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 373, in train_batch
    self._exec_schedule(sched)
  File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1373, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 789, in _exec_backward_pass
    torch.autograd.backward(tensors=out_tensors, grad_tensors=grad_tensors)
  File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/__init__.py", line 244, in backward
    grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
  File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/__init__.py", line 88, in _make_grads
    raise RuntimeError(
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2262, 6144]) and output[0] has a shape of torch.Size([4, 2361, 6144]).
WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.14 cuda:1 tensor(2361, device='cuda:1') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.15 cuda:1 tensor(2361, device='cuda:1') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.16 cuda:1 tensor(2361, device='cuda:1') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.17 cuda:1 tensor(2361, device='cuda:1') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.18 cuda:1 tensor(2361, device='cuda:1') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.19 cuda:1 tensor(2361, device='cuda:1') 

tensor(2361, device='cuda:2') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.21 cuda:2 tensor(2361, device='cuda:2') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.22 cuda:2 tensor(2361, device='cuda:2') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.23 cuda:2 tensor(2361, device='cuda:2') 

WARNING:  torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.24 cuda:2 tensor(2361, device='cuda:2') 

As you can see, when across the GPU, print() is also truncated, and it seems the tensor with shape (4, 2361, 6144) are missing when sent from stage0(GPU0) to stage1(GPU1).

What should I do fix it up?

If anything else is required, Please tell me. Thank you very much!

Expected behavior
Correct communication among stages(GPUs)

ds_report output
Please run ds_report to give us details about your setup.

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: Ubuntu 22.04
  • GPU count and types [8xA100-80G]
  • Python 3.9
  • torch-2.1.2(2.0.1 also down)/Deepspeed-0.13.5/cuda-12.2/acclerate-0.31.0/transformers-4.37.2
@Youngluc Youngluc added bug Something isn't working training labels Jun 25, 2024
@tohtana
Copy link
Contributor

tohtana commented Sep 9, 2024

Hi @Youngluc, Can you try dynamic_shape=True? The similar issue was reported in #5568 and fixed by this option.

model = PipelineModule(
...
   dynamic_shape=True
)

@Youngluc
Copy link
Author

@tohtana I will have a try, thanks!

@tohtana
Copy link
Contributor

tohtana commented Sep 20, 2024

Since we haven’t received any additional information, we’re closing this issue for now. Please feel free to reopen it if you have more details to share.

@tohtana tohtana closed this as completed Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants