Need help on Micro Batch Size, Global Batch Size, Pipeline Parallel size calculation #6795
-
I am using "https://huggingface.co/nvidia/GPT-2B-001" model. where i need to do the data & model parallelization. For that I used config.model.micro_batch_size = 4 config.model.tensor_model_parallel_size = 1 Total No of gpu's = 4 (single machine with multigpu) For that I got, https://github.com/NVIDIA/NeMo/blob/v1.18.1/nemo/collections/nlp/modules/common/megatron/megatron_init.py Can anyone help me to solve this? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
In short, For the equation there:
Now that you understand the equation, the issue is probably because the DP_size is wrong. A possible reason is that you didn't partition the model into PP=4. so the program uses DP=4. Simply setting PP=4 in the config wouldn't work, you need to use |
Beta Was this translation helpful? Give feedback.
In short,
TP_size * PP_size * DP_size == num_GPUs
. so if you want to do PP=4, you cannot do DP.For the equation there:
Therefore they should be equal.
Now that you understand the equation, the issue is probably because the DP_size is wrong.
A possible reason is that you didn't partition the m…