Please help advise on multi-gpu training of marblenet with pytorch lightning 1.9.4 #6959

PaulSZH95 · 2023-07-02T11:23:56Z

PaulSZH95
Jul 2, 2023

Hi all,

May i ask if it is normal for single gpu to have lesser % utilisation than if ddp was used.
May I ask if it is normal that if i have 2 gpu and 1 gpu is at 100% the other would be at only 30 %
how do i track what is normal gpu utilisation with multigpu training
are these methods : nemo.collections.asr.modules.AudioToMFCCPreprocessor and nemo.collections.asr.modules.SpectrogramAugmentation using cpu or gpu
if the methods uses cpu, how will if affect parallel training. Details below

I followed the tutorial on VAD through https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/starthere/tutorials.html which used branch r1.18.0. I have set up all dependencies i.e. having megatron and apex. My pytorch compatibility are also resolved.

According to the tutorial I can speed up training with device = 2 if I have 2 device.

While the image I provide is a snapshot where both gpu has the same load, most of the time it would be either 1 gpu on 80 - 100% and the other at 0% - 30 % or the other way around. So both gpu are used but at any 1 time the load on either gpu would be vastly different. I am not sure if it is an expected behavior.

In addition, when i use the default marble_net config as given in tutorial, the gpu utilisation is at 14 - 20 % which is much diff than if ddp was used.

Using watch -n 0.5 nvidia-smi (please ignore the circled area)

Lastly, unlike other discussion where they found the pid to be allocated to just 1 gpu, it seems both of my gpu is running. Unfortunately I am still trying to get pid for gpu to be shown thus I can't say for sure either.

PaulSZH95 · 2023-07-02T15:38:04Z

PaulSZH95
Jul 2, 2023
Author

Could I also check how the number of batch is decided?

in the MarbleNet_3x2x64.yaml file, batch_size was clearly stated as 128. in log,

[NeMo I 2023-07-02 03:08:18 collections:298] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-07-02 03:08:18 collections:301] Dataset loaded with 2164876 items, total duration of 378.85 hours.
[NeMo I 2023-07-02 03:08:18 collections:303] # 2164876 files loaded accounting to # 2 labels
[NeMo I 2023-07-02 03:08:22 collections:298] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-07-02 03:08:22 collections:301] Dataset loaded with 234914 items, total duration of 41.11 hours.
[NeMo I 2023-07-02 03:08:22 collections:303] # 234914 files loaded accounting to # 2 labels
[NeMo I 2023-07-02 03:08:34 collections:298] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-07-02 03:08:34 collections:301] Dataset loaded with 824574 items, total duration of 144.30 hours.

but in training:

Epoch 3: 100%|████████████████████████████████████████████████████| 18750/18750 [59:59<00:00, 5.21it/s, loss=0.322, v_num=8-42

if 18750 at batch_size of 128 I would have to have 2400000 which is severely greater than my 2164876 training items. And even if the batch size is the measurement of mfcc, each file give 64 x 64, then number of batch should be at 1082438. if purely by audio file then it should be 16913 or 16914 depending on rounding.

please someone help, thank you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please help advise on multi-gpu training of marblenet with pytorch lightning 1.9.4 #6959

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Please help advise on multi-gpu training of marblenet with pytorch lightning 1.9.4 #6959

PaulSZH95 Jul 2, 2023

Replies: 1 comment

PaulSZH95 Jul 2, 2023 Author

PaulSZH95
Jul 2, 2023

PaulSZH95
Jul 2, 2023
Author