Replies: 1 comment
-
Could I also check how the number of batch is decided? in the MarbleNet_3x2x64.yaml file, batch_size was clearly stated as 128. in log, [NeMo I 2023-07-02 03:08:18 collections:298] Filtered duration for loading collection is 0.000000. but in training: Epoch 3: 100%|████████████████████████████████████████████████████| 18750/18750 [59:59<00:00, 5.21it/s, loss=0.322, v_num=8-42 if 18750 at batch_size of 128 I would have to have 2400000 which is severely greater than my 2164876 training items. And even if the batch size is the measurement of mfcc, each file give 64 x 64, then number of batch should be at 1082438. if purely by audio file then it should be 16913 or 16914 depending on rounding. please someone help, thank you. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I followed the tutorial on VAD through https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/starthere/tutorials.html which used branch r1.18.0. I have set up all dependencies i.e. having megatron and apex. My pytorch compatibility are also resolved.
According to the tutorial I can speed up training with device = 2 if I have 2 device.
While the image I provide is a snapshot where both gpu has the same load, most of the time it would be either 1 gpu on 80 - 100% and the other at 0% - 30 % or the other way around. So both gpu are used but at any 1 time the load on either gpu would be vastly different. I am not sure if it is an expected behavior.
In addition, when i use the default marble_net config as given in tutorial, the gpu utilisation is at 14 - 20 % which is much diff than if ddp was used.
Using watch -n 0.5 nvidia-smi (please ignore the circled area)
Lastly, unlike other discussion where they found the pid to be allocated to just 1 gpu, it seems both of my gpu is running. Unfortunately I am still trying to get pid for gpu to be shown thus I can't say for sure either.
Beta Was this translation helpful? Give feedback.
All reactions