-
Hi, Thank you for building this amazing open source tool! I'm trying to train a Nemo ASR model using a large dataset consisting of about 10K hours (10 million number of wav files) ranging from 1 to 60 seconds. I would like to ask if there are any recommended ideal settings in terms of n_shard and n_bucket to generate shard data for efficient ASR training with this amount of data. I have been trying to use (num_shards, buckets_num) = (1024, 32) or (2048, 32), etc. However, when trying to train an ASR model with the generated data, there were issues where data loading took too long or training did not start in the end. python scripts/speech_recognition/convert_to_tarred_audio_dataset.py \
--manifest_path=${path_manifest_file} \
--target_dir=${path_tar_file} \
--num_shards=1024 \
--max_duration=60.0 \
--min_duration=0.5 \
--shuffle \
--shuffle_seed=1 \
--sort_in_shards \
--workers=-1 \
--buckets_num=32 Please note that I am going to train the ASR model using A100 |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
Number of tarfiles and buckets somewhat depends on your compute cluster. We generally train on 128-256 gpus. Therefore we need to have enough tarfiles to allocate to at least 1 tarfile per GPU usually. So that sets an lower bound on the number of tarfiles * buckets. Generally, we use 8 buckets, and around 512-8192 tarfiles depending on the dataset size. @nithinraok can give more accurate numbers About data loading, make sure you use sharded manifest option in the config https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#sharded-manifests Another speedup would be to convert all audio to flac which is a new option added recently. https://github.com/NVIDIA/NeMo/blob/main/scripts/speech_recognition/convert_to_tarred_audio_dataset.py#L186 @pzelasko can we add this flag to all the examples that show how to call the script please ? Finally, @pzelasko has integrated Lhotse data loader into NeMo and it's showing very very strong performance. We do need to add it to the NeMo docs, but for the time being can you show an example config of what you would have to update to use the Lhotse data loader @pzelasko |
Beta Was this translation helpful? Give feedback.
-
Sure.
The doc is available in the repo trunk as we have yet to release the new version of NeMo that has Lhotse integration. You can find it here: https://github.com/NVIDIA/NeMo/blob/main/docs/source/asr/datasets.rst#lhotse-dataloading @eesungkim Lhotse+NeMo might be indeed a good fit for your use-case as it can more easily scale to larger data. Please see the above documentation for steps to get started. It will "just work" with your existing NeMo data whether it's tarred or not. You don't need to bucket the data as Lhotse can bucket it dynamically, but it will work with that too. It's a very recent feature in NeMo and we have yet to add an E2E example of usage or a tutorial, but in the meantime I'm happy to help and answer any questions you might have. |
Beta Was this translation helpful? Give feedback.
-
Hi @titu1994 @pzelasko In my case, I have been conducting experiments using a single node with 8 GPU. So I faced the issue where loading the manifest file was time-consuming since the size of the data is almost 10 millions. After changing this part [1, 2] to parallel processing, I was able to start training much faster. Second, regarding my data configuration, I noticed that when I loaded all data to start training, it took up 400GB of memory, so I increased my server memory accordingly and made sure the trainer ran properly. I would try it with Lhotse as you suggested and then ask more if I have any questions. Thanks again! |
Beta Was this translation helpful? Give feedback.
-
Actually, there are tons of bugs in the RAM memory management in Nemo that need to be fixed because of the Python handling of the current dataset format in the form of named tuples and the 400GB+ memory usage because of that problem |
Beta Was this translation helpful? Give feedback.
Number of tarfiles and buckets somewhat depends on your compute cluster. We generally train on 128-256 gpus. Therefore we need to have enough tarfiles to allocate to at least 1 tarfile per GPU usually. So that sets an lower bound on the number of tarfiles * buckets.
Generally, we use 8 buckets, and around 512-8192 tarfiles depending on the dataset size. @nithinraok can give more accurate numbers
About data loading, make sure you use sharded manifest option in the config https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#sharded-manifests
Another speedup would be to convert all audio to flac which is a new option added recently. https://github.com/NVIDIA/…