Regarding ideal settings of n_shard and n_bucket to generate shard data when converting wavs to tarred dataset #8218

eesungkim · 2024-01-22T22:27:48Z

eesungkim
Jan 22, 2024

Hi,

Thank you for building this amazing open source tool!

I'm trying to train a Nemo ASR model using a large dataset consisting of about 10K hours (10 million number of wav files) ranging from 1 to 60 seconds.

I would like to ask if there are any recommended ideal settings in terms of n_shard and n_bucket to generate shard data for efficient ASR training with this amount of data.

I have been trying to use (num_shards, buckets_num) = (1024, 32) or (2048, 32), etc. However, when trying to train an ASR model with the generated data, there were issues where data loading took too long or training did not start in the end.

python scripts/speech_recognition/convert_to_tarred_audio_dataset.py \
    --manifest_path=${path_manifest_file} \
    --target_dir=${path_tar_file} \
  --num_shards=1024 \
  --max_duration=60.0 \
  --min_duration=0.5 \
  --shuffle \
  --shuffle_seed=1 \
  --sort_in_shards \
  --workers=-1 \
  --buckets_num=32

Please note that I am going to train the ASR model using A100

Answered by titu1994

Jan 23, 2024

Number of tarfiles and buckets somewhat depends on your compute cluster. We generally train on 128-256 gpus. Therefore we need to have enough tarfiles to allocate to at least 1 tarfile per GPU usually. So that sets an lower bound on the number of tarfiles * buckets.

Generally, we use 8 buckets, and around 512-8192 tarfiles depending on the dataset size. @nithinraok can give more accurate numbers

About data loading, make sure you use sharded manifest option in the config https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#sharded-manifests

Another speedup would be to convert all audio to flac which is a new option added recently. https://github.com/NVIDIA/…

View full answer

titu1994 · 2024-01-23T09:08:56Z

titu1994
Jan 23, 2024
Maintainer

Number of tarfiles and buckets somewhat depends on your compute cluster. We generally train on 128-256 gpus. Therefore we need to have enough tarfiles to allocate to at least 1 tarfile per GPU usually. So that sets an lower bound on the number of tarfiles * buckets.

Generally, we use 8 buckets, and around 512-8192 tarfiles depending on the dataset size. @nithinraok can give more accurate numbers

About data loading, make sure you use sharded manifest option in the config https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#sharded-manifests

Another speedup would be to convert all audio to flac which is a new option added recently. https://github.com/NVIDIA/NeMo/blob/main/scripts/speech_recognition/convert_to_tarred_audio_dataset.py#L186

@pzelasko can we add this flag to all the examples that show how to call the script please ?

Finally, @pzelasko has integrated Lhotse data loader into NeMo and it's showing very very strong performance. We do need to add it to the NeMo docs, but for the time being can you show an example config of what you would have to update to use the Lhotse data loader @pzelasko

0 replies

pzelasko · 2024-01-23T14:01:48Z

pzelasko
Jan 23, 2024
Collaborator

@pzelasko can we add this flag to all the examples that show how to call the script please ?

Sure.

Finally, @pzelasko has integrated Lhotse data loader into NeMo and it's showing very very strong performance. We do need to add it to the NeMo docs, but for the time being can you show an example config of what you would have to update to use the Lhotse data loader @pzelasko

The doc is available in the repo trunk as we have yet to release the new version of NeMo that has Lhotse integration. You can find it here: https://github.com/NVIDIA/NeMo/blob/main/docs/source/asr/datasets.rst#lhotse-dataloading

@eesungkim Lhotse+NeMo might be indeed a good fit for your use-case as it can more easily scale to larger data. Please see the above documentation for steps to get started. It will "just work" with your existing NeMo data whether it's tarred or not. You don't need to bucket the data as Lhotse can bucket it dynamically, but it will work with that too.

It's a very recent feature in NeMo and we have yet to add an E2E example of usage or a tutorial, but in the meantime I'm happy to help and answer any questions you might have.

0 replies

eesungkim · 2024-01-29T04:54:33Z

eesungkim
Jan 29, 2024
Author

Hi @titu1994 @pzelasko
Thank you for helping me.

In my case, I have been conducting experiments using a single node with 8 GPU. So I faced the issue where loading the manifest file was time-consuming since the size of the data is almost 10 millions. After changing this part [1, 2] to parallel processing, I was able to start training much faster.

Second, regarding my data configuration, I noticed that when I loaded all data to start training, it took up 400GB of memory, so I increased my server memory accordingly and made sure the trainer ran properly.

I would try it with Lhotse as you suggested and then ask more if I have any questions. Thanks again!

0 replies

systemdevart · 2024-02-06T20:24:01Z

systemdevart
Feb 6, 2024

Actually, there are tons of bugs in the RAM memory management in Nemo that need to be fixed because of the Python handling of the current dataset format in the form of named tuples and the 400GB+ memory usage because of that problem

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding ideal settings of n_shard and n_bucket to generate shard data when converting wavs to tarred dataset #8218

{{title}}

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Regarding ideal settings of n_shard and n_bucket to generate shard data when converting wavs to tarred dataset #8218

eesungkim Jan 22, 2024

Replies: 4 comments

titu1994 Jan 23, 2024 Maintainer

pzelasko Jan 23, 2024 Collaborator

eesungkim Jan 29, 2024 Author

systemdevart Feb 6, 2024

eesungkim
Jan 22, 2024

titu1994
Jan 23, 2024
Maintainer

pzelasko
Jan 23, 2024
Collaborator

eesungkim
Jan 29, 2024
Author

systemdevart
Feb 6, 2024