Skip to content

Commit

Permalink
Merge branch 'main' into export_wordlist_fix
Browse files Browse the repository at this point in the history
  • Loading branch information
oyilmaz-nvidia authored May 2, 2024
2 parents 7678ff4 + 9e2325d commit 0a04163
Show file tree
Hide file tree
Showing 54 changed files with 309 additions and 268 deletions.
53 changes: 24 additions & 29 deletions docs/source/asr/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -261,11 +261,6 @@ Semi Sorted Batching

Sorting samples by duration and spliting them into batches speeds up training, but can degrade the quality of the model. To avoid quality degradation and maintain some randomness in the partitioning process, we add pseudo noise to the sample length when sorting.

.. image:: images/ssb.png
:align: center
:alt: semi sorted batching
:scale: 50%

It may result into training speeedup of more than 40 percent with the same quality. To enable and use semi sorted batching add some lines in config.

.. code::
Expand Down Expand Up @@ -772,30 +767,30 @@ To enable multimodal dataloading, we provide several configuration options:

Example 3. Combine an ASR (audio-text) dataset with an MT (text-only) dataset so that mini-batches have some examples from both datasets. Provide a custom prompt field for both datasets (to be leveraged by a relevant dataset class):

```yaml
use_multimodal_sampling: true
batch_tokens: 1024
token_equivalent_duration: 0.08 # 0.01 frame shift * 8 subsampling factor
quadratic_factor: 50
num_buckets: 30
use_bucketing: true
input_cfg:
- type: nemo_tarred
manifest_filepath: /path/to/manifest__OP_0..512_CL_.json
tarred_audio_filepath: /path/to/tarred_audio/audio__OP_0..512_CL_.tar
weight: 0.5
tags:
lang: en
prompt: "Given the following recording, transcribe what the person is saying:"
- type: txt_pair
source_path: /path/to/en__OP_0..512_CL_.txt
target_path: /path/to/pl__OP_0..512_CL_.txt
source_language: en
target_language: pl
weight: 0.5
tags:
prompt: "Translate the following text to Polish:"
```
.. code-block:: yaml
use_multimodal_sampling: true
batch_tokens: 1024
token_equivalent_duration: 0.08 # 0.01 frame shift * 8 subsampling factor
quadratic_factor: 50
num_buckets: 30
use_bucketing: true
input_cfg:
- type: nemo_tarred
manifest_filepath: /path/to/manifest__OP_0..512_CL_.json
tarred_audio_filepath: /path/to/tarred_audio/audio__OP_0..512_CL_.tar
weight: 0.5
tags:
lang: en
prompt: "Given the following recording, transcribe what the person is saying:"
- type: txt_pair
source_path: /path/to/en__OP_0..512_CL_.txt
target_path: /path/to/pl__OP_0..512_CL_.txt
source_language: en
target_language: pl
weight: 0.5
tags:
prompt: "Translate the following text to Polish:"
.. caution:: We strongly recommend to use multiple shards for text files as well so that different nodes and dataloading workers are able to randomize the order of text iteration. Otherwise, multi-GPU training has a high risk of duplication of text examples.

Expand Down
4 changes: 2 additions & 2 deletions docs/source/asr/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,11 +156,11 @@ Canary-1B is a multi-lingual, multi-task model, supporting automatic speech-to-t

.. raw:: html

<iframe src="https://hf.space/embed/nvidia/canary-1b/+"
<iframe src="https://nvidia-canary-1b.hf.space"
width="100%" class="gradio-asr" allow="microphone *"></iframe>

<script type="text/javascript" language="javascript">
$('.gradio-asr').css('height', $(window).height()+'px');
$('.gradio-asr').css('height', $(window).height() * 0.8+'px');
</script>


Expand Down
4 changes: 3 additions & 1 deletion docs/source/asr/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,14 @@ HuggingFace Spaces to try out Parakeet models in your browser:
* `Parakeet-TDT-1.1B <https://huggingface.co/spaces/nvidia/parakeet-tdt-1.1b>`__ space

.. _Conformer_model:

Conformer
---------

.. _Conformer-CTC_model:

Conformer-CTC
~~~~~~~~~~~~~
-------------

Conformer-CTC is a CTC-based variant of the Conformer model introduced in :cite:`asr-models-gulati2020conformer`. Conformer-CTC has a
similar encoder as the original Conformer but uses CTC loss and decoding instead of RNNT/Transducer loss, which makes it a non-autoregressive model.
Expand Down
2 changes: 2 additions & 0 deletions docs/source/asr/speech_intent_slot/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,10 @@ Mixins
.. autoclass:: nemo.collections.asr.parts.mixins.ASRModuleMixin
:show-inheritance:
:members:
:no-index:

.. autoclass:: nemo.collections.asr.parts.mixins.ASRBPEMixin
:show-inheritance:
:members:
:no-index:

2 changes: 2 additions & 0 deletions docs/source/asr/ssl/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,12 @@ Mixins
.. autoclass:: nemo.collections.asr.parts.mixins.mixins.ASRModuleMixin
:show-inheritance:
:members:
:no-index:

.. autoclass:: nemo.core.classes.mixins.access_mixins.AccessMixin
:show-inheritance:
:members:
:no-index:



4 changes: 2 additions & 2 deletions docs/source/ckpt_converters/dev_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Script Placement and Naming Conventions
Code Template
-------------

Below template tries to address the 11 steps in the guideline part. Please also use `Gemma Huggingface to NeMo converter <https://github.com/NVIDIA/NeMo/tree/main/scripts/checkpoint_converters/convert_gemma_hf_to_nemo.py>`_ as an full example for development.
Below template tries to address the 11 steps in the guideline part. Please also use `Gemma Huggingface to NeMo converter <https://github.com/NVIDIA/NeMo/tree/main/scripts/checkpoint_converters/convert_gemma_hf_to_nemo.py>`__ as an full example for development.

.. code-block:: python
Expand Down Expand Up @@ -210,7 +210,7 @@ A Simple Guide for Model Mapping and Conversion

2. **Common issues when converting: results not matching between Community model and NeMo model**:

a. Megatron Core uses a special QKV layout, which needs careful handling and reshaping from community models, especially when GQA or MQA is used. Refer to the `Gemma Huggingface to NeMo converter <https://github.com/NVIDIA/NeMo/tree/main/scripts/checkpoint_converters/convert_gemma_hf_to_nemo.py#L144>`_ for guidance.
a. Megatron Core uses a special QKV layout, which needs careful handling and reshaping from community models, especially when GQA or MQA is used. Refer to the `Gemma Huggingface to NeMo converter <https://github.com/NVIDIA/NeMo/tree/main/scripts/checkpoint_converters/convert_gemma_hf_to_nemo.py#L144>`__ for guidance.

b. GLU Variants weights could also be a common source of error. In Megatron Core, the regular feedforward projection weights and gated forward weights are fused together, requiring careful attention to the order of these two. Refer to the `Gemma Huggingface to NeMo converter <https://github.com/NVIDIA/NeMo/tree/main/scripts/checkpoint_converters/convert_gemma_hf_to_nemo.py#L135>`_ for more details.

Expand Down
Loading

0 comments on commit 0a04163

Please sign in to comment.