Merge branch 'main' into export_wordlist_fix

JimmyZhang12 · Apr 22, 2024 · 37ead2b · 37ead2b
2 parents 494e4a1 + a452a4f
commit 37ead2b
Show file tree

Hide file tree

Showing 24 changed files with 1,533 additions and 116 deletions.
diff --git a/README.rst b/README.rst
@@ -188,12 +188,15 @@ The NeMo Framework can be installed in a variety of ways, depending on your need
   * This is recommended for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) domains.
   * When using a Nvidia PyTorch container as the base, this is the recommended installation method for all domains.
 
-* Docker - Refer to the `Docker containers <#docker-containers>`_ section for installation instructions.
+* Docker Containers - Refer to the `Docker containers <#docker-containers>`_ section for installation instructions.
 
   * This is recommended for Large Language Models (LLM), Multimodal and Vision domains.
-  * NeMo LLM & Multimodal Container - `nvcr.io/nvidia/nemo:24.01.01.framework`
+  * NeMo LLM & Multimodal Container - `nvcr.io/nvidia/nemo:24.03.framework`
   * NeMo Speech Container - `nvcr.io/nvidia/nemo:24.01.speech`
 
+* LLM and Multimodal Dependencies - Refer to the `LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>`_ section for isntallation instructions.
+  * It's higly recommended to start with a base NVIDIA PyTorch container: `nvcr.io/nvidia/pytorch:24.02-py3`
+
 Conda
 ~~~~~
 
@@ -330,23 +333,59 @@ Note that RNNT requires numba to be installed from conda.
   pip uninstall numba
   conda install -c conda-forge numba
 
+LLM and Multimodal Dependencies
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The LLM and Multimodal domains require three additional dependencies: 
+NVIDIA Apex, NVIDIA Transformer Engine, and NVIDIA Megatron Core.
+
+When working with the `main` branch these dependencies may require a recent commit.
+The most recent working versions of these dependencies are:
+
+.. code-block:: bash
+
+  export apex_commit=810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c
+  export te_commit=bfe21c3d68b0a9951e5716fb520045db53419c5e
+  export mcore_commit=fbb375d4b5e88ce52f5f7125053068caff47f93f
+  export nv_pytorch_tag=24.02-py3
+
+When using a released version of NeMo, 
+please refer to the `Software Component Versions <https://docs.nvidia.com/nemo-framework/user-guide/latest/softwarecomponentversions.html>`_ 
+for the correct versions.
+
+If starting with a base NVIDIA PyTorch container first launch the container:
+
+.. code-block:: bash
+
+  docker run \
+    --gpus all \
+    -it \
+    --rm \
+    --shm-size=16g \
+    --ulimit memlock=-1 \
+    --ulimit stack=67108864 \
+    nvcr.io/nvidia/pytorch:$nv_pytorch_tag
+
+Then install the dependencies:
+
 Apex
 ~~~~
-NeMo LLM Domain training requires NVIDIA Apex to be installed.
-Install it manually if not using the NVIDIA PyTorch container.
+NeMo LLM Multimodal Domains require that NVIDIA Apex to be installed.
+Apex comes installed in the NVIDIA PyTorch container but it's possible that
+NeMo LLM and Multimodal may need to be updated to a newer version.
 
 To install Apex, run
 
 .. code-block:: bash
 
     git clone https://github.com/NVIDIA/apex.git
     cd apex
-    git checkout b496d85fb88a801d8e680872a12822de310951fd
-    pip install -v --no-build-isolation --disable-pip-version-check --no-cache-dir --config-settings "--build-option=--cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam" ./
+    git checkout $apex_commit
+    pip install . -v --no-build-isolation --disable-pip-version-check --no-cache-dir --config-settings "--build-option=--cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam --group_norm"
 
-It is highly recommended to use the NVIDIA PyTorch or NeMo container if having issues installing Apex or any other dependencies.
 
-While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with.
+While installing Apex outside of the NVIDIA PyTorch container,
+it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with.
 This raise can be avoided by commenting it here: https://github.com/NVIDIA/apex/blob/master/setup.py#L32
 
 cuda-nvprof is needed to install Apex. The version should match the CUDA version that you are using:
@@ -366,35 +405,43 @@ With the latest versions of Apex, the `pyproject.toml` file in Apex may need to
 
 Transformer Engine
 ~~~~~~~~~~~~~~~~~~
-NeMo LLM Domain has been integrated with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
-Transformer Engine enables FP8 training on NVIDIA Hopper GPUs.
-`Install <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>`_ it manually if not using the NVIDIA PyTorch container.
-
-.. code-block:: bash
 
-  pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable
+The NeMo LLM Multimodal Domains require that NVIDIA Transformer Engine to be installed.
+Transformer Engine comes installed in the NVIDIA PyTorch container but it's possible that
+NeMo LLM and Multimodal may need Transformer Engine to be updated to a newer version.
 
-It is highly recommended to use the NVIDIA PyTorch or NeMo container if having issues installing Transformer Engine or any other dependencies.
+Transformer Engine enables FP8 training on NVIDIA Hopper GPUs and many performance optimizations for transformer-based model training.
+Documentation for installing Transformer Engine can be found `here <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>`_. 
 
-Transformer Engine requires PyTorch to be built with CUDA 11.8.
+.. code-block:: bash
 
+  git clone https://github.com/NVIDIA/TransformerEngine.git && \
+  cd TransformerEngine && \
+  git checkout $te_commit && \
+  git submodule init && git submodule update && \
+  NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install .
 
-Flash Attention
-~~~~~~~~~~~~~~~
-When traning Large Language Models in NeMo, users may opt to use Flash Attention for efficient training. Transformer Engine already supports Flash Attention for GPT models. If you want to use Flash Attention for non-causal models, please install `flash-attn <https://github.com/HazyResearch/flash-attention>`_. If you want to use Flash Attention with attention bias (introduced from position encoding, e.g. Alibi), please also install triton pinned version following the `implementation <https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3>`_.
+Transformer Engine requires PyTorch to be built with at least CUDA 11.8.
 
-.. code-block:: bash
+Megatron Core
+~~~~~~~~~~~~~
 
-  pip install flash-attn
-  pip install triton==2.0.0.dev20221202
+The NeMo LLM Multimodal Domains require that NVIDIA Megatron Core to be installed.
+Megatron core is a library for scaling large transfromer base models. 
+NeMo LLM and Multimodal models leverage Megatron Core for model parallelism, 
+transformer architectures, and optimized pytorch datasets.
 
-NLP inference UI
-~~~~~~~~~~~~~~~~~~~~
-To launch the inference web UI server, please install the gradio `gradio <https://gradio.app/>`_.
+NeMo LLM and Multimodal may need Megatron Core to be updated to a recent version.
 
 .. code-block:: bash
 
-  pip install gradio==3.34.0
+  git clone https://github.com/NVIDIA/Megatron-LM.git && \
+  cd Megatron-LM && \
+  git checkout $mcore_commit && \
+  pip install . && \
+  cd megatron/core/datasets && \
+  make
+
 
 NeMo Text Processing
 ~~~~~~~~~~~~~~~~~~~~
@@ -404,7 +451,7 @@ Docker containers
 ~~~~~~~~~~~~~~~~~
 We release NeMo containers alongside NeMo releases. For example, NeMo ``r1.23.0`` comes with container ``nemo:24.01.speech``, you may find more details about released containers in `releases page <https://github.com/NVIDIA/NeMo/releases>`_.
 
-To use built container, please run
+To use a pre-built container, please run
 
 .. code-block:: bash
 

diff --git a/docs/source/asr/api.rst b/docs/source/asr/api.rst
@@ -41,7 +41,7 @@ Model Classes
 
 .. _confidence-ensembles-api:
 
-.. autoclass:: nemo.collections.asr.models.confidence_ensembles.ConfidenceEnsembleModel
+.. autoclass:: nemo.collections.asr.models.confidence_ensemble.ConfidenceEnsembleModel
     :show-inheritance:
     :members: transcribe
 

diff --git a/docs/source/asr/datasets.rst b/docs/source/asr/datasets.rst
@@ -806,13 +806,17 @@ We recommend to pre-compute the bucket duration bins in order to accelerate the
 The following script may be used:
 
 .. code-block:: bash
+
     $ python scripts/speech_recognition/estimate_duration_bins.py -b 30 manifest.json
     Use the following options in your config:
             num_buckets=30
             bucket_duration_bins=[1.78,2.34,2.69,...
     <other diagnostic information about the dataset>
+
 For multi-dataset setups, one may provide multiple manifests and even their weights:
+
 .. code-block:: bash
+
     $ python scripts/speech_recognition/estimate_duration_bins.py -b 30 [[manifest.json,0.7],[other.json,0.3]]
     Use the following options in your config:
             num_buckets=30

diff --git a/docs/source/asr/models.rst b/docs/source/asr/models.rst
@@ -314,7 +314,7 @@ For more details about this model, see the `paper <https://arxiv.org/abs/2306.15
 or read our `tutorial <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/Confidence_Ensembles.ipynb>`_.
 
 NeMo support Confidence-based Ensembles through the
-:ref:`nemo.collections.asr.models.confidence_ensembles.ConfidenceEnsembleModel <confidence-ensembles-api>` class.
+:ref:`nemo.collections.asr.models.confidence_ensemble.ConfidenceEnsembleModel <confidence-ensembles-api>` class.
 
 A typical workflow to create and use the ensemble is like this
 

diff --git a/nemo/collections/asr/parts/submodules/ctc_decoding.py b/nemo/collections/asr/parts/submodules/ctc_decoding.py
@@ -98,6 +98,10 @@ class AbstractCTCDecoding(ConfidenceMixin):
                     Which aggregation type to use for collapsing per-token confidence into per-word confidence.
                     Valid options are `mean`, `min`, `max`, `prod`.
 
+                tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
+                    attached to the regular frame confidence,
+                    making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
+
                 method_cfg:
                     A dict-like object which contains the method name and settings to compute per-frame
                     confidence scores.
@@ -911,10 +915,15 @@ class CTCDecoding(AbstractCTCDecoding):
                 exclude_blank:
                     Bool flag indicating that blank token confidence scores are to be excluded
                     from the `token_confidence`.
+
                 aggregation:
                     Which aggregation type to use for collapsing per-token confidence into per-word confidence.
                     Valid options are `mean`, `min`, `max`, `prod`.
 
+                tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
+                    attached to the regular frame confidence,
+                    making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
+
                 method_cfg:
                     A dict-like object which contains the method name and settings to compute per-frame
                     confidence scores.
@@ -1122,6 +1131,10 @@ class CTCBPEDecoding(AbstractCTCDecoding):
                     Which aggregation type to use for collapsing per-token confidence into per-word confidence.
                     Valid options are `mean`, `min`, `max`, `prod`.
 
+                tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
+                    attached to the regular frame confidence,
+                    making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
+
                 method_cfg:
                     A dict-like object which contains the method name and settings to compute per-frame
                     confidence scores.