Merge branch 'main' into export_wordlist_fix

JimmyZhang12 · Apr 30, 2024 · c743937 · c743937
2 parents 37ead2b + 43ccc1d
commit c743937
Show file tree

Hide file tree

Showing 121 changed files with 8,003 additions and 1,757 deletions.
diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
diff --git a/Dockerfile b/Dockerfile
@@ -67,6 +67,7 @@ WORKDIR /workspace/
 RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
   cd Megatron-LM && \
   git checkout 36e9b6bf3d8034b10c9bbd9fc357c2df2bd1515c && \
+  git cherry-pick -n e69187bc3679ea5841030a165d587bb48b56ee77 && \
   pip install .
 
 # Performance optimizations for distributed optimizer: https://github.com/NVIDIA/apex/pull/1771
@@ -133,7 +134,7 @@ RUN pip install flash-attn
 # install numba for latest containers
 RUN pip install numba>=0.57.1
 # install ammo
-RUN pip install nvidia-ammo~=0.7.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir
+RUN pip install nvidia-ammo~=0.9.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir
 
 # copy nemo source into a scratch image
 FROM scratch as nemo-src

diff --git a/Jenkinsfile b/Jenkinsfile
@@ -97,7 +97,7 @@ pipeline {
 
     stage('AMMO installation') {
       steps {
-         sh 'pip install nvidia-ammo~=0.7.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir'
+         sh 'pip install nvidia-ammo~=0.9.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir'
       }
     }
 

diff --git a/README.rst b/README.rst
@@ -46,7 +46,7 @@ Latest News
   <details open>
     <summary><b>Large Language Models and Multimodal</b></summary>
         <details>
-          <summary><a href="https://cloud.google.com/blog/products/compute/gke-and-nvidia-nemo-framework-to-train-generative-ai-models">Accelerate your generative AI journey with NVIDIA NeMo framework on GKE</a> (2024/03/16) </summary>
+          <summary><a href="https://cloud.google.com/blog/products/compute/gke-and-nvidia-nemo-framework-to-train-generative-ai-models">Accelerate your generative AI journey with NVIDIA NeMo Framework on GKE</a> (2024/03/16) </summary>
 
           An end-to-end walkthrough to train generative AI models on the Google Kubernetes Engine (GKE) using the NVIDIA NeMo Framework is available at https://github.com/GoogleCloudPlatform/nvidia-nemo-on-gke. The walkthrough includes detailed instructions on how to set up a Google Cloud Project and pre-train a GPT model using the NeMo Framework.
           <br><br>
@@ -71,7 +71,7 @@ Latest News
     <details>
       <summary><a href="https://blogs.nvidia.com/blog/nemo-amazon-titan/">NVIDIA now powers training for Amazon Titan Foundation models</a> (2023/11/28) </summary>
 
-      NVIDIA NeMo framework now empowers the Amazon Titan foundation models (FM) with efficient training of large language models (LLMs). The Titan FMs form the basis of Amazon’s generative AI service, Amazon Bedrock. The NeMo Framework provides a versatile framework for building, customizing, and running LLMs.
+      NVIDIA NeMo Framework now empowers the Amazon Titan foundation models (FM) with efficient training of large language models (LLMs). The Titan FMs form the basis of Amazon’s generative AI service, Amazon Bedrock. The NeMo Framework provides a versatile framework for building, customizing, and running LLMs.
       <br><br>
     </details>
 
@@ -486,7 +486,7 @@ We welcome community contributions! Please refer to `CONTRIBUTING.md <https://gi
 Publications
 ------------
 
-We provide an ever-growing list of `publications <https://nvidia.github.io/NeMo/publications/>`_ that utilize the NeMo framework.
+We provide an ever-growing list of `publications <https://nvidia.github.io/NeMo/publications/>`_ that utilize the NeMo Framework.
 
 If you would like to add your own article to the list, you are welcome to do so via a pull request to this repository's ``gh-pages-src`` branch.
 Please refer to the instructions in the `README of that branch <https://github.com/NVIDIA/NeMo/tree/gh-pages-src#readme>`_.

diff --git a/docs/source/asr/datasets.rst b/docs/source/asr/datasets.rst
@@ -823,6 +823,54 @@ For multi-dataset setups, one may provide multiple manifests and even their weig
             bucket_duration_bins=[1.91,3.02,3.56,...
     <other diagnostic information about the dataset>
 
+
+Seeds and randomness
+~~~~~~~~~~~~~~~~~~~~
+
+In Lhotse dataloading configuration we have two parameters controlling randomness: ``seed`` and ``shard_seed``.
+Both of them can be either set to a fixed number, or one of two string options ``"randomized"`` and ``"trng"``.
+Their roles are:
+
+* ``seed`` is the base random seed, and is one of several factors used to initialize various RNGs participating in dataloading.
+
+* ``shard_seed`` controls the shard randomization strategy in distributed data parallel setups when using sharded tarred datasets.
+
+Below are the typical examples of configuration with an explanation of the expected outcome.
+
+Case 1 (default): ``seed=<int>`` and ``shard_seed="trng"``:
+
+* The ``trng`` setting discards ``seed`` and causes the actual random seed to be drawn using OS's true RNG. Each node/GPU/dataloading worker draws its own unique random seed when it first needs it.
+
+* Each node/GPU/dataloading worker yields data in a different order (no mini-batch duplication).
+
+* On each training script run, the order of dataloader examples are **different**.
+
+* Since the random seed is unpredictable, the exact dataloading order is not replicable.
+
+Case 2: ``seed=<int>`` and ``shard_seed="randomized"``:
+
+* The ``randomized`` setting uses ``seed`` along with DDP ``rank`` and dataloading ``worker_id`` to set a unique but deterministic random seed in each dataloading process across all GPUs.
+
+* Each node/GPU/dataloading worker yields data in a different order (no mini-batch duplication).
+
+* On each training script run, the order of dataloader examples are **identical** as long as ``seed`` is the same.
+
+* This setup guarantees 100% dataloading reproducibility.
+
+* Resuming training without changing of the ``seed`` value will cause the model to train on data it has already seen. For large data setups, not managing the ``seed`` may cause the model to never be trained on a majority of data. This is why this mode is not the default.
+
+* If you're combining DDP with model parallelism techniques (Tensor Parallel, Pipeline Parallel, etc.) you need to use ``shard_seed="randomized"``. Using ``"trng"`` will cause different model parallel ranks to desynchronize and cause a deadlock.
+
+* Generally the seed can be managed by the user by providing a different value each time the training script is launched. For example, for most models the option to override would be ``model.train_ds.seed=<value>``. If you're launching multiple tasks queued one after another on a grid system, you can generate a different random seed for each task, e.g. on most Unix systems ``RSEED=$(od -An -N4 -tu4 < /dev/urandom | tr -d ' ')`` would generate a random uint32 number that can be provided as the seed.
+
+Other, more exotic configurations:
+
+* With ``shard_seed=<int>``, all dataloading workers will yield the same results. This is only useful for unit testing and maybe debugging.
+
+* With ``seed="trng"``, the base random seed itself will be drawn using a TRNG. It will be different on each GPU training process. This setting is not recommended.
+
+* With ``seed="randomized"``, the base random seed is set to Python's global RNG seed. It might be different on each GPU training process. This setting is not recommended.
+
 Preparing Text-Only Data for Hybrid ASR-TTS Models
 --------------------------------------------------
 

diff --git a/docs/source/ckpt_converters/convert_mlm.rst b/docs/source/ckpt_converters/convert_mlm.rst
@@ -0,0 +1,32 @@
+Converting from Megatron-LM
+===========================
+
+NVIDIA NeMo and NVIDIA Megatron-LM share many underlying technologies. This document provides guidance for migrating your project from Megatron-LM to NVIDIA NeMo.
+
+Converting Checkpoints
+----------------------
+
+You can convert your GPT-style model checkpoints trained with Megatron-LM into the NeMo Framework using the provided example script. This script facilitates the conversion of Megatron-LM checkpoints to NeMo compatible formats.
+
+.. code-block:: bash
+
+   <NeMo_ROOT_FOLDER>/examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \
+     --checkpoint_folder <path_to_PTL_checkpoints_folder> \
+     --checkpoint_name megatron_gpt--val_loss=99.99-step={steps}-consumed_samples={consumed}.0 \
+     --nemo_file_path <path_to_output_nemo_file> \
+     --model_type <megatron_model_type> \
+     --tensor_model_parallel_size <tensor_model_parallel_size> \
+     --pipeline_model_parallel_size <pipeline_model_parallel_size> \
+     --gpus_per_node <gpus_per_node>
+
+Resuming Training
+-----------------
+
+To resume training from a converted Megatron-LM checkpoint, it is crucial to correctly set up the training parameters to match the previous learning rate schedule. Use the following setting for the `trainer.max_steps` parameter in your NeMo training configuration:
+
+.. code-block:: none
+
+   trainer.max_steps=round(lr-warmup-fraction * lr-decay-iters + lr-decay-iters)
+
+This configuration ensures that the learning rate scheduler in NeMo continues from where it left off in Megatron-LM, using the `lr-warmup-fraction` and `lr-decay-iters` arguments from the original Megatron-LM training setup.
+
diff --git a/docs/source/ckpt_converters/intro.rst b/docs/source/ckpt_converters/intro.rst
@@ -0,0 +1,22 @@
+Community Checkpoint Converter
+==============================
+
+We provide easy-to-use tools that enable users to convert community checkpoints into the NeMo format. These tools facilitate various operations, including resuming training, Sparse Fine-Tuning (SFT), Parameter-Efficient Fine-Tuning (PEFT), and deployment. For detailed instructions and guidelines, please refer to our documentation.
+
+We offer comprehensive guides to assist both end users and developers:
+
+- **User Guide**: Detailed steps on how to convert community model checkpoints for further training or deployment within NeMo. For more information, please see our :doc:`user_guide`.
+
+- **Developer Guide**: Instructions for developers on how to implement converters for community model checkpoints, allowing for broader compatibility and integration within the NeMo ecosystem. For development details, refer to our :doc:`dev_guide`.
+
+- **Megatron-LM Checkpoint Conversion**: NVIDIA NeMo and NVIDIA Megatron-LM share several foundational technologies. You can convert your GPT-style model checkpoints trained with Megatron-LM into the NeMo Framework using our scripts, see our :doc:`convert_mlm`.
+
+Access the user and developer guides directly through the links below:
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Conversion Guides
+
+   user_guide
+   dev_guide
+   convert_mlm
diff --git a/docs/source/collections.rst b/docs/source/collections.rst
@@ -0,0 +1,70 @@
+================
+NeMo Collections
+================
+
+Documentation for the individual collections
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Large Language Models (LLMs)
+   :name: Large Language Models
+   :titlesonly:
+
+   nlp/nemo_megatron/intro
+   nlp/models
+   nlp/machine_translation/machine_translation
+   nlp/megatron_onnx_export
+   nlp/quantization
+   nlp/api
+
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Speech AI
+   :name: Speech AI
+   :titlesonly:
+
+   asr/intro
+   asr/speech_classification/intro
+   asr/speaker_recognition/intro
+   asr/speaker_diarization/intro
+   asr/ssl/intro
+   asr/speech_intent_slot/intro
+
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Multimodal Models (MMs)
+   :name: Multimodal
+   :titlesonly:
+
+   multimodal/mllm/intro
+   multimodal/vlm/intro
+   multimodal/text2img/intro
+   multimodal/nerf/intro
+   multimodal/api
+
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Text To Speech (TTS)
+   :name: Text To Speech
+   :titlesonly:
+
+   tts/intro
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Vision (CV)
+   :name: vision
+   :titlesonly:
+
+   vision/intro
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Common
+   :name: Common
+   :titlesonly:
+
+   common/intro
diff --git a/docs/source/core/core_index.rst b/docs/source/core/core_index.rst
@@ -1,5 +1,5 @@
 =========
-NeMo Core
+NeMo APIs
 =========
 
 You can learn more about the underlying principles of the NeMo codebase in this section.
@@ -30,7 +30,7 @@ Alternatively, you can jump straight to the documentation for the individual col
 
 * :doc:`Automatic Speech Recognition (ASR) <../asr/intro>`
 
-* :doc:`Multimodal (MM) Models <../multimodal/mllm/intro>`
+* :doc:`Multimodal Models (MMs) <../multimodal/mllm/intro>`
 
 * :doc:`Text-to-Speech (TTS) <../tts/intro>`
 

diff --git a/docs/source/features/memory_optimizations.rst b/docs/source/features/memory_optimizations.rst
@@ -0,0 +1,48 @@
+Memory Optimizations
+====================
+
+Parallelism
+-----------
+Refer to :doc:`Parallelism <./parallelism>`.
+
+Flash Attention
+---------------
+
+Overview
+^^^^^^^^
+
+Flash Attention is a method designed to enhance the efficiency of Transformer models, which are widely utilized in applications such as Natural Language Processing (NLP). Traditional Transformers are slow and consume a lot of memory, especially with long sequences, due to the quadratic time and memory complexity of self-attention. FlashAttention, an IO-aware exact attention algorithm that leverages tiling to minimize the number of memory reads/writes between the GPU's high bandwidth memory (HBM) and on-chip SRAM. This approach is designed to be more efficient in terms of IO complexity compared to standard attention mechanisms.
+
+Turn Flash Attention On and Off
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In the NeMo Framework, Flash Attention is supported through the Transformer Engine with the inclusion of Flash Attention 2. By default, Flash Attention is enabled, but the Transformer Engine may switch to a different kernel if the tensor dimensions are not optimal for Flash Attention. Users can completely disable Flash Attention by setting the environment variable ``NVTE_FLASH_ATTN=0``.
+
+For more details on the supported Dot Attention backend, please refer to the Transformer Engine source code available at `Transformer Engine's Attention Mechanism <https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py>`_.
+
+.. bibliography:: ./nlp_all.bib
+    :style: plain
+    :labelprefix: nlp-megatron
+    :keyprefix: nlp-megatron-
+
+Overview
+^^^^^^^^
+
+Full Activation Recomputation
+"""""""""""""""""""""""""""""
+This method recalculates all the intermediate activations during the backward pass of a model's training, instead of storing them during the forward pass. This technique maximizes memory efficiency at the cost of computational overhead, as each activation is recomputed when needed.
+
+Partial Activation Recomputation
+""""""""""""""""""""""""""""""""
+This method recomputes only a subset of layers during the backward phase. It is a trade-off between the full recomputation and no recomputation, balancing memory savings with computational efficiency.
+
+Selective Activation Recomputation
+""""""""""""""""""""""""""""""""""
+This method reduces memory footprint of activations significantly via smart activation checkpointing. This approach involves selectively storing only crucial activations and recomputing the others as needed. It is particularly useful in large models to minimize memory usage while controlling the computational cost.
+
+Refer to "Reducing Activation Recomputation in Large Transformer Models" for more details: https://arxiv.org/abs/2205.05198
+
+.. bibliography:: ./nlp_all.bib
+    :style: plain
+    :labelprefix: nlp-megatron
+    :keyprefix: nlp-megatron-
diff --git a/docs/source/features/mixed_precision.rst b/docs/source/features/mixed_precision.rst
@@ -0,0 +1,6 @@
+.. _mix_precision:
+
+Mixed Precision Training
+------------------------
+
+Mixed precision training significantly enhances computational efficiency by conducting operations in half-precision and fp8 formats, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo now supports FP16, BF16, and FP8 (via Transformer Engine) across most models. Further details will be provided shortly.
diff --git a/...source/nlp/nemo_megatron/parallelisms.rst → docs/source/features/parallelisms.rst b/...source/nlp/nemo_megatron/parallelisms.rst → docs/source/features/parallelisms.rst
@@ -3,13 +3,13 @@
 Parallelisms
 ------------
 
-NeMo Megatron supports 5 types of parallelisms (which can be mixed together arbitraritly):
+NeMo Megatron supports 5 types of parallelisms (which can be mixed together arbitrarily):
 
-Distributed Data parallelism
+Distributed Data Parallelism
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Distributed Data parallelism (DDP) creates idential copies of the model across multiple GPUs.
+Distributed Data Parallelism (DDP) creates idential copies of the model across multiple GPUs.
 
-.. image:: images/ddp.gif
+.. image:: ../nlp/nemo_megatron/images/ddp.gif
     :align: center
     :width: 800px
     :alt: Distributed Data Parallel
@@ -20,7 +20,7 @@ Tensor Parallelism
 With Tensor Paralellism (TP) a tensor is split into non-overlapping pieces and
 different parts are distributed and processed on separate GPUs.
 
-.. image:: images/tp.gif
+.. image:: ../nlp/nemo_megatron/images/tp.gif
     :align: center
     :width: 800px
     :alt: Tensor Parallel
@@ -29,15 +29,15 @@ Pipeline Parallelism
 ^^^^^^^^^^^^^^^^^^^^
 With Pipeline Paralellism (PP) consecutive layer chunks are assigned to different GPUs.
 
-.. image:: images/pp.gif
+.. image:: ../nlp/nemo_megatron/images/pp.gif
     :align: center
     :width: 800px
     :alt: Pipeline Parallel
 
 Sequence Parallelism
 ^^^^^^^^^^^^^^^^^^^^
 
-.. image:: images/sp.gif
+.. image:: ../nlp/nemo_megatron/images/sp.gif
     :align: center
     :width: 800px
     :alt: Sequence Parallel
@@ -47,7 +47,7 @@ Expert Parallelism
 Expert Paralellim (EP) distributes experts across GPUs.
 
 
-.. image:: images/ep.png
+.. image:: ../nlp/nemo_megatron/images/ep.png
     :align: center
     :width: 800px
     :alt: Expert Parallelism
@@ -57,7 +57,7 @@ Parallelism nomenclature
 
 When reading and modifying NeMo Megatron code you will encounter the following terms.
 
-.. image:: images/pnom.gif
+.. image:: ../nlp/nemo_megatron/images/pnom.gif
     :align: center
     :width: 800px
     :alt: Parallelism nomenclature
diff --git a/...rce/nlp/nemo_megatron/packed_sequence.rst → ...rce/features/throughput_optimizations.rst b/...rce/nlp/nemo_megatron/packed_sequence.rst → ...rce/features/throughput_optimizations.rst
@@ -1,7 +1,9 @@
+Throughput Optimizations
+========================
+
 Sequence Packing for SFT/PEFT
 -----------------------------
 
-
 Overview
 ^^^^^^^^
 
@@ -133,6 +135,10 @@ To train with packed sequences, you need to change four items in the SFT/PEFT co
 
 Now you are all set to finetune your model with a much improved throughput!
 
+Communication Overlap
+---------------------
+NeMo leverages Megatron-Core's optimizations to enhance bandwidth utilization and effectively overlap computation with communication. Additional details will be provided soon.
+
 
 .. rubric:: Footnotes