Merge branch 'main' into fix_weight_only_ckpt_save

JimmyZhang12 · Apr 21, 2024 · 0be76d8 · 0be76d8
2 parents 32180d3 + 9bafd37
commit 0be76d8
Show file tree

Hide file tree

Showing 16 changed files with 335 additions and 90 deletions.
diff --git a/README.rst b/README.rst
@@ -41,17 +41,43 @@
 Latest News
 -----------
 
-- 2023/12/06 `New NVIDIA NeMo Framework Features and NVIDIA H200 <https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility/>`_
+.. raw:: html
 
-.. image:: https://github.com/sbhavani/TransformerEngine/blob/main/docs/examples/H200-NeMo-performance.png
-  :target: https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility
-  :alt: H200-NeMo-performance
-  :width: 600
+  <details open>
+    <summary><b>Large Language Models and Multimodal</b></summary>
+        <details>
+          <summary><a href="https://cloud.google.com/blog/products/compute/gke-and-nvidia-nemo-framework-to-train-generative-ai-models">Accelerate your generative AI journey with NVIDIA NeMo framework on GKE</a> (2024/03/16) </summary>
 
-NeMo Framework has been updated with state-of-the-art features,
-such as FSDP, Mixture-of-Experts, and RLHF with TensorRT-LLM to provide speedups up to 4.2x for Llama-2 pre-training on H200.
-**All of these features will be available in an upcoming release.**
+          An end-to-end walkthrough to train generative AI models on the Google Kubernetes Engine (GKE) using the NVIDIA NeMo Framework is available at https://github.com/GoogleCloudPlatform/nvidia-nemo-on-gke. The walkthrough includes detailed instructions on how to set up a Google Cloud Project and pre-train a GPT model using the NeMo Framework.
+          <br><br>
+        </details>
 
+      <details>
+        <summary><a href="https://blogs.nvidia.com/blog/bria-builds-responsible-generative-ai-using-nemo-picasso/">Bria Builds Responsible Generative AI for Enterprises Using NVIDIA NeMo, Picasso</a> (2024/03/06) </summary>
+
+        Bria, a Tel Aviv startup at the forefront of visual generative AI for enterprises now leverages the NVIDIA NeMo Framework. The Bria.ai platform uses reference implementations from the NeMo Multimodal collection, trained on NVIDIA Tensor Core GPUs, to enable high-throughput and low-latency image generation. Bria has also adopted NVIDIA Picasso, a foundry for visual generative AI models, to run inference.
+        <br><br>
+    </details>
+
+    <details>
+      <summary><a href="https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility/">New NVIDIA NeMo Framework Features and NVIDIA H200</a> (2023/12/06) </summary>
+
+      NVIDIA NeMo Framework now includes several optimizations and enhancements, including: 1) Fully Sharded Data Parallelism (FSDP) to improve the efficiency of training large-scale AI models, 2) Mix of Experts (MoE)-based LLM architectures with expert parallelism for efficient LLM training at scale, 3) Reinforcement Learning from Human Feedback (RLHF) with TensorRT-LLM for inference stage acceleration, and 4) up to 4.2x speedups for Llama 2 pre-training on NVIDIA H200 Tensor Core GPUs.
+      <br><br>
+      <a href="https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility"><img src="https://github.com/sbhavani/TransformerEngine/blob/main/docs/examples/H200-NeMo-performance.png" alt="H200-NeMo-performance" style="width: 600px;"></a>
+      <br><br>
+    </details>
+
+    <details>
+      <summary><a href="https://blogs.nvidia.com/blog/nemo-amazon-titan/">NVIDIA now powers training for Amazon Titan Foundation models</a> (2023/11/28) </summary>
+
+      NVIDIA NeMo framework now empowers the Amazon Titan foundation models (FM) with efficient training of large language models (LLMs). The Titan FMs form the basis of Amazon’s generative AI service, Amazon Bedrock. The NeMo Framework provides a versatile framework for building, customizing, and running LLMs.
+      <br><br>
+    </details>
+
+  </details>
+
+
 
 
 Introduction

diff --git a/nemo/collections/asr/parts/submodules/ctc_decoding.py b/nemo/collections/asr/parts/submodules/ctc_decoding.py
@@ -98,6 +98,10 @@ class AbstractCTCDecoding(ConfidenceMixin):
                     Which aggregation type to use for collapsing per-token confidence into per-word confidence.
                     Valid options are `mean`, `min`, `max`, `prod`.
 
+                tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
+                    attached to the regular frame confidence,
+                    making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
+
                 method_cfg:
                     A dict-like object which contains the method name and settings to compute per-frame
                     confidence scores.
@@ -911,10 +915,15 @@ class CTCDecoding(AbstractCTCDecoding):
                 exclude_blank:
                     Bool flag indicating that blank token confidence scores are to be excluded
                     from the `token_confidence`.
+
                 aggregation:
                     Which aggregation type to use for collapsing per-token confidence into per-word confidence.
                     Valid options are `mean`, `min`, `max`, `prod`.
 
+                tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
+                    attached to the regular frame confidence,
+                    making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
+
                 method_cfg:
                     A dict-like object which contains the method name and settings to compute per-frame
                     confidence scores.
@@ -1122,6 +1131,10 @@ class CTCBPEDecoding(AbstractCTCDecoding):
                     Which aggregation type to use for collapsing per-token confidence into per-word confidence.
                     Valid options are `mean`, `min`, `max`, `prod`.
 
+                tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
+                    attached to the regular frame confidence,
+                    making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
+
                 method_cfg:
                     A dict-like object which contains the method name and settings to compute per-frame
                     confidence scores.

diff --git a/nemo/collections/asr/parts/submodules/rnnt_decoding.py b/nemo/collections/asr/parts/submodules/rnnt_decoding.py
@@ -96,6 +96,9 @@ class AbstractRNNTDecoding(ConfidenceMixin):
                     from the `token_confidence`.
                 aggregation: Which aggregation type to use for collapsing per-token confidence into per-word confidence.
                     Valid options are `mean`, `min`, `max`, `prod`.
+                tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
+                    attached to the regular frame confidence,
+                    making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
                 method_cfg: A dict-like object which contains the method name and settings to compute per-frame
                     confidence scores.
 
@@ -209,7 +212,8 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):
         self.compute_timestamps = self.cfg.get('compute_timestamps', None)
         self.word_seperator = self.cfg.get('word_seperator', ' ')
 
-        if self.durations is not None and self.durations != []:  # this means it's a TDT model.
+        self._is_tdt = self.durations is not None and self.durations != []  # this means it's a TDT model.
+        if self._is_tdt:
             if blank_id == 0:
                 raise ValueError("blank_id must equal len(non_blank_vocabs) for TDT models")
             if self.big_blank_durations is not None and self.big_blank_durations != []:
@@ -254,6 +258,12 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):
         # initialize confidence-related fields
         self._init_confidence(self.cfg.get('confidence_cfg', None))
 
+        if self._is_tdt:
+            if self.preserve_frame_confidence is True and self.preserve_alignments is False:
+                raise ValueError(
+                    "If `preserve_frame_confidence` flag is set, then `preserve_alignments` flag must also be set."
+                )
+
         # Confidence estimation is not implemented for these strategies
         if (
             not self.preserve_frame_confidence
@@ -264,7 +274,7 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):
 
         if self.cfg.strategy == 'greedy':
             if self.big_blank_durations is None or self.big_blank_durations == []:
-                if self.durations is None or self.durations == []:
+                if not self._is_tdt:
                     self.decoding = rnnt_greedy_decoding.GreedyRNNTInfer(
                         decoder_model=decoder,
                         joint_model=joint,
@@ -289,6 +299,7 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):
                         ),
                         preserve_alignments=self.preserve_alignments,
                         preserve_frame_confidence=self.preserve_frame_confidence,
+                        include_duration_confidence=self.tdt_include_duration_confidence,
                         confidence_method_cfg=self.confidence_method_cfg,
                     )
             else:
@@ -307,7 +318,7 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):
 
         elif self.cfg.strategy == 'greedy_batch':
             if self.big_blank_durations is None or self.big_blank_durations == []:
-                if self.durations is None or self.durations == []:
+                if not self._is_tdt:
                     self.decoding = rnnt_greedy_decoding.GreedyBatchedRNNTInfer(
                         decoder_model=decoder,
                         joint_model=joint,
@@ -334,6 +345,7 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):
                         ),
                         preserve_alignments=self.preserve_alignments,
                         preserve_frame_confidence=self.preserve_frame_confidence,
+                        include_duration_confidence=self.tdt_include_duration_confidence,
                         confidence_method_cfg=self.confidence_method_cfg,
                         use_cuda_graph_decoder=self.cfg.greedy.get('use_cuda_graph_decoder', False),
                     )
@@ -530,7 +542,7 @@ def decode_hypothesis(self, hypotheses_list: List[Hypothesis]) -> List[Union[Hyp
             if self.big_blank_durations is not None and self.big_blank_durations != []:  # multi-blank RNNT
                 num_extra_outputs = len(self.big_blank_durations)
                 prediction = [p for p in prediction if p < self.blank_id - num_extra_outputs]
-            elif self.durations is not None and self.durations != []:  # TDT model.
+            elif self._is_tdt:  # TDT model.
                 prediction = [p for p in prediction if p < self.blank_id]
             else:  # standard RNN-T
                 prediction = [p for p in prediction if p != self.blank_id]
@@ -569,28 +581,69 @@ def compute_confidence(self, hypotheses_list: List[Hypothesis]) -> List[Hypothes
         Returns:
             A list of hypotheses with high-level confidence scores.
         """
-        if self.exclude_blank_from_confidence:
-            for hyp in hypotheses_list:
-                hyp.token_confidence = hyp.non_blank_frame_confidence
-        else:
+        if self._is_tdt:
+            # if self.tdt_include_duration_confidence is True then frame_confidence elements consist of two numbers
+            maybe_pre_aggregate = (
+                (lambda x: self._aggregate_confidence(x)) if self.tdt_include_duration_confidence else (lambda x: x)
+            )
             for hyp in hypotheses_list:
-                offset = 0
                 token_confidence = []
-                if len(hyp.timestep) > 0:
-                    for ts, te in zip(hyp.timestep, hyp.timestep[1:] + [len(hyp.frame_confidence)]):
-                        if ts != te:
-                            # <blank> tokens are considered to belong to the last non-blank token, if any.
-                            token_confidence.append(
-                                self._aggregate_confidence(
-                                    [hyp.frame_confidence[ts][offset]]
-                                    + [fc[0] for fc in hyp.frame_confidence[ts + 1 : te]]
+                # trying to recover frame_confidence according to alignments
+                subsequent_blank_confidence = []
+                # going backwards since <blank> tokens are considered belonging to the last non-blank token.
+                for fc, fa in zip(hyp.frame_confidence[::-1], hyp.alignments[::-1]):
+                    # there is only one score per frame most of the time
+                    if len(fa) > 1:
+                        for i, a in reversed(list(enumerate(fa))):
+                            if a[-1] == self.blank_id:
+                                if not self.exclude_blank_from_confidence:
+                                    subsequent_blank_confidence.append(maybe_pre_aggregate(fc[i]))
+                            elif not subsequent_blank_confidence:
+                                token_confidence.append(maybe_pre_aggregate(fc[i]))
+                            else:
+                                token_confidence.append(
+                                    self._aggregate_confidence(
+                                        [maybe_pre_aggregate(fc[i])] + subsequent_blank_confidence
+                                    )
                                 )
-                            )
-                            offset = 0
+                                subsequent_blank_confidence = []
+                    else:
+                        i, a = 0, fa[0]
+                        if a[-1] == self.blank_id:
+                            if not self.exclude_blank_from_confidence:
+                                subsequent_blank_confidence.append(maybe_pre_aggregate(fc[i]))
+                        elif not subsequent_blank_confidence:
+                            token_confidence.append(maybe_pre_aggregate(fc[i]))
                         else:
-                            token_confidence.append(hyp.frame_confidence[ts][offset])
-                            offset += 1
+                            token_confidence.append(
+                                self._aggregate_confidence([maybe_pre_aggregate(fc[i])] + subsequent_blank_confidence)
+                            )
+                            subsequent_blank_confidence = []
+                token_confidence = token_confidence[::-1]
                 hyp.token_confidence = token_confidence
+        else:
+            if self.exclude_blank_from_confidence:
+                for hyp in hypotheses_list:
+                    hyp.token_confidence = hyp.non_blank_frame_confidence
+            else:
+                for hyp in hypotheses_list:
+                    offset = 0
+                    token_confidence = []
+                    if len(hyp.timestep) > 0:
+                        for ts, te in zip(hyp.timestep, hyp.timestep[1:] + [len(hyp.frame_confidence)]):
+                            if ts != te:
+                                # <blank> tokens are considered to belong to the last non-blank token, if any.
+                                token_confidence.append(
+                                    self._aggregate_confidence(
+                                        [hyp.frame_confidence[ts][offset]]
+                                        + [fc[0] for fc in hyp.frame_confidence[ts + 1 : te]]
+                                    )
+                                )
+                                offset = 0
+                            else:
+                                token_confidence.append(hyp.frame_confidence[ts][offset])
+                                offset += 1
+                    hyp.token_confidence = token_confidence
         if self.preserve_word_confidence:
             for hyp in hypotheses_list:
                 hyp.word_confidence = self._aggregate_token_confidence(hyp)
@@ -1010,6 +1063,9 @@ class RNNTDecoding(AbstractRNNTDecoding):
                     from the `token_confidence`.
                 aggregation: Which aggregation type to use for collapsing per-token confidence into per-word confidence.
                     Valid options are `mean`, `min`, `max`, `prod`.
+                tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
+                    attached to the regular frame confidence,
+                    making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
                 method_cfg: A dict-like object which contains the method name and settings to compute per-frame
                     confidence scores.
 
@@ -1276,6 +1332,9 @@ class RNNTBPEDecoding(AbstractRNNTDecoding):
                     from the `token_confidence`.
                 aggregation: Which aggregation type to use for collapsing per-token confidence into per-word confidence.
                     Valid options are `mean`, `min`, `max`, `prod`.
+                tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
+                    attached to the regular frame confidence,
+                    making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
                 method_cfg: A dict-like object which contains the method name and settings to compute per-frame
                     confidence scores.