Skip to content

Commit

Permalink
Merge branch 'main' into fix_weight_only_ckpt_save
Browse files Browse the repository at this point in the history
  • Loading branch information
pablo-garay authored Apr 21, 2024
2 parents 32180d3 + 9bafd37 commit 0be76d8
Show file tree
Hide file tree
Showing 16 changed files with 335 additions and 90 deletions.
42 changes: 34 additions & 8 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,17 +41,43 @@
Latest News
-----------

- 2023/12/06 `New NVIDIA NeMo Framework Features and NVIDIA H200 <https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility/>`_
.. raw:: html

.. image:: https://github.com/sbhavani/TransformerEngine/blob/main/docs/examples/H200-NeMo-performance.png
:target: https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility
:alt: H200-NeMo-performance
:width: 600
<details open>
<summary><b>Large Language Models and Multimodal</b></summary>
<details>
<summary><a href="https://cloud.google.com/blog/products/compute/gke-and-nvidia-nemo-framework-to-train-generative-ai-models">Accelerate your generative AI journey with NVIDIA NeMo framework on GKE</a> (2024/03/16) </summary>

NeMo Framework has been updated with state-of-the-art features,
such as FSDP, Mixture-of-Experts, and RLHF with TensorRT-LLM to provide speedups up to 4.2x for Llama-2 pre-training on H200.
**All of these features will be available in an upcoming release.**
An end-to-end walkthrough to train generative AI models on the Google Kubernetes Engine (GKE) using the NVIDIA NeMo Framework is available at https://github.com/GoogleCloudPlatform/nvidia-nemo-on-gke. The walkthrough includes detailed instructions on how to set up a Google Cloud Project and pre-train a GPT model using the NeMo Framework.
<br><br>
</details>

<details>
<summary><a href="https://blogs.nvidia.com/blog/bria-builds-responsible-generative-ai-using-nemo-picasso/">Bria Builds Responsible Generative AI for Enterprises Using NVIDIA NeMo, Picasso</a> (2024/03/06) </summary>

Bria, a Tel Aviv startup at the forefront of visual generative AI for enterprises now leverages the NVIDIA NeMo Framework. The Bria.ai platform uses reference implementations from the NeMo Multimodal collection, trained on NVIDIA Tensor Core GPUs, to enable high-throughput and low-latency image generation. Bria has also adopted NVIDIA Picasso, a foundry for visual generative AI models, to run inference.
<br><br>
</details>

<details>
<summary><a href="https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility/">New NVIDIA NeMo Framework Features and NVIDIA H200</a> (2023/12/06) </summary>

NVIDIA NeMo Framework now includes several optimizations and enhancements, including: 1) Fully Sharded Data Parallelism (FSDP) to improve the efficiency of training large-scale AI models, 2) Mix of Experts (MoE)-based LLM architectures with expert parallelism for efficient LLM training at scale, 3) Reinforcement Learning from Human Feedback (RLHF) with TensorRT-LLM for inference stage acceleration, and 4) up to 4.2x speedups for Llama 2 pre-training on NVIDIA H200 Tensor Core GPUs.
<br><br>
<a href="https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility"><img src="https://github.com/sbhavani/TransformerEngine/blob/main/docs/examples/H200-NeMo-performance.png" alt="H200-NeMo-performance" style="width: 600px;"></a>
<br><br>
</details>

<details>
<summary><a href="https://blogs.nvidia.com/blog/nemo-amazon-titan/">NVIDIA now powers training for Amazon Titan Foundation models</a> (2023/11/28) </summary>

NVIDIA NeMo framework now empowers the Amazon Titan foundation models (FM) with efficient training of large language models (LLMs). The Titan FMs form the basis of Amazon’s generative AI service, Amazon Bedrock. The NeMo Framework provides a versatile framework for building, customizing, and running LLMs.
<br><br>
</details>

</details>




Introduction
Expand Down
13 changes: 13 additions & 0 deletions nemo/collections/asr/parts/submodules/ctc_decoding.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,10 @@ class AbstractCTCDecoding(ConfidenceMixin):
Which aggregation type to use for collapsing per-token confidence into per-word confidence.
Valid options are `mean`, `min`, `max`, `prod`.
tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
attached to the regular frame confidence,
making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
method_cfg:
A dict-like object which contains the method name and settings to compute per-frame
confidence scores.
Expand Down Expand Up @@ -911,10 +915,15 @@ class CTCDecoding(AbstractCTCDecoding):
exclude_blank:
Bool flag indicating that blank token confidence scores are to be excluded
from the `token_confidence`.
aggregation:
Which aggregation type to use for collapsing per-token confidence into per-word confidence.
Valid options are `mean`, `min`, `max`, `prod`.
tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
attached to the regular frame confidence,
making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
method_cfg:
A dict-like object which contains the method name and settings to compute per-frame
confidence scores.
Expand Down Expand Up @@ -1122,6 +1131,10 @@ class CTCBPEDecoding(AbstractCTCDecoding):
Which aggregation type to use for collapsing per-token confidence into per-word confidence.
Valid options are `mean`, `min`, `max`, `prod`.
tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
attached to the regular frame confidence,
making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
method_cfg:
A dict-like object which contains the method name and settings to compute per-frame
confidence scores.
Expand Down
101 changes: 80 additions & 21 deletions nemo/collections/asr/parts/submodules/rnnt_decoding.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,9 @@ class AbstractRNNTDecoding(ConfidenceMixin):
from the `token_confidence`.
aggregation: Which aggregation type to use for collapsing per-token confidence into per-word confidence.
Valid options are `mean`, `min`, `max`, `prod`.
tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
attached to the regular frame confidence,
making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
method_cfg: A dict-like object which contains the method name and settings to compute per-frame
confidence scores.
Expand Down Expand Up @@ -209,7 +212,8 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):
self.compute_timestamps = self.cfg.get('compute_timestamps', None)
self.word_seperator = self.cfg.get('word_seperator', ' ')

if self.durations is not None and self.durations != []: # this means it's a TDT model.
self._is_tdt = self.durations is not None and self.durations != [] # this means it's a TDT model.
if self._is_tdt:
if blank_id == 0:
raise ValueError("blank_id must equal len(non_blank_vocabs) for TDT models")
if self.big_blank_durations is not None and self.big_blank_durations != []:
Expand Down Expand Up @@ -254,6 +258,12 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):
# initialize confidence-related fields
self._init_confidence(self.cfg.get('confidence_cfg', None))

if self._is_tdt:
if self.preserve_frame_confidence is True and self.preserve_alignments is False:
raise ValueError(
"If `preserve_frame_confidence` flag is set, then `preserve_alignments` flag must also be set."
)

# Confidence estimation is not implemented for these strategies
if (
not self.preserve_frame_confidence
Expand All @@ -264,7 +274,7 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):

if self.cfg.strategy == 'greedy':
if self.big_blank_durations is None or self.big_blank_durations == []:
if self.durations is None or self.durations == []:
if not self._is_tdt:
self.decoding = rnnt_greedy_decoding.GreedyRNNTInfer(
decoder_model=decoder,
joint_model=joint,
Expand All @@ -289,6 +299,7 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):
),
preserve_alignments=self.preserve_alignments,
preserve_frame_confidence=self.preserve_frame_confidence,
include_duration_confidence=self.tdt_include_duration_confidence,
confidence_method_cfg=self.confidence_method_cfg,
)
else:
Expand All @@ -307,7 +318,7 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):

elif self.cfg.strategy == 'greedy_batch':
if self.big_blank_durations is None or self.big_blank_durations == []:
if self.durations is None or self.durations == []:
if not self._is_tdt:
self.decoding = rnnt_greedy_decoding.GreedyBatchedRNNTInfer(
decoder_model=decoder,
joint_model=joint,
Expand All @@ -334,6 +345,7 @@ def __init__(self, decoding_cfg, decoder, joint, blank_id: int):
),
preserve_alignments=self.preserve_alignments,
preserve_frame_confidence=self.preserve_frame_confidence,
include_duration_confidence=self.tdt_include_duration_confidence,
confidence_method_cfg=self.confidence_method_cfg,
use_cuda_graph_decoder=self.cfg.greedy.get('use_cuda_graph_decoder', False),
)
Expand Down Expand Up @@ -530,7 +542,7 @@ def decode_hypothesis(self, hypotheses_list: List[Hypothesis]) -> List[Union[Hyp
if self.big_blank_durations is not None and self.big_blank_durations != []: # multi-blank RNNT
num_extra_outputs = len(self.big_blank_durations)
prediction = [p for p in prediction if p < self.blank_id - num_extra_outputs]
elif self.durations is not None and self.durations != []: # TDT model.
elif self._is_tdt: # TDT model.
prediction = [p for p in prediction if p < self.blank_id]
else: # standard RNN-T
prediction = [p for p in prediction if p != self.blank_id]
Expand Down Expand Up @@ -569,28 +581,69 @@ def compute_confidence(self, hypotheses_list: List[Hypothesis]) -> List[Hypothes
Returns:
A list of hypotheses with high-level confidence scores.
"""
if self.exclude_blank_from_confidence:
for hyp in hypotheses_list:
hyp.token_confidence = hyp.non_blank_frame_confidence
else:
if self._is_tdt:
# if self.tdt_include_duration_confidence is True then frame_confidence elements consist of two numbers
maybe_pre_aggregate = (
(lambda x: self._aggregate_confidence(x)) if self.tdt_include_duration_confidence else (lambda x: x)
)
for hyp in hypotheses_list:
offset = 0
token_confidence = []
if len(hyp.timestep) > 0:
for ts, te in zip(hyp.timestep, hyp.timestep[1:] + [len(hyp.frame_confidence)]):
if ts != te:
# <blank> tokens are considered to belong to the last non-blank token, if any.
token_confidence.append(
self._aggregate_confidence(
[hyp.frame_confidence[ts][offset]]
+ [fc[0] for fc in hyp.frame_confidence[ts + 1 : te]]
# trying to recover frame_confidence according to alignments
subsequent_blank_confidence = []
# going backwards since <blank> tokens are considered belonging to the last non-blank token.
for fc, fa in zip(hyp.frame_confidence[::-1], hyp.alignments[::-1]):
# there is only one score per frame most of the time
if len(fa) > 1:
for i, a in reversed(list(enumerate(fa))):
if a[-1] == self.blank_id:
if not self.exclude_blank_from_confidence:
subsequent_blank_confidence.append(maybe_pre_aggregate(fc[i]))
elif not subsequent_blank_confidence:
token_confidence.append(maybe_pre_aggregate(fc[i]))
else:
token_confidence.append(
self._aggregate_confidence(
[maybe_pre_aggregate(fc[i])] + subsequent_blank_confidence
)
)
)
offset = 0
subsequent_blank_confidence = []
else:
i, a = 0, fa[0]
if a[-1] == self.blank_id:
if not self.exclude_blank_from_confidence:
subsequent_blank_confidence.append(maybe_pre_aggregate(fc[i]))
elif not subsequent_blank_confidence:
token_confidence.append(maybe_pre_aggregate(fc[i]))
else:
token_confidence.append(hyp.frame_confidence[ts][offset])
offset += 1
token_confidence.append(
self._aggregate_confidence([maybe_pre_aggregate(fc[i])] + subsequent_blank_confidence)
)
subsequent_blank_confidence = []
token_confidence = token_confidence[::-1]
hyp.token_confidence = token_confidence
else:
if self.exclude_blank_from_confidence:
for hyp in hypotheses_list:
hyp.token_confidence = hyp.non_blank_frame_confidence
else:
for hyp in hypotheses_list:
offset = 0
token_confidence = []
if len(hyp.timestep) > 0:
for ts, te in zip(hyp.timestep, hyp.timestep[1:] + [len(hyp.frame_confidence)]):
if ts != te:
# <blank> tokens are considered to belong to the last non-blank token, if any.
token_confidence.append(
self._aggregate_confidence(
[hyp.frame_confidence[ts][offset]]
+ [fc[0] for fc in hyp.frame_confidence[ts + 1 : te]]
)
)
offset = 0
else:
token_confidence.append(hyp.frame_confidence[ts][offset])
offset += 1
hyp.token_confidence = token_confidence
if self.preserve_word_confidence:
for hyp in hypotheses_list:
hyp.word_confidence = self._aggregate_token_confidence(hyp)
Expand Down Expand Up @@ -1010,6 +1063,9 @@ class RNNTDecoding(AbstractRNNTDecoding):
from the `token_confidence`.
aggregation: Which aggregation type to use for collapsing per-token confidence into per-word confidence.
Valid options are `mean`, `min`, `max`, `prod`.
tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
attached to the regular frame confidence,
making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
method_cfg: A dict-like object which contains the method name and settings to compute per-frame
confidence scores.
Expand Down Expand Up @@ -1276,6 +1332,9 @@ class RNNTBPEDecoding(AbstractRNNTDecoding):
from the `token_confidence`.
aggregation: Which aggregation type to use for collapsing per-token confidence into per-word confidence.
Valid options are `mean`, `min`, `max`, `prod`.
tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
attached to the regular frame confidence,
making TDT frame confidence element a pair: (`prediction_confidence`, `duration_confidence`).
method_cfg: A dict-like object which contains the method name and settings to compute per-frame
confidence scores.
Expand Down
Loading

0 comments on commit 0be76d8

Please sign in to comment.