Bug when generating confidence scores with timestamps for a buffered rnnt model #11456

aanchan · 2024-12-03T09:47:54Z

Describe the bug

In trying to generate confidence scores with timestamps using an RNN transducer model (stt_en_conformer_transducer_large), with buffering, there is a type mismatch error on this line for ts, te in zip(hyp.timestep, hyp.timestep[1:] + [len(hyp.frame_confidence)])

Sample:: 100%|██████████| 1/1 [00:00<00:00, 11618.57it/s]
[NeMo W 2024-12-03 14:55:42 rnnt_decoding:1184] Specified segment seperators are not in supported punctuation {"'"}. If the seperators are not punctuation marks, ignore this warning. Otherwise, specify 'segment_gap_threshold' parameter in decoding config to form segments.
<class 'dict'>
Backend macosx is interactive backend. Turning interactive mode on.
Error executing job with overrides: ['model_path=null', 'pretrained_name=stt_en_conformer_transducer_large', 'audio_dir=/Users/aanchan/work/podcast_transcription_using_nemo/test', 'output_filename=/Users/aanchan/work/podcast_transcription_using_nemo/test_rnn_t_f1.json', 'total_buffer_in_secs=4.0', 'chunk_len_in_secs=1.6', 'model_stride=4', 'batch_size=32', 'merge_algo=lcs', 'lcs_alignment_dir=$PWD/lcs']
Traceback (most recent call last):
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/Users/aanchan/work/podcast_transcription_using_nemo/rnnt_timestamps.py", line 301, in main
    hyps = get_buffered_pred_feat_rnnt(
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/nemo/collections/asr/parts/utils/transcribe_utils.py", line 95, in get_buffered_pred_feat_rnnt
    hyp_list = asr.transcribe(tokens_per_chunk, delay)
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/nemo/collections/asr/parts/utils/streaming_utils.py", line 1309, in transcribe
    self.infer_logits()
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/nemo/collections/asr/parts/utils/streaming_utils.py", line 1081, in infer_logits
    self._get_batch_preds()
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/nemo/collections/asr/parts/utils/streaming_utils.py", line 1148, in _get_batch_preds
    best_hyp, _ = self.asr_model.decoding.rnnt_decoder_predictions_tensor(
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 569, in rnnt_decoder_predictions_tensor
    hypotheses = self.compute_confidence(hypotheses)
  File "/Users/aanchan/work/podcast_transcription_using_nemo/env_nemo_1/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 688, in compute_confidence
    for ts, te in zip(hyp.timestep, hyp.timestep[1:] + [len(hyp.frame_confidence)]):
TypeError: unhashable type: 'slice'

A clear and concise description of what the bug is.

From the debugger it looks like hyp.timestamp is a dict, and really the zip should be happening over hyp.timestamp['timestamp'] which happens to be a PyTorch tensor. The slicing over a dict type seems incorrect e.g. hyp.timestamp[1:]

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Start with the code example speech_to_text_buffered_infer_rnnt.py
Add a ConfidenceConfig as in the ASR with confidence estimation tutorial

from nemo.collections.asr.parts.utils.asr_confidence_utils import (
    ConfidenceConfig,
    ConfidenceConstants,
    ConfidenceMethodConfig,
    ConfidenceMethodConstants,
)

confidence_cfg = ConfidenceConfig(
    preserve_frame_confidence=True, # Internally set to true if preserve_token_confidence == True
    # or preserve_word_confidence == True
    preserve_token_confidence=True, # Internally set to true if preserve_word_confidence == True
    preserve_word_confidence=True,
    aggregation="prod", # How to aggregate frame scores to token scores and token scores to word scores
    exclude_blank=False, # If true, only non-blank emissions contribute to confidence scores
    tdt_include_duration=False, # If true, calculate duration confidence for the TDT models
    method_cfg=ConfidenceMethodConfig( # Config for per-frame scores calculation (before aggregation)
        name="max_prob", # Or "entropy" (default), which usually works better
        entropy_type="gibbs", # Used only for name == "entropy". Recommended: "tsallis" (default) or "renyi"
        alpha=0.5, # Low values (<1) increase sensitivity, high values decrease sensitivity
        entropy_norm="lin" # How to normalize (map to [0,1]) entropy. Default: "exp"
    )
)

Change the decoding strategy and attach the confidence config to the RNNTDecoderConfig being used

asr_model.change_decoding_strategy(
        RNNTDecodingConfig(compute_timestamps=True,
                           preserve_alignments=True,
                           confidence_cfg=confidence_cfg)
    )

Run the script with the following arguments

model_path=null
pretrained_name=stt_en_conformer_transducer_large
audio_dir=/Users/aanchan/work/podcast_transcription_using_nemo/test
output_filename=/Users/aanchan/work/podcast_transcription_using_nemo/test_rnn_t_f1.json
total_buffer_in_secs=4.0
chunk_len_in_secs=1.6
model_stride=4
batch_size=32
merge_algo="lcs"
lcs_alignment_dir=$PWD/lcs

An example code file and input is in this Google Drive folder

Expected behavior

A clear and concise description of what you expected to happen.

The expected output was a json file with time stamps written out.

Environment overview (please complete the following information)

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
The environment is a local laptop installation
Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.

pip install Cython packaging
pip install --upgrade pip
export BRANCH="main"
pip install torch torchvision torchaudio 
pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]

If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

OS version : MacOS Monterey 12.5
PyTorch version : 2.5.1
Python version 3.10.4

Additional context

Add any other context about the problem here.
Example: GPU model

This was run on a CPU, and not a GPU

The text was updated successfully, but these errors were encountered:

aanchan added the bug Something isn't working label Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug when generating confidence scores with timestamps for a buffered rnnt model #11456

Bug when generating confidence scores with timestamps for a buffered rnnt model #11456

aanchan commented Dec 3, 2024 •

edited

Loading

Bug when generating confidence scores with timestamps for a buffered rnnt model #11456

Bug when generating confidence scores with timestamps for a buffered rnnt model #11456

Comments

aanchan commented Dec 3, 2024 • edited Loading

aanchan commented Dec 3, 2024 •

edited

Loading