Trying to get streaming working with an EncDecRNNTModel is that possible? #7311

utunga · 2023-08-24T07:15:00Z

utunga
Aug 24, 2023

Hi !

We have a custom trained Nemo model. In particular a Conformer, with RNNT Char encoded decoder layers..

At risk of repeating myself, it's a conformer model using Char encoding (instead of BPE) and using an RNNT/Transducer (instead of CTC). The model class is EncDecRNNTModel.

I'm trying to get this working in streaming aka buffered inference mode.

There are some excellent notebooks with explanations and example code of how to do streaming with Nemo this, here and here

(Yes I do realize that these notebooks are in the Nemo github, not on google per se).

I'm getting problems that might be because the examples have not been updated to latest versions? Or maybe it's something else. Anyway would really appreciate any help.

The short version for the problem I'm having is that I get this error when I try to use it.

AttributeError: 'EncDecRNNTModel' object has no attribute 'tokenizer'

Specifically the LongestCommonSubsequenceBatchedFrameASRRNNT class (from nemo/collections/asr/parts/utils/streaming_utils.py) makes reference to the model.tokenizer object.

It does that on this line 715

        if hasattr(asr_model.decoder, "vocabulary"):
            self.blank_id = len(asr_model.decoder.vocabulary)
        else:
            self.blank_id = len(asr_model.joint.vocabulary)
        self.tokenizer = asr_model.tokenizer # <-- here

The problem is that the asr_model I'm using aka the EncDecRNNTModel from Nemo 1.20 doesn't have a tokenizer. Methods like decode_ids_to_tokens are on the model.decoding object.

Any help very much appreciated! Thanks in advance.

BTW I'm using ..

>>> nemo.__version__
'1.20.0'

titu1994 · 2023-08-24T07:31:55Z

titu1994
Aug 24, 2023
Maintainer

I see. I don't think we have ever tried to make RNNT char models work in buffered inference mode. There's not much of a reason actually - because character encoding can be simulated with Sentencepiece with "char" spe type.

Anyway, now that you have a char model here's a few options -

use BatchedFrameASRRNNT(FrameBatchASR) instead of LCSMerge. LCS is a research demo more than a well supported algo.
Comment out line 715. Self.tokenizer is found exactly two times in the entire file, and the second call can be replaced by model.decoding.decode_ids_to_text or the equivalent. So modify calls to self.tokenizer.xyz with model.decoding.text_to_ids or model.decoding.ids_to_text. Underneath the hood, "decoding" class is a thin wrapper over the tokenization format of the model - it offers a subset of functionality universally to both char and subword models.
Try to see if char models will work after those changes. It should not require too much changes but I'm more concerned with accuracy - we've never tried char models under streaming inference.

0 replies

utunga · 2023-08-25T00:36:56Z

utunga
Aug 25, 2023
Author

Thanks so much for the reply and the help @titu1994.

Still working on this. But thought I'd let you know that I tried substituting for self.tokenizer with model.decoding as you suggested. It turns out the tokenizer is passed into streaming_utils.py, where it is used quite a bit in various places so wasn't just two places but - just to check if this works - I made the following wrapper..

class MilesQuickPretendTokenizer():
    def __init__(self, model:EncDecRNNTModel):
        self.model = model
        self.eos_id = self._get_eos_id()

    def _get_eos_id(self):
        return self.model.decoder.vocab_size
    
    def ids_to_text(self, decoded_prediction):
        # .. for when tokenizer used like this..
        # hypothesis = self.asr_model.tokenizer.ids_to_text(decoded_prediction)
        return self.model.decoding.decode_tokens_to_str(decoded_prediction)

    def ids_to_tokens(self, ids):
        # .. for when tokenizer used like this..
        # token = tokenizer.ids_to_tokens([token_id])[0]
        return self.model.decoding.decode_ids_to_tokens(ids)

And then did this in a couple of places

self.tokenizer = MilesQuickPretendTokenizer(asr_model)

Unfortunately I am getting very poor performance. In particular almost entirely blank or empty transcriptions.. so even with a 19 min long source audio (*which transcribes OK otherwise) I only end up with a tiny bit of text.

{"audio_filepath": "/src/wavs/large_test_file_221119.wav", "pred_text": "he over you? we?he kēhe"}

Its so 'wrong' that wonder if I am just not doing the transcribe step correctly. I have other code that transcribes with EncDecRNNTModel just fine - using the 'transcribe()' method. This model was trained with 'streaming' in mind (in terms of context buffers etc). Maybe it's to do with mis-matching stride lengths or context buffer sizes or something.

To help me debug this is there any version of this streaming code that is simpler.. that is, one at a time true 'streaming' (as opposed to batch based) and where it doesn't calculate the buffers up front, but instead does it as it comes in. ?

3 replies

utunga Aug 25, 2023
Author

I'll be honest this doesn't tell us a lot but fwiw I segmented the long file into 30 sec segments and then used the (broken) streaming mode (with middle_token mode, not lcs) to transcribe and ended up with the following results


{"audio_filepath": "/src/wav_bak/segments/segment000.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment001.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment002.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment003.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment004.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment005.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment006.wav", "pred_text": "going to "}
{"audio_filepath": "/src/wav_bak/segments/segment007.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment008.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment009.wav", "pred_text": "he"}
{"audio_filepath": "/src/wav_bak/segments/segment010.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment011.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment012.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment013.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment014.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment015.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment016.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment017.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment018.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment019.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment020.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment021.wav", "pred_text": "he "}
{"audio_filepath": "/src/wav_bak/segments/segment022.wav", "pred_text": "e "}
{"audio_filepath": "/src/wav_bak/segments/segment023.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment024.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment025.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment026.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment027.wav", "pred_text": "he"}
{"audio_filepath": "/src/wav_bak/segments/segment028.wav", "pred_text": "he"}
{"audio_filepath": "/src/wav_bak/segments/segment029.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment030.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment031.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment032.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment033.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment034.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment035.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment036.wav", "pred_text": "our "}
{"audio_filepath": "/src/wav_bak/segments/segment037.wav", "pred_text": ""}
{"audio_filepath": "/src/wav_bak/segments/segment038.wav", "pred_text": ""}

Whereas if I transcribe any of them, using the same model but the transcribe() method
hypotheses = self.model.transcribe([wav_path], return_hypotheses=True)

i get a much longer result for each segment eg

{"audio_filepath": "/src/wav_bak/segments/segment023.wav", 
  "pred_text": "rawa ka kore te kāhui ariki Pākehā e hoki mai, engari nā tāna tamaiti i whakatau mai tēnei wā, he aha tō whakaaro i te hokinga mai anō rā o te kāhui āriki Pākehā i tēnei wā mai i Ingarangi ki konei, ahakoa e maha ā tātou tō tātou titironga tonu, anei rā ko te kāhui āriki Māori kua tūheitia pai mārire, he aha tō whakaaro maroto o te taenga mai o te kāhui ariki Pākehā?" }

I wonder if you think this indicates that the model is mis-configured somehow to not pick up on audio? Yes its a char model but not sure why this makes a difference?

Should i try longer chunk_len_in_secs or something?

Any ideas appreciated!

utunga Aug 25, 2023
Author

Ah. This is a bit embarrassing. I've now realised that I need to be using the speech_to_text_cache_aware_streaming_infer.py script as a starting point not speech_to_text_buffered_infer_rnnt.py because the former seems to be fine with EncDecRNNTModel out of the box. Also it's simpler and closer to what i need.

utunga Aug 25, 2023
Author

The only thing is that it seems to be dropping about, well about 92% of the audio/transcription..

So I'm getting

   865 characters in streaming mode - speech_to_text_cache_aware_streaming_infer.py + custom model vs
10,140 characters in the human verified transcription

Does anyone have any idea what might be causing that?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to get streaming working with an EncDecRNNTModel is that possible? #7311

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Trying to get streaming working with an EncDecRNNTModel is that possible? #7311

utunga Aug 24, 2023

Replies: 2 comments · 3 replies

titu1994 Aug 24, 2023 Maintainer

utunga Aug 25, 2023 Author

utunga Aug 25, 2023 Author

utunga Aug 25, 2023 Author

utunga Aug 25, 2023 Author

utunga
Aug 24, 2023

Replies: 2 comments 3 replies

titu1994
Aug 24, 2023
Maintainer

utunga
Aug 25, 2023
Author

utunga Aug 25, 2023
Author

utunga Aug 25, 2023
Author

utunga Aug 25, 2023
Author