ASR for rapid speech #6360

OllieBroadhurst · 2023-04-04T07:27:25Z

OllieBroadhurst
Apr 4, 2023

Hi there!

I'm currently training a conformer transducer model on a dataset with very fast conversational speech. Transformer transducers allow for multiple words per time step which happens fairly frequently, but omissions are still very common. In my mind I have two options:

Slow the speech down with a tool such as rubberband
Decrease the n_window_stride (hop_length) of the preprocessor to allow for higher-resolution time steps.

I'm leaning towards the second option as the first will require additional processing and will also slow down any speech spoken at a normal rate. I want to know how the second option will affect training - will the model behave as normal but with the audio appearing artificially longer due to the additional timesteps? How easily will the model be able to adapt to the additional number of features?

I'm using a bpe vocabulary of 1024 tokens.

Thanks in advance!

Answered by titu1994

Apr 4, 2023

Note that conformer is a 4x stride model - and we use a window stride of 0.01 s so effectively conformer output chunk is of duration 0.04 s. If you modify the window stride larger, remember that you will have to deal with longer and longer delay between token emissions. It's also not a guarantee that it will do better - RNNT can predict multiple token per timestep but it will not predict overlapped tokens in the same time frame without specifically training for such a task. So be careful or arbitrarily changing the window stride.

View full answer

titu1994 · 2023-04-04T08:05:25Z

titu1994
Apr 4, 2023
Maintainer

Note that conformer is a 4x stride model - and we use a window stride of 0.01 s so effectively conformer output chunk is of duration 0.04 s. If you modify the window stride larger, remember that you will have to deal with longer and longer delay between token emissions. It's also not a guarantee that it will do better - RNNT can predict multiple token per timestep but it will not predict overlapped tokens in the same time frame without specifically training for such a task. So be careful or arbitrarily changing the window stride.

0 replies

OllieBroadhurst · 2023-04-04T08:15:55Z

OllieBroadhurst
Apr 4, 2023
Author

0.04s is very small, even for rapid speech, so perhaps timesteps aren't the issue here. I'll keep experimenting. Thank you for the quick reply!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASR for rapid speech #6360

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

ASR for rapid speech #6360

OllieBroadhurst Apr 4, 2023

Replies: 2 comments

titu1994 Apr 4, 2023 Maintainer

OllieBroadhurst Apr 4, 2023 Author

OllieBroadhurst
Apr 4, 2023

titu1994
Apr 4, 2023
Maintainer

OllieBroadhurst
Apr 4, 2023
Author