ASR for rapid speech #6360
-
Hi there! I'm currently training a conformer transducer model on a dataset with very fast conversational speech. Transformer transducers allow for multiple words per time step which happens fairly frequently, but omissions are still very common. In my mind I have two options:
I'm leaning towards the second option as the first will require additional processing and will also slow down any speech spoken at a normal rate. I want to know how the second option will affect training - will the model behave as normal but with the audio appearing artificially longer due to the additional timesteps? How easily will the model be able to adapt to the additional number of features? I'm using a bpe vocabulary of 1024 tokens. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Note that conformer is a 4x stride model - and we use a window stride of 0.01 s so effectively conformer output chunk is of duration 0.04 s. If you modify the window stride larger, remember that you will have to deal with longer and longer delay between token emissions. It's also not a guarantee that it will do better - RNNT can predict multiple token per timestep but it will not predict overlapped tokens in the same time frame without specifically training for such a task. So be careful or arbitrarily changing the window stride. |
Beta Was this translation helpful? Give feedback.
-
0.04s is very small, even for rapid speech, so perhaps timesteps aren't the issue here. I'll keep experimenting. Thank you for the quick reply! |
Beta Was this translation helpful? Give feedback.
Note that conformer is a 4x stride model - and we use a window stride of 0.01 s so effectively conformer output chunk is of duration 0.04 s. If you modify the window stride larger, remember that you will have to deal with longer and longer delay between token emissions. It's also not a guarantee that it will do better - RNNT can predict multiple token per timestep but it will not predict overlapped tokens in the same time frame without specifically training for such a task. So be careful or arbitrarily changing the window stride.