Skip to content

- Contents of projects of Speech Recognition / Speech Synthesis (Tổng hợp giọng nói - CS535): pre-trained two models - Tacotron2 and WaveGlow.

Notifications You must be signed in to change notification settings

ndtuan10/SpeechRecognition-Tacotron2

Repository files navigation

SpeechRecognition-Tacotron2

uitlogo

  • Contents of projects of Speech Recognition / Speech Synthesis (Tổng hợp giọng nói - CS535): pre-trained two models - Tacotron2 and WaveGlow.

Introduction to data

Training the model

Evaluation the model

Score 1 2 3 4 5
Label Very bad Bad Medium Good Excellent

Introduction to testing data

STT Content
1 My love for you is like the raging sea. So powerful and deep it will forever be. Through storm, wind, and heavy rain. It will withstand every pain.
2 Amazing good job, you!
3 Manchester City kept their hopes of winning a fourth consecutive Carabao Cup alive after overcoming Manchester United 2-0 in their semi-final at Old Trafford.
4 Roger Federer led 4-1, and 30-0 in the second set.
5 Hanoi capital continue to get cold air which flows from the North, the temperature may drop under 15 degrees, so citizens should keep the bodies warm.
... ...
... ...
23 And the memories bring back, memories bring back you.
24 I woke up early this morning. So I went to school on time.
25 And all my love, I’m holding on forever. Reaching for the love that seems so far.

Text synthesis

  • We enter a text in English in any “TEXT” section, for example: “My love for you is like the raging sea. So powerful and deep it will forever be. Through storm, wind, and heavy rain, it will withstand every pain”. In terms of meaning, we roughly translate "Tình yêu anh dành cho em giống như biển cả đang điên cuồng. Quá mạnh mẽ và sâu sắc, nó sẽ luôn mãi mãi như vậy. Băng qua bão, gió và mưa lớn. Nó sẽ chịu đựng được tất cả mọi nỗi đau".

  • So this is a text like a verse. As a result, we expect the Tacotron2 and WaveGlow models to deliver a soothing, soulful reading.

  • Then we convert that text into a mel spectrogram, and plot it using the matplotlib library. image

  • Output audio by converting the generated mel spectrogram to audio. We use WaveGlow in inference and run using the output of the mel spectrogram when passing through the post-net, with sigma = 0.666 used to denoise the mel spectrogram and sampling rate = 22.050 kHz per second.

  • As a result, we get a 11-second audio female voice reading from the entered English text above. We can download this audio and listen from here. https://github.com/ndtuan10/SpeechRecognition-Tacotron2/blob/main/result.wav

Introduction to evaluation criteria and experimental results table

System MOS
Tacotron2 + WaveGlow (1st person) 3.82
Tacotron2 + WaveGlow (2nd person) 3.65
Tacotron2 + WaveGlow (medium) 3.735

In conclusion

  • According to our assessment, after training two pretrained Tacotron2 and WaveGlow models, we find our model can handle different types of text from weather forecasting, reading stories, reading news ... with a voice that sounds natural and fluent like a human voice. Especially, with exclamations and questions, the voice of the reader tends to increase the intonation at the end of the sentence, and with the words "you're", the voice still gives the correct reading of this word.
  • The Tacotron 2 and WaveGlow models form a TTS (text-to-speech) system that allows users to synthesize natural sounding voices from raw recordings without any additional information, capable of producing high quality voices from mel spectrograms, combining details from Glow and WaveNet enables fast, efficient speech synthesis with a simple model that is easy to train.
  • Despite having a high-quality and clear voice, it still does not response our requirements when reading poetry, the voice is still not inspiring and gentle.
  • In addition, when reading sports news, with specialized sports terms, readers cannot read these words correctly. For example, in football, to pronounce the number “0” instead of 2-0 (two nil), the pronunciation pattern is (two zero); In tennis, instead of 30-0 (thirty love), the pronunciation pattern is (thirty zero).

About

- Contents of projects of Speech Recognition / Speech Synthesis (Tổng hợp giọng nói - CS535): pre-trained two models - Tacotron2 and WaveGlow.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published