This repository contains code to train a End-to-End Speech Synthesis system, based on the Tacotron2 model with modifications as described in Location Relative Attention Mechanisms for Robust Long-Form Speech Synthesis.
The system consists of two parts:
-
A Tacotron2 model with Dynamic Convolutional Attention which modifies the hybrid location sensitive attention mechanism to be purely location based, resulting in better generalization on long utterances. This model takes text (in the form of character sequence) as input and predicts a sequence of mel-spectrogram frames as output (the seq2seq model).
-
A WaveRNN based vocoder; which takes the mel-spectrogram predicted in the previous step as input and generates a waveform as output (the vocoder model).
All audio processing parameters, model hyperparameters, training configuration etc are specified in config.py
.
-
Download and extract the dataset (The assumption made is that the dataset to be used for training is in the same format as the LJSpeech dataset)
-
Preprocess the downloaded dataset; perform feature extraction on the wav files and create train/val/eval splits
python preprocess.py \ --dataset_dir <Path to the root of the downloaded dataset> \ --out_dir <Output path to write the processed dataset>
-
Train the Tacotron2 model
python train_Tacotron2.py \ --data_dir <Path to the processed dataset to be used to train the model> \ --checkpoint_dir <Path to location where training checkpoints will be saved> \ --alignments_dir <Path to the location where training alignments will be saved> \ --resume_checkpoint_path <If specified load checkpoint and resume training>
-
Train the WaveRNN model
python train_WaveRNN.py \ --data_dir <Path to the processed dataset to be used to train the model> \ --checkpoint_dir <Path to location where training checkpoints will be saved> \ --resume_checkpoint_path <If specified load checkpoint and resume training>
-
Prepare the text to be synthesized
The text to be synthesized should be placed in the
synthesis.txt
file which has the following format<TEXT_ID_1> TEXT_1 <TEXT_ID_2> TEXT_2 . . .
-
Text to speech synthesis
python tts_Synthesis.py \ --synthesis_file <Path to the synthesis.txt file (created in Step 1)> \ --Tacotron2_checkpoint <Path to the trained Tacotron2 model to use for synthesis> \ --WaveRNN_checkpoint <Path to the trained WaveRNN model to use for synthesis> \ --out_dir <Path to where the synthesized waveforms will be written to disk>
This code is based on the code in the following repositories