This is an implementation of Fast Speech paper.
The script downloads the necessary resources.
./setup.sh
Model checkpoint can be downloaded with the following command:
gdown --id 1wk7amOThMnfZtoMEKELT399rANOy1ZnN -O best_checkpoint.pth
IMPORTANT: I USE PRECOMPUTED ALIGNMENTS
Here are samples of the synthesized speech:
A defibrillator is a device that gives a high energy electric shock to the heart of someone who is in cardiac arrest
Massachusetts Institute of Technology may be best known for its math, science and engineering education
Wasserstein distance or Kantorovich Rubinstein metric is a distance function defined between probability distributions on a given metric space
The logs can be found in W&B.
The final model trained for 40 epochs with this config. Training took 11 hours and 51 minutes.
The original aligner used durations that were extracted from wav2vec. Since this aligner computed durations in waveform there were problems with rounding during duration modeling.
I trained model with this aligner for 17 hours in total. The results were not great, but the sentences are mostly legible.
Example:
The sentence is The Chronicles of Newgate, Volume two. By Arthur Griffiths. Section four: Newgate down to eighteen eighteen.
The spectrograms are oversmoothed, but you can see some faint patterns in low frequencies which is not bad (upper - original, lower - predicted).
Another set of alignments wes taken from an open-source implementation. These alignments have the exact same shape as our encoded texts and they were extracted from Tacotron. There were no rounding problems because the alignments were computed in melspectrogram.
Using precomputed alignments greatly boosted the quality. The convergence improved, too. My model was able to reach the same quality as the previous one in under 1 hour.
The examples of audio are provided in the section Results.
I was surprised to learn that torchaudio
's pipelines did not do the expansion. The dataset contains a lot of abbreviations hence the lack of expansion can really affect the loss.
Another challenge were non-ASCII symbols in the dataset. Dataset's webpage mentions non-ASCII symbols, however there is no information about which symbols they are. Turns out there is a variety of diacritics and umlauts because the dataset contains quotes from languages other than English. I replace such symbols with their respective ASCII version.
There are also some other symbols that tokenizer does not support. I simply delete such characters.
I took model hyperparameters from FastSpeech paper and I did not change them at all. I mainly experimented with learning rate and various schedulers.
Model achieved good results with OneCycleLR and with WarmupScheduler from the original Transformer paper. With WarmupScheduler model trained faster, so I picked it as a final model.
The spectrograms look significantly better than the previous example (upper - original, lower - predicted).