A speech processing toolkit focused on easy configuration of complex speech data preparation pipelines and rapid prototyping of speech synthesis models (TTS).
The goal of this project is to provide a comprehensive toolkit to solve the TTS problem, including a multilingual frontend for text preparation, forced alignment models, and a framework for assembling various TTS systems from unified blocks.
- Clone a repository
git clone https://github.com/just-ai/speechflow
cd speechflow && git submodule update --init --recursive -f
On Ubuntu:
- Installation Singularity (or run
env/singularity.sh
) - Run
install.sh
- Run singularity container
singularity shell --nv --writable --no-home -B /run/user/:/run/user/,.:/src --pwd /src torch_*.sif
- Activate conda environment
source /ext3/miniconda3/etc/profile.d/conda.sh && conda activate py38
On Windows:
- Install Python 3.8
- Installations additional dependencies: .NET 5.0, C++ Build Tools or Visual Studio, eSpeak, FFmpeg
- Install additional packages
pip install -r requirements.txt
- Install submodules
libs/install.sh
For other systems see env/Singularityfile
To work with TTS models, you need to convert your dataset to a special TextGrid markup format. To automate this procedure, a data annotation module is provided. It is designed to transform a list of audio files or a whole audiobook into a dataset containing single utterances aligned with the corresponding audio chunks.
1) Data preparation
Structure your dataset to the following format
dataset_root:
- languages.yml
- language_code_1
- speakers.yml
- dataset_1
- speaker_1
- file_1.wav
- file_1.txt
...
- file_n.wav
- file_n.txt
...
- speaker_n
- file_1.wav
- file_1.txt
...
- file_n.wav
- file_n.txt
...
- dataset_n
- language_code_n
- speakers.yml
- dataset_1
...
Supported languages: RU, EN, IT, ES, FR-FR, DE, PT, PT-BR, KK.
The possibility of working with other languages can be found here (https://github.com/espeak-ng/espeak-ng/tree/master).
We recommend using normalized transcription that does not contain numerals and abbreviations. However, for supported languages, this package will be automatically applied for text normalization.
Having transcription files is not a requirement. If you only have audio files, the transcription will be built automatically by the whisper-large-v2 ASR model.
It is recommended to split large audio files into 20-30 minute parts.
Annotation of datasets containing both single and multiple speakers is supported. In this example, you can study in more detail the structure of the source data directories, as well as the format of the languages.yml and speakers.yml configuration files.
2) Run annotation processing
The annotation process includes segmenting the audio file into single utterances, text normalization, generating a phonetic transcription, forced alignment of the phonetic transcription with the audio chunk, silence detection, audio sample rate conversion, and equalizing the volume.
We provide pre-trained multilingual forced alignment models at the phoneme level. These models were trained on 1500 hours of audio (more than 8K speakers in 9 languages), including LibriTTS, Hi-Fi TTS, VCTK, LJSpeech, and other datasets.
Run this script to get segmentations:
# single GPU (the minimum requirement is 128GB RAM and 24GB VRAM)
python -m annotator.runner -d source_data_root -o segmentation_dataset_name -l=MULTILANG -ngpu=1 -nproc=16 -bs=16 --pretrained_models mfa_stage1_epoch=19-step=208340.pt mfa_stage2_epoch=29-step=312510.pt
# multi GPU (the minimum requirement is 256GB RAM and 24GB VRAM per GPU)
python -m annotator.runner -d source_data_root -o segmentation_dataset_name -l=MULTILANG -ngpu=4 -nproc=32 -nw=8 --pretrained_models mfa_stage1_epoch=19-step=208340.pt mfa_stage2_epoch=29-step=312510.pt
To improve the alignment of your data, use the flag --finetune_model
:
python -m annotator.runner -d source_data_root -o segmentation_dataset_name -l=MULTILANG -ngpu=1 -nproc=16 -bs=16 --finetune_model mfa_stage1_epoch=19-step=208340.pt
To process single audio files, use this interface.
The resulting segmentations can be opened in Praat. See more examples here.
The alignment model is based on the Glow-TTS code. Our implementation can be studied here.
coming soon ...
coming soon ...
coming soon ...