Skip to content

Latest commit

 

History

History
128 lines (91 loc) · 5.83 KB

README.md

File metadata and controls

128 lines (91 loc) · 5.83 KB

SpeechFlow

A speech processing toolkit focused on easy configuration of complex speech data preparation pipelines and rapid prototyping of speech synthesis models (TTS).

The goal of this project is to provide a comprehensive toolkit to solve the TTS problem, including a multilingual frontend for text preparation, forced alignment models, and a framework for assembling various TTS systems from unified blocks.

Installation

  1. Clone a repository
git clone https://github.com/just-ai/speechflow
cd speechflow && git submodule update --init --recursive -f

On Ubuntu:

  1. Installation Singularity (or run env/singularity.sh)
  2. Run install.sh
  3. Run singularity container singularity shell --nv --writable --no-home -B /run/user/:/run/user/,.:/src --pwd /src torch_*.sif
  4. Activate conda environment source /ext3/miniconda3/etc/profile.d/conda.sh && conda activate py38

On Windows:

  1. Install Python 3.8
  2. Installations additional dependencies: .NET 5.0, C++ Build Tools or Visual Studio, eSpeak, FFmpeg
  3. Install additional packages pip install -r requirements.txt
  4. Install submodules libs/install.sh

For other systems see env/Singularityfile

Data Annotation

To work with TTS models, you need to convert your dataset to a special TextGrid markup format. To automate this procedure, a data annotation module is provided. It is designed to transform a list of audio files or a whole audiobook into a dataset containing single utterances aligned with the corresponding audio chunks.

Steps to obtain segmentations:

1) Data preparation

Structure your dataset to the following format

    dataset_root:
    - languages.yml
    - language_code_1
      - speakers.yml
      - dataset_1
        - speaker_1
          - file_1.wav
          - file_1.txt
          ...
          - file_n.wav
          - file_n.txt
          ...
          - speaker_n
            - file_1.wav
            - file_1.txt
            ...
            - file_n.wav
            - file_n.txt
      ...
      - dataset_n
    - language_code_n
      - speakers.yml
      - dataset_1
      ...

Supported languages: RU, EN, IT, ES, FR-FR, DE, PT, PT-BR, KK.

The possibility of working with other languages can be found here (https://github.com/espeak-ng/espeak-ng/tree/master).

We recommend using normalized transcription that does not contain numerals and abbreviations. However, for supported languages, this package will be automatically applied for text normalization.

Having transcription files is not a requirement. If you only have audio files, the transcription will be built automatically by the whisper-large-v2 ASR model.

It is recommended to split large audio files into 20-30 minute parts.

Annotation of datasets containing both single and multiple speakers is supported. In this example, you can study in more detail the structure of the source data directories, as well as the format of the languages.yml and speakers.yml configuration files.

2) Run annotation processing

The annotation process includes segmenting the audio file into single utterances, text normalization, generating a phonetic transcription, forced alignment of the phonetic transcription with the audio chunk, silence detection, audio sample rate conversion, and equalizing the volume.

We provide pre-trained multilingual forced alignment models at the phoneme level. These models were trained on 1500 hours of audio (more than 8K speakers in 9 languages), including LibriTTS, Hi-Fi TTS, VCTK, LJSpeech, and other datasets.

Run this script to get segmentations:

# single GPU (the minimum requirement is 128GB RAM and 24GB VRAM)
python -m annotator.runner -d source_data_root -o segmentation_dataset_name -l=MULTILANG -ngpu=1 -nproc=16 -bs=16 --pretrained_models mfa_stage1_epoch=19-step=208340.pt mfa_stage2_epoch=29-step=312510.pt

# multi GPU (the minimum requirement is 256GB RAM and 24GB VRAM per GPU)
python -m annotator.runner -d source_data_root -o segmentation_dataset_name -l=MULTILANG -ngpu=4 -nproc=32 -nw=8 --pretrained_models mfa_stage1_epoch=19-step=208340.pt mfa_stage2_epoch=29-step=312510.pt

To improve the alignment of your data, use the flag --finetune_model:

python -m annotator.runner -d source_data_root -o segmentation_dataset_name -l=MULTILANG -ngpu=1 -nproc=16 -bs=16 --finetune_model mfa_stage1_epoch=19-step=208340.pt

To process single audio files, use this interface.

The resulting segmentations can be opened in Praat. See more examples here. segmentation_example

The alignment model is based on the Glow-TTS code. Our implementation can be studied here.

Trainig accustic models

coming soon ...

Trainig vocoder

coming soon ...

Inferensces

coming soon ...