SpeechFlow

A speech processing toolkit focused on easy configuration of complex speech data preparation pipelines and rapid prototyping of speech synthesis models (TTS).

The goal of this project is to provide a comprehensive toolkit to solve the TTS problem, including a multilingual frontend for text preparation, forced alignment models, and a framework for assembling various TTS systems from unified blocks.

Installation

Clone a repository

git clone https://github.com/just-ai/speechflow
cd speechflow && git submodule update --init --recursive -f

On Ubuntu:

Installation Singularity (or run env/singularity.sh)
Run install.sh
Run singularity container singularity shell --nv --writable --no-home -B /run/user/:/run/user/,.:/src --pwd /src torch_*.sif
Activate conda environment source /ext3/miniconda3/etc/profile.d/conda.sh && conda activate py38

On Windows:

Install Python 3.8
Installations additional dependencies: .NET 5.0, C++ Build Tools or Visual Studio, eSpeak, FFmpeg
Install additional packages pip install -r requirements.txt
Install submodules libs/install.sh

For other systems see env/Singularityfile

Data Annotation

To work with TTS models, you need to convert your dataset to a special TextGrid markup format. To automate this procedure, a data annotation module is provided. It is designed to transform a list of audio files or a whole audiobook into a dataset containing single utterances aligned with the corresponding audio chunks.

Steps to obtain segmentations:

1) Data preparation

Structure your dataset to the following format

    dataset_root:
    - languages.yml
    - language_code_1
      - speakers.yml
      - dataset_1
        - speaker_1
          - file_1.wav
          - file_1.txt
          ...
          - file_n.wav
          - file_n.txt
          ...
          - speaker_n
            - file_1.wav
            - file_1.txt
            ...
            - file_n.wav
            - file_n.txt
      ...
      - dataset_n
    - language_code_n
      - speakers.yml
      - dataset_1
      ...

Supported languages: RU, EN, IT, ES, FR-FR, DE, PT, PT-BR, KK.

The possibility of working with other languages can be found here (https://github.com/espeak-ng/espeak-ng/tree/master).

We recommend using normalized transcription that does not contain numerals and abbreviations. However, for supported languages, this package will be automatically applied for text normalization.

Having transcription files is not a requirement. If you only have audio files, the transcription will be built automatically by the whisper-large-v2 ASR model.

It is recommended to split large audio files into 20-30 minute parts.

Annotation of datasets containing both single and multiple speakers is supported. In this example, you can study in more detail the structure of the source data directories, as well as the format of the languages.yml and speakers.yml configuration files.

2) Run annotation processing

The annotation process includes segmenting the audio file into single utterances, text normalization, generating a phonetic transcription, forced alignment of the phonetic transcription with the audio chunk, silence detection, audio sample rate conversion, and equalizing the volume.

We provide pre-trained multilingual forced alignment models at the phoneme level. These models were trained on 1500 hours of audio (more than 8K speakers in 9 languages), including LibriTTS, Hi-Fi TTS, VCTK, LJSpeech, and other datasets.

Run this script to get segmentations:

# single GPU (the minimum requirement is 128GB RAM and 24GB VRAM)
python -m annotator.runner -d source_data_root -o segmentation_dataset_name -l=MULTILANG -ngpu=1 -nproc=16 -bs=16 --pretrained_models mfa_stage1_epoch=19-step=208340.pt mfa_stage2_epoch=29-step=312510.pt

# multi GPU (the minimum requirement is 256GB RAM and 24GB VRAM per GPU)
python -m annotator.runner -d source_data_root -o segmentation_dataset_name -l=MULTILANG -ngpu=4 -nproc=32 -nw=8 --pretrained_models mfa_stage1_epoch=19-step=208340.pt mfa_stage2_epoch=29-step=312510.pt

To improve the alignment of your data, use the flag --finetune_model:

python -m annotator.runner -d source_data_root -o segmentation_dataset_name -l=MULTILANG -ngpu=1 -nproc=16 -bs=16 --finetune_model mfa_stage1_epoch=19-step=208340.pt

To process single audio files, use this interface.

The resulting segmentations can be opened in Praat. See more examples here.

The alignment model is based on the Glow-TTS code. Our implementation can be studied here.

Trainig accustic models

coming soon ...

Trainig vocoder

coming soon ...

Inferensces

coming soon ...

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
annotator		annotator
docs		docs
env		env
examples		examples
libs		libs
nlp		nlp
speechflow		speechflow
tests		tests
tts		tts
.gitignore		.gitignore
.gitmodules		.gitmodules
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpeechFlow

Installation

Data Annotation

Steps to obtain segmentations:

Trainig accustic models

Trainig vocoder

Inferensces

About

Releases

Packages

Languages

License

just-ai/speechflow

Folders and files

Latest commit

History

Repository files navigation

SpeechFlow

Installation

Data Annotation

Steps to obtain segmentations:

Trainig accustic models

Trainig vocoder

Inferensces

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages