FragmentVC-Japanese

This repository provides a voice conversion model based on the FragmentVC architecture, but modified for Japanese (instead of English as the root repo). The model is designed to convert the voice of a source speaker to that of a target speaker while preserving linguistic content. Voice conversion technology has various applications, such as in the entertainment industry for dubbing or character voice synthesis, in assistive technologies for individuals with speech impairments, and in virtual assistants or chatbots for generating natural-sounding speech. The following are the overall model architecture and the conceptual illustration.

And the architecture of smoother blocks and extractor blocks.

Dataset

JVS (Japanese versatile speech) corpus - This corpus consists of Japanese text (transcripts) and multi-speaker voice data. The specification is as follows.

100 professional speakers
Each speaker utters:
- "parallel100" ... 100 reading-style utterances that are common among speakers
- "nonpara30" ... 30 reading-style utterances that are completely different among speakers
- "whisper10" ... 10 whispered utterances
- "falsetto10" ... 10 falsetto utterances
High-quality (studio recording), high-sampling-rate (24 kHz), and large-sized (30 hours) audio files
Useful tags included (e.g., gender, F0 range, speaker similarity, duration, and phoneme alignment (automatically generated))

However, I only use parallel100 for the voice conversion task.

Usage

You can download the pretrained model as well as the vocoder following the link Fragment and unzip into FragmentVC-Japanese folder path.

The whole project was developed using Python 3.8, torch 1.10.1, and the pretrained model as well as the vocoder were turned to TorchScript, so it's not guaranteed to be backward compatible. You can install the dependencies with

pip install -r requirements.txt

If you encounter any problems while installing fairseq, please refer to pytorch/fairseq for the installation instruction.

Wav2Vec

In our implementation, we're using Wav2Vec 2.0 Base w/o finetuning which is trained on LibriSpeech. You can download the checkpoint wav2vec_small.pt from pytorch/fairseq.

Vocoder

The WaveRNN-based neural vocoder is from yistLin/universal-vocoder which is based on the paper, Towards achieving robust universal neural vocoding.

Voice conversion with pretrained models

You can convert an utterance from source speaker with multiple utterances from target speaker, e.g.

python convert.py \
    -w <WAV2VEC_PATH> \
    -v <VOCODER_PATH> \
    -c <CHECKPOINT_PATH> \
    ./test/source/TRAVEL1000_0023.wav \ # source utterance
    ./test/target/female/FKN_SN_003.AD.wav \ # target utterance 1/3
    ./test/target/female/FKN_SN_004.AD.wav \ # target utterance 2/3
    ./test/target/female/FKN_SN_005.AD.wav \ # target utterance 3/3
    output.wav

Or you can prepare a conversion pairs information file in YAML format, like

# pairs_info.yaml
pair1:
    source: ./test/source/TRAVEL1000_0023.wav
    target:
        - ./test/target/female/FKN_SN_004.AD.wav
pair2:
    source: ./test/source/TRAVEL1000_0023.wav
    target:
        - ./test/target/female/FKN_SN_003.AD.wav
        - ./test/target/female/FKN_SN_004.AD.wav
        - ./test/target/female/FKN_SN_005.AD.wav
        - ./test/target/female/FKN_SN_006.AD.wav
        - ./test/target/female/FKN_SN_007.AD.wav
        - ./test/target/female/FKN_SN_008.AD.wav
        - ./test/target/female/FKN_SN_009.AD.wav

And convert multiple pairs at the same time, e.g.

python convert_batch.py \
    -w <WAV2VEC_PATH> \
    -v <VOCODER_PATH> \
    -c <CHECKPOINT_PATH> \
    pairs_info.yaml \
    outputs # the output directory of conversion results

After the conversion, the output directory, outputs, will be containing

pair1.wav
pair1.mel.png
pair1.attn.png
pair2.wav
pair2.mel.png
pair2.attn.png

where *.wav are the converted utterances, *.mel.png are the plotted mel-spectrograms of the formers, and *.attn.png are the attention map between Conv1d 1 and Extractor 3 (please refer to the model architecture above).

Train from scratch

Preprocessing

You can preprocess multiple corpora by passing multiple paths. But each path should be the directory that directly contains the speaker directories, i.e.

python preprocess.py \
    datasetVC/ \
    basic5000/ \
    <WAV2VEC_PATH> \
    features  # the output directory of preprocessed features

After preprocessing, the output directory will be containing:

metadata.json
utterance-000x7gsj.tar
utterance-00wq7b0f.tar
utterance-01lpqlnr.tar
...

Training

python train.py features --save_dir ./ckpts

You can further specify --preload for preloading all training data into RAM to boost training speed. If --comment <COMMENT> is specified, e.g. --comment jp, the training logs will be placed under a newly created directory like, logs/2020-02-02_12:34:56_jp, otherwise there won't be any logging. For more details, you can refer to the usage by python train.py -h.

Demo

To see demo using gradio, run:

python app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FragmentVC-Japanese

Dataset

Usage

Wav2Vec

Vocoder

Voice conversion with pretrained models

Train from scratch

Preprocessing

Training

Demo

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
datasetVC		datasetVC
images		images
models		models
outputs		outputs
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
convert.py		convert.py
convert_batch.py		convert_batch.py
metadata.json		metadata.json
pairs_info.yaml		pairs_info.yaml
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py

License

QuyAnh2005/FragmentVC-Japanese

Folders and files

Latest commit

History

Repository files navigation

FragmentVC-Japanese

Dataset

Usage

Wav2Vec

Vocoder

Voice conversion with pretrained models

Train from scratch

Preprocessing

Training

Demo

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages