Skip to content
This repository has been archived by the owner on Sep 11, 2022. It is now read-only.

Commit

Permalink
Merge pull request #191 from yt605155624/refactor
Browse files Browse the repository at this point in the history
refactor of examples
  • Loading branch information
zh794390558 authored Oct 12, 2021
2 parents 49caf86 + a603924 commit 4f97ff2
Show file tree
Hide file tree
Showing 103 changed files with 540 additions and 1,064 deletions.
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,14 @@ Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-spee

## News <img src="./docs/images/news_icon.png" width="40"/>

- Oct-12-2021, Parallel WaveGAN with LJSpeech. Check [examples/GANVocoder/parallelwave_gan/ljspeech](./examples/GANVocoder/parallelwave_gan/ljspeech).
- Oct-12-2021, FastSpeech2/FastPitch with LJSpeech. Check [examples/fastspeech2/ljspeech](./examples/fastspeech2/ljspeech).
- Sep-14-2021, Reconstruction of TransformerTTS. Check [examples/transformer_tts/ljspeech](./examples/transformer_tts/ljspeech).
- Aug-31-2021, Chinese Text Frontend. Check [examples/text_frontend](./examples/text_frontend).
- Aug-23-2021, FastSpeech2/FastPitch with AISHELL-3. Check [examples/fastspeech2/aishell3](./examples/fastspeech2/aishell3).
- Aug-03-2021, FastSpeech2/FastPitch with CSMSC. Check [examples/fastspeech2/baker](./examples/fastspeech2/baker).
- Jul-19-2021, SpeedySpeech with CSMSC. Check [examples/speedyspeech/baker](./examples/speedyspeech/baker).
- Jul-01-2021, Parallel WaveGAN with CSMSC. Check [examples/parallelwave_gan/baker](./examples/parallelwave_gan/baker).
- Jul-01-2021, Parallel WaveGAN with CSMSC. Check [examples/GANVocoder/parallelwave_gan/baker](./examples/GANVocoder/parallelwave_gan/baker).
- Jul-01-2021, Montreal-Forced-Aligner. Check [examples/use_mfa](./examples/use_mfa).
- May-07-2021, Voice Cloning in Chinese. Check [examples/tacotron2_aishell3](./examples/tacotron2_aishell3).

Expand Down Expand Up @@ -68,7 +70,7 @@ Entries to the introduction, and the launch of training and synthsis for differe
- [>>> Chinese Text Frontend](./examples/text_frontend)
- [>>> FastSpeech2/FastPitch](./examples/fastspeech2)
- [>>> Montreal-Forced-Aligner](./examples/use_mfa)
- [>>> Parallel WaveGAN](./examples/parallelwave_gan)
- [>>> Parallel WaveGAN](./examples/GANVocoder/parallelwave_gan)
- [>>> SpeedySpeech](./examples/speedyspeech)
- [>>> Tacotron2_AISHELL3](./examples/tacotron2_aishell3)
- [>>> GE2E](./examples/ge2e)
Expand All @@ -87,9 +89,10 @@ Check our [website](https://paddleparakeet.readthedocs.io/en/latest/demo.html) f
#### FastSpeech2/FastPitch
1. [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip)
2. [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_aishell3_ckpt_0.4.zip)
3. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)

#### SpeedySpeech
1. [speedyspeech_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/speedyspeech_baker_ckpt_0.4.zip)
1. [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/speedyspeech_nosil_baker_ckpt_0.5.zip)

#### TransformerTTS

Expand All @@ -109,6 +112,7 @@ Check our [website](https://paddleparakeet.readthedocs.io/en/latest/demo.html) f
#### Parallel WaveGAN

1. [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip)
2. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip)

### Voice Cloning

Expand Down
1 change: 1 addition & 0 deletions examples/GANVocoder/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
different GAN Vocoders have the same preprocess.py and normalize.py
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -37,19 +37,19 @@ Also there is a `metadata.jsonl` in each subfolder. It is a table-like file whic

## Train the model

`./run.sh` calls `Parakeet/utils/pwg_train.py`.
`./run.sh` calls `../train.py`.
```bash
./run.sh
```
Here's the complete help message.

```text
usage: pwg_train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--device DEVICE] [--nprocs NPROCS] [--verbose VERBOSE]
[--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--device DEVICE] [--nprocs NPROCS] [--verbose VERBOSE]
[--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model.
Expand Down Expand Up @@ -102,14 +102,14 @@ pwg_baker_ckpt_0.4

## Synthesize

`synthesize.sh` calls `Parakeet/utils/pwg_syn.py `, which can synthesize waveform from `metadata.jsonl`.
`synthesize.sh` calls `../synthesize.py `, which can synthesize waveform from `metadata.jsonl`.
```bash
./synthesize.sh
```
```text
usage: pwg_syn.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
[--device DEVICE] [--verbose VERBOSE]
usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
[--device DEVICE] [--verbose VERBOSE]
Synthesize with parallel wavegan.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,7 @@ window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
trim_silence: false # Whether to trim the start and end of silence.
top_db: 60 # Need to tune carefully if the recording is not good.
trim_frame_length: 2048 # Frame size in trimming. (in samples)
trim_hop_length: 512 # Hop size in trimming. (in samples)


###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,24 +3,20 @@
stage=0
stop_stage=100

fs=24000
n_shift=300

export MAIN_ROOT=`realpath ${PWD}/../../../`
export MAIN_ROOT=`realpath ${PWD}/../../../../`

if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./baker_alignment_tone \
--output=durations.txt \
--sample-rate=${fs} \
--n-shift=${n_shift}
--config=conf/default.yaml
fi

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "Extract features ..."
python3 ${MAIN_ROOT}/utils/vocoder_preprocess.py \
python3 ../../preprocess.py \
--rootdir=~/datasets/BZNSYP/ \
--dataset=baker \
--dumpdir=dump \
Expand All @@ -42,16 +38,16 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."

python3 ${MAIN_ROOT}/utils/vocoder_normalize.py \
python3 ../../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${MAIN_ROOT}/utils/vocoder_normalize.py \
python3 ../../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy

python3 ${MAIN_ROOT}/utils/vocoder_normalize.py \
python3 ../../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
#!/bin/bash

export MAIN_ROOT=`realpath ${PWD}/../../../`

FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${MAIN_ROOT}/utils/pwg_train.py \
python ../train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=conf/default.yaml \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
#!/bin/bash

export MAIN_ROOT=`realpath ${PWD}/../../../`

python3 ${MAIN_ROOT}/utils/pwg_syn.py \
python3 ../synthesize.py \
--config=conf/default.yaml \
--checkpoint=exp/default/checkpoints/snapshot_iter_400000.pdz\
--test-metadata=dump/test/norm/metadata.jsonl \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,8 @@ def evaluate(args, config):
# extract mel feats
mel = mel_extractor.get_log_mel_fbank(wav)
mel = paddle.to_tensor(mel)
gen_wav = pwg_inference(mel)
with paddle.no_grad():
gen_wav = pwg_inference(mel)
sf.write(
str(output_dir / ("gen_" + utt_name)),
gen_wav.numpy(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,19 +39,19 @@ The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of whi
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance.

## Train the model
`./run.sh` calls `Parakeet/utils/pwg_train.py`.
`./run.sh` calls `../train.py`.
```bash
./run.sh
```
Here's the complete help message.

```text
usage: pwg_train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--device DEVICE] [--nprocs NPROCS] [--verbose VERBOSE]
[--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--device DEVICE] [--nprocs NPROCS] [--verbose VERBOSE]
[--batch-size BATCH_SIZE] [--max-iter MAX_ITER]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model.
Expand Down Expand Up @@ -102,14 +102,14 @@ pwg_ljspeech_ckpt_0.5
```

## Synthesize
`synthesize.sh` calls `Parakeet/utils/pwg_syn.py `, which can synthesize waveform from `metadata.jsonl`.
`synthesize.sh` calls `../synthesize.py `, which can synthesize waveform from `metadata.jsonl`.
```bash
./synthesize.sh
```
```text
usage: pwg_syn.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
[--device DEVICE] [--verbose VERBOSE]
usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
[--device DEVICE] [--verbose VERBOSE]
Synthesize with parallel wavegan.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,6 @@ window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
trim_silence: false # Whether to trim the start and end of silence.
top_db: 60 # Need to tune carefully if the recording is not good.
trim_frame_length: 2048 # Frame size in trimming. (in samples)
trim_hop_length: 512 # Hop size in trimming. (in samples)

###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,21 @@
stage=0
stop_stage=100

fs=22050
n_shift=256

export MAIN_ROOT=`realpath ${PWD}/../../../`
export MAIN_ROOT=`realpath ${PWD}/../../../../`

if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./ljspeech_alignment \
--output=durations.txt \
--sample-rate=${fs} \
--n-shift=${n_shift}
--config=conf/default.yaml
fi

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${MAIN_ROOT}/utils/vocoder_preprocess.py \
python3 ../../preprocess.py \
--rootdir=~/datasets/LJSpeech-1.1/ \
--dataset=ljspeech \
--dumpdir=dump \
Expand All @@ -43,16 +39,16 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."

python3 ${MAIN_ROOT}/utils/vocoder_normalize.py \
python3 ../../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${MAIN_ROOT}/utils/vocoder_normalize.py \
python3 ../../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy

python3 ${MAIN_ROOT}/utils/vocoder_normalize.py \
python3 ../../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
#!/bin/bash

export MAIN_ROOT=`realpath ${PWD}/../../../`

FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${MAIN_ROOT}/utils/pwg_train.py \
python ../train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=conf/default.yaml \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
#!/bin/bash

export MAIN_ROOT=`realpath ${PWD}/../../../`

python3 ${MAIN_ROOT}/utils/pwg_syn.py \
python3 ../synthesize.py \
--config=conf/default.yaml \
--checkpoint=exp/default/checkpoints/snapshot_iter_400000.pdz\
--test-metadata=dump/test/norm/metadata.jsonl \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,8 @@ def main():
mel = example['feats']
mel = paddle.to_tensor(mel) # (T, C)
with timer() as t:
wav = generator.inference(c=mel)
with paddle.no_grad():
wav = generator.inference(c=mel)
wav = wav.numpy()
N += wav.size
T += t.elapse
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,11 @@
from parakeet.datasets.vocoder_batch_fn import Clip
from parakeet.models.parallel_wavegan import PWGGenerator
from parakeet.models.parallel_wavegan import PWGDiscriminator
from parakeet.models.parallel_wavegan import PWGUpdater
from parakeet.models.parallel_wavegan import PWGEvaluator
from parakeet.modules.stft_loss import MultiResolutionSTFTLoss
from parakeet.training.extensions.snapshot import Snapshot
from parakeet.training.extensions.visualizer import VisualDL
from parakeet.training.pwg_updater import PWGUpdater
from parakeet.training.pwg_updater import PWGEvaluator
from parakeet.training.seeding import seed_everything
from parakeet.training.trainer import Trainer
from pathlib import Path
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -132,14 +132,6 @@ def process_sentence(config: Dict[str, Any],
start, end = librosa.time_to_samples([start, end], sr=config.fs)
y = y[start:end]

# energy based silence trimming
if config.trim_silence:
y, _ = librosa.effects.trim(
y,
top_db=config.top_db,
frame_length=config.trim_frame_length,
hop_length=config.trim_hop_length)

# extract mel feats
logmel = mel_extractor.get_log_mel_fbank(y)

Expand Down
19 changes: 8 additions & 11 deletions examples/fastspeech2/aishell3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,21 +48,18 @@ The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of whi
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance.

## Train the model
`./run.sh` calls `Parakeet/utils/multi_spk_fs2_train.py`.
`./run.sh` calls `../train.py`.
```bash
./run.sh
```
Here's the complete help message.
```text
usage: multi_spk_fs2_train.py [-h] [--config CONFIG]
[--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA]
[--output-dir OUTPUT_DIR] [--device DEVICE]
[--nprocs NPROCS] [--verbose VERBOSE]
[--phones-dict PHONES_DICT]
[--speaker-dict SPEAKER_DICT]
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--device DEVICE] [--nprocs NPROCS] [--verbose VERBOSE]
[--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT]
Train a FastSpeech2 model with multiple speaker dataset.
Train a FastSpeech2 model.
optional arguments:
-h, --help show this help message and exit
Expand All @@ -79,7 +76,7 @@ optional arguments:
--phones-dict PHONES_DICT
phone vocabulary file.
--speaker-dict SPEAKER_DICT
speaker id map file.
speaker id map file for multiple speaker model.
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
Expand Down Expand Up @@ -148,7 +145,7 @@ optional arguments:
--phones-dict PHONES_DICT
phone vocabulary file.
--speaker-dict SPEAKER_DICT
speaker id map file.
speaker id map file for multiple speaker model.
--test-metadata TEST_METADATA
test metadata.
--output-dir OUTPUT_DIR
Expand Down
Loading

0 comments on commit 4f97ff2

Please sign in to comment.