This repository has been archived by the owner on Sep 11, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 83
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #199 from yt605155624/add_pwg_vctk
add vctk fastspeech2 and pwg
- Loading branch information
Showing
18 changed files
with
1,069 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
# Parallel WaveGAN with VCTK | ||
This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [VCTK](https://datashare.ed.ac.uk/handle/10283/3443). | ||
## Preprocess the dataset | ||
### Download and Extract the datasaet | ||
Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`. | ||
|
||
### Get MFA results for silence trim | ||
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. | ||
You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo. | ||
ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa/local/reorganize_vctk.py)): | ||
1. `p315`, because no txt for it. | ||
2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them. | ||
|
||
### Preprocess the dataset | ||
Assume the path to the dataset is `~/datasets/VCTK-Corpus-0.92`. | ||
Assume the path to the MFA result of VCTK is `./vctk_alignment`. | ||
Run the command below to preprocess the dataset. | ||
```bash | ||
./preprocess.sh | ||
``` | ||
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below. | ||
|
||
```text | ||
dump | ||
├── dev | ||
│ ├── norm | ||
│ └── raw | ||
├── test | ||
│ ├── norm | ||
│ └── raw | ||
└── train | ||
├── norm | ||
├── raw | ||
└── feats_stats.npy | ||
``` | ||
|
||
The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. | ||
|
||
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. | ||
|
||
## Train the model | ||
|
||
`./run.sh` calls `../train.py`. | ||
```bash | ||
./run.sh | ||
``` | ||
Here's the complete help message. | ||
|
||
```text | ||
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA] | ||
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR] | ||
[--device DEVICE] [--nprocs NPROCS] [--verbose VERBOSE] | ||
[--batch-size BATCH_SIZE] [--max-iter MAX_ITER] | ||
[--run-benchmark RUN_BENCHMARK] | ||
[--profiler_options PROFILER_OPTIONS] | ||
Train a ParallelWaveGAN model. | ||
optional arguments: | ||
-h, --help show this help message and exit | ||
--config CONFIG config file to overwrite default config. | ||
--train-metadata TRAIN_METADATA | ||
training data. | ||
--dev-metadata DEV_METADATA | ||
dev data. | ||
--output-dir OUTPUT_DIR | ||
output dir. | ||
--device DEVICE device type to use. | ||
--nprocs NPROCS number of processes. | ||
--verbose VERBOSE verbose. | ||
benchmark: | ||
arguments related to benchmark. | ||
--batch-size BATCH_SIZE | ||
batch size. | ||
--max-iter MAX_ITER train max steps. | ||
--run-benchmark RUN_BENCHMARK | ||
runing benchmark or not, if True, use the --batch-size | ||
and --max-iter. | ||
--profiler_options PROFILER_OPTIONS | ||
The option of profiler, which should be in format | ||
"key1=value1;key2=value2;key3=value3". | ||
``` | ||
|
||
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. | ||
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. | ||
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. | ||
4. `--device` is the type of the device to run the experiment, 'cpu' or 'gpu' are supported. | ||
5. `--nprocs` is the number of processes to run in parallel, note that nprocs > 1 is only supported when `--device` is 'gpu'. | ||
|
||
## Pretrained Models | ||
|
||
|
||
## Synthesize | ||
|
||
`synthesize.sh` calls `../synthesize.py `, which can synthesize waveform from `metadata.jsonl`. | ||
```bash | ||
./synthesize.sh | ||
``` | ||
```text | ||
usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] | ||
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] | ||
[--device DEVICE] [--verbose VERBOSE] | ||
Synthesize with parallel wavegan. | ||
optional arguments: | ||
-h, --help show this help message and exit | ||
--config CONFIG parallel wavegan config file. | ||
--checkpoint CHECKPOINT | ||
snapshot to load. | ||
--test-metadata TEST_METADATA | ||
dev data. | ||
--output-dir OUTPUT_DIR | ||
output dir. | ||
--device DEVICE device to run. | ||
--verbose VERBOSE verbose. | ||
``` | ||
|
||
1. `--config` parallel wavegan config file. You should use the same config with which the model is trained. | ||
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory. If you use the pretrained model, use the `pwg_snapshot_iter_400000.pdz`. | ||
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory. | ||
4. `--output-dir` is the directory to save the synthesized audio files. | ||
5. `--device` is the type of device to run synthesis, 'cpu' and 'gpu' are supported. | ||
|
||
## Acknowledgement | ||
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN. |
115 changes: 115 additions & 0 deletions
115
examples/GANVocoder/parallelwave_gan/vctk/conf/default.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# This is the hyperparameter configuration file for Parallel WaveGAN. | ||
# Please make sure this is adjusted for the VCTK corpus. If you want to | ||
# apply to the other dataset, you might need to carefully change some parameters. | ||
# This configuration requires 12 GB GPU memory and takes ~3 days on RTX TITAN. | ||
|
||
########################################################### | ||
# FEATURE EXTRACTION SETTING # | ||
########################################################### | ||
fs: 24000 # Sampling rate. | ||
n_fft: 2048 # FFT size. (in samples) | ||
n_shift: 300 # Hop size. (in samples) | ||
win_length: 1200 # Window length. (in samples) | ||
# If set to null, it will be the same as fft_size. | ||
window: "hann" # Window function. | ||
n_mels: 80 # Number of mel basis. | ||
fmin: 80 # Minimum freq in mel basis calculation. (Hz) | ||
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz) | ||
|
||
########################################################### | ||
# GENERATOR NETWORK ARCHITECTURE SETTING # | ||
########################################################### | ||
generator_params: | ||
in_channels: 1 # Number of input channels. | ||
out_channels: 1 # Number of output channels. | ||
kernel_size: 3 # Kernel size of dilated convolution. | ||
layers: 30 # Number of residual block layers. | ||
stacks: 3 # Number of stacks i.e., dilation cycles. | ||
residual_channels: 64 # Number of channels in residual conv. | ||
gate_channels: 128 # Number of channels in gated conv. | ||
skip_channels: 64 # Number of channels in skip conv. | ||
aux_channels: 80 # Number of channels for auxiliary feature conv. | ||
# Must be the same as num_mels. | ||
aux_context_window: 2 # Context window size for auxiliary feature. | ||
# If set to 2, previous 2 and future 2 frames will be considered. | ||
dropout: 0.0 # Dropout rate. 0.0 means no dropout applied. | ||
use_weight_norm: true # Whether to use weight norm. | ||
# If set to true, it will be applied to all of the conv layers. | ||
upsample_scales: [4, 5, 3, 5] # Upsampling scales. Prodcut of these must be the same as hop size. | ||
|
||
########################################################### | ||
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING # | ||
########################################################### | ||
discriminator_params: | ||
in_channels: 1 # Number of input channels. | ||
out_channels: 1 # Number of output channels. | ||
kernel_size: 3 # Number of output channels. | ||
layers: 10 # Number of conv layers. | ||
conv_channels: 64 # Number of chnn layers. | ||
bias: true # Whether to use bias parameter in conv. | ||
use_weight_norm: true # Whether to use weight norm. | ||
# If set to true, it will be applied to all of the conv layers. | ||
nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv. | ||
nonlinear_activation_params: # Nonlinear function parameters | ||
negative_slope: 0.2 # Alpha in LeakyReLU. | ||
|
||
########################################################### | ||
# STFT LOSS SETTING # | ||
########################################################### | ||
stft_loss_params: | ||
fft_sizes: [1024, 2048, 512] # List of FFT size for STFT-based loss. | ||
hop_sizes: [120, 240, 50] # List of hop size for STFT-based loss | ||
win_lengths: [600, 1200, 240] # List of window length for STFT-based loss. | ||
window: "hann" # Window function for STFT-based loss | ||
|
||
########################################################### | ||
# ADVERSARIAL LOSS SETTING # | ||
########################################################### | ||
lambda_adv: 4.0 # Loss balancing coefficient. | ||
|
||
########################################################### | ||
# DATA LOADER SETTING # | ||
########################################################### | ||
batch_size: 8 # Batch size. | ||
batch_max_steps: 24000 # Length of each audio in batch. Make sure dividable by hop_size. | ||
pin_memory: true # Whether to pin memory in Pytorch DataLoader. | ||
num_workers: 4 # Number of workers in Pytorch DataLoader. | ||
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps. | ||
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory. | ||
|
||
########################################################### | ||
# OPTIMIZER & SCHEDULER SETTING # | ||
########################################################### | ||
generator_optimizer_params: | ||
epsilon: 1.0e-6 # Generator's epsilon. | ||
weight_decay: 0.0 # Generator's weight decay coefficient. | ||
generator_scheduler_params: | ||
learning_rate: 0.0001 # Generator's learning rate. | ||
step_size: 200000 # Generator's scheduler step size. | ||
gamma: 0.5 # Generator's scheduler gamma. | ||
# At each step size, lr will be multiplied by this parameter. | ||
generator_grad_norm: 10 # Generator's gradient norm. | ||
discriminator_optimizer_params: | ||
epsilon: 1.0e-6 # Discriminator's epsilon. | ||
weight_decay: 0.0 # Discriminator's weight decay coefficient. | ||
discriminator_scheduler_params: | ||
learning_rate: 0.00005 # Discriminator's learning rate. | ||
step_size: 200000 # Discriminator's scheduler step size. | ||
gamma: 0.5 # Discriminator's scheduler gamma. | ||
# At each step size, lr will be multiplied by this parameter. | ||
discriminator_grad_norm: 1 # Discriminator's gradient norm. | ||
|
||
########################################################### | ||
# INTERVAL SETTING # | ||
########################################################### | ||
discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator. | ||
train_max_steps: 1000000 # Number of training steps. | ||
save_interval_steps: 5000 # Interval steps to save checkpoint. | ||
eval_interval_steps: 1000 # Interval steps to evaluate the network. | ||
|
||
########################################################### | ||
# OTHER SETTING # | ||
########################################################### | ||
num_save_intermediate_results: 4 # Number of results to be saved as intermediate results. | ||
num_snapshots: 10 # max number of snapshots to keep while training | ||
seed: 42 # random seed for paddle, random, and np.random |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
#!/bin/bash | ||
|
||
stage=0 | ||
stop_stage=100 | ||
|
||
export MAIN_ROOT=`realpath ${PWD}/../../../../` | ||
|
||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then | ||
# get durations from MFA's result | ||
echo "Generate durations.txt from MFA results ..." | ||
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ | ||
--inputdir=./vctk_alignment \ | ||
--output=durations.txt \ | ||
--config=conf/default.yaml | ||
fi | ||
|
||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then | ||
echo "Extract features ..." | ||
python3 ../../preprocess.py \ | ||
--rootdir=~/datasets/VCTK-Corpus-0.92/ \ | ||
--dataset=vctk \ | ||
--dumpdir=dump \ | ||
--dur-file=durations.txt \ | ||
--config=conf/default.yaml \ | ||
--cut-sil=True \ | ||
--num-cpu=20 | ||
fi | ||
|
||
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then | ||
# get features' stats(mean and std) | ||
echo "Get features' stats ..." | ||
python3 ${MAIN_ROOT}/utils/compute_statistics.py \ | ||
--metadata=dump/train/raw/metadata.jsonl \ | ||
--field-name="feats" | ||
fi | ||
|
||
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then | ||
# normalize, dev and test should use train's stats | ||
echo "Normalize ..." | ||
|
||
python3 ../../normalize.py \ | ||
--metadata=dump/train/raw/metadata.jsonl \ | ||
--dumpdir=dump/train/norm \ | ||
--stats=dump/train/feats_stats.npy | ||
python3 ../../normalize.py \ | ||
--metadata=dump/dev/raw/metadata.jsonl \ | ||
--dumpdir=dump/dev/norm \ | ||
--stats=dump/train/feats_stats.npy | ||
|
||
python3 ../../normalize.py \ | ||
--metadata=dump/test/raw/metadata.jsonl \ | ||
--dumpdir=dump/test/norm \ | ||
--stats=dump/train/feats_stats.npy | ||
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
#!/bin/bash | ||
|
||
FLAGS_cudnn_exhaustive_search=true \ | ||
FLAGS_conv_workspace_size_limit=4000 \ | ||
python ../train.py \ | ||
--train-metadata=dump/train/norm/metadata.jsonl \ | ||
--dev-metadata=dump/dev/norm/metadata.jsonl \ | ||
--config=conf/default.yaml \ | ||
--output-dir=exp/default \ | ||
--nprocs=1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
#!/bin/bash | ||
|
||
python3 ../synthesize.py \ | ||
--config=conf/default.yaml \ | ||
--checkpoint=exp/default/checkpoints/snapshot_iter_35000.pdz_bak\ | ||
--test-metadata=dump/test/norm/metadata.jsonl \ | ||
--output-dir=exp/default/test |
Oops, something went wrong.