Releases: openvpi/DiffSinger
v2.4.0: Rectified Flow algorithm and new feature extractor based on harmonic-noise separation model
New generative model algorithm: Rectified Flow (#184)
Rectified Flow is a new ODE-based generative model algorithm which is introduced in this paper and used in Stable Diffusion 3. The experimental results has shown that Rectified Flow outperforms the former DDPM in all modules of DiffSinger. This should be the first publicly known usage of Rectified Flow in SVS systems.
Rectified Flow has already been the default algorithm to train a new DiffSinger model. No actions are required if you are using the template configuration file. Though not recommended, you can turn back to DDPM with the following line in your configuration:
diffusion_type: 'ddpm' # default value is 'reflow'
Feature extractor based on harmonic-noise separation model (#196)
Harmonic-noise separation is a fundamental step to extract breathiness, voicing and tension from singing voice. The old WORLD-based method is unable to separate harmonic and noise clearly, making the extracted features not as accurate as expected. We introduced a new NN-based algorithm (Vocal Remover) for this separation process. With the new method, the performance of most variance parameters (especially tension) should improve.
The new harmonic-noise separator has already been the default choice for preprocessing new datasets. Please read the guidance in GettingStarted.md and download the model file. Though not recommended, you can still use WORLD with the following line in your configuration:
hnsep: world # default value is 'vr'
Other improvements, changes and bug fixes
- The
--speedup
option in infer.py is replaced by--steps
for continuous acceleration of Rectified Flow - All exported models are adapted to the new continuous acceleration API
- Mel log base migration: log10 setting is banned in preprocessing
- Mel log base migration: all exported models are converted to accept log e mel spectrograms
- The trainer now shows an error message when user sets all
predict_*
tofalse
in variance model training - The binarizer now shows an error message when negative values are found in
ph_dur
ornote_dur
- Package versions in requirements.txt are updated; ONNX exporting requirements are written in requirements-onnx.txt
- Bugfix: the extracted tension can be incorrect if the recording and label are not aligned
Some changes may not be listed above. See full change log: v2.3.0...v2.4.0
v2.3.0: New voicing and tension parameters, log base number migration plan and removal of old features
New variance parameters: voicing and tension (#169, #170)
Voicing: controlling power of the harmonic part
Voicing is defined as the RMS curve of the harmonic part of the singing, in dB, which can control the power of the harmonics in vowels and voiced consonants in the voice.
Unlike other singing synthesizers which only allows decreasing the voicing, DiffSinger allows both increasing and decreasing of this parameter. Voicing is actually expected as a successor to energy, and energy is not recommended anymore.
Tension: controlling timbre and strength
Tension is mostly related to the ratio of the base harmonic to the full harmonics. The detailed calculation process is described in the documentation.
Usage and notice
Before enabling the parameters above, please read carefully the docs about choosing the proper variance parameters at https://github.com/openvpi/DiffSinger/tree/main/docs/BestPractices.md#choosing-variance-parameters. Please note that enabling all parameters does not mean your model will get good results.
To train an acoustic model that accepts voicing or tension control, edit the configuration file:
use_voicing_embed: true
use_tension_embed: true
To train a variance model that predicts voicing or tension, edit the configuration file:
predict_voicing: true
predict_tension: true
Migration of log base of the mel spectrograms
Due to historical reasons, DiffSinger uses log 10 mel spectrograms rather than log e mel-spectrograms. All acoustic models and vocoder ONNX models are producing or accepting log 10 mel spectrograms, and they are not compatible with log e mel-spectrograms unless we manually multiply a coefficient to their outputs or inputs. To address this problem, we plan to gradually migrate the configurations and models to log e mel spectrograms.
1st stage: support training log e models (#175)
In the first stage, we now support both 10 and e as base numbers of the mel spectrogram. Users can now train acoustic models that outputs true log e mel spectrogram with the following change in the configuration file:
Old | New |
---|---|
spec_min: [-5]
spec_max: [0]
mel_vmin: -6.
mel_vmax: 1.5
mel_base: '10' # <- this is the default |
spec_min: [-12]
spec_max: [0]
mel_vmin: -14.
mel_vmax: 4.
mel_base: 'e' # <- recommended in the future |
The mel_base
configuration will also be stored in dsconfig.yaml and vocoder.yaml for compatible checks in downstream applications like OpenUTAU. Once the downstream applications have adapted the checks, we will consider changing the default configuration to log e mel spectrograms.
2nd stage: force new models to be in log e
In the second stage (ETA next minor release), we will prevent users from training new acoustic models under the log 10 settings. Meanwhile, all log 10 models will be converted to log e models when they are exported to ONNX format.
3rd stage: eliminate log 10 from this repository
In the third stage (ETA some day before 2025), we will remove the mel_base
configuration key and regard all models in this repository as in log e. There will be a formal warning before this change because log 10 models will produce wrong results afterwards. Models already exported to ONNX will still work in downstream applications.
Further plans
Currently there is no plan on whether log 10 models will be completely eliminated from the whole production workflow. If the decision is finally made, there will be at least a tool for converting pre-existing log 10 ONNX models to log e ones so that they can still work as expected.
Dropping support for some old features and behaviors (#172)
Removed features and behaviors
- Discrete F0 embedding type (temporarily reserved in ONNX exporter)
- Code backup on training start
- Random seeding during training
- Linear domain of random time stretching augmentation
- Migration script and guidance for transcriptions and checkpoints from version 1.X.
Changes in configuration file
interp_uv
configuration is removed and forced toTrue
.train_set_name
andvalid_set_name
are removed and forced totrain
andvalid
.num_pad_tokens
is removed and forced to 1.ffn_padding
is removed and forced toSAME
.g2p_dictionary
configuration is removed in favor ofdictionary
.pndm_speedup
configuration is renamed todiff_speedup
.- Some configuration keys are now directly accessed so the configuration must contain them:
dictionary
,diff_accelerator
,pe
,use_key_shift_embed
,use_speed_embed
.
Other improvements, changes and bug fixes
- Custom
kwargs
are now available even if the PL strategy is set toauto
(#159) - The default vocoder is set to the new NSF-HiFiGAN release 2024.02
- Configuration files for OpenUTAU (namely dsconfig.yaml and vocoder.yaml) are now automatically generated when exporting ONNX
- The two old dictionaries opencpop.txt and opencpop-strict.txt are removed
- Fix a clamping issue in DDPM causing abnormal loudness when speedup = 1 (ONNX models should be re-exported)
- Fix overlapped initializer names when exporting variance models
- Fix unexpected access to NoneType when not using melody encoder
- Fix a regression that inference of the ground truth happens repeatedly during acoustic model training
Some changes may not be listed above. See full change log: v2.2.1...v2.3.0
v2.2.1: Important notice and minor patch release
Vocoder fine-tuning is available
Everything about vocoder training, fine-tuning and research now has its own place: https://github.com/openvpi/SingingVocoders
User can now fine-tune the shared NSF-HiFiGAN vocoder model on their own datasets without much computing resources. In most cases, vocoder fine-tuning can reduce the noise caused by unmatched mel-spectrogram predictions with the ground truth on unseen datasets, improving the final audio quality. See the documentation about how to use custom vocoder models and deploy them to ONNX format in this repository.
Mutual influence between variance modules
A recent research from the developer team found some mutual influence between the duration predictor, the pitch predictor and the variance predictor of a variance model. The findings have been written as formal suggestions into the documentation. Following these suggestions to train your variance models can improve the accuracy and avoid unstable loudness.
Changes and bug fixes
This patch release contains the following changes:
- The pitch expressiveness factor is now exposed by default but can be disabled by
--freeze_expr
- Note glide type can now be frozen by
--freeze_glide
for compatibility with OpenUTAU - Shallow diffusion and FP16 AMP are now enabled by default
- The default
f0_max
configuration value is changed from 800 to 1100 - Model path can be specified by
--ckpt
when exporting custom vocoder model to ONNX - Documentation about preparing and deploying custom vocoders is added and re-organized
- Melody encoder is added to the new variance model architecture graph
The following bugs are fixed:
- A relative path bug caused by custom checkpoint saving directory
- Interpolation error is raised during inference of variance model when all notes are rest
- The breathiness unexpectedly becomes NaN in some rare edge cases
Known issues
When training with DDP, the TensorBoard sometimes raises error and no longer updates after a validation. The temporary solution is adding the option --reload_multifile=true
when launching TensorBoard.
Full change log: v2.2.0...v2.2.1
v2.2.0: Shallow diffusion returning back, melody encoder and ornaments support, multi-node batched validation, minor improvements and bug fixes
Shallow diffusion returning back (#128)
Shallow diffusion is a mechanism that can improve quality and save inference time for diffusion models that was first introduced in the original DiffSinger paper. Instead of starting the diffusion process from purely gaussian noise as classic diffusion does, shallow diffusion adds a shallow gaussian noise on a low-quality results generated by a simple network (which is called the auxiliary decoder) to skip many unnecessary steps from the beginning. With the combination of shallow diffusion and sampling acceleration algorithms, we can get better results under the same inference speed as before, or achieve higher inference speed without quality deterioration.
Quick start in configuration file
use_shallow_diffusion: true
K_step: 400 # adjust according to your needs
K_step_infer: 400 # should be <= K_step
See other advanced settings and usages in the BestPractices.md.
Inference and deployment
The diffusion depth (K_step_infer
) can be adjusted at inference time with --depth
option of infer.py.
Acoustic models with shallow diffusion in will get and additional input called depth
after exporting to ONNX format.
The above depth
arguments are guaranteed to be safe as they are clipped by the maximum trained number of diffusion steps (K_step
).
Melody encoder and ornaments support (#143)
Melody encoder directly calculates attention on the note sequene besides the linguistic features. With this new method of melody modeling, the pitch predictor gains more sensitiveness on the pitch trend in the music scores, thus imroving accuracy and stability on short slurs, long vibratos and out-of-range notes. In addition, this note-level encoder can also accept ornament tags as input, for example, the glides.
Melody encoder for pitch prediction
The results showed that melody encoder is more suitable than base pitch to carry music score information, especially on expressive datasets. On TensorBoard, significant improvements on short slurs and long vibratos were also observed. In our internal tests, pitch predictors with melody encoder also outperformed the old method on out-of-range notes, and can still show its sensitiveness even if the music scores are far higher than normal range (e.g. over C7 for a male singer). [Demo]
Before using melody encoder, we recommend you label your phoneme timings and MIDI sequence accurately. To enable melody encoder, simply introduce the following line in your configuration file:
use_melody_encoder: true
Pitch predictors with melody encoder enabled will get an additional input called note_rest
after exporting to ONNX format.
Natural glide support
Melody encoder currently support glides, where the pitch smoothly rises at the beginning of the note, or drops at the end of the note. With enough glide samples that are properly labeled in the dataset, the pitch predictor can produce accurate and natural glides with simple glide flags, without having to draw manual pitch curves like before. [Demo]
To enable glide input, ensure that melody encoder is enabled, and introduce the following line in your configuration file:
use_glide_embed: true
In your transcriptions.csv, you should add a new column called note_glide
with glide type names, where none
is for no glide and other names are for glide types defined in glide_types
configuration key. By default, there are two types of glide notes: up and down.
Glide labeling has already been supported by MakeDiffSinger and SlurCutter.
Pitch predictors with glide embedding will get an additional input called note_glide
after exporting to ONNX format.
Multi-node batched validation and improved strategy selection (#148)
Validation during training can now run on all of the notes and devices when DDP is enabled. Additionally, validation batch size is no longer limited to 1. To configure this, override the following keys in your configuration file:
# adjust according to your needs
max_val_batch_frames: 10000
max_val_batch_size: 4
The PyTorch Lightning trainer strategy can now be configured more dynamically. Configuration example:
pl_trainer_strategy:
name: ddp
# keyword arguments of the strategy class can be configured below
process_group_backend: nccl
See more available strategies in the official documentation.
Besides, a new configuration key called nccl_p2p
is introduced to control P2P option of NCCL in case it gets stuck.
Other improvements and changes
- TensorBoard manipulation of plots and audio samples are improved (#148)
- Binarizers now also prints data duration for each speaker respectively (#148)
- Harvest pitch extractor and F0 range configurations are supported (#149)
- Data augmentation is now enabled by default and ONNX exporter no longer needs
--expose_*
options - Formatting of configuration attributes in the configuration schema has been improved (#153)
- Documentation and links are updated (#156)
Major bug fixes
- ONNX exporter of acoustic models now loads state dict in strict mode to prevent incorrect checkpoint
- SciPy version is constrained to >= 1.10.0 to avoid interpolation raising ValueError in some cases
- Potential alignment issues of the parselmouth pitch extractor are fixed
Known issues
For performance concerns, the find_unused_parameters
option of DDP strategy is disabled by default. However, the DDP strategy requires all the parameters to be included in the computing graph, otherwise it raises a RuntimeError.
In some cases, like when you turned off train_aux_decoder
or train_diffusion
in shallow diffusion configurations, part of the model can be expected to hung outside of the computing graph. If you are using DDP with this, you can enable the option manually to avoid the error:
pl_trainer_strategy:
name: ddp
find_unused_parameters: true # <- enable this option
Some changes may not be listed above. See full change log: v2.1.0...v2.2.0
v2.1.0: Fine-tuning and parameter freezing, pitch expressiveness control, DS files training, minor featrure improvements and bug fixes
Fine-tuning and parameter freezing (#108, #120)
If you already have some pre-trained checkpoints, and you need to adapt them to other datasets with their functionalities unchanged, fine-tuning may save training steps and time. Configuration example:
finetune_enabled: true # the main switch to enable fine-tuning
finetune_ckpt_path: checkpoints/pretrained/model_ckpt_steps_320000.ckpt # path to your pre-trained checkpoint
finetune_ignored_params: # prefix rules to exclude specific parameters when loading the checkpoints
- model.fs2.encoder.embed_tokens # in case when the phoneme set is changed
- model.fs2.txt_embed # same as above
- model.fs2.spk_embed # in case when the speaker set is changed
finetune_strict_shapes: true # whether to raise an error when parameter shapes mismatch
Freezing part of the model parameter during training and fine-tuning may be able to save GPU memory, accelerate the training process or avoid catastrophic forgetting. Configuration example:
freezing_enabled: true # main switch to enable parameter freezing
frozen_params: # prefix rules to freeze specific parameters during training
- model.fs2.encoder
- model.fs2.pitch_embed
Please see the documentation for detailed usages of these two features.
Pitch expressiveness controlling mechanism (#97)
Expressiveness controls how freely the variance model generates pitch curves. By default, the variance model predicts pitch at a 100% expressiveness, which means completely following the style of the voice provider. Correspondingly, a 0% expressiveness will produce pitch completely close to the smoothened music score. Expressiveness can be freely adjusted from 0% to 100%, statically, or even dynamically on frame level.
Pitch expressiveness controlling is compatible with all variance models with a pitch predictor without re-training anything.
Control pitch expressiveness in CLI
python scripts/infer.py variance my_project.ds --exp my_pitch_exp --predict pitch --expr 0.8 # a value between 0 and 1
Control pitch expressiveness in DS files
{
"expr": 0.8 // static control
}
or
{
"expr": "0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0", // dynamic control
"expr_timestep": "0.005"
}
Expose pitch expressiveness control in ONNX models
python scripts/export.py variance --exp my_pitch_exp --expose_expr
This will add an additional input named expr
in my_pitch_exp.pitch.onnx.
DS files training (#132)
Using DS files to train variance models are now supported - this means users of voicebanks can tune projects of their own styles without recording any real singing voice. The only things needed to be done are: copy the DS files in ds/
folder in raw dataset directory, write a single-column transcriptions.csv to declare them and turn on the main switch of DS files binarization in the configuration file:
binarization_args:
prefer_ds: true # prefer loading from DS files
Please see the documentation for more detailed usages and information of DS files binarization.
Other minor feature improvements
- Support the state-of-the-art RMVPE pitch extractor (#118, #122)
- Show objective evaluation metrics on TensorBoard (#123, #127)
- Support composite LR schedulers (#125)
- Perform graceful exit on keyboard interrupt during binarization and inference (#119)
- Improve logging format of learning rate (#115)
- Add more documentation for old and new features
Major bug fixes
- Fixed wrong speaker ID assignment in fixed pitch shifting augmentation
- Fixed illegal access to
None
when training dur predictor - Fixed slur mistakes in a sample DS file
- Fixed wrong model loading logic when using
--mel
- Fixed noisy output of ONNX models on DirectML
- Fixed missing
spk_embed
input of multi-speaker duration predictor ONNX models
Some changes may not be listed above. See full change log: v2.0.0...v2.1.0
v2.0.0: Complete refactor, brand-new variance models, universal dictionary compatibility, AMP/DDP support and much more improvements
Backwards Incompatible Changes
Dataset making pipelines
Dataset making pipelines (based on MFA) has been moved to its own repository MakeDiffSinger. The original Jupyter Notebook has been removed and replaced with command-line scripts.
Old functionality removal
The following functionalities has been removed and are not supported:
- MIDI-A/B training and inference
- PitchExtracter (xiaoma_pe) training and inference
- Old 24 kHz vocoder (HiFi-GAN & PWG) training and inference
Environment & dependencies
Dependencies have been refactored and require re-installing. The ONNX exporting dependency has been updated to PyTorch 1.13 from PyTorch 1.8.
Model loading
Old acoustic model checkpoints should be migrated via the following script before loading:
python scripts/migrate.py ckpt <INPUT_CKPT> <OUTPUT_CKPT>
Before resuming training from old checkpoints, the following line should be added to the configuration file:
num_pad_tokens: 3
Datasets
Old datasets should be re-binarized before training.
Old data labels (transcriptions.txt) should be migrated to new transcriptions.csv via the following script before loading:
python scripts/migrate.py txt <INPUT_TXT>
Configuration files
The following configuration keys have been renamed:
g2p_dictionary
=>dictionary
max_tokens
=>max_batch_frames
max_sentences
=>max_batch_size
max_eval_tokens
=>max_val_batch_frames
max_eval_sentences
=>max_val_batch_size
lr
=>optimizer_args.lr
optimizer_adam_beta1
->optimizer_args.beta1
optimizer_adam_beta2
->optimizer_args.beta2
weight_decay
->optimizer_args.weight_decay
warmup_updates
->lr_scheduler_args.warmup_steps
decay_steps
=>lr_scheduler_args.step_size
gamma
=>lr_scheduler_args.gamma
DS files
DS files in v1.x format are not supported now. Please export them again with the latest version of OpenUTAU for DiffSinger for inference.
The new variance models, parameters and mechanisms
Variance models
Training, inference and deployment of the new variance models are supported.
Functionalities included:
- Automatically predicts phoneme durations (Duration Predictor)
- Automatically predicts the pitch curve (Pitch Diffusion)
- Automatically predicts other variance parameters jointly (Multi-Variance Diffusion)
Before training variance models, the current data transcriptions should be migrated. Required operations may vary according to functionalities chosen and dictionaries used (sometimes requires manual labeling). See details at: variance-temp-solution.
Phoneme durations
In acoustic models, the users need to input duration for every phoneme, thus the acoustic model relies on phoneme duration predictors. The phoneme duration prediction modules in variance models can predict duration for every phoneme with given phoneme sequence, word division, word duration and approximate MIDI sequence.
Pitch curve
Acoustic models require explicit pitch input from outside. Pitch prediction modules in the variance models can predict pitch curve with given phoneme information and smoothened MIDI curve. The specially designed labeling system can correct bad data with many out-of-tune errors and produce accurate models.
Variance parameters
Variance parameters can bring higher expressiveness and controllability besides phoneme durations and pitch. They are predicted by the variance model with given phoneme information and pitch curve, then given to the acoustic model for the control.
NOTE: Variance parameters are represented by absolute values instead of relative values (offset), thus no default value curves are defined. For this reason, new acoustic models that accept these parameters as input should be trained besides their corresponding variance models.
Energy
Energy is defined as the RMS curve of the singing, in dB, which can control the strength of voice to a certain extent.
In DS files, energy
and energy_timestep
are used to control energy.
Breathiness
Breathiness is defined as the RMS curve of the aperiodic part of the singing, in dB, which can control the power of the air and unvoiced consonants in the voice.
In DS files, breathiness
and breathiness_timestep
are used to control breathiness.
Style fusion mechanism
All parameters that variance models support can be dynamically style-mixed. Among them, the phoneme durations are mixed in the level of phonemes, while others are mixed in the level of frames. Style fusion of different parameters as well as style fusion and timbre fusion, are independent from each other.
Style fusion controls are similar to the timbre fusion of acoustic models:
ph_spk_mix
is used to control the fusion of phoneme durations.spk_mix
andspk_mix_timestep
are used to control the fusion of other parameters.
Local retaking mechanism
Pitch and all other variance parameters support local retaking, i.e. re-generate curve on a continuous sub-region based on given curve segments. Meanwhile, this mechanism ensures that the retaken curve is smoothly connected to the given curve.
To retake pitch, complete phoneme information, position of the region to be retaken and the pitch curve on non-retaking regions should be given.
To retake other variance parameters, complete phoneme information, complete pitch curve, variance parameter names to retake (retaking multiple parameters in one go is supported), positions of the regions to be retaken (retaking different parameters on different regions is supported) and the parameters curves of non-retaking regions should be given.
Parameter cascading mechanism
Overall cascading logic
The cascading order of the variance model in general is: music scores => phoneme durations => pitch => other variance parameters.
Customized variance cascading
By utilizing the local retaking mechanism, the cascading order of all variance parameters except pitch can be customized at inference time. The following shows some example use cases:
- Jointly predicts parameter A, B and C in one go while they do not couple with each other, thus changing one parameter will not cause others to change.
- Make parameter C to be after parameter A and B, i.e., Jointly predicts A and B in one go, then use A and B to predict C. Parameter C will change on modification to parameter A and B, but parameter A and B do not influence each other.
- Freeze parameter A and make parameter B and C coupled. Use A to predict B and C, and modifying either B and C will cause the other to change.
Universal dictionary and phoneme system support
The brand-new variance model and phoneme labeling system can support any dictionaries of any phoneme systems. See the variance labels migration guidelines (variance-temp-solution) and custom dictionary guidelines for more details.
Automatic mixed precision, multi-GPU and gradient accumulation
This project has been adapted to the latest version of Lightning and supports automatic mixed precision (FP16/BF16 AMP), multi-GPU training (DDP) and gradient accumulation for accelerating training and saving GPU memory. See performance tuning guidelines for more details.
Other new contents and changes
- Documentation of this project has been refactored. The README lists all important documents and links.
- Code structure and dependencies are significantly refactored. Some dependencies are updated.
- Scripts for preprocessing, training, inference and deployment have been refactored and moved under scripts/.
- A new script for deleting specific speaker embedding from model checkpoints is added.
PYTHONPATH
andCUDA_VISIBLE_DEVICES
are not required to be exported now when running preprocessing and training.- The speaker IDs distributed to each dataset can now be customized via the
spk_ids
configuration key. Giving the same ID to multiple datasets is also supported. - Multiprocessing binarization is now supported. The number of workers can be customized.
- Dataset binary format has been changed to HDF5. Redundant contents are removed.
- The learning rate and the optimizer can now be customized more freely via
lr_scheduler_args
andoptimizer_args
. - DDIM, DPM-Solver++ (replacement of DPM-Solver) and UniPC algorithms are supported for diffusion sampling acceleration.
- The diffusion accelerator integrated in ONNX models has been changed to DDIM.
- When exporting multi-speaker models, all speakers will be exported by default if the
--export_spk
option is unset. - Version of the operator set of exported ONNX models has been upgraded to 15.
Some changes may not be listed above. See the repository README for more details.
Bug fixes
- Fixed a bug causing the epoch count in the terminal logging to be only 1/1000 of the actual epoch.
- Fixed potential file handle racing when reading/seeking dataset.
- Fixed a bug causing inconsistency between joint augmentation formula and implementation.
- Fixed hyper-parameters failed to render colors in some terminal and some Python versions.
- Fixed messed up code backup directory structure.
License
License of this project has been changed from the MIT License to the Apache Licnese 2.0.
NOTICE: No more backward compatibility for MIDI-A/B, 24kHz vocoder and PE
This is a backup release for our further developments.
We will re-arrange the codebase so that no more support for MIDI-A/B SVS modes and 24kHz vocoder will be provided.
At next version of DiffSinger in this forked repository, inference or training with MIDI-A/B SVS, PitchPredictor, the old 24kHz vocoder and PitchExtractor (which MIDI-B relies on) will result in raising errors.
With these clean-ups, we will be able to focus on MIDI-less acoustic model preparation/preprocessing/training/inference and further development for more functionalities.
Time stretching, velocity control and optimized joint data augmentation
Overview
In this release:
- We introduced time stretching augmentation that allows you to control the frame-level velocity of any part of the singing (similar to but much more flexible than VEL parameter in VOCALOID). Here, we are glad to announce that our velocity parameter is a brand-new curve parameter that has probably never been introduced to modern singing voice synthesis architectures and products before. The velocity parameter will bring you free experience to control the texture of consonants and the transition of each part within vowels.
- We implemented a scaling algorithm for multiple types of augmentation that are enabled together. See the dataset making pipeline for more details.
- Custom learning rate decay ratio (gamma) is supported. You are able to control the lr schedule more freely to adapt to more complex datasets.
Random time stretching
Randomly changes the speed of your training data. This will probably improve the stability of long utterances (especially for speaking data) and allows you to control the brand-new velocity parameter as described above. This augmentation can be enabled together with either random or fixed pitch shifting augmentation.
To enable random time stretching augmentation for your former dataset, add the following configuration in the config file:
augmentation_args:
random_time_stretching:
range: [0.5, 2.0]
scale: 2.0
use_speed_embed: true
Control velocity curve in *.ds files
{
"velocity_timestep": "0.005", // timestep in seconds, like f0_timestep
"velocity": "0.5 0.6 0.7 ... 1.8 1.9 2.0", // sequence of float values, like f0_seq
... // other attributes
}
Export to ONNX format
python onnx/export/export_acoustic.py --exp YOUR_EXP_NAME --expose_velocity
Configure lr decay ratio
Add the following configuration in the config file:
gamma: 0.5 # This is the default value. You may use any positive value that is less that 1.
Pretrained models
0218_opencpop_ds1000_velocity
Pretrained model with time stretching augmentation and velocity control.
0223_opencpop_ds1000_joint_aug
Pretrained model with joint augmentation of random pitch shifting and random time stretching.
Data augmentation and gender control (usage and pretrained model)
Overview
In this release, we introduced data augmentation to DiffSinger in this forked repository.
See the dataset making pipeline for more details.
Random pitch shifting
Randomly shifts pitch of training data and embeds how many semitones the pitch is shifted into the neural networks. This broadens the pitch range and allows you to control the gender (like GEN parameter in VOCALOID) at frame level.
To enable random pitch shifting for your former dataset, add the following configuration in the config file:
augmentation_args:
random_pitch_shifting:
range: [-5., 5.]
scale: 2.0
use_key_shift_embed: true
Fixed pitch shifting
Shifts pitch of the training data for several semitones. All data with pitch shifting is regarded to be from other speakers than the original speaker. Speaker embedding is enabled and the number of speakers is increased, and the pitch range is also broadened.
To enable fixed pitch shifting for your former dataset, add the following configuration in the config file:
augmentation_args:
fixed_pitch_shifting:
targets: [-5., 5.]
scale: 0.75
use_key_shift_embed: false
use_spk_id: true
num_spk: X # Set this value to at least (1 + T) * N, where T is the number of targets and N is the number of speakers before augmentation.
0211_opencpop_ds1000_keyshift
The pretrained model on the opencpop dataset and applied with randomly pitch shifting.
Control gender value with CLI args of main.py:
python main.py xxx.ds --exp 0211_opencpop_ds1000_keyshift --gender GEN
where GEN
is a float value between -1 and 1 (negative = male, positive = female).
Control gender curve in *.ds files:
{
"gender_timestep": "0.005", // timestep in seconds, like f0_timestep
"gender": "-1.0 -0.9 -0.8 ... 0.8 0.9 1.0", // sequence of float values, like f0_seq
... // other attributes
}
Export to ONNX format
python onnx/export/export_acoustic.py --exp 0211_opencpop_ds1000_keyshift --expose_gender
or
python onnx/export/export_acoustic.py --exp 0211_opencpop_ds1000_keyshift [--freeze_gender GEN]
where GEN
is the gender value that you would like to freeze into the model (defaults to 0).
Pretrained multi-speaker model
This is a pretrained model with multiple speakers embedded and enabled for the newest features of speaker mix with DiffSinger in this forked repository.
Demo: https://www.bilibili.com/video/BV1Yy4y1d7Cg
0116_female_triplet_ds1000
There are 3 female singers in this model:
- Opencpop, which we used to train models before (set as default)
- Qixuan (绮萱), a 12-year-old girl (anyone who used or mixed her voice should leave her name and credit to https://space.bilibili.com/498285939 and https://y.qq.com/n/ryqq/singer/003HjD6H4aZn1K)
- XiaYeZi (夏叶子), female virtual singer from 韶和Project (anyone who used or mixed her voice should leave her name and credit to https://space.bilibili.com/13303439 and https://space.bilibili.com/787619)
Any commercial usage with this model is prohibited. This notice should be attached to all types of redistributions of this model.
If you used speaker mix, you must follow the rules of each speaker that you added with a proportion larger than zero.