Release v2.3.0: New voicing and tension parameters, log base number migration plan and removal of old features · openvpi/DiffSinger

New variance parameters: voicing and tension (#169, #170)

Voicing: controlling power of the harmonic part

Voicing is defined as the RMS curve of the harmonic part of the singing, in dB, which can control the power of the harmonics in vowels and voiced consonants in the voice.

Unlike other singing synthesizers which only allows decreasing the voicing, DiffSinger allows both increasing and decreasing of this parameter. Voicing is actually expected as a successor to energy, and energy is not recommended anymore.

Tension: controlling timbre and strength

Tension is mostly related to the ratio of the base harmonic to the full harmonics. The detailed calculation process is described in the documentation.

Usage and notice

Before enabling the parameters above, please read carefully the docs about choosing the proper variance parameters at https://github.com/openvpi/DiffSinger/tree/main/docs/BestPractices.md#choosing-variance-parameters. Please note that enabling all parameters does not mean your model will get good results.

To train an acoustic model that accepts voicing or tension control, edit the configuration file:

use_voicing_embed: true
use_tension_embed: true

To train a variance model that predicts voicing or tension, edit the configuration file:

predict_voicing: true
predict_tension: true

Migration of log base of the mel spectrograms

Due to historical reasons, DiffSinger uses log 10 mel spectrograms rather than log e mel-spectrograms. All acoustic models and vocoder ONNX models are producing or accepting log 10 mel spectrograms, and they are not compatible with log e mel-spectrograms unless we manually multiply a coefficient to their outputs or inputs. To address this problem, we plan to gradually migrate the configurations and models to log e mel spectrograms.

1st stage: support training log e models (#175)

In the first stage, we now support both 10 and e as base numbers of the mel spectrogram. Users can now train acoustic models that outputs true log e mel spectrogram with the following change in the configuration file:

Old	New
spec_min: [-5] spec_max: [0] mel_vmin: -6. mel_vmax: 1.5 mel_base: '10' # <- this is the default	spec_min: [-12] spec_max: [0] mel_vmin: -14. mel_vmax: 4. mel_base: 'e' # <- recommended in the future

The mel_base configuration will also be stored in dsconfig.yaml and vocoder.yaml for compatible checks in downstream applications like OpenUTAU. Once the downstream applications have adapted the checks, we will consider changing the default configuration to log e mel spectrograms.

2nd stage: force new models to be in log e

In the second stage (ETA next minor release), we will prevent users from training new acoustic models under the log 10 settings. Meanwhile, all log 10 models will be converted to log e models when they are exported to ONNX format.

3rd stage: eliminate log 10 from this repository

In the third stage (ETA some day before 2025), we will remove the mel_base configuration key and regard all models in this repository as in log e. There will be a formal warning before this change because log 10 models will produce wrong results afterwards. Models already exported to ONNX will still work in downstream applications.

Further plans

Currently there is no plan on whether log 10 models will be completely eliminated from the whole production workflow. If the decision is finally made, there will be at least a tool for converting pre-existing log 10 ONNX models to log e ones so that they can still work as expected.

Dropping support for some old features and behaviors (#172)

Removed features and behaviors

Discrete F0 embedding type (temporarily reserved in ONNX exporter)
Code backup on training start
Random seeding during training
Linear domain of random time stretching augmentation
Migration script and guidance for transcriptions and checkpoints from version 1.X.

Changes in configuration file

interp_uv configuration is removed and forced to True.
train_set_name and valid_set_name are removed and forced to train and valid.
num_pad_tokens is removed and forced to 1.
ffn_padding is removed and forced to SAME.
g2p_dictionary configuration is removed in favor of dictionary.
pndm_speedup configuration is renamed to diff_speedup.
Some configuration keys are now directly accessed so the configuration must contain them: dictionary, diff_accelerator, pe, use_key_shift_embed, use_speed_embed.

Other improvements, changes and bug fixes

Custom kwargs are now available even if the PL strategy is set to auto (#159)
The default vocoder is set to the new NSF-HiFiGAN release 2024.02
Configuration files for OpenUTAU (namely dsconfig.yaml and vocoder.yaml) are now automatically generated when exporting ONNX
The two old dictionaries opencpop.txt and opencpop-strict.txt are removed
Fix a clamping issue in DDPM causing abnormal loudness when speedup = 1 (ONNX models should be re-exported)
Fix overlapped initializer names when exporting variance models
Fix unexpected access to NoneType when not using melody encoder
Fix a regression that inference of the ground truth happens repeatedly during acoustic model training

Some changes may not be listed above. See full change log: v2.2.1...v2.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.3.0: New voicing and tension parameters, log base number migration plan and removal of old features