v2.3.0: New voicing and tension parameters, log base number migration plan and removal of old features
New variance parameters: voicing and tension (#169, #170)
Voicing: controlling power of the harmonic part
Voicing is defined as the RMS curve of the harmonic part of the singing, in dB, which can control the power of the harmonics in vowels and voiced consonants in the voice.
Unlike other singing synthesizers which only allows decreasing the voicing, DiffSinger allows both increasing and decreasing of this parameter. Voicing is actually expected as a successor to energy, and energy is not recommended anymore.
Tension: controlling timbre and strength
Tension is mostly related to the ratio of the base harmonic to the full harmonics. The detailed calculation process is described in the documentation.
Usage and notice
Before enabling the parameters above, please read carefully the docs about choosing the proper variance parameters at https://github.com/openvpi/DiffSinger/tree/main/docs/BestPractices.md#choosing-variance-parameters. Please note that enabling all parameters does not mean your model will get good results.
To train an acoustic model that accepts voicing or tension control, edit the configuration file:
use_voicing_embed: true
use_tension_embed: true
To train a variance model that predicts voicing or tension, edit the configuration file:
predict_voicing: true
predict_tension: true
Migration of log base of the mel spectrograms
Due to historical reasons, DiffSinger uses log 10 mel spectrograms rather than log e mel-spectrograms. All acoustic models and vocoder ONNX models are producing or accepting log 10 mel spectrograms, and they are not compatible with log e mel-spectrograms unless we manually multiply a coefficient to their outputs or inputs. To address this problem, we plan to gradually migrate the configurations and models to log e mel spectrograms.
1st stage: support training log e models (#175)
In the first stage, we now support both 10 and e as base numbers of the mel spectrogram. Users can now train acoustic models that outputs true log e mel spectrogram with the following change in the configuration file:
Old | New |
---|---|
spec_min: [-5]
spec_max: [0]
mel_vmin: -6.
mel_vmax: 1.5
mel_base: '10' # <- this is the default |
spec_min: [-12]
spec_max: [0]
mel_vmin: -14.
mel_vmax: 4.
mel_base: 'e' # <- recommended in the future |
The mel_base
configuration will also be stored in dsconfig.yaml and vocoder.yaml for compatible checks in downstream applications like OpenUTAU. Once the downstream applications have adapted the checks, we will consider changing the default configuration to log e mel spectrograms.
2nd stage: force new models to be in log e
In the second stage (ETA next minor release), we will prevent users from training new acoustic models under the log 10 settings. Meanwhile, all log 10 models will be converted to log e models when they are exported to ONNX format.
3rd stage: eliminate log 10 from this repository
In the third stage (ETA some day before 2025), we will remove the mel_base
configuration key and regard all models in this repository as in log e. There will be a formal warning before this change because log 10 models will produce wrong results afterwards. Models already exported to ONNX will still work in downstream applications.
Further plans
Currently there is no plan on whether log 10 models will be completely eliminated from the whole production workflow. If the decision is finally made, there will be at least a tool for converting pre-existing log 10 ONNX models to log e ones so that they can still work as expected.
Dropping support for some old features and behaviors (#172)
Removed features and behaviors
- Discrete F0 embedding type (temporarily reserved in ONNX exporter)
- Code backup on training start
- Random seeding during training
- Linear domain of random time stretching augmentation
- Migration script and guidance for transcriptions and checkpoints from version 1.X.
Changes in configuration file
interp_uv
configuration is removed and forced toTrue
.train_set_name
andvalid_set_name
are removed and forced totrain
andvalid
.num_pad_tokens
is removed and forced to 1.ffn_padding
is removed and forced toSAME
.g2p_dictionary
configuration is removed in favor ofdictionary
.pndm_speedup
configuration is renamed todiff_speedup
.- Some configuration keys are now directly accessed so the configuration must contain them:
dictionary
,diff_accelerator
,pe
,use_key_shift_embed
,use_speed_embed
.
Other improvements, changes and bug fixes
- Custom
kwargs
are now available even if the PL strategy is set toauto
(#159) - The default vocoder is set to the new NSF-HiFiGAN release 2024.02
- Configuration files for OpenUTAU (namely dsconfig.yaml and vocoder.yaml) are now automatically generated when exporting ONNX
- The two old dictionaries opencpop.txt and opencpop-strict.txt are removed
- Fix a clamping issue in DDPM causing abnormal loudness when speedup = 1 (ONNX models should be re-exported)
- Fix overlapped initializer names when exporting variance models
- Fix unexpected access to NoneType when not using melody encoder
- Fix a regression that inference of the ground truth happens repeatedly during acoustic model training
Some changes may not be listed above. See full change log: v2.2.1...v2.3.0