Skip to content

LibriVoc is a new open-source, large-scale dataset for vocoder artifact detection. LibriVoc is derived from the LibriTTS speech corpus, which is widely used in text-to- speech research. The LibriTTS corpus is derived from the Librispeech dataset, wherein each sample is extracted from LibriVox audiobooks.

Notifications You must be signed in to change notification settings

csun22/LibriVoc-Dataset

Repository files navigation

LibriVoc-Dataset

LibriVoc is a new open-source, large-scale dataset for vocoder artifact detection. LibriVoc is derived from the LibriTTS speech corpus, which is widely used in text-to-speech research. The LibriTTS corpus is derived from the Librispeech dataset, wherein each sample is extracted from LibriVox audiobooks.

The Dataset could be view or download from :https://drive.google.com/file/d/1JwxyWK52zSu96S1PEqmh59bttu3uHWrW/view?usp=share_link

We use six state-of-the-art neural vocoders to generate speech samples in the LibriVoc dataset, namely, WaveNet and WaveRNN from the autoregressive vocoders, Mel-GAN and Parallel WaveGAN from the GAN-based vocoders, and WaveGrad and DiffWave from the diffusion-based vocoders. Specifically, we have 126.41 hours of real samples and 118.08 hours of synthesized, self-vocoded samples in the training set. Table 1 shows the details of the LibriVoc dataset.

Table 1. The number of hours of audio synthesized by each neural vocoder in the LibriVoc dataset.

Model train-clean-100 train-clean-360 dev-clean test-clean
WaveNet 4.28 15.49 0.75 0.76
WaveRNN 4.33 14.92 0.67 0.72
MelGAN 4.36 15.26 0.71 0.76
Parallel WaveGAN 4.37 15.54 0.68 0.75
WaveGrad 4.19 15.81 0.76 0.74
DiffWave 4.16 15.37 0.62 0.66
Total 25.69 92.39 4.19 4.39

Each vocoder synthesizes waveform samples from a given mel spectrogram extracted from an original sample; we refer to this process as “self-vocoding.” By providing each vocoder with the same mel spectrogram, we ensure that any unique artifacts present in the synthesized samples are attributable to the specific vocoder used to reconstruct the audio signal. We withhold a set of real samples to use as a validation set in the training process. Specifically, we design the LibriVoc dataset as follows:

  1. Samples corresponding to 25% of the speakers contain only real (original) samples.
  2. Samples corresponding to 25% of the speakers contain only synthesized samples.
  3. For each speaker in the remaining 50%, we allocate half of the samples from that speaker to be real and the other half to be synthesized. By doing so, we ensure that our classifier does not over-fit speaker identity during the training process. We further split the whole dataset into three non-overlapped sets for training (33, 236 samples), development (5, 736 samples), and testing (4, 837 samples).

About

LibriVoc is a new open-source, large-scale dataset for vocoder artifact detection. LibriVoc is derived from the LibriTTS speech corpus, which is widely used in text-to- speech research. The LibriTTS corpus is derived from the Librispeech dataset, wherein each sample is extracted from LibriVox audiobooks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published