PyTorch implementation of speech embedding net and loss described here: https://arxiv.org/pdf/1710.10467.pdf.
Also contains code to create embeddings compatible as input for the speaker diarization model found at https://github.com/google/uis-rnn
The TIMIT speech corpus was used to train the model, found here: https://catalog.ldc.upenn.edu/LDC93S1, or here, https://github.com/philipperemy/timit
- PyTorch 0.4.1
- python 3.5+
- numpy 1.15.4
- librosa 0.6.1
The python WebRTC VAD found at https://github.com/wiseman/py-webrtcvad is required to create run dvector_create.py, but not to train the neural network.
Change the following config.yaml key to a regex containing all .WAV files in your downloaded TIMIT dataset. The TIMIT .WAV files must be converted to the standard format (RIFF) for the dvector_create.py script, but not for training the neural network.
unprocessed_data: './TIMIT/*/*/*/*.wav'
Run the preprocessing script:
./data_preprocess.py
Two folders will be created, train_tisv and test_tisv, containing .npy files containing numpy ndarrays of speaker utterances with a 90%/10% training/testing split.
To train the speaker verification model, run:
./train_speech_embedder.py
with the following config.yaml key set to true:
training: !!bool "true"
for testing, set the key value to:
training: !!bool "false"
The log file and checkpoint save locations are controlled by the following values:
log_file: './speech_id_checkpoint/Stats'
checkpoint_dir: './speech_id_checkpoint'
Only TI-SV is implemented.
EER across 10 epochs: 0.0377
After training and testing the model, run dvector.py to create the data.pkl
The file can be loaded and used to train the triple-loss model.
After create dvector,we use the triplet loss to train a model which are discribed here: https://arxiv.org/pdf/1705.02304.pdf run train.py
When reference speakers,run cli.py