Speaker verification part

Introduction

The speaker verification part of this repository is based on voxceleb_trainer

Usage

replace <name> in the following code blocks with appropriate values. the following steps are almost same with the file ./local/run_sv.sh

Note: You can download our trained model and start from step 5 if you only want to use our speaker verification system with your kws system to compute final scores.

1. Environment Setup

$ pip install -r ./sv_part/requirements.txt

2. Generate train lists

$ cd ./sv_part
$ python ./dataprep.py --train_set <path_pvtc_train> --dev_path <path_pvtc_dev>'/task1/wav_data/' --pvtc_trials_path <path_pvtc_dev>'/task1/trials' --utt2label <path_pvtc_dev>'/task1/trials_for_wake'  --make_sv_trials --make_list|| exit 1
$ cd ../

Train lists and trial file for speaker verification model is generated by this step.

3. Pre-train the model

$ cd ./sv_part
$ CUDA_VISIBLE_DEVICES=<gpu> python ./trainSpeakerNet.py --model ResNetSE34v2 --log_input True --encoder_type ASP --trainfunc amsoftmax --save_path <path_to_save> --nClasses 3091 --augment True --n_mels 80 --lr_decay 0.2 --test_interval 15 --lr 0.01 --batch_size 512 --scale 32 --margin 0.2 --train_list <list_pretrain> --test_list <pvtc_trials> --train_path <train_path> --test_path <test_path> --musan_path <musan_path> --rir_path <rir_path> --optimizer sgd 
$ cd ../

Considering the small scale of competition data, we choose some Mandarin utterances on Openslr to pre-train the model. Musan(SLR17) and RIRs(SLR28) are also used for dynamic data augmentation to improve the robustness of the system. augment can be set as false if you don't want use dynamic data augmentation for training data.

The pre-train list <list_pretrain> should contain the identity and the file path, one line per utterance, as follows:

aidatatangG1427 aidatatang_200zh/corpus/G1427/T0055G1427S0445.wav
aishell_089 AISHELL-wakeup/AISHELL-B-33/AISHELL-2019B-EVAL/SPEECHDATA/wave/089/089_7_1_062_fast.wav
P00262A ST-CMDS-20170001_1-OS_file/ST-CMDS-20170001_1-OS/20170001P00229A0049.wav

In our baseline system, We use SLR38, SLR47, SLR62, SLR82 and part of SLR33(only closed microphone and one 1M channel are used) as the pre-train data. The pre-train list we use can be found in here(access code : 6jpe).

The sv part trial file <pvtc_trials> was generated from the development set of PVTC. Note that this file is only used for measuring the speaker verification part.

More detailed information of setting parameters can be find in voxceleb_trainer.

4. Fine-tuning the model

$ cd ./sv_part
$ CUDA_VISIBLE_DEVICES=<gpu> python ./finetune.py --model ResNetSE34v2 --log_input True --encoder_type ASP --trainfunc amsoftmax --save_path <path_to_save_finetuned model> --nClasses 3091 --nClasses_ft 300 --initial_model <path_pretrained_model> --augment True --n_mels 80 --lr_decay 0.9 --test_interval 10 --lr 0.001 --batch_size 256 --scale 32 --margin 0.2 --train_list <list_pvtc> --test_list <pvtc_trials> --train_path <train_path> --test_path <test_path> --musan_path <musan_path> --rir_path <rir_path> --optimizer sgd 
$ cd ../

The train list <list_train> should contain the identity and the file path, one line per utterance, as follows:

PVTC0103 /Netdata/AudioData/PVTC/official_data/train/PVTC0103/xiaole/3/0353.wav
PVTC0103 /Netdata/AudioData/PVTC/official_data/train/PVTC0103/xiaole/3/0255.wav
PVTC0103 /Netdata/AudioData/PVTC/official_data/train/PVTC0103/xiaole/3/0060.wav
PVTC0103 /Netdata/AudioData/PVTC/official_data/train/PVTC0103/xiaole/3/0446.wav

You can use all the train data in the challange to train a text-independent speaker system or only use the positive part of data (samples that contain '小乐小乐') to train a text-dependent speaker system.

We provide the text-dependent finetuned model after 35 epoch pre-training and 10 epoch fine-tuning.

Model	Task	EER(%)	Download Link
text-dependent	Taks1	`1.4817`	BaiduCloud
text-dependent	Taks2	`2.5650`	BaiduCloud

The access code is ayez

5. Determine system threshold and compute final score

Once the speaker verfication system is obtained, the threshold of the system is determined based on the given task.

$ cd ./sv_part
$ CUDA_VISIBLE_DEVICES=<gpu> python ./inference.py --inference  --model ResNetSE34v2 --log_input True --encoder_type ASP --trainfunc amsoftmax --save_path <path_task> --nClasses 300 --augment True --n_mels 80 --lr_decay 0.2 --lr 0.01  --initial_model <finetuned_model> --scale 32 --margin 0.2  --optimizer sgd --devdatapath <path of dev utt> --trials_list <pvtc_trial> --uttpath <path of cut utt> --utt2label <utt2label_template>  --save_dic True
$ cd ../

By running the above code, the threthold of the system and the speaker embedding of the enrollment data in the trial file will be calculated and saved under <path_task>. Note that <path_task> is the path for saving threshold (not the model).We suggest set the devdatapath parameter as PVTC/official_data/dev/taskX/wav_data/, which is the raw audio in tasks in order to extract embeddings of enrollment utterances in trial file. The uttpath needs to be set as the positive part audio cut by your kws system.

note: in order to calculate the threshold of the speaker system, we assume the wake-up system is perfect enough. So we use the standard utt2label file which is absolutely right in this step, it can be found in PVTC/official_data/dev/task1/trials_for_wake. It contains the uttrance index and its ground truth, as follows:

PVTC_task1_0004.wav positive
PVTC_task1_0008.wav positive
PVTC_task1_0012.wav positive
PVTC_task1_26529.wav negative
PVTC_task1_23626.wav negative
PVTC_task1_22238.wav negative

Now we can calculate the final score S if you have the result from the keyword spotting system.

$ cd ./sv_part
$ CUDA_VISIBLE_DEVICES=<gpu> python ./compute_score.py --inference  --model ResNetSE34v2 --log_input True --encoder_type ASP --trainfunc amsoftmax --save_path <path_task> --nClasses 300 --augment True --n_mels 80 --lr_decay 0.2 --lr 0.01  --initial_model <finetuned_model> --scale 32 --margin 0.2  --optimizer sgd  --devdatapath <path of dev utt> --trials_list <pvtc_trial> --uttpath <path of cut utt>  --utt2label <utt2label>  --save_dic False --parameter_savepath "eer_threshold.npy"  --alpha 1,10,15,19,20
$ cd ../

the uttpath needs to be set as the positive part audio cut by your kws system.

The utt2label file is the classification result of the wake-up system, it should contain the utterance index and its classification result, one line per utterance, as follows:

PVTC_task1_20031.wav non-trigger
PVTC_task1_19779.wav non-trigger
PVTC_task1_14568.wav trigger
PVTC_task1_2348.wav trigger
PVTC_task1_3055.wav trigger
PVTC_task1_1630.wav trigger

The classification result of the utterance should be 'trigger' or 'non-trigger'.

note: we must use the real output of the wake-up system as utt2label in the step!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SV_README.md

SV_README.md

Speaker verification part

Introduction

Usage

1. Environment Setup

2. Generate train lists

3. Pre-train the model

4. Fine-tuning the model

5. Determine system threshold and compute final score

Files

SV_README.md

Latest commit

History

SV_README.md

File metadata and controls

Speaker verification part

Introduction

Usage

1. Environment Setup

2. Generate train lists

3. Pre-train the model

4. Fine-tuning the model

5. Determine system threshold and compute final score