VERSA (Versatile Evaluation of Speech and Audio) is a toolkit dedicated to collecting evaluation metrics in speech and audio quality. Our goal is to provide a comprehensive connection to the cutting-edge techniques developed for evaluation. The toolkit is also tightly integrated into ESPnet.
Colab Demonstration at Interspeech2024 Tutorial
The base-installation is as easy as follows:
git clone https://github.com/shinjiwlab/versa.git
cd versa
pip install .
As for collection purposes, VERSA instead of re-distributing the model, we try to align as much to the original API provided by the algorithm developer. Therefore, we have many dependencies. We try to include as many as default, but there are cases where the toolkit needs specific installation requirements. Please refer to our list-of-metric section for more details on whether the metrics are automatically included or not. If not, we provide an installation guide or installers in tools
.
python versa/test/test_general.py
# test metrics with additional installation
python versa/test/test_{metric}.py
Simple usage case for a few samples.
# direct usage
python versa/bin/scorer.py \
--score_config egs/speech.yaml \
--gt test/test_samples/test1 \
--pred test/test_samples/test2 \
--output_file test_result
# with scp-style input
python versa/bin/scorer.py \
--score_config egs/speech.yaml \
--gt test/test_samples/test1.scp \
--pred test/test_samples/test2.scp \
--output_file test_result
# with kaldi-ark style
python versa/bin/scorer.py \
--score_config egs/speech.yaml \
--gt test/test_samples/test1.scp \
--pred test/test_samples/test2.scp \
--output_file test_result \
--io kaldi
# For text information
python versa/bin/scorer.py \
--score_config egs/separate_metrics/wer.yaml \
--gt test/test_samples/test1.scp \
--pred test/test_samples/test2.scp \
--output_file test_result \
--text test/test_samples/text
Use launcher with slurm job submissions
# use the launcher
# Option1: with gt speech
./launch.sh \
<pred_speech_scp> \
<gt_speech_scp> \
<score_dir> \
<split_job_num>
# Option2: without gt speech
./launch.sh \
<pred_speech_scp> \
None \
<score_dir> \
<split_job_num>
# aggregate the results
cat <score_dir>/result/*.result.cpu.txt > <score_dir>/utt_result.cpu.txt
cat <score_dir>/result/*.result.gpu.txt > <score_dir>/utt_result.gpu.txt
# show result
python scripts/show_result.py <score_dir>/utt_result.cpu.txt
python scripts/show_result.py <score_dir>/utt_result.gpu.txt
Access egs/*.yaml
for different configs for different setups.
We include x mark if the metric is auto-installed in versa.
Number | Auto-Install | Metric Name (Auto-Install) | Key in config | Key in report | Code Source | References |
---|---|---|---|---|---|---|
1 | x | Deep Noise Suppression MOS Score of P.835 (DNSMOS) | pseudo_mos | dnsmos_overall | speechmos (MS) | paper |
2 | x | Deep Noise Suppression MOS Score of P.808 (DNSMOS) | pseudo_mos | dnsmos_p808 | speechmos (MS) | paper |
3 | Non-intrusive Speech Quality and Naturalness Assessment (NISQA) | NISQA | paper | |||
4 | x | UTokyo-SaruLab System for VoiceMOS Challenge 2022 (UTMOS) | pseudo_mos | utmos | speechmos | paper |
5 | x | Packet Loss Concealment-related MOS Score (PLCMOS) | pseudo_mos | plcmos | speechmos (MS) | paper |
6 | x | PESQ in TorchAudio-Squim | squim_no_ref | torch_squim_pesq | torch_squim | paper |
7 | x | STOI in TorchAudio-Squim | squim_no_ref | torch_squim_stoi | torch_squim | paper |
8 | x | SI-SDR in TorchAudio-Squim | squim_no_ref | torch_squim_si_sdr | torch_squim | paper |
9 | x | Singing voice MOS | singmos | singmos | singmos | paper |
10 | x | Sheet SSQA MOS Models | sheet_ssqa | sheet_ssqa | Sheet | paper |
11 | UTMOSv2: UTokyo-SaruLab MOS Prediction System | utmosv2 | utmosv2 | UTMOSv2 | paper | |
12 | Speech Contrastive Regression for Quality Assessment without reference (ScoreQ) | scoreq_nr | scoreq_nr | ScoreQ | paper | |
13 | x | Speech enhancement-based SI-SNR | se_snr | se_si_snr | ESPnet | |
14 | x | Speech enhancement-based CI-SDR | se_snr | se_ci_sdr | ESPnet | |
15 | x | Speech enhancement-based SAR | se_snr | se_sar | ESPnet | |
16 | x | Speech enhancement-based SDR | se_snr | se_sdr | ESPnet | |
17 | x | PAM: Prompting Audio-Language Models for Audio Quality Assessment | pam | pam | PAM | Paper |
18 | Speech-to-Reverberation Modulation energy Ratio (SRMR) | srmr | srmr | SRMRpy | Paper | |
19 | x | Voice Activity Detection (VAD) | vad | vad_info | SileroVAD | |
20 | Speaker Turn Taking (SPK-TT) | |||||
21 | x | SPeaker Word Rate (SWR) | ||||
22 | x | Auti-spoofing Score (SpoofS) with AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks | asvspoof_score | asvspoof_score | AASIST | Paper |
Number | Auto-Install | Metric Name (Auto-Install) | Key in config | Key in report | Code Source | References |
---|---|---|---|---|---|---|
1 | x | Mel Cepstral Distortion (MCD) | mcd_f0 | mcd | espnet and s3prl-vc | paper |
2 | x | F0 Correlation | mcd_f0 | f0_corr | espnet and s3prl-vc | paper |
3 | x | F0 Root Mean Square Error | mcd_f0 | f0_rmse | espnet and s3prl-vc | paper |
4 | x | Signal-to-interference Ratio (SIR) | signal_metric | sir | espnet | - |
5 | x | Signal-to-artifact Ratio (SAR) | signal_metric | sar | espnet | - |
6 | x | Signal-to-distortion Ratio (SDR) | signal_metric | sdr | espnet | - |
7 | x | Convolutional scale-invariant signal-to-distortion ratio (CI-SDR) | signal_metric | ci-sdr | ci_sdr | paper |
8 | x | Scale-invariant signal-to-noise ratio (SI-SNR) | signal_metric | si-snr | espnet | paper |
9 | x | Perceptual Evaluation of Speech Quality (PESQ) | pesq | pesq | pesq | paper |
10 | x | Short-Time Objective Intelligibility (STOI) | stoi | stoi | pystoi | paper |
11 | x | Speech BERT Score | discrete_speech | speech_bert | discrete speech metric | paper |
12 | x | Discrete Speech BLEU Score | discrete_speech | speech_belu | discrete speech metric | paper |
13 | x | Discrete Speech Token Edit Distance | discrete_speech | speech_token_distance | discrete speech metric | paper |
14 | Dynamic Time Warping Cost Metric | warpq | warpq | WARP-Q | paper | |
15 | Speech Contrastive Regression for Quality Assessment with reference (ScoreQ) | scoreq_ref | scoreq_ref | ScoreQ | paper | |
16 | 2f-Model | |||||
17 | x | Log-Weighted Mean Square Error | log_wmse | log_wmse | log_wmse | |
18 | x | ASR-oriented Mismatch Error Rate (ASR-Mismatch) | ||||
19 | Virtual Speech Quality Objective Listener (VISQOL) | visqol | visqol | google-visqol | paper | |
20 | Frequency-Weighted SEGmental SNR (FWSEGSNR) | pysepm | pysepm_fwsegsnr | pysepm | Paper | |
21 | Weighted Spectral Slope (WSS) | pysepm | pysepm_wss | pysepm | Paper | |
22 | Cepstrum Distance Objective Speech Quality Measure (CD) | pysepm | pysepm_cd | pysepm | Paper | |
23 | Composite Objective Speech Quality (composite) | pysepm | pysepm_Csig, pysepm_Cbak, pysepm_Covl | pysepm | Paper | |
24 | Coherence and speech intelligibility index (CSII) | pysepm | pysepm_csii_high, pysepm_csii_mid, pysepm_csii_low | pysepm | Paper | |
25 | Normalized-covariance measure (NCM) | pysepm | pysepm_ncm | pysepm | Paper |
Number | Auto-Install | Metric Name (Auto-Install) | Key in config | Key in report | Code Source | References |
---|---|---|---|---|---|---|
1 | NORESQA : A Framework for Speech Quality Assessment using Non-Matching References | noresqa | noresqa | Noresqa | Paper | |
2 | x | MOS in TorchAudio-Squim | squim_ref | torch_squim_mos | torch_squim | paper |
3 | x | ESPnet Speech Recognition-based Error Rate | espnet_wer | espnet_wer | ESPnet | paper |
4 | x | ESPnet-OWSM Speech Recognition-based Error Rate | owsm_wer | owsm_wer | ESPnet | paper |
5 | x | OpenAI-Whisper Speech Recognition-based Error Rate | whisper_wer | whisper_wer | Whisper | paper |
6 | Emotion2vec similarity (emo2vec) | emo2vec_similarity | emotion_similarity | emo2vec | paper | |
7 | x | Speaker Embedding Similarity | speaker | spk_similarity | espnet | paper |
8 | NOMAD: Unsupervised Learning of Perceptual Embeddings For Speech Enhancement and Non-Matching Reference Audio Quality Assessment | nomad | nomad | Nomad | paper | |
9 | Contrastive Language-Audio Pretraining Score (CLAP Score) | clap_score | clap_score | fadtk | paper | |
10 | Accompaniment Prompt Adherence (APA) | apa | apa | Sony-audio-metrics | paper | |
11 | Log Likelihood Ratio (LLR) | pysepm | pysepm_llr | pysepm | Paper |
Number | Auto-Install | Metric Name (Auto-Install) | Key in config | Key in report | Code Source | References |
---|---|---|---|---|---|---|
1 | Frechet Audio Distance (FAD) | fad | fad | fadtk | paper | |
2 | Kullback-Leibler Divergence on Embedding Distribution | kl_embedding | kl_embedding | Stability-AI | ||
3 | Audio Density Score | audio_density_coverage | audio_density | Sony-audio-metrics | paper | |
4 | Audio Coverage Score | audio_density_coverage | audio_coverage | Sony-audio-metrics | paper | |
5 | KID : Kernel Distance Metric for Audio/Music Quality | KID | Paper |
If you find this repo useful, please cite the following papers:
@misc{shi2024versaversatileevaluationtoolkit,
title={VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music},
author={Jiatong Shi and Hye-jin Shim and Jinchuan Tian and Siddhant Arora and Haibin Wu and Darius Petermann and Jia Qi Yip and You Zhang and Yuxun Tang and Wangyou Zhang and Dareen Safar Alharthi and Yichen Huang and Koichi Saito and Jionghao Han and Yiwen Zhao and Chris Donahue and Shinji Watanabe},
year={2024},
eprint={2412.17667},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2412.17667},
}
@misc{shi2024espnetcodeccomprehensivetrainingevaluation,
title={ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech},
author={Jiatong Shi and Jinchuan Tian and Yihan Wu and Jee-weon Jung and Jia Qi Yip and Yoshiki Masuyama and William Chen and Yuning Wu and Yuxun Tang and Massa Baali and Dareen Alharhi and Dong Zhang and Ruifan Deng and Tejes Srivastava and Haibin Wu and Alexander H. Liu and Bhiksha Raj and Qin Jin and Ruihua Song and Shinji Watanabe},
year={2024},
eprint={2409.15897},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2409.15897},
}
We sincerely thank all the open-source implementations listed in https://github.com/shinjiwlab/versa/tree/main#list-of-metrics