-
-
Notifications
You must be signed in to change notification settings - Fork 323
Quality Benchmarks
For your convenience, we provide a set of benchmarks on publicly available datasets. We chose Google's STT as a decent approximation of a high quality enterprise solution available commercially and in many languages.
Our approach is described in this article.
Unlike many solutions available off-the-shelf our models (especially the Enterprise Edition models) feature generalization across the following domains:
- Video;
- Lectures;
- Narration;
- Phone calls;
- Various noises, codecs, recording methods and conditions;
Any "in-the-wild" speech with sufficient SNR and recording quality should work reasonably fine by design. The main caveat is that our models work poorly with far-field audio and extremely noisy audio.
Though our models work fine with 8kHz audio (phone calls), for simplicity we always resample to 16 kHz. Robustness is built into the models themselves.
Be prepared that sometimes the CE-model has hard time producing visually pleasing transcriptions, though the results are phonetically similar.
This is usually solved one way or another:
- Limiting the model to a very narrow domain (i.e. speech commands);
- Adding an external traditional (n-gram) or more modern (DL-based) language model(s) and performing some sort of fusion / re-scoring;
- Using much larger (hence slower) model;
Options (1) and (3) contradict our design philosophy and in general limit the real life applicability of models. We are firm believers that technology should be embarrassingly simple to use (i.e. one line of code). Naturally we have solved these challenges with the EE edition of our models, but at this stage we are still not ready to publish the embarrassingly simple EE models that fulfill the same criteria (i.e. all compute graph triggered by one line of code).
- Google was used as a main reference in terms of quality;
- CE = Community Edition;
- EE = Enterprise Edition;
All of the below metrics are WER (word error rate).
Simple WER Version Comparison Table
V1 | V2 | V3 | V4 | V5 | |
---|---|---|---|---|---|
AudioBooks | |||||
lj | 5.4 | 5.6 | 5.1 | ||
librispeech_test_clean | 6.9 | 6.9 | 5.9 | 6.1 | 5.5 |
librispeech_val | 11.5 | 11.7 | 9.7 | 10 | 8.8 |
librispeech_test_other | 17.1 | 17.4 | 15.1 | 15.2 | 13.5 |
mls_test | 17.9 | 17.1 | 14.8 | ||
mls_dev | 15.8 | 15.3 | 13.3 | ||
Lecture / speech | |||||
multi_ted_test_he | 12 | 11.5 | 12.1 | 10.2 | 8.4 |
multi_ted_test_common | 17.6 | 17.3 | 18.3 | 15.5 | 14 |
multi_ted_val | 20 | 19.9 | 21 | 18.4 | 16.9 |
voxpopuli_dev | 20.8 | 19.4 | 16 | ||
voxpopuli_test | 21.4 | 20.5 | 16.4 | ||
Finance | |||||
kensho | 8.1 | 5.9 | 4.3 | ||
In the wild | |||||
common_voice_val | 20.6 | 20.3 | 18.5 | 15.8 | 15 |
common_voice_test | 25.5 | 25.3 | 23.3 | 20.3 | 19.2 |
gigaspeech | 20.7 | ||||
VOIP / calls | |||||
voip_test | 21 | 18.7 | 18.3 | ||
Dialects | |||||
UK dialects mean | 14.6 | 14.6 | 12.6 | 10.9 | 10.4 |
All of these tests were run in early September 2020.
Dataset | Silero CE | Silero EE | Google Video Premium | Google Phone Premium |
---|---|---|---|---|
AudioBooks | ||||
en_v001_librispeech_test_clean | 8.6 | 6.9 | 7.8 | 8.7 |
en_librispeech_val | 14.4 | 11.5 | 11.3 | 13.1 |
en_librispeech_test_other | 20.6 | 17.1 | 16.2 | 19.1 |
Lecture / speech | ||||
en_multi_ted_test_he | 16.6 | 12.0 | 15.3 | 14.1 |
en_multi_ted_test_common | 21.2 | 17.6 | 16.9 | 16 |
en_multi_ted_val | 23.5 | 20 | 22.7 | 20.8 |
In the wild | ||||
en_common_voice_val | 27.5 | 20.6 | 20.8 | 20.8 |
en_common_voice_test | 32.6 | 25.5 | 22.2 | 24 |
VOIP / calls | ||||
en_voip_test | 9 | 8.6 | 19.7 | 18.3 |
British Dialects | ||||
en_uk_dialects_midlands_english_female | 16.7 | 10.8 | 9.6 | 8.4 |
en_uk_dialects_southern_english_female | 16.7 | 11.4 | 10.8 | 9.3 |
en_uk_dialects_welsh_english_female | 17.1 | 12.1 | 20.5 | 10.5 |
en_uk_dialects_southern_english_male | 17.9 | 12.7 | 11.5 | 10.6 |
en_uk_dialects_welsh_english_male | 18.6 | 13.2 | 12.1 | |
en_uk_dialects_northern_english_male | 20 | 13.9 | 15.5 | 11.7 |
en_uk_dialects_scottish_english_male | 21.3 | 15.1 | 10 | 11.3 |
en_uk_dialects_midlands_english_male | 21.7 | 15.1 | 11.8 | 10.3 |
en_uk_dialects_northern_english_female | 22 | 15.2 | 15 | 12.7 |
en_uk_dialects_scottish_english_female | 22.2 | 15.7 | 13.5 | 12.6 |
en_uk_dialects_irish_english_male | 32.7 | 25.5 | 25.5 | 21.9 |
Far-field / very noisy | ||||
en_voices_rm2_clo_none_stu_manifest | 17.3 | 13.7 | 21.5 | 27 |
en_voices_rm2_far_none_lav_manifest | 31.4 | 26.5 | 27.5 | 42.3 |
en_voices_rm4_far_none_stu_manifest | 33.5 | 28.7 | 43.2 | 43.2 |
en_voices_rm3_clo_none_stu_manifest | 34.5 | 29.9 | 28.6 | 40.8 |
en_voices_rm2_far_musi_stu_manifest | 35.4 | 30.9 | 30.6 | 42.4 |
en_voices_rm2_far_babb_stu_manifest | 39.3 | 35.0 | 38.5 | 48.2 |
en_voices_rm3_clo_musi_stu_manifest | 46.9 | 43 | 38.1 | 51.8 |
en_voices_rm4_ceo_none_lav_manifest | 50.3 | 46.4 | 42.9 | 52.5 |
en_voices_rm3_far_none_stu_manifest | 78.9 | 78.3 | 68.8 | 81.6 |
en_nsc_val_manifest_part1 | 31.7 | 24.4 | NA | NA |
en_nsc_val_manifest_part2 | 67.0 | 60.9 | NA | NA |
Google tests were run in early September 2020.
EN V2 metrics updated in early November 2020.
Dataset | Silero CE | Silero EE | Google Video Premium | Google Phone Premium |
---|---|---|---|---|
AudioBooks | ||||
en_v001_librispeech_test_clean | 8.7 | 6.9 | 7.8 | 8.7 |
en_librispeech_val | 14.5 | 11.7 | 11.3 | 13.1 |
en_librispeech_test_other | 20.6 | 17.4 | 16.2 | 19.1 |
Lecture / speech | ||||
en_multi_ted_test_he | 15.0 | 11.5 | 15.3 | 14.1 |
en_multi_ted_test_common | 20.7 | 17.3 | 16.9 | 16 |
en_multi_ted_val | 22.9 | 19.9 | 22.7 | 20.8 |
In the wild | ||||
en_common_voice_val | 27.1 | 20.3 | 20.8 | 20.8 |
en_common_voice_test | 32.1 | 25.3 | 22.2 | 24 |
VOIP / calls | ||||
en_voip_test | 11.4 | 10.8 | 19.7 | 18.3 |
British Dialects | ||||
en_uk_dialects_midlands_english_female | 15.7 | 10.4 | 9.6 | 8.4 |
en_uk_dialects_southern_english_female | 16.6 | 11.6 | 10.8 | 9.3 |
en_uk_dialects_welsh_english_female | 16.9 | 11.9 | 20.5 | 10.5 |
en_uk_dialects_southern_english_male | 17.4 | 12.6 | 11.5 | 10.6 |
en_uk_dialects_welsh_english_male | 17.8 | 13.1 | 12.1 | |
en_uk_dialects_northern_english_male | 19.7 | 13.7 | 15.5 | 11.7 |
en_uk_dialects_scottish_english_male | 20.5 | 14.6 | 10 | 11.3 |
en_uk_dialects_midlands_english_male | 21.4 | 16.1 | 11.8 | 10.3 |
en_uk_dialects_northern_english_female | 21.3 | 15.5 | 15 | 12.7 |
en_uk_dialects_scottish_english_female | 21.8 | 15.4 | 13.5 | 12.6 |
en_uk_dialects_irish_english_male | 32.5 | 25.7 | 25.5 | 21.9 |
Far-field / very noisy | ||||
en_voices_rm2_clo_none_stu_manifest | 17.5 | 14.1 | 21.5 | 27 |
en_voices_rm2_far_none_lav_manifest | 31.6 | 27.0 | 27.5 | 42.3 |
en_voices_rm4_far_none_stu_manifest | 33.7 | 29.3 | 43.2 | 43.2 |
en_voices_rm3_clo_none_stu_manifest | 34.7 | 30.4 | 28.6 | 40.8 |
en_voices_rm2_far_musi_stu_manifest | 35.9 | 31.5 | 30.6 | 42.4 |
en_voices_rm2_far_babb_stu_manifest | 39.8 | 35.7 | 38.5 | 48.2 |
en_voices_rm3_clo_musi_stu_manifest | 47.2 | 43.5 | 38.1 | 51.8 |
en_voices_rm4_ceo_none_lav_manifest | 50.0 | 46.3 | 42.9 | 52.5 |
en_voices_rm3_far_none_stu_manifest | 78.3 | 78.0 | 68.8 | 81.6 |
en_nsc_val_manifest_part1 | 18.3 | 13.9 | NA | NA |
en_nsc_val_manifest_part2 | 31.7 | 28.5 | NA | NA |
Google tests were run in early September 2020.
EN V3 metrics updated in April 2021.
Dataset | Silero | Silero | Silero | Silero | Silero | ||
---|---|---|---|---|---|---|---|
xsmall_q | xsmall | small_q | small | large | Video | Phone | |
CE | CE | CE | CE | CE | Premium | Premium | |
AudioBooks / narration | |||||||
lj | 11.5 | 10.2 | 8.6 | 7.9 | 6.6 | ||
librispeech_test_clean | 14.3 | 12.1 | 11.1 | 9.7 | 7.4 | 7.8 | 8.7 |
librispeech_val | 21.0 | 18.4 | 16.9 | 15.2 | 11.9 | 11.3 | 13.1 |
librispeech_test_other | 29.0 | 25.7 | 23.8 | 21.6 | 17.9 | 16.2 | 19.1 |
aru | 21.3 | 18.5 | 16.9 | 14.4 | 11.1 | 16.2 | 19.1 |
mls_test | 32.0 | 29.2 | 27.3 | 25.2 | 22.0 | ||
mls_dev | 29.6 | 26.7 | 24.6 | 22.7 | 19.7 | ||
Lecture / speech | |||||||
multi_ted_test_he | 25.9 | 23.1 | 20.6 | 19.0 | 15.8 | 15.3 | 14.1 |
multi_ted_test_common | 34.3 | 30.9 | 28.1 | 25.8 | 21.5 | 16.9 | 16.0 |
multi_ted_val | 34.6 | 31.5 | 29.4 | 27.7 | 23.9 | 22.7 | 20.8 |
voxpopuli_dev | 35.2 | 32.6 | 30.6 | 28.7 | 25.0 | ||
voxpopuli_test | 36.3 | 34.1 | 31.7 | 30.1 | 26.4 | ||
Finance | |||||||
kensho | 21.3 | 18.8 | 15.3 | 13.8 | 10.0 | ||
In the wild | |||||||
common_voice_val | 37.8 | 35.1 | 31.2 | 28.8 | 25.3 | 20.8 | 20.8 |
common_voice_test | 42.2 | 39.5 | 35.9 | 33.5 | 30.1 | 22.2 | 24 |
VOIP / calls | |||||||
voip_test | 32.7 | 31.7 | 23.7 | 23.7 | 21.2 | 19.7 | 18.3 |
Dialects | |||||||
uk_dialects_midlands_english_female | 26.0 | 23.1 | 21.3 | 19.6 | 13.6 | 9.6 | 8.4 |
uk_dialects_southern_english_female | 26.7 | 23.6 | 20.9 | 18.9 | 14.2 | 10.8 | 9.3 |
uk_dialects_welsh_english_female | 25.6 | 22.6 | 19.8 | 18.3 | 14.2 | 20.5 | 10.5 |
uk_dialects_southern_english_male | 27.7 | 24.7 | 22.2 | 20.0 | 15.0 | 11.5 | 10.6 |
uk_dialects_welsh_english_male | 27.8 | 25.3 | 22.6 | 20.5 | 16.6 | 12.1 | |
uk_dialects_northern_english_male | 31.3 | 28.2 | 24.8 | 23.0 | 17.2 | 15.5 | 11.7 |
uk_dialects_scottish_english_male | 32.0 | 28.8 | 25.1 | 23.2 | 17.8 | 10 | 11.3 |
uk_dialects_midlands_english_male | 33.1 | 30.2 | 26.5 | 24.3 | 18.0 | 11.8 | 10.3 |
uk_dialects_northern_english_female | 33.2 | 30.1 | 26.6 | 24.3 | 19.3 | 15 | 12.7 |
uk_dialects_scottish_english_female | 31.3 | 28.6 | 25.4 | 23.5 | 18.6 | 13.5 | 12.6 |
uk_dialects_irish_english_male | 42.7 | 40.2 | 36.8 | 34.1 | 29.3 | 25.5 | 21.9 |
nsc_val_manifest_part1 | |||||||
Far-field / very noisy | |||||||
voices_rm2_clo_none_stu | 25.6 | 22.4 | 19.7 | 17.5 | 14.2 | 21.5 | 27 |
voices_rm2_far_none_lav | 41.5 | 37.2 | 32.1 | 29.0 | 25.7 | 27.5 | 42.3 |
voices_rm4_far_none_stu | 46.1 | 41.4 | 36.5 | 33.1 | 30.1 | 43.2 | 43.2 |
voices_rm3_clo_none_stu | 43.2 | 38.9 | 35.0 | 32.1 | 28.9 | 28.6 | 40.8 |
voices_rm2_far_musi_stu | 46.0 | 41.6 | 37.0 | 33.6 | 30.3 | 30.6 | 42.4 |
voices_rm2_far_babb_stu | 50.6 | 46.3 | 41.0 | 37.9 | 34.7 | 38.5 | 48.2 |
voices_rm3_clo_musi_stu | 55.1 | 51.0 | 47.6 | 44.7 | 41.7 | 38.1 | 51.8 |
voices_rm4_ceo_none_lav | 60.7 | 56.2 | 52.4 | 49.0 | 45.3 | 42.9 | 52.5 |
voices_rm3_far_none_stu | 82.0 | 79.5 | 76.4 | 73.8 | 71.8 | 68.8 | 81.6 |
Dataset | Silero | Silero | Silero | Silero | Silero | ||
---|---|---|---|---|---|---|---|
xsmall_q | xsmall | small_q | small | large | Video | Phone | |
EE | EE | EE | EE | EE | Premium | Premium | |
AudioBooks / narration | |||||||
lj | 6.8 | 6.3 | 5.9 | 5.6 | 5.4 | ||
librispeech_test_clean | 9.6 | 8.3 | 7.7 | 7.0 | 5.9 | 7.8 | 8.7 |
librispeech_val | 15.0 | 13.2 | 12.4 | 11.2 | 9.7 | 11.3 | 13.1 |
librispeech_test_other | 21.7 | 19.2 | 17.9 | 16.5 | 15.1 | 16.2 | 19.1 |
aru | 13.7 | 11.7 | 11.0 | 9.7 | 8.2 | 16.2 | 19.1 |
mls_test | 24.4 | 22.1 | 20.9 | 19.3 | 17.9 | ||
mls_dev | 22.0 | 19.8 | 18.5 | 17.2 | 15.8 | ||
Lecture / speech | |||||||
multi_ted_test_he | 19.0 | 16.6 | 14.8 | 14.1 | 12.1 | 15.3 | 14.1 |
multi_ted_test_common | 28.1 | 24.9 | 22.7 | 21.1 | 18.3 | 16.9 | 16.0 |
multi_ted_val | 29.3 | 26.2 | 24.8 | 23.2 | 21.0 | 22.7 | 20.8 |
voxpopuli_dev | 25.7 | 24.4 | 23.5 | 22.4 | 20.8 | ||
voxpopuli_test | 26.1 | 25.0 | 24.1 | 23.0 | 21.4 | ||
Finance | |||||||
kensho | 14.0 | 12.3 | 10.6 | 9.7 | 8.1 | ||
In the wild | |||||||
common_voice_val | 25.7 | 24.0 | 21.4 | 20.1 | 18.5 | 20.8 | 20.8 |
common_voice_test | 30.9 | 29.0 | 26.4 | 24.9 | 23.3 | 22.2 | 24 |
VOIP / calls | |||||||
voip_test | 29.1 | 29.0 | 24.0 | 23.6 | 21.0 | 19.7 | 18.3 |
Dialects | |||||||
uk_dialects_midlands_english_female | 15.5 | 13.8 | 12.5 | 10.8 | 8.8 | 9.6 | 8.4 |
uk_dialects_southern_english_female | 16.4 | 14.7 | 13.1 | 11.9 | 9.9 | 10.8 | 9.3 |
uk_dialects_welsh_english_female | 15.8 | 14.3 | 12.0 | 12.8 | 10.7 | 20.5 | 10.5 |
uk_dialects_southern_english_male | 17.6 | 15.7 | 14.1 | 12.9 | 10.6 | 11.5 | 10.6 |
uk_dialects_welsh_english_male | 17.9 | 16.4 | 14.5 | 13.9 | 12.1 | 12.1 | |
uk_dialects_northern_english_male | 19.8 | 17.9 | 15.7 | 14.6 | 12.0 | 15.5 | 11.7 |
uk_dialects_scottish_english_male | 20.5 | 18.4 | 15.9 | 14.9 | 12.7 | 10 | 11.3 |
uk_dialects_midlands_english_male | 22.6 | 20.2 | 17.6 | 16.0 | 12.2 | 11.8 | 10.3 |
uk_dialects_northern_english_female | 21.1 | 18.9 | 16.3 | 15.8 | 13.4 | 15 | 12.7 |
uk_dialects_scottish_english_female | 20.1 | 18.2 | 16.5 | 15.2 | 12.8 | 13.5 | 12.6 |
uk_dialects_irish_english_male | 31.4 | 29.6 | 28.1 | 26.3 | 23.7 | 25.5 | 21.9 |
nsc_val_manifest_part1 | 10.0 | 9.3 | 8.3 | ||||
Far-field / very noisy | |||||||
voices_rm2_clo_none_stu_manifest | 18.5 | 15.9 | 14.2 | 12.6 | 11.2 | 21.5 | 27 |
voices_rm2_far_none_lav_manifest | 34.3 | 29.7 | 25.4 | 22.8 | 21.5 | 27.5 | 42.3 |
voices_rm4_far_none_stu_manifest | 39.5 | 34.4 | 28.6 | 26.2 | 24.7 | 43.2 | 43.2 |
voices_rm3_clo_none_stu_manifest | 36.8 | 32.1 | 41.9 | 39.2 | 37.9 | 28.6 | 40.8 |
voices_rm2_far_musi_stu_manifest | 39.1 | 34.3 | 30.2 | 27.5 | 26.1 | 30.6 | 42.4 |
voices_rm2_far_babb_stu_manifest | 44.8 | 39.3 | 34.6 | 31.8 | 30.9 | 38.5 | 48.2 |
voices_rm3_clo_musi_stu_manifest | 49.8 | 45.1 | 29.8 | 27.0 | 26.0 | 38.1 | 51.8 |
voices_rm4_ceo_none_lav_manifest | 56.3 | 50.7 | 46.9 | 43.7 | 41.3 | 42.9 | 52.5 |
voices_rm3_far_none_stu_manifest | 80.9 | 78.0 | 74.2 | 71.5 | 70.0 | 68.8 | 81.6 |
Google tests were run in early September 2020. EN V4 metrics updated in June 2021.
Dataset | Silero | Silero | Silero | Silero | Silero | ||
---|---|---|---|---|---|---|---|
xsmall_q | xsmall | small_q | small | large | Video | Phone | |
CE | CE | CE | CE | CE | Premium | Premium | |
AudioBooks / narration | |||||||
lj | 6.6 | ||||||
librispeech_test_clean | 6.8 | 7.8 | 8.7 | ||||
librispeech_val | 11.7 | 11.3 | 13.1 | ||||
librispeech_test_other | 17.5 | 16.2 | 19.1 | ||||
aru | 10.6 | 16.2 | 19.1 | ||||
mls_test | 20.6 | ||||||
mls_dev | 18.7 | ||||||
Lecture / speech | |||||||
multi_ted_test_he | 12.2 | 15.3 | 14.1 | ||||
multi_ted_test_common | 17.4 | 16.9 | 16 | ||||
multi_ted_val | 20.4 | 22.7 | 20.8 | ||||
voxpopuli_dev | 21.2 | ||||||
voxpopuli_test | 22.6 | ||||||
Finance | |||||||
kensho | 6.5 | ||||||
In the wild | |||||||
common_voice_val | 21.6 | 20.8 | 20.8 | ||||
common_voice_test | 26.4 | 22.2 | 24 | ||||
VOIP / calls | |||||||
voip_test | 21.2 | 19.7 | 18.3 | ||||
Dialects | |||||||
uk_dialects_midlands_english_female | 10.8 | 9.6 | 8.4 | ||||
uk_dialects_southern_english_female | 11.8 | 10.8 | 9.3 | ||||
uk_dialects_welsh_english_female | 12.2 | 20.5 | 10.5 | ||||
uk_dialects_southern_english_male | 12.6 | 11.5 | 10.6 | ||||
uk_dialects_welsh_english_male | 14.1 | 12.1 | |||||
uk_dialects_northern_english_male | 14.0 | 15.5 | 11.7 | ||||
uk_dialects_scottish_english_male | 15.1 | 10 | 11.3 | ||||
uk_dialects_midlands_english_male | 13.7 | 11.8 | 10.3 | ||||
uk_dialects_northern_english_female | 16.0 | 15 | 12.7 | ||||
uk_dialects_scottish_english_female | 15.8 | 13.5 | 12.6 | ||||
uk_dialects_irish_english_male | 25.8 | 25.5 | 21.9 | ||||
Far-field / very noisy | |||||||
voices_rm2_clo_none_stu | 13.7 | 21.5 | 27 | ||||
voices_rm2_far_none_lav | 25.0 | 27.5 | 42.3 | ||||
voices_rm4_far_none_stu | 30.0 | 43.2 | 43.2 | ||||
voices_rm3_clo_none_stu | 28.0 | 28.6 | 40.8 | ||||
voices_rm2_far_musi_stu | 29.7 | 30.6 | 42.4 | ||||
voices_rm2_far_babb_stu | 34.7 | 38.5 | 48.2 | ||||
voices_rm3_clo_musi_stu | 41.3 | 38.1 | 51.8 | ||||
voices_rm4_ceo_none_lav | 44.5 | 42.9 | 52.5 | ||||
voices_rm3_far_none_stu | 70.7 | 68.8 | 81.6 |
Dataset | Silero | Silero | Silero | Silero | Silero | ||
---|---|---|---|---|---|---|---|
xsmall_q | xsmall | small_q | small | large | Video | Phone | |
EE | EE | EE | EE | EE | Premium | Premium | |
AudioBooks / narration | |||||||
lj | 5.6 | ||||||
librispeech_test_clean | 6.1 | 7.8 | 8.7 | ||||
librispeech_val | 10.0 | 11.3 | 13.1 | ||||
librispeech_test_other | 15.2 | 16.2 | 19.1 | ||||
aru | 8.0 | 16.2 | 19.1 | ||||
mls_test | 17.1 | ||||||
mls_dev | 15.3 | ||||||
Lecture / speech | |||||||
multi_ted_test_he | 10.2 | 15.3 | 14.1 | ||||
multi_ted_test_common | 15.5 | 16.9 | 16 | ||||
multi_ted_val | 18.4 | 22.7 | 20.8 | ||||
voxpopuli_dev | 19.4 | ||||||
voxpopuli_test | 20.5 | ||||||
Finance | |||||||
kensho | 5.9 | ||||||
In the wild | |||||||
common_voice_val | 15.8 | 20.8 | 20.8 | ||||
common_voice_test | 20.3 | 22.2 | 24 | ||||
VOIP / calls | |||||||
voip_test | 18.7 | 19.7 | 18.3 | ||||
Dialects | |||||||
uk_dialects_midlands_english_female | 7.8 | 9.6 | 8.4 | ||||
uk_dialects_southern_english_female | 8.3 | 10.8 | 9.3 | ||||
uk_dialects_welsh_english_female | 8.9 | 20.5 | 10.5 | ||||
uk_dialects_southern_english_male | 9.2 | 11.5 | 10.6 | ||||
uk_dialects_welsh_english_male | 10.9 | 12.1 | |||||
uk_dialects_northern_english_male | 10.0 | 15.5 | 11.7 | ||||
uk_dialects_scottish_english_male | 11.1 | 10 | 11.3 | ||||
uk_dialects_midlands_english_male | 9.8 | 11.8 | 10.3 | ||||
uk_dialects_northern_english_female | 11.3 | 15 | 12.7 | ||||
uk_dialects_scottish_english_female | 11.7 | 13.5 | 12.6 | ||||
uk_dialects_irish_english_male | 21.2 | 25.5 | 21.9 | ||||
Far-field / very noisy | |||||||
voices_rm2_clo_none_stu | 10.8 | 21.5 | 27 | ||||
voices_rm2_far_none_lav | 21.1 | 27.5 | 42.3 | ||||
voices_rm4_far_none_stu | 26.0 | 43.2 | 43.2 | ||||
voices_rm3_clo_none_stu | 24.1 | 28.6 | 40.8 | ||||
voices_rm2_far_musi_stu | 26.1 | 30.6 | 42.4 | ||||
voices_rm2_far_babb_stu | 31.5 | 38.5 | 48.2 | ||||
voices_rm3_clo_musi_stu | 38 | 38.1 | 51.8 | ||||
voices_rm4_ceo_none_lav | 41.3 | 42.9 | 52.5 | ||||
voices_rm3_far_none_stu | 69.3 | 68.8 | 81.6 |
Google tests were run in early September 2020. EN V5 metrics updated in September 2021.
Dataset | Silero | Silero | Silero | Silero | Silero | ||
---|---|---|---|---|---|---|---|
xsmall_q | xsmall | small_q | small | xlarge | Video | Phone | |
CE | CE | CE | CE | CE | Premium | Premium | |
AudioBooks / narration | |||||||
lj | 9.2 | 8.4 | 5.9 | ||||
librispeech_test_clean | 11.6 | 10.2 | 6.1 | 7.8 | 8.7 | ||
librispeech_val | 17.7 | 15.9 | 10.3 | 11.3 | 13.1 | ||
librispeech_test_other | 24 | 22.2 | 15.7 | 16.2 | 19.1 | ||
aru | 17.8 | 15.4 | 9.3 | 16.2 | 19.1 | ||
mls_test | 26.2 | 23.9 | 17.9 | ||||
mls_dev | 23.9 | 21.8 | 16.1 | ||||
Lecture / speech | |||||||
multi_ted_test_he | 18.3 | 16.7 | 10.3 | 15.3 | 14.1 | ||
multi_ted_test_common | 25.4 | 23.2 | 16.1 | 16.9 | 16 | ||
multi_ted_val | 27.2 | 25.5 | 18.8 | 22.7 | 20.8 | ||
voxpopuli_dev | 22.8 | 21.4 | 17.2 | ||||
voxpopuli_test | 23.3 | 22.3 | 17.9 | ||||
Finance | |||||||
kensho | 10.5 | 9.3 | 4.7 | ||||
In the wild | |||||||
common_voice_val | 28.5 | 26.3 | 20.2 | 20.8 | 20.8 | ||
common_voice_test | 33.2 | 30.9 | 24.6 | 22.2 | 24 | ||
gigaspeech_test | 30.5 | 28.6 | 22.4 | ||||
VOIP / calls | |||||||
voip_test | 19.4 | 19.5 | 18.3 | 19.7 | 18.3 | ||
Dialects | |||||||
uk_dialects_midlands_english_female | 19.3 | 17.2 | 9.1 | 9.6 | 8.4 | ||
uk_dialects_southern_english_female | 19.6 | 17.5 | 11.2 | 10.8 | 9.3 | ||
uk_dialects_welsh_english_female | 18.7 | 16.6 | 11.9 | 20.5 | 10.5 | ||
uk_dialects_southern_english_male | 20.2 | 18.5 | 11.7 | 11.5 | 10.6 | ||
uk_dialects_welsh_english_male | 20.6 | 18.9 | 13.4 | 12.1 | |||
uk_dialects_northern_english_male | 23.7 | 21.1 | 12.9 | 15.5 | 11.7 | ||
uk_dialects_scottish_english_male | 23 | 21 | 14.5 | 10 | 11.3 | ||
uk_dialects_midlands_english_male | 24.6 | 23.2 | 13.1 | 11.8 | 10.3 | ||
uk_dialects_northern_english_female | 24.4 | 22.3 | 15.7 | 15 | 12.7 | ||
uk_dialects_scottish_english_female | 23.5 | 21.7 | 15.2 | 13.5 | 12.6 | ||
uk_dialects_irish_english_male | 35.6 | 33.6 | 25.3 | 25.5 | 21.9 | ||
Far-field / very noisy | |||||||
voices_rm2_clo_none_stu | 20.7 | 18.2 | 11.5 | 21.5 | 27 | ||
voices_rm2_far_none_lav | 34 | 30.7 | 22 | 27.5 | 42.3 | ||
voices_rm4_far_none_stu | 37.9 | 34.3 | 26 | 43.2 | 43.2 | ||
voices_rm3_clo_none_stu | 36.5 | 33.4 | 25.1 | 28.6 | 40.8 | ||
voices_rm2_far_musi_stu | 39 | 35.8 | 26.5 | 30.6 | 42.4 | ||
voices_rm2_far_babb_stu | 44.5 | 41.2 | 30.8 | 38.5 | 48.2 | ||
voices_rm3_clo_musi_stu | 49.5 | 46.6 | 38.5 | 38.1 | 51.8 | ||
voices_rm4_ceo_none_lav | 54.3 | 50.9 | 40.6 | 42.9 | 52.5 | ||
voices_rm3_far_none_stu | 76 | 74.5 | 69.9 | 68.8 | 81.6 |
Dataset | Silero | Silero | Silero | Silero | Silero | ||
---|---|---|---|---|---|---|---|
xsmall_q | xsmall | small_q | small | xlarge | Video | Phone | |
EE | EE | EE | EE | EE | Premium | Premium | |
AudioBooks / narration | |||||||
lj | 6.1 | 5.8 | 5.1 | ||||
librispeech_test_clean | 8.3 | 7.5 | 5.5 | 7.8 | 8.7 | ||
librispeech_val | 12.8 | 11.9 | 8.8 | 11.3 | 13.1 | ||
librispeech_test_other | 18.6 | 17.3 | 13.5 | 16.2 | 19.1 | ||
aru | 11.6 | 10.3 | 7 | 16.2 | 19.1 | ||
mls_test | 20.1 | 18.5 | 14.8 | ||||
mls_dev | 18 | 16.6 | 13.3 | ||||
Lecture / speech | |||||||
multi_ted_test_he | 12.9 | 11.9 | 8.4 | 15.3 | 14.1 | ||
multi_ted_test_common | 20.2 | 18.6 | 14 | 16.9 | 16 | ||
multi_ted_val | 22.4 | 21.1 | 16.9 | 22.7 | 20.8 | ||
voxpopuli_dev | 18.6 | 17.9 | 16 | ||||
voxpopuli_test | 18.9 | 18.3 | 16.4 | ||||
Finance | |||||||
kensho | 7.5 | 6.7 | 4.3 | ||||
In the wild | |||||||
common_voice_val | 19.9 | 18.5 | 15 | 20.8 | 20.8 | ||
common_voice_test | 24.4 | 22.9 | 19.2 | 22.2 | 24 | ||
gigaspeech_test | 26.2 | 24.5 | 20.7 | ||||
VOIP / calls | |||||||
voip_test | 18.7 | 20.2 | 18.3 | 19.7 | 18.3 | ||
Dialects | |||||||
uk_dialects_midlands_english_female | 11.9 | 10.8 | 6.8 | 9.6 | 8.4 | ||
uk_dialects_southern_english_female | 12.4 | 11.2 | 8 | 10.8 | 9.3 | ||
uk_dialects_welsh_english_female | 12.5 | 11.3 | 8.7 | 20.5 | 10.5 | ||
uk_dialects_southern_english_male | 13.3 | 12.2 | 8.6 | 11.5 | 10.6 | ||
uk_dialects_welsh_english_male | 13.8 | 12.9 | 10.1 | 12.1 | |||
uk_dialects_northern_english_male | 14.9 | 13.8 | 9.6 | 15.5 | 11.7 | ||
uk_dialects_scottish_english_male | 14.9 | 13.8 | 10.6 | 10 | 11.3 | ||
uk_dialects_midlands_english_male | 15.5 | 14.5 | 9.2 | 11.8 | 10.3 | ||
uk_dialects_northern_english_female | 15.9 | 14.9 | 11.2 | 15 | 12.7 | ||
uk_dialects_scottish_english_female | 15.5 | 14.5 | 11.3 | 13.5 | 12.6 | ||
uk_dialects_irish_english_male | 26.5 | 25.1 | 20.3 | 25.5 | 21.9 | ||
Far-field / very noisy | |||||||
voices_rm2_clo_none_stu | 15 | 13.4 | 9.3 | 21.5 | 27 | ||
voices_rm2_far_none_lav | 27.4 | 24.7 | 18.6 | 27.5 | 42.3 | ||
voices_rm4_far_none_stu | 31.3 | 28.4 | 22.5 | 43.2 | 43.2 | ||
voices_rm3_clo_none_stu | 30 | 27.7 | 21.7 | 28.6 | 40.8 | ||
voices_rm2_far_musi_stu | 32.5 | 29.7 | 23.1 | 30.6 | 42.4 | ||
voices_rm2_far_babb_stu | 38.1 | 35.1 | 27.5 | 38.5 | 48.2 | ||
voices_rm3_clo_musi_stu | 44 | 41.5 | 35.3 | 38.1 | 51.8 | ||
voices_rm4_ceo_none_lav | 48.9 | 45.8 | 37.4 | 42.9 | 52.5 | ||
voices_rm3_far_none_stu | 73.7 | 72.2 | 68.3 | 68.8 | 81.6 |
Google tests were run in early September 2020. EN V6 metrics updated in February 2022.
Dataset | Silero | Silero | ||
---|---|---|---|---|
small | xlarge | Video | Phone | |
CE | CE | Premium | Premium | |
AudioBooks / narration | ||||
lj | 7.7 | 5.8 | ||
librispeech_test_clean | 10.0 | 6.1 | 7.8 | 8.7 |
librispeech_val | 15.5 | 10.4 | 11.3 | 13.1 |
librispeech_test_other | 21.9 | 15.7 | 16.2 | 19.1 |
aru | 16.1 | 9.6 | 16.2 | 19.1 |
mls_test | 23.1 | 17.6 | ||
mls_dev | 21.1 | 15.9 | ||
Lecture / speech | ||||
multi_ted_test_he | 15.7 | 9.9 | 15.3 | 14.1 |
multi_ted_test_common | 22.5 | 16.0 | 16.9 | 16 |
multi_ted_val | 23.9 | 18.5 | 22.7 | 20.8 |
voxpopuli_dev | 21.0 | 16.8 | ||
voxpopuli_test | 21.9 | 17.4 | ||
Finance | ||||
kensho | 8.4 | 4.6 | ||
In the wild | ||||
common_voice_val | 25.9 | 19.9 | 20.8 | 20.8 |
common_voice_test | 30.4 | 24.4 | 22.2 | 24 |
gigaspeech_test | 27.5 | 22.1 | ||
gigaspeech_2s_test | 26.1 | 20.5 | ||
fluent_ai_speech_commands | 23.6 | 18.6 | ||
speech_commands | 17.0 | 15.1 | ||
VOIP / calls | ||||
voip_test | 19.7 | 17.5 | 19.7 | 18.3 |
voip_val | 19.3 | 17.8 | ||
vystadial_dev | 9.3 | 6.1 | ||
vystadial_test | 9.1 | 5.6 | ||
vystadial_train | 9.1 | 6.1 | ||
Dialects | ||||
uk_dialects | 19.3 | 13.0 | ||
uk_dialects_midlands_english_female | 16.7 | 8.7 | 9.6 | 8.4 |
uk_dialects_southern_english_female | 17.4 | 11.3 | 10.8 | 9.3 |
uk_dialects_welsh_english_female | 16.5 | 11.8 | 20.5 | 10.5 |
uk_dialects_southern_english_male | 18.2 | 11.9 | 11.5 | 10.6 |
uk_dialects_welsh_english_male | 18.7 | 13.3 | 12.1 | |
uk_dialects_northern_english_male | 20.6 | 13.1 | 15.5 | 11.7 |
uk_dialects_scottish_english_male | 20.8 | 14.7 | 10 | 11.3 |
uk_dialects_midlands_english_male | 22.4 | 13.4 | 11.8 | 10.3 |
uk_dialects_northern_english_female | 21.8 | 15.7 | 15 | 12.7 |
uk_dialects_scottish_english_female | 21.5 | 15.5 | 13.5 | 12.6 |
uk_dialects_irish_english_male | 33.4 | 25.9 | 25.5 | 21.9 |
cmu_arctic_val | 10.5 | 6.2 | ||
l2arctic_arabic | 30.1 | 24.2 | ||
l2arctic_chinese | 34.1 | 27.5 | ||
l2arctic_hindi | 19.1 | 14.0 | ||
l2arctic_korean | 23.9 | 17.6 | ||
l2arctic_spanish | 28.7 | 22.6 | ||
l2arctic_vietnamese | 39.4 | 33.8 | ||
Far-field / very noisy | ||||
voices_rm2_clo_none_stu | 17.2 | 11.1 | 21.5 | 27 |
voices_rm2_far_none_lav | 30.5 | 21.4 | 27.5 | 42.3 |
voices_rm4_far_none_stu | 34.3 | 25.5 | 43.2 | 43.2 |
voices_rm3_clo_none_stu | 32.3 | 24.1 | 28.6 | 40.8 |
voices_rm2_far_musi_stu | 35.3 | 25.7 | 30.6 | 42.4 |
voices_rm2_far_babb_stu | 42.1 | 31.0 | 38.5 | 48.2 |
voices_rm3_clo_musi_stu | 45.2 | 36.7 | 38.1 | 51.8 |
voices_rm4_ceo_none_lav | 48.9 | 38.9 | 42.9 | 52.5 |
voices_rm3_far_none_stu | 73.8 | 65.5 | 68.8 | 81.6 |
Dataset | Silero | Silero | ||
---|---|---|---|---|
small | xlarge | Video | Phone | |
EE | EE | Premium | Premium | |
AudioBooks / narration | ||||
lj | 5.7 | 5.0 | ||
librispeech_test_clean | 7.5 | 5.4 | 7.8 | 8.7 |
librispeech_val | 11.6 | 8.8 | 11.3 | 13.1 |
librispeech_test_other | 17.3 | 13.6 | 16.2 | 19.1 |
aru | 10.6 | 7.2 | 16.2 | 19.1 |
mls_test | 18.3 | 14.8 | ||
mls_dev | 16.6 | 13.4 | ||
Lecture / speech | ||||
multi_ted_test_he | 11.3 | 8.4 | 15.3 | 14.1 |
multi_ted_test_common | 17.7 | 13.9 | 16.9 | 16 |
multi_ted_val | 20.6 | 16.8 | 22.7 | 20.8 |
voxpopuli_dev | 17.8 | 15.8 | ||
voxpopuli_test | 18.3 | 16.2 | ||
Finance | ||||
kensho | 6.3 | 4.3 | ||
In the wild | ||||
common_voice_val | 18.3 | 14.9 | 20.8 | 20.8 |
common_voice_test | 22.6 | 19.1 | 22.2 | 24 |
gigaspeech_test | 23.6 | 20.6 | ||
gigaspeech_2s_test | 22.1 | 19.1 | ||
fluent_ai_speech_commands | 17.2 | 15.3 | ||
speech_commands | 16.6 | 12.0 | ||
VOIP / calls | ||||
voip_test | 19.6 | 18.2 | 19.7 | 18.3 |
voip_val | 18.4 | 18.3 | ||
vystadial_dev | 8.2 | 6.1 | ||
vystadial_test | 8.3 | 5.8 | ||
vystadial_train | 8.7 | 6.0 | ||
Dialects | ||||
uk_dialects | 13.1 | 9.7 | ||
uk_dialects_midlands_english_female | 10.1 | 6.3 | 9.6 | 8.4 |
uk_dialects_southern_english_female | 11.6 | 8.2 | 10.8 | 9.3 |
uk_dialects_welsh_english_female | 11.4 | 8.8 | 20.5 | 10.5 |
uk_dialects_southern_english_male | 12.2 | 8.8 | 11.5 | 10.6 |
uk_dialects_welsh_english_male | 13.1 | 10.2 | 12.1 | |
uk_dialects_northern_english_male | 13.7 | 9.8 | 15.5 | 11.7 |
uk_dialects_scottish_english_male | 14.2 | 10.9 | 10 | 11.3 |
uk_dialects_midlands_english_male | 14.6 | 9.2 | 11.8 | 10.3 |
uk_dialects_northern_english_female | 15.2 | 11.3 | 15 | 12.7 |
uk_dialects_scottish_english_female | 14.8 | 11.5 | 13.5 | 12.6 |
uk_dialects_irish_english_male | 25.1 | 21.2 | 25.5 | 21.9 |
cmu_arctic_val | 7.6 | 5.1 | ||
l2arctic_arabic | 23.1 | 19.4 | ||
l2arctic_chinese | 26.8 | 22.4 | ||
l2arctic_hindi | 14.1 | 11.3 | ||
l2arctic_korean | 17.5 | 13.9 | ||
l2arctic_spanish | 22.1 | 18.5 | ||
l2arctic_vietnamese | 32.1 | 28.3 | ||
Far-field / very noisy | ||||
voices_rm2_clo_none_stu | 13.1 | 9.3 | 21.5 | 27 |
voices_rm2_far_none_lav | 25.2 | 18.5 | 27.5 | 42.3 |
voices_rm4_far_none_stu | 29.0 | 22.4 | 43.2 | 43.2 |
voices_rm3_clo_none_stu | 27.2 | 21.1 | 28.6 | 40.8 |
voices_rm2_far_musi_stu | 30.0 | 22.6 | 30.6 | 42.4 |
voices_rm2_far_babb_stu | 37.1 | 28.0 | 38.5 | 48.2 |
voices_rm3_clo_musi_stu | 40.7 | 33.7 | 38.1 | 51.8 |
voices_rm4_ceo_none_lav | 44.5 | 36.0 | 42.9 | 52.5 |
voices_rm3_far_none_stu | 71.8 | 63.9 | 68.8 | 81.6 |
All of these tests were run in early September 2020.
At the moment of this test, there was no premium model available for Google. There were several models for several regions, but with minor differences we chose the default German model.
Dataset | CE | EE | |
---|---|---|---|
AudioBooks | |||
de_caito_manifest_val | 12.5 | 8.7 | 19.5 |
Narration | |||
de_voxforge_manifest_val | 3.8 | 2.3 | 5.9 |
In the wild | |||
de_common_voice_test_manifest | 28.0 | 17.6 | 16.1 |
de_common_voice_val_manifest | 24.9 | 15.0 | 14.0 |
de_telekinect_dev_manifest | 28.1 | 18.6 | 13.5 |
de_telekinect_test_manifest | 28.3 | 19.4 | 15.7 |
All of these tests were run in early September 2020.
At the moment of this test, there was no premium model available for Google. There were several models for several regions, but with minor differences we chose the default German model.
Dataset | CE | EE | |
---|---|---|---|
Books | |||
de_mls_test | 19.5 | 15.0 | N/A |
de_mls_val | 16.6 | 12.7 | N/A |
Narration | |||
de_voxforge_manifest_val | 7.4 | 5.2 | 5.9 |
Public speech | |||
de_voxpopuli_dev | 27.0 | 24.6 | N/A |
de_voxpopuli_test | 25.0 | 22.8 | N/A |
In the wild | |||
de_common_voice_test_manifest | 21.0 | 14.3 | 16.1 |
de_common_voice_val_manifest | 18.8 | 12.5 | 14.0 |
de_telekinect_dev_manifest | 16.6 | 11.6 | 13.5 |
de_telekinect_test_manifest | 17.3 | 12.1 | 15.7 |
Google tests were run in early September 2020.
At the moment of this test, there was no premium model available for Google. There were several models for several regions, but with minor differences we chose the default German model.
Dataset | CE | EE | |
---|---|---|---|
Books | |||
de_mls_test | 16.3 | 12.8 | N/A |
de_mls_val | 13.3 | 10.5 | N/A |
Narration | |||
de_voxforge_val | 5.8 | 4.4 | 5.9 |
Public speech | |||
de_voxpopuli_dev | 26.3 | 23.8 | N/A |
de_voxpopuli_test | 24 | 21.6 | N/A |
In the wild | |||
de_common_voice_test | 20.6 | 14.1 | 16.1 |
de_common_voice_val | 18.4 | 12.3 | 14 |
de_telekinect_dev | 16.2 | 11.3 | 13.5 |
de_telekinect_test | 16.4 | 12 | 15.7 |
All of these tests were run in early September 2020.
For Spanish, we chose the region (US) where a Premium model was available. Judging by the benchmark results, Google heavily relies on the data it sources from Android most likely due to large population and less regulation. Note that most "dialect" recordings are quite clean, but pronunciation varies.
Dataset | CE | EE | Google Phone Premium | |
---|---|---|---|---|
AudioBooks | ||||
es_caito_val | 7.7 | 5.7 | 20.3 | 22.3 |
Narration | ||||
es_voxforge_val | 1.4 | 1.1 | 18.1 | 19.4 |
In the wild | ||||
es_common_voice_test | 22.0 | 14.4 | 27.2 | 23.1 |
es_common_voice_val | 20.1 | 13.0 | 24.5 | 19.6 |
Dialects | ||||
es_dialects_argentinian_val | 19.0 | 12.9 | 11.8 | 6.7 |
es_dialects_chilean_val | 19.8 | 13.7 | 8.9 | 6.6 |
es_dialects_columbian_val | 18.4 | 11.9 | 7.8 | 5.4 |
es_dialects_peruvian_val | 14.4 | 9.1 | 6.2 | 4.7 |
es_dialects_puerto_rico_val | 21.1 | 14.5 | 7.9 | 6.0 |
es_dialects_venezuela_val | 19.2 | 13.2 | 8.2 | 6.4 |
We decided to keep the quality assessment really simple: we generated audio from the validation subsets of our data (~200 files per speaker), shuffled them with the original recorded audios of the same speakers, and gave it to a group of 24 asessors to evaluate the sound quality on a five-point scale. For 8kHz
and 16kHz
the scores were collected separately (both for synthesized and original speech). For simplicity we had the following grades - [1, 2, 3, 4-, 4, 4+, 5-, 5] - the higher the quality the more detailed our scale is. Then, for each speaker, we simply calculated the mean.
In total people scored audios 37,403
times. 12 people annotated the whole dataset. 12 other people managed to annotate from 10% to 75% of audios. For each speaker we calculated mean (standard deviation is shown in brackets). We also tried first calculating median scores for each audio and then averaging them. But this just increases the mean values without affecting the ratios, so we just used plain averages in the end. The key metric here of course is the ratio between the mean score for synthesis vs the original audio. Some users had much lower scores overall (hence high dispersion), but we decided to keep all scores as is without cleaning outliers.
Speaker | Original | Synthesis | Ratio | Examples |
---|---|---|---|---|
aidar_8khz | 4.67 (.45) | 4.52 (.55) | 96.8% | link |
baya_8khz | 4.52 (.57) | 4.25 (.76) | 94.0% | link |
kseniya_8khz | 4.80 (.40) | 4.54 (.60) | 94.5% | link |
aidar_16khz | 4.72 (.43) | 4.53 (.55) | 95.9% | link |
baya_16khz | 4.59 (.55) | 4.18 (.76) | 91.1% | link |
kseniya_16khz | 4.84 (.37) | 4.54 (.59) | 93.9% | link |
We asked our asessors to rate the "naturalness of the speech" (not the audio quality). Nevertheless we were surprised that based on anecdotes people cannot tell 8 kHz from 16 kHz on their everyday devices (which is also confirmed by metrics). Baya has the lowest absolute and relative scores. Kseniya has the highest absolute scores, Aidar has the highest relative scores. Baya also has higher score dispersion.
Manually inspecting audios with high score dispersion reveals several patterns. Speaker errors, tacotron errors (pauses), proper names and hard-to-read words are the most common causes. Of course 75% of such differences are in synthesized audios and sampling rate does not seem to affect it.
We tried to rate "naturalness". But it is only natural to try estimating "unnaturalness" or "robotness" as well. It can be measured by asking people to choose between to audios. But we went one step beyond and essentially applied a double blind test. We asked our asessors to rate the same audio 4 times in random order - original and synthesis with different sampling rates. For asessors who annotated the whole dataset we calculated the following table:
Comparison | Worse | Same | Better |
---|---|---|---|
16k vs 8k, original | 957 | 4811 | 1512 |
16k vs 8k, synthesis | 1668 | 4061 | 1551 |
Original vs synthesis, 8k | 816 | 3697 | 2767 |
Original vs synthesis, 16k | 674 | 3462 | 3144 |
Several conclusions can be drawn:
- In 66% of cases people cannot hear difference between 8k и 16k;
- In synthesis 8k helps to hide some errors;
- In about 60% of cases synthesis is same or better than the original;
- Two last conclusions hold regardless of the sampling rate, 8k having a slight advantage;
You can see for yourself how it sounds, both for our unique voices and for speakers from external sources (more audio for each speaker can be synthesized in the colab notebook in our repo.
Contrary to the popular trends we aim to provide as detailed, informative and honest metrics as possible. In this particular case, we used the following datasets for validation:
- Validation subsets of our private text corpora (5,000 sentences per language);
- Audiobooks, we use the caito dataset, which has texts in all the languages the model was trained on (20,000 random sentences for each language);
We use the following metrics:
- WER (word error rate) as a percentage: separately calculated for repunctuation
WER_p
(both sentences are transformed to lowercase) and for recapitalizationWER_c
(here we throw out all punctuation marks); - Precision / recall / F1 to check the quality of classification (i) between the space and the punctuation marks mentioned above
.,-!?-
, and (ii) for the restoration of capital letters - between classes a token of lowercase letters / a token starts with a capital / a token of all caps. Also we provide confusion matrices for visualization;
For the correct and informative metrics calculation, the following transformations were applied to the texts beforehand:
- Punctuation characters other than
.,-!?-
were removed; - Punctuation at the beginning of a sentence was removed;
- In case of multiple consecutive punctuation marks we keep only the first one;
- For Spanish
¿¡
were discarded from the model predictions, because they aren't in the texts of the books, but in general the model places them as well;
WER_p
/ WER_c
are specified in the cells below. The baseline metrics are calculated for a naive algorithm that starts the text with a capital letter and ends it with a full stop.
Domain - validation data:
Languages | ||||
---|---|---|---|---|
en | de | ru | es | |
baseline | 14 / 19 | 13 / 41 | 17 / 20 | 10 / 16 |
model | 6 / 6 | 5 / 5 | 7 / 7 | 5 / 5 |
Domain - books:
Languages | ||||
---|---|---|---|---|
en | de | ru | es | |
baseline | 14 / 13 | 15 / 26 | 23 / 14 | 13 / 8 |
model | 12 / 7 | 11 / 8 | 18 / 10 | 12 / 6 |
Domain - validation data:
Languages | ||||
---|---|---|---|---|
en | de | ru | es | |
baseline | 12 / 18 | 10 / 33 | 13 / 12 | 8 / 11 |
model | 5 / 4 | 5 / 4 | 7 / 4 | 5 / 4 |
Domain - books:
Languages | ||||
---|---|---|---|---|
en | de | ru | es | |
baseline | 12 / 10 | 12 / 22 | 19 / 9 | 15 / 7 |
model | 12 / 6 | 10 / 6 | 17 / 7 | 13 / 5 |
WER_p
/ WER_c
are specified in the cells below. The baseline metrics are calculated for a naive algorithm that starts the sentence with a capital letter and ends it with a full stop.
Domain - validation data:
Languages | ||||
---|---|---|---|---|
en | de | ru | es | |
baseline | 20 / 26 | 13 / 36 | 18 / 17 | 8 / 13 |
model | 8 / 8 | 7 / 7 | 13 / 6 | 6 / 5 |
Domain - books:
Languages | ||||
---|---|---|---|---|
en | de | ru | es | |
baseline | 14 / 13 | 13 / 22 | 20 / 11 | 14 / 7 |
model | 14 / 8 | 11 / 6 | 21 / 7 | 13 / 6 |
Domain - validation data:
Metric | ' ' | . | , | - | ! | ? | — |
---|---|---|---|---|---|---|---|
en | |||||||
precision | 0.98 | 0.97 | 0.78 | 0.91 | 0.80 | 0.89 | nan |
recall | 0.99 | 0.98 | 0.64 | 0.75 | 0.67 | 0.78 | nan |
f1 | 0.98 | 0.98 | 0.71 | 0.82 | 0.73 | 0.84 | nan |
de | |||||||
precision | 0.98 | 0.98 | 0.86 | 0.81 | 0.74 | 0.90 | nan |
recall | 0.99 | 0.99 | 0.68 | 0.60 | 0.70 | 0.71 | nan |
f1 | 0.99 | 0.98 | 0.76 | 0.69 | 0.72 | 0.79 | nan |
ru | |||||||
precision | 0.98 | 0.97 | 0.80 | 0.90 | 0.80 | 0.84 | 0 |
recall | 0.98 | 0.99 | 0.74 | 0.70 | 0.58 | 0.78 | nan |
f1 | 0.98 | 0.98 | 0.77 | 0.78 | 0.67 | 0.81 | nan |
es | |||||||
precision | 0.98 | 0.96 | 0.70 | 0.74 | 0.85 | 0.83 | 0 |
recall | 0.99 | 0.98 | 0.60 | 0.29 | 0.60 | 0.70 | nan |
f1 | 0.98 | 0.98 | 0.64 | 0.42 | 0.70 | 0.76 | nan |
Metric | a | A | AAA |
---|---|---|---|
en | |||
precision | 0.98 | 0.94 | 0.97 |
recall | 0.99 | 0.91 | 0.70 |
f1 | 0.98 | 0.92 | 0.81 |
de | |||
precision | 0.99 | 0.98 | 0.89 |
recall | 0.99 | 0.98 | 0.53 |
f1 | 0.99 | 0.98 | 0.66 |
ru | |||
precision | 0.99 | 0.96 | 0.99 |
recall | 0.99 | 0.92 | 0.99 |
f1 | 0.99 | 0.94 | 0.99 |
es | |||
precision | 0.99 | 0.95 | 0.98 |
recall | 0.99 | 0.90 | 0.82 |
f1 | 0.99 | 0.92 | 0.89 |
Domain - books:
Metric | ' ' | . | , | - | ! | ? | — |
---|---|---|---|---|---|---|---|
en | |||||||
precision | 0.96 | 0.80 | 0.59 | 0.82 | 0.23 | 0.39 | nan |
recall | 0.99 | 0.73 | 0.23 | 0.13 | 0.58 | 0.85 | 0 |
f1 | 0.97 | 0.77 | 0.33 | 0.22 | 0.33 | 0.53 | nan |
de | |||||||
precision | 0.97 | 0.75 | 0.80 | 0.55 | 0.21 | 0.41 | nan |
recall | 0.99 | 0.71 | 0.49 | 0.35 | 0.58 | 0.67 | 0 |
f1 | 0.98 | 0.73 | 0.61 | 0.43 | 0.30 | 0.51 | nan |
ru | |||||||
precision | 0.97 | 0.77 | 0.69 | 0.90 | 0.17 | 0.49 | 0 |
recall | 0.98 | 0.60 | 0.55 | 0.61 | 0.68 | 0.75 | nan |
f1 | 0.98 | 0.68 | 0.61 | 0.72 | 0.28 | 0.60 | nan |
es | |||||||
precision | 0.96 | 0.57 | 0.59 | 0.96 | 0.30 | 0.24 | nan |
recall | 0.98 | 0.70 | 0.29 | 0.02 | 0.40 | 0.68 | 0 |
f1 | 0.97 | 0.63 | 0.38 | 0.04 | 0.34 | 0.36 | nan |
Metric | a | A | AAA |
---|---|---|---|
en | |||
precision | 0.99 | 0.80 | 0.94 |
recall | 0.98 | 0.89 | 0.95 |
f1 | 0.98 | 0.85 | 0.94 |
de | |||
precision | 0.99 | 0.90 | 0.77 |
recall | 0.98 | 0.94 | 0.62 |
f1 | 0.98 | 0.92 | 0.70 |
ru | |||
precision | 0.99 | 0.81 | 0.82 |
recall | 0.99 | 0.87 | 0.96 |
f1 | 0.99 | 0.84 | 0.89 |
es | |||
precision | 0.99 | 0.71 | 0.45 |
recall | 0.98 | 0.82 | 0.91 |
f1 | 0.98 | 0.76 | 0.60 |
As one can see from the spreadsheets - even for Russian, the hyphen values remained empty, because the model suggested not to put it down at all on the data used for calculating metrics, or to replace the hyphen with some other symbol; seems that it's placed better in case of sentence in the form of definition.