XLS-R is a set of large-scale models for self-supervised cross-lingual speech representation learning based on wav2vec 2.0. It was pretrained on 128 languages and approximately 436K hours of unlabeled speech data. With finetuning, these models achieve state of the art performance in speech translation, speech recognition and language identification. We evaluate the model across multiple benchmarks such as CoVoST-2 for speech translation, BABEL / MLS / CommonVoice / VoxPopuli for automatic speech recognition, and VoxLingua107 for language identification as we llas VoxCeleb1 for speaker identification. More details about this work can be found in our paper and download links can be found below.
Model | Link |
---|---|
XLS-R 300M | download |
XLS-R 1B | download |
XLS-R 2B | download |
You can also download these models here and read more about it in the blogpost from Hugging Face.
We multilingually finetune XLS-R models on CoVoST 2, which has 21 into-English and 15 out-of-English directions.
Model | Directions | Link |
---|---|---|
XLS-R 300M | 21 langs → En | download |
XLS-R 300M | En → 15 langs | download |
XLS-R 1B | 21 langs → En | download |
XLS-R 1B | En → 15 langs | download |
XLS-R 2B | 21 langs → En | download |
XLS-R 2B | En → 15 langs | download |
XLS-R 2B | 21 langs → En + En → 15 langs | download |
You can refer the original wav2vec documentation on detailed instructions about how to finetune a pretrained model with CTC here. Below is an example command and you can find the values for different hyperparameters to reproduce the results in our paper.
$ fairseq-hydra-train \
distributed_training.distributed_port=$PORT \
task.data=/path/to/data \
model.w2v_path=/path/to/model.pt \
--config-dir /path/to/fairseq-py/examples/wav2vec/xlsr/config \
--config-name finetune
For finetuning the 300M as well as 1B model, we use the same hyperparameter setting defined in finetune.yaml
. We vary optimization.max_update
as described in the below table and the optimization.lr
is picked from the interval [2e-5, 3e-4] based on dev word error rate.
Benchmark | Total Number of Updates |
---|---|
Babel | 26000 |
Common Voice | 13000 |
VoxPopuli | 50000 |
MLS 10h | 20000 |
For finetuning the 2B model, we make some additional changes for finetune.yaml
. We use the fully_sharded distributed_training.ddp_backend
provided by the fairscale library and and set model.activation_checkpoint
to true. We also increase dataset.max_tokens
to 2560000 and use a total effective batch size of 2560000*24. We sweep for the best optimization.lr
within the interval [3e−6,3e−5] using dev error rate. For common voice dataset, we pick the model.mask_prob
for different languages among {0.30, 0.40} based on best dev error rate.
Please cite as:
@article{babu2021xlsr,
title={XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale},
author={Arun Babu and Changhan Wang and Andros Tjandra and Kushal Lakhotia and Qiantong Xu and Naman Goyal and Kritika Singh and Patrick von Platen and Yatharth Saraf and Juan Pino and Alexei Baevski and Alexis Conneau and Michael Auli},
year={2021},
volume={abs/2111.09296},
journal={arXiv},
}