Paper | Demo Page | Introduction | Test Files | Pretrained Models | Inference | Training
Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot and Ethan Fetaya
The demo page includes many sample videos and comparisons to other baselines.
Official implementation of LipVoicer, a lip-to-speech method. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained ASR serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary.
The lip reading network used in LipVoicer is taken from the Visual Speech Recognition for Multiple Languages repository. The ASR system is adapted from Audio-Visual Efficient Conformer for Robust Speech Recognition.
- Clone the repository:
git clone https://github.com/yochaiye/LipVoicer.git
cd LipVoicer
- Install the required packages and ffmpeg
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
cd ..
- Install
ibug.face_detection
git clone https://github.com/hhj1897/face_detection.git
cd face_detection
git lfs pull
pip install -e .
cd ..
- Install
ibug.face_alignment
git clone https://github.com/hhj1897/face_alignment.git
cd face_alignment
pip install -e .
cd ..
- Install RetinaFace or MediaPipe face tracker
- Install ctcdecode for the ASR beam search
git clone --recursive https://github.com/WayenVan/ctcdecode.git
cd ctcdecode
pip install .
cd ..
We provide the audio generated by LipVoicer for the test videos of LRS2 and LRS3. They were used to compute the metrics in the paper, and therefore it will hopefully facilitate future comparisons.
The links are given below:
We provide pretrained checkpoints for LipVoicer so you can kick-start generating speech for silent videos. You can download checkpoint for the following models
- MelGen trained on LRS2/LRS3
- ASR finetuned for LipVoicer on LRS2/LRS3
- Language model for the ASR (provided by Audio-Visual Efficient Conformer for Robust Speech Recognition)
- Lip-reading network and its language model (provided by Visual Speech Recognition for Multiple Languages)
- HiFi-GAN trained on 16KHz audio signals. In the paper we used DiffWave as the vocoder, but since HiFi-GAN is faster it is used here as the vocoder.
The simplest and fastet way to download the models is to run
python download_checkpoints.py
which will download all the pretrained checkpoints and put them in the right place in the repository.
Alternatively, you can download individual checkpoint from Google Drive
To generate a speech signal for your video, you first need to edit the following arguments in the hydra config file
generate.ckpt_path
generate.video_path
generate.save_dir
You can also play with the values of w_video, w_asr, ast_start
. Then run the following command
python inference_real_video.py
which will also take care of converting the video fps rate to 25Hz if necessary, mouth cropping and lip-reading.
If you wish to generate audio files for all of the test videos of LRS2/LRS3, first download the predicted lip-readings (LRS2, LRS3), and then use the following
python inference_full_test_split.py generate.ckpt_path=<path_to_MelGen_ckpt>
generate.save_dir=<save_dir> \
generate.lipread_text_dir=<lipread_text_dir> \
dataset.videos_dir=<videos_dir> \
dataset.audios_dir=<audio_dir> \
dataset.mouthrois_dir=<mouthrois_dir
For training LipVoicer on the benchmark datasets, please download LRS2 or LRS3.
The purpose of the data preparation step is to compute the groundtruth mel-spectrograms of the benchmark videos and extract the lip region videos. At the end of the process, you should have the following directory trees for LRS2 and LRS3:
├── LRS2
│ └── [videos] (contain the video in .mp4)
│ └── [main]
│ └── [pretrain]
│ └── [audios] (contain the audio files in .wav and .wav.spec)
│ └── [main]
│ └── [pretrain
│ └── [mouth_rois] (contain the mouth ROIs in .npz)
│ └── [main]
│ └── [pretrain]
├── LRS3
│ └── [videos] (contain the video in .mp4)
│ └── [pretrain]
│ └── [trainval]
│ └── [test]
│ └── [audios] (contain the audio files in .wav and .wav.spec)
│ └── [pretrain]
│ └── [trainval]
│ └── [test]
│ └── [mouth_rois] (contain the mouth ROIs in .npz)
│ └── [pretrain]
│ └── [trainval]
│ └── [test]
To this end, perform the following steps inside the LipVoicer directory:
- Extract the audio files from the videos (audio files will be saved in a WAV format)
python dataloaders/extract_audio_from_video.py --ds_dir <path_to_video_dir> \
--split <trainval/test/...> \
--out_dir <output_directory>
The wav
files will be saved to output_directory/split
-
Compute the log mel-spectrograms and save them
python dataloaders/wav2mel.py dataset.audios_dir=<path_to_directory_with_extracted_wav_files>
It will save the mel-spectrograms with extension
.wav.spec
. -
Crop the mouth regions of the videos, convert to greyscale and save to
<mouthrois_dir>
. The easiest way is to- Clone Visual Speech Recognition for Multiple Languages.
- Download the landmarks for LRS2/LRS3
- Copy
dataloader/extract_mouthcrops.py
from the LipVoicer repository and run it from the command line. It saves the greyscale mouthcrops as numpy arrays in.npz
files.
- Train MelGen
CUDA_VISIBLE_DEVICES=0,1 python train_melgen.py train.save_dir=<save_dir> \
dataset.videos_dir=<videos_path> \
dataset.audios_dir=<audios_dir> \
dataset.mouthrois_dir=<mouthrois_dir>
The progress of the training stage is monitored with TensorBoard.
- Finetune the modified ASR, which now includes the diffusion time-step embedding. For further details on how to carry out this step, please refer to Audio-Visual Efficient Conformer for Robust Speech Recognition.
@inproceedings{
yemini2024lipvoicer,
title={LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading},
author={Yochai Yemini and Aviv Shamsian and Lior Bracha and Sharon Gannot and Ethan Fetaya},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
}