Deep Lip Reading

This repository contains code for evaluating the best performing lip reading model described in the paper Deep Lip Reading: A comparison of models and an online application. The model is based on the Transformer architecture.

Input	Crop	Enc-Dec Attention	Prediction

Dependencies

System

ffmpeg

Python

TensorFlow
NumPy
PyAV

Optional for visualization

MoviePy
Imageio-ffmpeg
OpenCV
TensorBoard

Recommended way to install the python dependencies is creating a new virtual environment and then running

pip install -r requirements.txt

Demo

To verify that everything works

Run ./download_models.sh to get the pretrained models
Run a simple demo

python main.py --lip_model_path models/lrs2_lip_model

expected output:

(wer=0.0) IT'S-THAT-SIMPLE --> IT'S-THAT-SIMPLE
 1/1 [================] - ETA: 0:00:00 - cer: 0.00 - wer: 0.00

Visualization

To visualize the input, attention matrix and predictions set the --tb_eval flag to 1 (not supported with beam search)

python main.py  --lip_model_path models/lrs2_lip_model --tb_eval 1 --img_channels 3

Then point tensorboard to the resulting log directory

tensorboard --logdir=eval_tb_logs

Datasets

The models have been trained and evaluated on the LRW and LRS datasets as well as the non-public MVLRS dataset. More details can be found in the paper.

To evaluate on LRS2 download and extract the dataset into e.g. data/lrs2

For a quick evaluation on the test set without beam search run:

python main.py --gpu_id 0 --lip_model_path models/lrs2_lip_model --data_path data/lrs2/main --data_list media/lrs2_test_samples.txt

This should take a few minutes on a GPU and result in a WER of approximately 58%.

expected output:

(wer=116.7) AND-FOR-ME-THE-SURPRISE-WAS --> I-FOUND-FOR-ME-THAT-IT-IS-A-SURPRISE-RATE
   1/1243 [..............................] - ETA: 54:59 - cer: 0.6667 - wer: 1.1667
(wer=100.0) THEY'RE-MOVING-AROUND --> THEY-MOVED-IT-AROUND
   2/1243 [..............................] - ETA: 33:51 - cer: 0.5238 - wer: 1.0833
(wer=25.0) AND-WE-WERE-RIGHT --> AND-WE-WERE-READ
   3/1243 [..............................] - ETA: 26:30 - cer: 0.4276 - wer: 0.8056
(wer=100.0) AND-THE-NEXT-DAY --> IT'S-NOT-ACTUALLY
   4/1243 [..............................] - ETA: 22:40 - cer: 0.5395 - wer: 0.8542
(wer=62.5) WHEN-THERE-ISN'T-MUCH-ELSE-IN-THE-GARDEN --> WHETHER-IT'S-MUCH-HOLDING-THE-GARDEN
   5/1243 [..............................] - ETA: 21:46 - cer: 0.4916 - wer: 0.8083
                                         .
                                         .
                                         .
(wer=40.0) THESE-LAWS-WOULD-REMAIN-IN-PLACE-FOR-OVER-200-YEARS --> THESE-COURSE-WOULD-HAVE-REPLACED-FOR-OVER-200-YEARS
1239/1243 [============================>.] - ETA: 3s - cer: 0.3828 - wer: 0.5845
(wer=28.6) AS-A-RESULT-OF-THE-GUNPOWDER-PLOT --> AS-A-RESULT-OF-THE-COMPOUND-APPROACH
1240/1243 [============================>.] - ETA: 2s - cer: 0.3828 - wer: 0.5843
(wer=0.0) IT-MAY-TAKE-SOME-TIME --> IT-MAY-TAKE-SOME-TIME
1241/1243 [============================>.] - ETA: 1s - cer: 0.3824 - wer: 0.5838
(wer=0.0) YOU-KNOW-MOST-OF-IT --> YOU-KNOW-MOST-OF-IT
1242/1243 [============================>.] - ETA: 0s - cer: 0.3821 - wer: 0.5834
(wer=100.0) SO-I'LL-ASK-YOU-AGAIN --> WHEN-I-SAW-HIM
1243/1243 [==============================] - 951s 765ms/step - cer: 0.3825 - wer: 0.5837
lm=None, beam=0, bs=1, test_aug:0, horflip True: CER 0.3825, WER 0.583690

For the best results, run a full beam search, using the language model and performing simple test-time augmentation in the form of horizontal flips.

python main.py --gpu_id 0 --lip_model_path models/lrs2_lip_model --lm_path models/lrs2_language_model --data_path data/lrs2/main --data_list media/lrs2_test_samples.txt --graph_type infer --test_aug_times 2  --beam_size 35

This will take a few hours to complete on a GPU and give WER of approx. 49%.

Citation

If you use this code, please cite:

@InProceedings{Afouras18b,
  author       = "Afouras, T. and Chung, J.~S. and Zisserman, A.",
  title        = "Deep Lip Reading: a comparison of models and an online application",
  booktitle    = "INTERSPEECH",
  year         = "2018",
}

Acknowledgments

The Transformer model is based on the implementation of Kyubyong.

The beam search was adapted from Tensor2Tensor.

The char-RNN language model uses code from sherjilozair.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
language_model		language_model
lip_model		lip_model
media		media
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
download_models.sh		download_models.sh
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Lip Reading

Dependencies

System

Python

Optional for visualization

Demo

Visualization

Datasets

Citation

Acknowledgments

About

Releases

Packages

Languages

License

speech-utcluj/deep_lip_reading

Folders and files

Latest commit

History

Repository files navigation

Deep Lip Reading

Dependencies

System

Python

Optional for visualization

Demo

Visualization

Datasets

Citation

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages