- Python >= 3.6 , Pytorch >= 1.8 and ffmpeg
- Set up OpenFace
- We use the OpenFace tools to extract the initial pose of the reference image
- Make sure you have installed this tool, and set the
OPENFACE_POSE_EXTRACTOR_PATH
inconfig.py
. For example, it should be the absolute path of the "FeatureExtraction.exe
" for Windows.
- Other requirements are listed in the 'requirements.txt'
Please download the pretrained checkpoint from google-drive and unzip it to the directory (/checkpoints
). Or manually modify the settings of GENERATOR_CKPT
and AUDIO2POSE_CKPT
in the config.py
.
We employ the CMU phoneset to represent phonemes, the extra 'SIL' means silence. All the phonesets can be seen in 'phindex.json
'.
We have extracted the phonemes for the audios in the 'sample/audio
' directory. For other audios, you can extract the phonemes by other ASR tools and then map them to the CMU phoneset. Or email to wangsuzhen@corp.netease.com for help.
python test_script.py --img_path xxx.jpg --audio_path xxx.wav --phoneme_path xxx.json --save_dir "YOUR_DIR"
Note that the input images must keep the same height and width and the face should be appropriately cropped as in samples/imgs
. You can also preprocess your images with image_preprocess.py
.
@InProceedings{wang2021one,
author = Suzhen Wang, Lincheng Li, Yu Ding, Xin Yu
title = {One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning},
booktitle = {AAAI 2022},
year = {2022},
}
This codebase is based on First Order Motion Model and imaginaire, thanks for their contributions.