Current support the following features:
- Scene Detection
- Face Cropping
- Landmark Extraction
- Face Angle Extraction
- Facial Action Unit (FAU) Extraction
- Audio Feature Extraction
If your dataset is downloaded from the web, it may consist of full-frame images (where the face occupies a small portion) and may include non-continuous frames.
Please first use scene detection to split the videos to avoid vast scene change.
You can refer to the following code:
python extract_scenes.py \
--from_directory '/path/to/before_scene_detected/' \
--output_directory '/path/to/after_scene_detected/'
Subsequent filtering can be done based on some rough rules, such as removing files that are less than two seconds long, have an incorrect file format, or have empty file size.
python filter_videos_rough.py \
--before_filtering_dir '/path/to/before_filtering/' \
--after_filtering_dir '/path/to/after_filtering/' \
--min_duration 2 \
--min_size 10
There are some other filtering algorithms such as hand detection (hand_detection.py) and blur detection (blur_detection.py). We have not thoroughly checked these codes, but you can refer to them and build your own filter algorithms, especially when you aim to train models using diffusion-based methods like EMO.
Upon completion of this step, you will have obtained the raw video data. An example of such data can be found at data_processing/raw_data/FAzSK8PLmGI.mp4
. The goal of this step is to build a dataset similar to the HDTF dataset.
The purpose of this step is to crop the face from the frame. Note that our strategy differs from some methods as we do not align the face but rather fix the camera position. Please ensure that there is no significant movement of the person's face.
Note that the cropping strategy needs to be adjusted based on the actual scenario. For videos similar to the HDTF dataset, the face occupies the majority of the frame. In contrast, for videos like those in the EMO, the face may only occupy a smaller portion of the frame. Data flow: from data_processing/raw_data/
to data_processing/cropped_faces/
python extract_cropped_faces.py \
--from_dir_prefix "data_processing/raw_data/" \
--output_dir_prefix "data_processing/cropped_faces/" \
--expanded_ratio 0.6
We strongly recommend using the ffmpeg command for this step, as our version is 5.0.1
. Using other modules such as opencv may result in dropped frames, which can cause the audio and video frames to not align accurately.
- All videos file are converted to
25
frame-per-seconds (fps), fromdata_processing/cropped_faces/
todata_processing/specified_formats/videos/videos_25fps/
- The coresponding audio tracks are converted to the fixed sampling rate:
16k
, fromdata_processing/raw_data
todata_processing/specified_formats/audios/audios_16k/
- Video_frames will be extracted into
png
format, fromdata_processing/specified_formats/videos/videos_25fps/
todata_processing/specified_formats/videos/video_frames/
. The default frame filename will starts from000001.png
.
Run:
python extract_raw_video_data.py \
--source_folder 'data_processing/cropped_faces/' \
--video_target_folder 'data_processing/specified_formats/videos/videos_25fps/' \
--audio_target_folder 'data_processing/specified_formats/audios/audios_16k/' \
--frames_target_folder 'data_processing/specified_formats/videos/video_frames/' \
--convert_video True \
--convert_audio True \
--extract_frames True
The purpose of this step is to obtain 68 2D facial landmarks, as illustrated in the figure below. For reference, please see the image provided.
Data flow: from data_processing/specified_formats/videos/video_frames/
to data_processing/specified_formats/videos/landmarks/
Run:
python extract_frame_landmarks.py \
--from_dir './data_processing/specified_formats/videos/video_frames/' \
--lmd_output_dir './data_processing/specified_formats/videos/landmarks/' \
--skip_existing
Landmarks are generated into a text file with the same name as the video. Each line represents the landmark coordinates for a specific frame, with a total of 68 landmarks. For the meaning of the specific coordinates, please refer to the figure above. The content of each file is like the following format:
000001.png 509_230 511_269 515_305 520_339 531_367 550_393 574_416 602_432 630_436 656_428 679_411 698_388 712_363 718_335 720_305 723_275 725_243 544_232 560_222 579_217 599_219 616_225 650_222 668_213 687_209 705_214 715_231 633_245 634_263 635_279 637_297 614_316 623_319 633_322 643_319 651_315 568_251 581_251 593_249 604_250 593_253 581_253 656_250 669_246 682_247 693_251 682_253 669_252 588_356 606_347 622_343 632_345 643_342 656_346 670_354 656_369 643_376 632_378 621_378 605_372 595_356 621_352 632_353 643_351 663_354 642_361 632_363 621_363
...
Here is an example of a successfully detected case:
Facial orientation involves three-dimensional poses: pitch (tilt up and down), yaw (turn left and right), and roll (tilt side to side), as shown in the figure below.
Follow the instructions at 3DDFA_V2 to build the environment. Copy the path link to .3ddfav2_path
and run the following code to obtain pose angles.
python extract_face_orientation.py \
--video_frames_dir 'data_processing/specified_formats/videos/video_frames/' \
--visualization_dir 'data_processing/specified_formats/videos/pose_orientations/visualization/' \
--pose_data_dir 'data_processing/specified_formats/videos/pose_orientations/pose_data/'
The code models will perform visualization in default for each video, below is a visualization example that clearly represents the facial orientation information.
The range of angles for yaw, pitch, and roll extends from -180 to +180 degrees. However, in practice, for facial orientations, it is predominantly within the range of -90 to +90 degrees. Below is an example that displays the actual dataset for a single image.
Other tools (which we have not tested) can also be used to extract facial orientation: OpenFace. Additionally, GAIA has mentioned that they use 3DDFA. EMO utilizes mediapipe to obtain pose speed. DAE-talker utilizes this tool.
You can skip this stage if you do not need it. AU definition docs
This part is based on OpenFace. We recommend runing the code in docker and follow the command from Openface wiki.
After having lanuched the docker instance, run:
python extract_action_units.py \
--from_dir_path 'data_processing/specified_formats/videos/video_frames/' \
--to_dir_path 'data_processing/specified_formats/videos/facial_action_units/'
MFCC stands for Mel-frequency cepstral coefficients. It can quickly help us with code testing without the need to install many environments. The output shape of audio_feature will be (T, 39)
. This feature is not robust and is only suitable for early code testing. For detailed usage, please refer to mfcc_feature_example.py.
Before extraction, please make sure that all audio files have a sampling rate of 16k
Hz. and download the weights from URL and put them into weights dir. Although this model was pre-trained on 10,000 hours of Chinese data as unsupervised training data, we have also found that it can generalize to other languages as well.
python extract_audio_features.py \
--model_path "weights/chinese-hubert-large" \
--audio_dir_path "./data_processing/specified_formats/audios/audios_16k/" \
--audio_feature_saved_path "./data_processing/specified_formats/audios/hubert_features/" \
--computed_device "cuda" \
--padding_to_align_audio True
- The purpose of padding_to_align_audio is to pad the end of the audio to match the dimensionality, with the goal of maintaining consistency with video frames for convenient training.
- The result shape is
(25, T, 1024)
, 25 means all hidden layers including the one audio feature extraction plus 24 hidden layers. You can change code get specific layers, such as last layer, for training. - The purpose for extract all layers is that we trained on
weighted sum
strategies in diffdub and anitalker. - Currently, we only have tested feature extraction on hubert model.
- If your audio is long ( > 120 seconds), please set computed_device from
cuda
tocpu
to avoid GPU out-of-memory.
python extract_audio_features_whisper
- The detailed code can be found here
- There are some issues here that still need to be confirmed. Extract for a fixed 30 seconds, and check if it's effective for shorter durations. (20240731,updates,I found that it has minor impact on training digital talking heads)
- The vocabulary comprises 23,632 tokens, formed by merging two sets of 320 vector quantization (VQ) groups. The original mapping, extracted from feats_ctxt2v.zip, can be found in the label2vqidx file.
extract_asr_features.py
- Add detailed Environment Config
- Visualized Jupyter Code
Since I still have many other works to do, I encourage anyone to polish and contribute code to this repository.
The examples provided herein are based on the HDTF or VoxCeleb datasets and are intended solely for educational and academic research purposes. Please do not use them for any other purposes.
- https://github.com/MRzzm/HDTF
- https://github.com/TencentGameMate/chinese_speech_pretrain
- https://github.com/DefTruth/torchlm
- https://github.com/TadasBaltrusaitis/OpenFace
- https://github.com/cleardusk/3DDFA_V2
- https://www.robots.ox.ac.uk/~vgg/data/voxceleb/
- https://github.com/cpdu/unicats
- https://github.com/X-LANCE/SLAM-LLM