Skip to content

Latest commit

 

History

History
139 lines (106 loc) · 8.33 KB

README.md

File metadata and controls

139 lines (106 loc) · 8.33 KB

DVGFormer: Learning Camera Movement Control from Real-World Drone Videos

Paper | Project Page | Github | Data

Official implementation of our paper:
Learning Camera Movement Control from Real-World Drone Videos
Yunzhong Hou, Liang Zheng, Philip Torr

"To record as is, not to create from scratch." Video preview

Abstract: This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels. We select drone videos as our test case due to their rich and challenging motion patterns, distinctive viewing angles, and precise controls. Existing AI videography methods struggle with limited appearance diversity in simulation training, high costs of recording expert operations, and difficulties in designing heuristic-based goals to cover all scenarios. To avoid these issues, we propose a scalable method that involves collecting real-world training data to improve diversity, extracting camera trajectories automatically to minimize annotation costs, and training an effective architecture that does not rely on heuristics. Specifically, we collect 99k high-quality trajectories by running 3D reconstruction on online videos, connecting camera poses from consecutive frames to formulate 3D camera paths, and using Kalman filter to identify and remove low-quality data. Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages the camera path and images from all past frames to predict camera movement in the next frame. We evaluate our system across 38 synthetic natural scenes and 7 real city 3D scans. We show that our system effectively learns to perform challenging camera movements such as navigating through obstacles, maintaining low altitude to increase perceived speed, and orbiting tower and buildings, which are very useful for recording high-quality videos.

Project Updates

  • 🔥🔥 News: 2024/12/22: We have released model checkpoints and dataset on huggingface.
  • 🔥 News: 2024/12/13: We have released the DroneMotion-99k dataset. Check out the README file for the steps needed after downloading the HDF5 archieve.
  • 🔥 News: 2024/12/13: We have released the the code for the DVGFormer!
  • 🔥 News: 2024/12/13: Our paper is now online!

Todo List

  • Initialize repo
  • Code release
  • Blender scene files for evaluation
  • DroneMotion-99k dataset
    • HDF5 archieve of filtered 3D camera trajectories
    • scripts for downloading the corresponding YouTube videos
  • Release model checkpoints

Model Checkpoint

Please refer to huggingface for checkpoint download link.

Installation

  1. Create and activate a Conda environment:

    conda create -n dvgformer python=3.10
    conda activate dvgformer
    conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia
    conda install -c conda-forge ffmpeg
    pip install -r requirements.txt
  2. Download evaluation data

    For real city 3D scans from Google Earth, please download from this link.

    For synthetic natural scenes, you can either generate your own version from the official git repo princeton-vl/infinigen or directly download from this link. Note that our version has very basic graphic settings and you might need to generate your own version if you need higher graphics.

    After downloading the evaluation environments, your folder should look like this

    dvgformer/
    ├── infinigen/
    │   ├── arctic/
    │   ...
    │   └── snowy_mountain/
    ├── blosm/
    │   ├── himeji/
    │   ...
    │   └── sydney/
    ├── src/
    ├── README.md
    ...
    
  3. Download training data

    We provide the Colmap 3D reconstruction results and the filtered camera movement sequences in our DroneMotion-99k dataset. You can download either a minimal dataset with 10 videos and 129 sequences link or the full dataset with 13,653 videos and 99,003 camera trajectories link.

    After downloading the training data, your folder should look like this

    dvgformer/
    ├── youtube_drone_videos/
    │   ├── dataset_full.h5
    │   └── dataset_mini.h5
    ├── src/
    ├── README.md
    ...
    

    Due to the YouTube policy, we cannot share the video MP4s or the frames. As an alternative, we include a python script download_videos.py that can help you automatically download the videos and extract the frames.

    python download_videos.py --hdf5_fpath youtube_drone_videos/dataset_mini.h5
    python download_videos.py --hdf5_fpath youtube_drone_videos/dataset_full.h5

    This should update your downloaded HDF5 dataset file with the video frames.

    You can also adjust the number of workers for the download process or the frame extraction process in download_videos.py by specifying --num_download_workers or --num_extract_workers.

Running DVGFormer Model

  1. Inference: You can download the model checkpoint from huggingface. You can also directly load the pretrained model from this code

    import torch
    from src.models import DVGFormerModel
    
    model = DVGFormerModel.from_pretrained(
        'yunzhong-hou/DVGFormer'
        ).to('cuda').to(torch.bfloat16)

    For blender evaluation, you can run the following script.

    python blender_eval.py 
  2. Train your own model: We use two RTX 3090 in our experiments. Please run the following script for training your own model.

    bash run_gpu01.sh

Citation

If you find this project useful, please consider citing:

@article{hou2024dvgformer,
  author    = {Hou, Yunzhong and Zheng, Liang and Torr, Philip},
  title     = {Learning Camera Movement Control from Real-World Drone Videos},
  journal   = {arXiv preprint},
  year      = {2024},
}