GitHub - alibaba/VideoMV: VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model

VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model.

Qi Zuo*, Xiaodong Gu*, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, Qixing Huang

project_head_video.mp4

Project page | Paper | YouTube | 3D Rendering Dataset

TODO 🚩

Release GS、Neus、NeRF reconstruction code.
News: Release text-to-mv (G-Objaverse + Laion) training code and pretrained model(2024.04.22). Check the Inference&&Training Guidelines.

Generated Multi-View Images using prompts from DreamFusion420：

out.mp4

Release the training code.
Release multi-view inference code and pretrained weight(G-Objaverse).

Architecture

Install

System requirement: Ubuntu20.04
Tested GPUs: A100

Install requirements using following scripts.

git clone https://github.com/alibaba/VideoMV.git
conda create -n VideoMV python=3.8
conda activate VideoMV
cd VideoMV && bash install.sh

Inference

# Download our pretrained models
wget https://virutalbuy-public.oss-cn-hangzhou.aliyuncs.com/share/aigc3d/pretrained_models.zip
unzip pretrained_models.zip
# text-to-mv sampling
CUDA_VISIBLE_DEVICES=0 python inference.py --cfg ./configs/t2v_infer.yaml
# text-to-mv sampling using pretrained model trained on laion+Gobjaverse
wget oss://virutalbuy-public/share/aigc3d/videomv_laion/non_ema_00365000.pth
# modify the [test_model] as the location of [non_ema_00365000.pth]
CUDA_VISIBLE_DEVICES=0 python inference.py --cfg ./configs/t2v_infer.yaml


# image-to-mv sampling
CUDA_VISIBLE_DEVICES=0 python inference.py --cfg ./configs/i2vgen_xl_infer.yaml

# To test raw prompts: type the prompts in ./data/test_prompts.txt

# To test raw images: use Background-Remover(https://www.remove.bg/) to get the foreground of images
# place the images all in /path/to/your_dir
# Then run
python -m utils.recenter_i2v /path/to/your_dir
# The recenter results will be saved in ./data/images
# add test image paths in ./data/test_images.txt
# Then run
CUDA_VISIBLE_DEVICES=0 python inference.py --cfg ./configs/i2vgen_xl_infer.yaml

Training

# Download our dataset(G-Objaverse) following the instructions at 
# https://github.com/modelscope/richdreamer/tree/main/dataset/gobjaverse
# Modify the vid_dataset.data_dir_list as your download data_root 
# in ./configs/t2v_train.yaml and ./configs/i2vgen_xl_train.yaml

# Text-to-mv finetuning
CUDA_VISIBLE_DEVICES=0 python train_net.py --cfg ./configs/t2v_train.yaml
# Text-to-mv fintuning using both Laion and Gobjaverse. 
# (Note we use 24 A100 for training both datasets. If your computation resource is not sufficient, do not try it!)
CUDA_VISIBLE_DEVICES=0 python train_net.py --cfg ./configs/t2v_train_laion.yaml

# Text-to-mv Feed-forward reconstruction finetuning.
# Modify the UNet.use_lgm_refine as 'True' in ./configs/t2v_train.yaml. Then
CUDA_VISIBLE_DEVICES=0 python train_net.py --cfg ./configs/t2v_train.yaml


# Image-to-mv finetuning
CUDA_VISIBLE_DEVICES=0 python train_net.py --cfg ./configs/i2vgen_xl_train.yaml
# Image-to-mv Feed-forward reconstruction finetuning.
# Modify the UNet.use_lgm_refine as 'True' in ./configs/i2vgen_xl_train.yaml. Then
CUDA_VISIBLE_DEVICES=0 python train_net.py --cfg ./configs/i2vgen_xl_train.yaml

Tips

You will observe a sudden convergence in Text-to-MV finetuning(~5min).
You will not observe a sudden convergence in Image-to-MV finetuning. Usually it takes half a day for a initial convergence.
Remove the background of test image use Background-Remover instead of rembg to get a better result. The artifacts of segmentation mask will influence the quality of multi-view generation results.

Future Works

Dense View Large Reconstruction Model.
More general and high-quality Text-to-MV using better Video Diffusion Model(like HiGen) and novel finetuning techniques.

Acknowledgement

This work is built on many amazing research works and open-source projects:

Thanks for their excellent work and great contribution to 3D generation area.

We would like to express our special gratitude to Jiaxiang Tang, Yuan Liu for the valuable discussion in LGM and SyncDreamer.

Citation

@misc{zuo2024videomv,
      title={VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model}, 
      author={Qi Zuo and Xiaodong Gu and Lingteng Qiu and Yuan Dong and Zhengyi Zhao and Weihao Yuan and Rui Peng and Siyu Zhu and Zilong Dong and Liefeng Bo and Qixing Huang},
      year={2024},
      eprint={2403.12010},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
configs		configs
core		core
data		data
diff-gaussian-rasterization		diff-gaussian-rasterization
tools		tools
utils		utils
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
install.sh		install.sh
requirements.txt		requirements.txt
train_net.py		train_net.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model.

Project page | Paper | YouTube | 3D Rendering Dataset

TODO 🚩

Architecture

Install

Inference

Training

Tips

Future Works

Acknowledgement

Citation

About

Releases

Packages

Contributors 2

Languages

License

alibaba/VideoMV

Folders and files

Latest commit

History

Repository files navigation

VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model.

Project page | Paper | YouTube | 3D Rendering Dataset

TODO 🚩

Architecture

Install

Inference

Training

Tips

Future Works

Acknowledgement

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages