Skip to content
/ flatten Public

Pytorch Implementation of FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing (ICLR 2024)

License

Notifications You must be signed in to change notification settings

yrcong/flatten

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

arXiv Project Website Hits

Pytorch Implementation of "FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing".

🎊🎊🎊 We are proud to announce that our paper has been accepted at ICLR 2024! If you are interested in FLATTEN, please give us a star😬 teaser-ezgif com-resize

Thanks to @logtd for integrating FLATTEN into ComfyUI and the great sampled videos! Here is the Link!

ComfyUI-FLATTEN.mp4

📖Abstract

🚩Text-to-Video 🚩Training-free 🚩Plug-and-Play

Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. In this work, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency.

Requirements

First you can download Stable Diffusion 2.1 (base) here.

Install the following packages:

  • PyTorch == 2.1
  • accelerate == 0.24.1
  • diffusers == 0.19.0
  • transformers == 4.35.0
  • xformers == 0.0.23

Usage

For text-to-video edting, a source video and a textual prompt should be given. You can run the script to get the teaser video easily:

sh cat.sh

or with the command:

python inference.py \
--prompt "A Tiger, high quality" \
--neg_prompt "a cat with big eyes, deformed" \
--guidance_scale 20 \
--video_path "data/puff.mp4" \
--output_path "outputs/" \
--video_length 32 \
--width 512 \
--height 512 \
--old_qk 0 \
--frame_rate 2 \

Editing tricks

  • You can use a negative prompt (NP) when there is a big gap between the edit target and the source (1st row).
  • You can increase the scale of classifier-free guidance to enhance the semantic alignment (2nd row).
Source video NP: " " NP: "A cat with big eyes, deformed."
Classifier-free guidance: 10 Classifier-free guidance: 17.5 Classifier-free guidance: 25

BibTex

@article{cong2023flatten,
  title={FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing},
  author={Cong, Yuren and Xu, Mengmeng and Simon, Christian and Chen, Shoufa and Ren, Jiawei and Xie, Yanping and Perez-Rua, Juan-Manuel and Rosenhahn, Bodo and Xiang, Tao and He, Sen},
  journal={arXiv preprint arXiv:2310.05922},
  year={2023}
}

About

Pytorch Implementation of FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing (ICLR 2024)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published