MAKE-A-VIDEO

TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA

Main components

It consists of three main components:

A base T2I model trained on text-image pairs
spatiotemporal convolution and attention layers that extend the networks’ building blocks to the temporal dimension
spatiotemporal networks that consist of both spatiotemporal layers, as well as another crucial element needed for T2V generation - a frame interpolation network for high frame rate generation

Spaciotemporal layers

1. Pseudo-3D convolutional layers

It stacks a 1D convolution following each 2D convolutional (conv) layer to facilitate information sharing between spacial and temporal axes with less computational power compared with 3D conv layers.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
make_a_video		make_a_video
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pseudo3d.png		pseudo3d.png
t2v_architecture.png		t2v_architecture.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAKE-A-VIDEO

TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA

Main components

Spaciotemporal layers

1. Pseudo-3D convolutional layers

2. Pseudo-3D attention layers

3 Frame interpolation network

About

Releases

Packages

Languages

License

soran-ghaderi/make-a-video

Folders and files

Latest commit

History

Repository files navigation

MAKE-A-VIDEO

TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA

Main components

Spaciotemporal layers

1. Pseudo-3D convolutional layers

2. Pseudo-3D attention layers

3 Frame interpolation network

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages