MAKE-A-VIDEO

TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA

Main components

It consists of three main components:

A base T2I model trained on text-image pairs
spatiotemporal convolution and attention layers that extend the networks’ building blocks to the temporal dimension
spatiotemporal networks that consist of both spatiotemporal layers, as well as another crucial element needed for T2V generation - a frame interpolation network for high frame rate generation

Spaciotemporal layers

1. Pseudo-3D convolutional layers

It stacks a 1D convolution following each 2D convolutional (conv) layer to facilitate information sharing between spacial and temporal axes with less computational power compared with 3D conv layers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MAKE-A-VIDEO

TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA

Main components

Spaciotemporal layers

1. Pseudo-3D convolutional layers

2. Pseudo-3D attention layers

3 Frame interpolation network

Files

README.md

Latest commit

History

README.md

File metadata and controls

MAKE-A-VIDEO

TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA

Main components

Spaciotemporal layers

1. Pseudo-3D convolutional layers

2. Pseudo-3D attention layers

3 Frame interpolation network