It consists of three main components:
- A base T2I model trained on text-image pairs
- spatiotemporal convolution and attention layers that extend the networks’ building blocks to the temporal dimension
- spatiotemporal networks that consist of both spatiotemporal layers, as well as another crucial element needed for T2V generation - a frame interpolation network for high frame rate generation
It stacks a 1D convolution following each 2D convolutional (conv) layer to facilitate information sharing between spacial and temporal axes with less computational power compared with 3D conv layers.