A curated list of awesome video prediction papers with brief summary.
...
- ★ A Review on Deep Learning Techniques for Video Prediction | TPAMI 2020
- Deep Learning for Vision-based Prediction: A Survey | Arxiv 2020
-
Baseline Video Language Modeling (BVLM) | Video (language) modeling: a baseline for generative models of natural videos | Arxiv 2014 FAIR NYU
- first video prediction | patch-level language model, CNN+RNN | no inductive bias, raw pixels
-
LSTM Encoder-Decoder (LSTM-ED) | Unsupervised Learning of Video Representations using LSTMs | ICML 2015
- unsupervised learning representation | LSTM encoder into representation and LSTM decoder to reconstruct, FC-LSTM | no inductive bias, raw pixels
-
★ Convolutional LSTM (ConvLSTM) | Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting | NeurIPS 2015 HKUST
- model well spatial correlations | just modified to convLSTM as LSTM-ED, convLSTM | no inductive bias, raw pixels
-
Predictive Generative Network (PGN) | Unsupervised learning of visual structure using predictive generative networks | Arxiv 2015 Harvard
- unsupervised learning representation | CNN-LSTM-deCNN and mse+adversarial loss, CNN+LSTM+GAN | no inductive bias, raw pixels
-
Predictive Coding Network (PredNet) | Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning | Arxiv 2016 Harvard
- unsupervised learning representation | stacked multi-level encode representation and decode reconstruction variant, convLSTM | no inductive bias, raw pixels
-
Predictive Recurrent Neural Network (PredRNN) | PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning | NeurIPS 2017 TPAMI 2022 Tsinghua (Yunbo Wang)
- solve several problems in design of convLSTM for spatiotemporal predictive learning | spatiotemporal memory flow + spatiotemporal LSTM + reverse scheduled sampling curriculum learning, convLSTM | no inductive bias, raw pixels
-
Improved Predictive Recurrent Neural Network (PredRNN++) | PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning | ICML 2018 Tsinghua (Yunbo Wang)
- deeper in time and deep-in-time RNN vanishing gradient | causal LSTM + gradient highway unit, convLSTM | no inductive bias, raw pixels
-
★ Convolutional Dynamic Neural Advection (CDNA) | Unsupervised Learning for Physical Interaction through Video Prediction | NeurIPS 2016 UCBerkeley (Chelsea Finn, Ian Goodfellow, Sergey Levine)
- first real-world video long-range prediction | explicitly model pixel motion then merge previous frame, convLSTM | kernel-based transformation
-
Object-centric Transformation (ObjectTransformation) | Learning Object-Centric Transformation for Video Prediction | ACM-MM 2017 PKU
- different objects motion | attention to object patches and predict transformation kernels, CNN+RNN | kernel-based transformation
-
Spatially-Displaced Convolution Network (SDC-Net) | SDC-Net: Video prediction using spatially-displaced convolution | ECCV 2018 Nvidia
- high-resolution video prediction | combine vector-based and kernel-based transformation, 3D CNN | vector-based transformation + kernel-based transformation
-
★ Motion-Content Network (MCnet) | Decomposing Motion and Content for Natural Video Sequence Prediction | ICLR 2017
- first decompose motion and content | motion encoder + content encoder + combination decoder, CNN+convLSTM | motion and content separation
-
Decompositional Disentangled Predictive Auto-Encoder (DDPAE) | Learning to Decompose and Disentangle Representations for Video Prediction | NeurIPS 2018 Stanford (Li Fei-Fei)
- deal with high-dimentionality | decompose whole frame to different components and disentangle each component to time-invariant content and low-dimensionality pose, CNN+RNN+VAE | vector-based transformation + motion and content separation
-
★ Spatial-Temporal Multi-Frequency Analysis Network (STMFANet) | Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity and Temporal-Consistency Video Prediction | CVPR 2020 CAS
- deal with image distortion and temporal inconsistency | merge multi-level both spatial and temporal wavelet analysis into prediction, CNN+LSTM+wavelet | add traditional CV, raw pixels
-
★ Stochastic Variational Video Prediction (SV2P) | Stochastic Variational Video Prediction | ICLR 2018 UIUC (Chelsea Finn, Sergey Levine)
- first introduce stochastic | VAE noise as stochastic condition for CDNA, 3D CNN+convLSTM+VAE | kernel-based transformation + VAE stochastic
-
Stochastic Video Generation with a Learned Prior (SVG-LP) | Stochastic Video Generation with a Learned Prior | ICML 2018 NYU
- "learned prior as uncertainty predictive model" | learned prior for VAE, convLSTM+VAE | VAE stochastic
-
Stochastic Adversarial Video Prediction (SAVP) | Stochastic Adversarial Video Prediction | ICLR 2019 UCBerkeley (Chelsea Finn, Sergey Levine)
- bring together stochastic and realistic | VAE-GAN for SV2P, 3D CNN+convLSTM+VAE+GAN | kernel-based transformation + VAE stochastic
-
Hierarchical VRNN (Hierarchical-VRNN) | Improved Conditional VRNNs for Video Prediction | ICCV 2019
- "still blurry and due to underfitting" | hierarchical levels of latents to increase expressiveness, CNN+RNN+VAE | VAE hierarchical stochastic
-
Greedy Hierarchical Variational Auto-Encoders (GHVAE) | Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction | CVPR 2021 Stanford (Li Fei-Fei, Chelsea Finn)
- deal with memory constraints and optimization instability problems for hierarchical VAE | greedy and modular optimization, CNN+RNN+VAE | VAE hierarchical stochastic
-
Beyond Mean Square Error (BeyondMSE) | Deep multi-scale video prediction beyond mean square error | ICLR 2016 FAIR NYU (Yann LeCun)
- deal with blur | adversarial loss + gradient difference loss, CNN+GAN | no inductive bias, raw pixels
-
Eidetic 3D LSTM (E3D-LSTM) | Eidetic 3D LSTM: A Model for Video Prediction and Beyond | ICLR 2019 Tsinghua (Yunbo Wang, Li Fei-Fei)
- learn good for both short-term and long-term | 3D CNN for local dynamics and recurrent modeling for temporal dependencies, 3D CNN+LSTM | no inductive bias, raw pixels
-
★ Simple Video Prediction (SimVP) | SimVP: Simpler yet Better Video Prediction | CVPR 2022
- investigate simple techniques for CNN in video prediction | pure 2D CNN and only MSE loss, CNN | no inductive bias, raw pixels
-
Video Diffusion Models (VDM) | Video Diffusion Models | NeurIPS 2022 Google (Jonathan Ho)
- first video diffusion model for primarily unconditional video generation | diffusion model with 3D U-Net, 3D CNN+diffusion | no inductive bias, raw pixels
-
★ Masked Conditional Video Diffusion (MCVD) | MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation | NeurIPS 2022
- general-purpose as prediction/generation/interpolation | conditioned on masked past or future frames U-Net, CNN+diffusion | no inductive bias, raw pixels
-
Residual Video Diffusion (RVD) | Diffusion Probabilistic Modeling for Video Generation | Arxiv 2022
- "residual errors are easier to model than future observations" | MAF for average + diffusion for residual, CNN+RNN+diffusion | no inductive bias, raw pixels
-
Flexible Diffusion Model (FDM) | Flexible Diffusion Modeling of Long Videos | Arxiv 2022
- deal with long duration coherent prediction | randomly sampling train, 3D CNN+diffusion | no inductive bias, raw pixels
-
Video Transformer (VideoTransformer) | Scaling Autoregressive Video Models | ICLR 2020 Google
- first Transformer in video prediction | block-local self-attention and spatiotemporal subscaling for reducing memory, Transformer | no inductive bias, raw pixels
-
★ Latent Video Transformer (LVT) | Latent Video Transformer | Arxiv 2020
- solve computation requirement problem | VQ-VAE encodes pixels into discrete latent space and VideoTransformer operates in the discrete latent space, Transformer | discrete latent space
-
Convolutional Transformer (ConvTransformer) | ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis | Arxiv 2021
- combine CNN and Transformer in video prediction | multi-head convolutional self-attention, Transformer+CNN | no inductive bias, raw pixels
-
Video Generative Pre-Training (VideoGPT) | VideoGPT: Video Generation using VQ-VAE and Transformers | Arxiv 2021 UCBerkeley (Pieter Abbeel)
- combine GPT and Transformer in video prediction | VQ-VAE encodes pixels into discrete latent space and VideoTransformer operates in the discrete latent space, Transformer | discrete latent space
-
Video Prediction Transformer (VPTR) | Video Prediction by Efficient Transformers | ICPR 2022 IVC 2022
- solve computation requirement problem and extensive experiments on Transformer autoregressive formats | Pix2Pix autoencoder and VidHRFormer attention, Transformer | latent space
-
Masked Video Transformer (MaskViT) | MaskViT: Masked Visual Pre-Training for Video Prediction | ICLR 2023 Stanford (Jiajun Wu, Fei-Fei Li)
- mask visual modeling pre-training for video | VQ-GAN quantizing frame and mask visual modeling training, Transformer | discrete latent space
-
MAsked Generative VIdeo Transformer (MAGVIT) | MAGVIT: Masked Generative Video Transformer | CVPR 2023 CMU Google
- single model for multiple video synthesis tasks | 3D-VQ quantizing video and multi-task mask token modeling training, Transformer | discrete latent space
-
MOtion Scene and Object (MOSO) | MOSO: Decomposing MOtion, Scene and Object for Video Prediction | CVPR 2023 CAS
- decompose motion, scene and object | separate VQVAE quantizing and Transformer prediction, Transformer | discrete latent space + motion and content separation