A curated list of awesome work (currently 257 papers) a on video generation and video representation learning, and related topics (such as RL). Feel free to contribute or email me if I've missed your paper off the list : ]
They are ordered by year (new to old). I provide a link to the paper as well as to the github repo where available.
Disentangling multiple features in video sequences using Gaussian processes in variational autoencoders. Bhagat, Uppal, Yin, Lim https://arxiv.org/abs/2001.02408
Generative adversarial networks for spatio-temporal data: a survey. Gao, Xue, Shao, Zhao, Qin, Prabowo, Rahaman, Salim https://arxiv.org/pdf/2008.08903.pdf
Deep state-space generative model for correlated time-to-event predictions. Xue, Zhou, Du, Dai, Xu, Zhang, Cui https://dl.acm.org/doi/abs/10.1145/3394486.3403206
Toward discriminating and synthesizing motion traces using deep probabilistic generative models. Zhou, Liu, Zhang, Trajcevski https://ieeexplore.ieee.org/abstract/document/9165954/
Sample-efficient robot motion learning using Gaussian process latent variable models. Delgado-Guerrero, Colome, Torras http://www.iri.upc.edu/files/scidoc/2320-Sample-efficient-robot-motion-learning-using-Gaussian-process-latent-variable-models.pdf
Sequence prediction using spectral RNNS . Wolter, Gall, Yao https://www.researchgate.net/profile/Moritz_Wolter2/publication/329705630_Sequence_Prediction_using_Spectral_RNNs/links/5f36b9d892851cd302f44a57/Sequence-Prediction-using-Spectral-RNNs.pdf
Self-supervised video representation learning by pace prediction. Wang, Joai, Liu https://arxiv.org/pdf/2008.05861.pdf
RhyRNN: Rhythmic RNN for recognizing events in long and complex videos. Yu, Li, Li http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123550137.pdf
4D forecasting: sequential forecasting of 100,000 points. Weng, Wang, Levine, Kitani, Rhinehart http://www.xinshuoweng.com/papers/SPF2_eccvw/camera_ready.pdf
Multimodal deep generative models for trajectory prediction: a conditional variational autoencoder approach. Ivanovic, Leung, Schmerling, Pavone https://arxiv.org/pdf/2008.03880.pdf
Memory-augmented dense predictive coding for video representation learning. Han, Xie, Zisserman https://arxiv.org/pdf/2008.01065.pdf
SeCo: exploring sequence supervision for unsupervised representation learning. Yao, Zhang, Qiu, Pan, Mei https://arxiv.org/pdf/2008.00975.pdf
PDE-driven spatiotemporal disentanglement. Dona, Franceschi, Lamprier, Gallinari https://arxiv.org/pdf/2008.01352.pdf
Dynamics generalization via information bottleneck in deep reinforcement learning. Lu, Lee, Abbeel, Tiomkin https://arxiv.org/pdf/2008.00614.pdf
Latent space roadmap for visual action planning. Lippi, Poklukar, Welle, Varava, Yin, Marino, Kragic https://rss2020vlrrm.github.io/papers/3_CameraReadySubmission_RSS_workshop_latent_space_roadmap.pdf
Weakly-supervised learning of human dynamics. Zell, Rosenhahn, Wandt https://arxiv.org/pdf/2007.08969.pdf
Deep variational Leunberger-type observer for stochastic video prediction. Wang, Zhou, Yan, Yao, Liu, Ma, Lu https://arxiv.org/pdf/2003.00835.pdf
NewtonianVAE: proportional control and goal identification from pixels via physical latent spaces. Jaques, Burke, Hospedales https://arxiv.org/pdf/2006.01959.pdf
Constrained variational autoencoder for improving EEG based speech recognition systems. Krishna, Tran, Carnahan, Tewfik https://arxiv.org/pdf/2006.02902.pdf
Latent video transformer. Rakhumov, Volkhonskiy https://arxiv.org/pdf/2006.10704.pdf
Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness . Ribeiro, Tiels, Aguirre, Schon http://proceedings.mlr.press/v108/ribeiro20a/ribeiro20a.pdf
Towards recurrent autoeregressive flow models . Mern, Morales, Kochenderfer https://arxiv.org/pdf/2006.10096.pdf
Learning to combine top-down and bottom-up signals in recurrent neural networks with attention over modules. Mittal, Lamb, Goyal, Voleti et al. https://www.cs.colorado.edu/~mozer/Research/Selected%20Publications/reprints/Mittaletal2020.pdf
Unmasking the inductive biases of unsupervised object representations for video sequences. Weis, Chitta, Sharma et al. https://arxiv.org/pdf/2006.07034.pdf
G3AN: disentnagling appearance and motion for video generation. Wang, Bilinski, Bermond, Dantcheva http://openaccess.thecvf.com/content_CVPR_2020/papers/Wang_G3AN_Disentangling_Appearance_and_Motion_for_Video_Generation_CVPR_2020_paper.pdf
Learning dynamic relationships for 3D human motion prediction . Cui, Sun, Yang http://openaccess.thecvf.com/content_CVPR_2020/papers/Cui_Learning_Dynamic_Relationships_for_3D_Human_Motion_Prediction_CVPR_2020_paper.pdf
Joint training of variational auto-encoder and latent energy-based model. Han, Nijkamp, Zhou, Pang, Zhu, Wu http://openaccess.thecvf.com/content_CVPR_2020/papers/Han_Joint_Training_of_Variational_Auto-Encoder_and_Latent_Energy-Based_Model_CVPR_2020_paper.pdf
Learning invariant representations for reinforcement learning without reconstruction. Zhang, McAllister, Calandra, Gal, Levine https://arxiv.org/pdf/2006.10742.pdf
Variational inference for sequential data with future likelihood estimates. Kim, Jang, Yang, Kim http://ailab.kaist.ac.kr/papers/pdfs/KJYK2020.pdf
Video prediction via example guidance. Xu, Xu, Ni, Yang, Darrell https://arxiv.org/pdf/2007.01738.pdf
Hierarchical path VAE-GAN: generating diverse videos from a single sample. Gur, Benaim, Wolf https://arxiv.org/pdf/2006.12226.pdf
Dynamic facial expression generation on Hilbert Hypersphere with conditional Wasserstein Generative adversarial nets. Otberdout, Daoudi, Kacem, Ballihi, Berretti https://arxiv.org/abs/1907.10087
HAF-SVG: hierarchical stochastic video generation with aligned features. Lin, Yuan, Li https://www.ijcai.org/Proceedings/2020/0138.pdf
Improving generative imagination in object-centric world models. Lin, Wu, Peri, Fu, Jiang, Ahn https://proceedings.icml.cc/static/paper_files/icml/2020/4995-Paper.pdf
Deep generative video compression with temporal autoregressive transforms. Yang, Yang, Marino, Yang, Mandt https://joelouismarino.github.io/files/papers/2020/seq_flows_compression/seq_flows_compression.pdf
Spatially structured recurrent modules. Rahaman, Goyal, Gondal, Wuthrich, Bauer, Sharma, Bengio, Scholkopf https://arxiv.org/pdf/2007.06533.pdf
Unsupervised object-centric video generation and decomposition in 3D. Henderson, Lampert https://arxiv.org/pdf/2007.06705.pdf
Planning from images with deep latent gaussian process dynamics. Bosch, Achterhold, Leal-Taixe, Stuckler https://arxiv.org/pdf/2005.03770.pdf
Planning to explore via self-supervised world models . Sekar, Rybkin, Daniilidis, Abbeel, Hafner, Pathak https://arxiv.org/pdf/2005.05960.pdf
Mutual information maximization for robust plannable representations. Ding, Clavera, Abbeel https://arxiv.org/pdf/2005.08114.pdf
Supervised contrastive learning. Khosla, Teterwak, Wang, Sarna https://arxiv.org/pdf/2004.11362.pdf
Blind source extraction based on multi-channel variational autoencoder and x-vector-based speaker selection trained with data augmentation. Gu, Liao, Lu https://arxiv.org/pdf/2005.07976.pdf
BiERU: bidirectional emotional recurrent unit for conversational sentiment analysis. Li, Shao, Ji, Cambria https://arxiv.org/pdf/2006.00492.pdf
S3VAE: self-supervised sequential VAE for representation disentanglement and data generation. Zhu, Min, Kadav, Graf, https://arxiv.org/pdf/2005.11437.pdf
Probably approximately correct vision-based planning using motion primitives. Veer, Majumdar https://arxiv.org/abs/2002.12852
MoVi: a large multipurpose motion and video dataset . Ghorbani, Mahdaviani, Thaler, Kording, Cook, Blohm, Troje https://arxiv.org/abs/2003.01888
Temporal convolutional attention-based network for sequence modeling. Hao, Wang, Xia, Shen, Zhao https://arxiv.org/abs/2002.12530
Neuroevolution of self-interpretable agents. Tang, Nguyen, Ha https://arxiv.org/abs/2003.08165
Attentional adversarial variational video generation via decomposing motion and content. Talafha, Rekabdar, Ekenna, Mousas https://ieeexplore.ieee.org/document/9031476
Imputer sequence modelling via imputation and dynamic programming. Chan, Sharahia, Hinton, Norouzi, Jaitly https://arxiv.org/abs/2002.08926
Variational conditioning of deep recurrent networks for modeling complex motion dynamics. Buckchash, Raman https://ieeexplore.ieee.org/document/9055015?denied=
Training of deep neural networks for the generation of dynamic movement primitives. Pahic, Ridge, Gams, Morimoti, Ude https://www.sciencedirect.com/science/article/pii/S0893608020301301
PreCNet: next frame video prediction based on predictive coding. Straka, Svoboda, Hoffmann https://arxiv.org/pdf/2004.14878.pdf
Dimensionality reduction of movement primitives in parameter space. Tosatto, Stadtmuller, Peters https://arxiv.org/abs/2003.02634
Disentangling physical dynamics from unknown factors for unsupervised video prediction. Guen, Thorne https://arxiv.org/abs/2003.01460
A real-robot dataset for assessing transferability of learned dynamics models. Agudelo-Espana, Zadaianchuk, Wenk, Garg, Akpo et al https://www.is.mpg.de/uploads_file/attachment/attachment/589/ICRA20_1157_FI.pdf
Hierarchical decomposition of nonlinear dynamics and control for system indentification and policy distillation. Abdulsamad, Peters https://arxiv.org/pdf/2005.01432.pdf
Occlusion resistant learning of intuitive physics from videos. Riochet, Sivic, Laptev, Dupoux https://arxiv.org/pdf/2005.00069.pdf
Scalable learning in altent state sequence models Aicher https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/45550/Aicher_washington_0250E_21152.pdf?sequence=1
How useful is self-supervised pretraining for visual tasks? Newell, Deng https://arxiv.org/pdf/2003.14323.pdf
q-VAE for disentangled representation learning and latent dynamical systems Koboyashi https://arxiv.org/pdf/2003.01852.pdf
Variational recurrent models for solving partially observable control tasks. Han, Doya, Tani https://openreview.net/forum?id=r1lL4a4tDB
Stochastic latent residual video prediction. Franceschi, Delasalles, Chen, Lamprier, Gallinari https://arxiv.org/pdf/2002.09219.pdf https://sites.google.com/view/srvp
Disentangled speech embeddings using cross-modal self-supervision. Nagrani, Chung, Albanie, Zisserman https://arxiv.org/abs/2002.08742
TwoStreamVAM: improving motion modeling in video generation. Sun, Xu, Saenko https://arxiv.org/abs/1812.01037
Variational hyper RNN for sequence modeling. Deng, Cao, Chang, Sigal, Mori, Brubaker https://arxiv.org/abs/2002.10501
Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction. Jin, Hu,Tang, Niu, Shi, Han, Li https://arxiv.org/abs/2002.09905
Representing closed transformation paths in encoded network latent space. Connor, Rozell https://arxiv.org/pdf/1912.02644.pdf
Animating arbitrary objects via deep motion transfer. Siarohin, Lathuiliere, Tulyakov, Ricci, Sebe https://arxiv.org/abs/1812.08861
Feedback recurrent autoencoder. Yang, Sautiere, Ryu, Cohen https://arxiv.org/abs/1911.04018
First order motion model for image animation. Siarohin, Lathuiliere, Tulyakov, Ricci, Sebe https://papers.nips.cc/paper/8935-first-order-motion-model-for-image-animation
Point-to-point video generation. Wang, Cheng, Lin, Chen, Sun https://arxiv.org/pdf/1904.02912.pdf
Learning deep controllable and structured representations for image synthesis, structured prediction and beyond. Yan https://deepblue.lib.umich.edu/handle/2027.42/153334
Decoupling feature extraction from policy learning: assessing benefits of state representation learning in goal based robotics. Raffin, Hill, Traore, Lesort, Diaz-Rodriguez, Filliat https://arxiv.org/abs/1901.08651
Task-Conditioned variational autoencoders for learning movement primitives. Noseworthy, Paul, Roy, Park, Roy https://groups.csail.mit.edu/rrg/papers/noseworthy_corl_19.pdf
Spatio-temporal alignments: optimal transport through space and time. Janati, Cuturi, Gramfort https://arxiv.org/pdf/1910.03860.pdf
Action Genome: actions as composition of spatio-temporal scene graphs. Ji, Krishna, Fei-Fei, Niebles https://arxiv.org/pdf/1912.06992.pdf
Video-to-video translation for visual speech synthesis. Doukas, Sharmanska, Zafeiriou https://arxiv.org/pdf/1905.12043.pdf Predictive coding, variational autoencoders, and biological connections Marino https://openreview.net/pdf?id=SyeumQYUUH
Single Headed Attention RNN: stop thinking with your head. Merity https://arxiv.org/pdf/1911.11423.pdf
Hamiltonian neural networks. Greydanus, Dzamba, Yosinski https://arxiv.org/pdf/1906.01563.pdf https://github.com/greydanus/hamiltonian-nn
Learning what you can do before doing anything. Rybkin, Pertsch, Derpanis, Daniilidis, Jaegle https://openreview.net/pdf?id=SylPMnR9Ym https://daniilidis-group.github.io/learned_action_spaces
Deep Lagrangian networks: using physics as model prior for deep learning. Lutter, Ritter, Peters https://arxiv.org/pdf/1907.04490.pdf
A general framework for structured learning of mechanical systems. Gupta, Menda, Manchester, Kochenderfer https://arxiv.org/pdf/1902.08705.pdf https://github.com/sisl/machamodlearn
Learning predictive models from observation and interaction. Schmeckpeper, Xie, Rybkin, Tian, Daniilidis, Levine, Finn https://arxiv.org/pdf/1912.12773.pdf
A multigrid method for efficiently training video models. Wu, Girshick, He, Feichtenhofer, Krahenbuhl https://arxiv.org/pdf/1912.00998.pdf
Deep variational Koopman models: inferring Koopman observations for uncertainty-aware dynamics modeling and control . Morton, Witherden, Kochenderfer https://arxiv.org/pdf/1902.09742.pdf
Symplectic ODE-NET: learning hamiltonian dynamics with control. Zhong, Dey, Chakraborty https://arxiv.org/pdf/1909.12077.pdf
Hamiltonian graph networks with ODE integrators. Sanchez-Gonzalez, Bapst, Cranmer, Battaglia https://arxiv.org/pdf/1909.12790.pdf
Neural ordinary differential equations. Chen, Rubanova, Bettencourt, Duvenaud https://arxiv.org/pdf/1806.07366.pdf https://github.com/rtqichen/torchdiffeq
Variational autoencoder trajectory primitives and discrete latent codes. Osa, Ikemoto https://arxiv.org/pdf/1912.04063.pdf
Newton vs the machine: solving the chaotic three-body problem using deep neural networks. Breen, Foley, Boekholt, Zwart https://arxiv.org/pdf/1910.07291.pdf
Learning dynamical systems from partial observations. Ayed, de Bezenac, Pajot, Brajard, Gallinari https://arxiv.org/pdf/1902.11136.pdf
GP-VAE: deep probabilistic time series imputation. Fortuin, Baranchuk, Ratsch, Mandt https://arxiv.org/pdf/1907.04155.pdf https://github.com/ratschlab/GP-VAE
Ghost hunting in the nonlinear dynamic machine. Butner, Munion, Baucom, Wong https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0226572
Faster attend-infer-repeat with tractable probabilistic models. Stelzner, Peharz, Kersting http://proceedings.mlr.press/v97/stelzner19a/stelzner19a.pdf https://github/stelzner/supair
Tree-structured recurrent switching linear dynamical systems for multi-scale modeling. Nassar, Linderman, Bugallo, Park https://arxiv.org/pdf/1811.12386.pdf
DynaNet: neural Kalman dynamical model for motion estimation and prediction. Chen, Lu, Wang, Trigoni, Markham https://arxiv.org/pdf/1908.03918.pdf
Disentangled behavioral representations. Dezfouli, Ashtiani, Ghattas, Nock, Dayan, Ong https://papers.nips.cc/paper/8497-disentangled-behavioural-representations.pdf
Structured object-aware physics prediction for video modeling and planning. Kossen, Stelzner, Hussing, Voelcker, Kersting https://arxiv.org/pdf/1910.02425.pdf https://github.com/jlko/STOVE
Recurrent attentive neural process for sequential data. Qin, Zhu, Qin, Wang, Zhao https://arxiv.org/pdf/1910.09323.pdf https://kasparmartens.rbind.io/post/np/
DeepMDP: learning continuous latent space models for representation learning. Gelada, Kumar, Buckman, Nachum, Bellemare https://arxiv.org/pdf/1906.02736.pdf
Genesis: generative scene inference and sampling with object-centric latent representations. Engelcke, Kosiorek, Jones, Posner https://arxiv.org/pdf/1907.13052.pdf https://github.com/applied-ai-lab/genesis
Deep conservation: a latent dynamics model for exact satisfaction of physical conservation laws. Lee, Carlberg https://arxiv.org/pdf/1909.09754.pdf
Switching linear dynamics for variational bayes filtering. Becker-Ehmck, Peters, van der Smagt https://arxiv.org/pdf/1905.12434.pdf https://arxiv.org/pdf/1905.12434.pdf
Approximate Bayesian inference in spatial environments Mirchev, Kayalibay, Soelch, van der Smagt, Bayer https://arxiv.org/pdf/1805.07206.pdf
beta-DVBF: learning state-space models for control from high dimensional observations. Das, Karl, Becker-Ehmck, van der Smagt https://arxiv.org/pdf/1911.00756.pdf
SSA-GAN: End-to-end time-lapse video generation with spatial self-attention. Horita, Yanai http://img.cs.uec.ac.jp/pub/conf19/191126horita_0.pdf
Learning energy-based spatial-temporal generative convnets for dynamic patterns. Xie, Zhu, Wu https://arxiv.org/pdf/1909.11975.pdf http://www.stat.ucla.edu/~jxie/STGConvNet/STGConvNet.html
Multiplicative interactions and where to find them. Anon https://openreview.net/pdf?id=rylnK6VtDH
Time-series generative adversarial networks. Yoon, Jarrett, van der Schaar https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf
Explaining and interpreting LSTMs. Arras, Arjona-Medina, Widrich, Montavon, Gillhofer, Muller, Hochreiter, Samek https://arxiv.org/pdf/1909.12114.pdf
Gating revisited: deep multi-layer RNNs that can be trained. Turkoglu, D'Aronco, Wegner, Schindler https://arxiv.org/pdf/1911.11033.pdf
Re-examination of the role of latent variables in sequence modeling. Lai, Dai, Yang, Yoo https://arxiv.org/pdf/1902.01388.pdf
Improving sequential latent variable models with autoregressive flows. Marino, Chen, He, Mandt https://openreview.net/pdf?id=HklvmlrKPB
Learning stable and predictive structures in kinetic systems: benefits of a causal approach. Pfister, Bauer, Peters https://arxiv.org/pdf/1810.11776.pdf
Learning to disentangle latent physical factors for video prediction. Zhu, Munderloh, Rosenhahn, Stuckler https://link.springer.com/chapter/10.1007/978-3-030-33676-9_42
Adversarial video generation on complex datasets. Clark, Donahue, Simonyan https://arxiv.org/pdf/1907.06571.pdf
Learning to predict without looking ahead: world models without forward prediction. Freeman, Metz, Ha https://arxiv.org/pdf/1910.13038.pdf
Learning video representations using contrastive bidirectional transformer. Sun, Baradel, Murphy, Schmid https://arxiv.org/pdf/1906.05743.pdf
STCN: stochastic temporal convolution networks. Aksan, Hilliges https://arxiv.org/pdf/1902.06568.pdf http://jacobcwalker.com/DTP/DTP.html https://ait.ethz.ch/projects/2019/stcn/
Zero-shot generation of human-object interaction videos. Nawhal, Zhai, Lehrmann, Sigal https://arxiv.org/pdf/1912.02401.pdf http://www.sfu.ca/~mnawhal/projects/zs_hoi_generation.html
Learning a generative model for multi-step human-object interactions from videos. Wang, Pirk, Yumer, Kim, Sener, Sridhar, Guibas http://www.pirk.info/papers/Wang.etal-2019-LearningInteractions.pdf http://www.pirk.info/projects/learning_interactions/index.html
Dream to control: learning behaviors by latent imagining. Hafner, Lillicrap, Ba, Norouzi https://arxiv.org/pdf/1912.01603.pdf
Multistage attention network for multivariate time series prediction. Hu, Zheng https://www.sciencedirect.com/science/article/abs/pii/S0925231219316625
Predicting video-frames using encoder-convLSTM combination. Mukherjee, Ghosh, Ghosh, Kumar, Roy https://ieeexplore.ieee.org/document/8682158
A variational auto-encoder model for stochastic point processes. Mehrasa, Jyothi, Durand, He, Sigal, Mori https://arxiv.org/pdf/1904.03273.pdf
Unsupervised speech representation learning using WaveNet encoders. Chorowski, Weiss, Bengio, van den Oord https://arxiv.org/pdf/1901.08810.pdf
Local aggregation for unsupervised learning of visual embeddings. Zhuang, Zhai, Yamins http://openaccess.thecvf.com/content_ICCV_2019/papers/Zhuang_Local_Aggregation_for_Unsupervised_Learning_of_Visual_Embeddings_ICCV_2019_paper.pdf
Hamiltonian generative Networks. Toth, Rezende, Jaegle, Racaniere, Botev, higgins https://arxiv.org/pdf/1909.13789.pdf
VideoBERT: a joint model for video and language representations learning. Sun, Myers, Vondrick, Murphy, Schmid https://arxiv.org/pdf/1904.01766.pdf
Video representation learning via dense predictive coding. Han, Xie, Zisserman http://openaccess.thecvf.com/content_ICCVW_2019/papers/HVU/Han_Video_Representation_Learning_by_Dense_Predictive_Coding_ICCVW_2019_paper.pdf https://github.com/TengdaHan/DPC
Hamiltonian Generative Networks Toth, Rezende, Jaegle, Racaniere, Botev, higgins https://arxiv.org/pdf/1909.13789.pdf
Unsupervised state representation learning in Atari. Anand, Racah, Ozair, Bengio, Cote, Hjelm https://arxiv.org/pdf/1906.08226.pdf
Temporal cycle-consistency learning. Dwibedi, Aytar, Tompson, Sermanet, Zisserman http://openaccess.thecvf.com/content_CVPR_2019/papers/Dwibedi_Temporal_Cycle-Consistency_Learning_CVPR_2019_paper.pdf
Self-supervised learning by cross-modal audio-video clustering. Alwassel, Mahajan, Torresani, Ghanem, Tran https://arxiv.org/pdf/1911.12667.pdf
Human action recognition with deep temporal pyramids. Mazari, Sahbi https://arxiv.org/pdf/1905.00745.pdf
Evolving losses for unlabeled video representation learning. Piergiovanni, Angelova, Ryoo https://arxiv.org/pdf/1906.03248.pdf
MoGlow: probabilistic and controllable motion synthesis using normalizing flows. Henter, Alexanderson, Beskow https://arxiv.org/pdf/1905.06598.pdf https://www.youtube.com/watch?v=lYhJnDBWyeo
High fidelity video prediction with large stochastic recurrent neural networks. Villegas, Pathak, Kannan, Erhan, Le, Lee https://arxiv.org/pdf/1911.01655.pdf https://sites.google.com/view/videopredictioncapacity
Spatiotemporal pyramid network for video action recognition. Wang, Long, Wan, Yu https://arxiv.org/pdf/1903.01038.pdf
Attentive temporal pyramid network for dynamic scene classification. Huang, Cao, Zhen, Han https://www.aaai.org/ojs/index.php/AAAI/article/view/5184
Disentangling video with independent prediction. Whitney, Fergus https://arxiv.org/pdf/1901.05590.pdf
Disentangling state space representations. Miladinovic, Gondal, Scholkopf, Buhmann, Bauer https://arxiv.org/pdf/1906.03255.pdf
Cycle-SUM: cycle-consistent adversarial LSTM networks for unsupervised video summarization Yuan, Tay, Li, Zhou, Feng https://arxiv.org/pdf/1904.08265.pdf Unsupervised learning from video with deep neural embeddings Zhuang, Andonian, Yamins https://arxiv.org/pdf/1905.11954.pdf Scaling and benchmarking self-supervised visual representation learning. Goyal, Mahajan, Gupta, Misra https://arxiv.org/pdf/1905.01235.pdf
Self-supervised visual feature learning with deep neural networks: a survey. Jing, Tian https://arxiv.org/pdf/1902.06162.pdf Unsupervised learning of object structure and dynamics from videos Minderer, Sun, Villegas, Cole, Murphy, Lee https://arxiv.org/pdf/1906.07889.pdf
Learning correspondence from the cycle-consistency of time. Wang, Jabri, Efros https://arxiv.org/pdf/1903.07593.pdf https://ajabri.github.io/timecycle/
DistInit: learning video representations without a single labeled video . Girdhar, Tran, Torresani, Ramanan https://arxiv.org/pdf/1901.09244.pdf
VideoFlow: a flow-based generative model for video. Kumar, Babaeizadeh, Erhan, Finn, Levine, Dinh, Kingma https://arxiv.org/pdf/1903.01434.pdf find code in tensor2tensor library
Learning latent dynamics for planning from pixels. Hafner, Lillicrap, Fischer, Villegas, Ha, Lee, Davidson https://arxiv.org/pdf/1811.04551.pdf https://github.com/google-research/planet
View-LSTM: Novel-view video synthesis trough view decomposition. Lakhal, Lanz, Cavallaro http://openaccess.thecvf.com/content_ICCV_2019/papers/Lakhal_View-LSTM_Novel-View_Video_Synthesis_Through_View_Decomposition_ICCV_2019_paper.pdf
Likelihood conribution based multi-scale architecture for generative flows. Das, Abbeel, Spanos https://arxiv.org/pdf/1908.01686.pdf
Adaptive online planning for continual lifelong learning. Lu, Mordatch, Abbeel https://arxiv.org/pdf/1912.01188.pdf Exploiting video sequences for unsupervised disentangling in generative adversarial networks Tuesca, Uzal https://arxiv.org/pdf/1910.11104.pdf
Memory in memory: a predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. Wang, Zhang, Zhe, Long, Wang, Yu https://arxiv.org/pdf/1811.07490.pdf
Improved conditional VRNNs for video prediction. Castrejon, Ballas, Courville https://arxiv.org/pdf/1904.12165.pdf
Temporal difference variational auto-encoder. Gregor, Papamakarios, Besse, Buesing, Weber https://arxiv.org/pdf/1806.03107.pdf
Time-agnostic prediction: predicting predictable video frames. Jayaraman, Ebert, Efros, Levine https://arxiv.org/pdf/1808.07784.pdf https://sites.google.com/view/ta-pred
Variational tracking and prediction with generative disentangled state-space models. Akhundov, Soelch, Bayer, van der Smagt https://arxiv.org/pdf/1910.06205.pdf
Self-supervised spatiotemporal learning via video clip order prediction. Xu, Xiao, Zhao, Shao, Xie, Zhuang https://pdfs.semanticscholar.org/558a/eb7aa38cfcf8dd9951bfd24cf77972bd09aa.pdf https://github.com/xudejing/VCOP
Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. Wang, Jiao, Bao, He, Liu, Liu http://openaccess.thecvf.com/content_CVPR_2019/papers/Wang_Self-Supervised_Spatio-Temporal_Representation_Learning_for_Videos_by_Predicting_Motion_and_CVPR_2019_paper.pdf
Spatio-temporal associative representation for video person re-identification. Wu, Zhu, Gong http://www.eecs.qmul.ac.uk/~sgg/papers/WuEtAl_BMVC2019.pdf
Object segmentation using pixel-wise adversarial loss. Durall, Pfreundt, Kothe, Keuper https://arxiv.org/pdf/1909.10341.pdf
The dreaming variational autoencoder for reinforcement learning environments. Andersen, Goodwin, Granmo https://arxiv.org/pdf/1810.01112v1.pdf
MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics. Yan, Rastogi, Villegas, Sunkavalli, Shechtman, Hadap, Yumer, Lee http://openaccess.thecvf.com/content_ECCV_2018/html/Xinchen_Yan_Generating_Multimodal_Human_ECCV_2018_paper.html
Deep learning for universal linear embeddings of nonlinear dynamics. Lusch, Kutz, Brunton https://www.nature.com/articles/s41467-018-07210-0
Variational attention for sequence-to-sequence models. Bahuleyan, Mou, Vechtomova, Poupart https://arxiv.org/pdf/1712.08207.pdf https://github.com/variational-attention/tf-var-attention
Understanding image motion with group representations. Jaegle, Phillips, Ippolito, Daniilidis https://openreview.net/forum?id=SJLlmG-AZ
Relational neural expectation maximization: unsupervised discovery of objects and their interactions. van Steenkiste, Chang, Greff, Schmidhuber https://arxiv.org/pdf/1802.10353.pdf https://sites.google.com/view/r-nem-gifs https://github.com/sjoerdvansteenkiste/Relational-NEM
A general method for amortizing variational filtering. Marino, Cvitkovic, Yue https://arxiv.org/pdf/1811.05090.pdf https://github.com/joelouismarino/amortized-variational-filtering
Deep learning for physical processes: incorporating prior scientific knowledge de Bezenac, Pajot, Gallinari https://arxiv.org/pdf/1711.07970.pdf https://github.com/emited/flow
Probabilistic recurrent state-space models . Doerr, Daniel, Schiegg, Nguyen-Tuong, Schaal, Toussaint, Trimpe https://arxiv.org/pdf/1801.10395.pdf https://github.com/boschresearch/PR-SSM
TGANv2: efficient training of large models for video generation with multiple subsampling layers. Saito, Saito https://arxiv.org/abs/1811.09245
Towards high resolution video generation with progressive growing of sliced Wasserstein GANs. Acharya, Huang, Paudel, Gool https://arxiv.org/abs/1810.02419
Representation learning with contrastive predictive coding. van den Oord, Li, Vinyas https://arxiv.org/pdf/1807.03748.pdf
Deconfounding reinforcement learning in observational settings . Lu, Scholkopf, Hernandez-Lobato https://arxiv.org/pdf/1812.10576.pdf
Flow-grounded spatial-temporal video prediction from still images. Li, Fang, Yang, Wang, Lu, Yang https://arxiv.org/pdf/1807.09755.pdf
Adaptive skip intervals: temporal abstractions for recurrent dynamical models. Neitz, Parascandolo, Bauer, Scholkopf https://arxiv.org/pdf/1808.04768.pdf
Disentangled sequential autoencoder. Li, Mandt https://arxiv.org/abs/1803.02991 https://github.com/yatindandi/Disentangled-Sequential-Autoencoder
Video jigsaw: unsupervised learning of spatiotemporal context for video action recognition. Ahsan, Madhok, Essa https://arxiv.org/pdf/1808.07507.pdf
Iterative reoganization with weak spatial constraints: solving arbitrary jigsaw puzzels for unsupervised representation learning. Wei, Xie, Ren, Xia, Su, Liu, Tian, Yuille https://arxiv.org/pdf/1812.00329.pdf
Stochastic adversarial video prediction. Lee, Zhang, Ebert, Abbeel, Finn, Levine https://arxiv.org/pdf/1804.01523.pdf https://alexlee-gk.github.io/video_prediction/
Stochastic variational video prediction. Babaeizadeh, Finn, Erhan, Campbell, Levine https://arxiv.org/pdf/1710.11252.pdf https://github.com/alexlee-gk/video_prediction
Folded recurrent neural networks for future video prediction. Oliu, Selva, Escalera https://arxiv.org/pdf/1712.00311.pdf
PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. Wang, Gao, Long, Wang, Yu https://arxiv.org/pdf/1804.06300.pdf https://github.com/Yunbo426/predrnn-pp
Stochastic video generation with a learned prior. Denton, Fergus https://arxiv.org/pdf/1802.07687.pdf https://sites.google.com/view/svglp
Unsupervised learning from videos using temporal coherency deep networks. Redondo-Cabrera, Lopez-Sastre https://arxiv.org/pdf/1801.08100.pdf
Time-contrastive networks: self-supervised learning from video. Sermanet, Lynch, Chebotar, Hsu, Jang, Schaal, Levine https://arxiv.org/pdf/1704.06888.pdf
Learning to decompose and disentangle representations for video prediction. Hsieh, Liu, Huang, Fei-Fei, Niebles https://arxiv.org/pdf/1806.04166.pdf https://github.com/jthsieh/DDPAE-video-prediction
Probabilistic video generation using holistic attribute control. He, Lehrmann, Marino, Mori, Sigal https://arxiv.org/pdf/1803.08085.pdf
Interpretable intuitive physics model. Ye, Wang, Davidson, Gupta https://arxiv.org/pdf/1808.10002.pdf https://github.com/tianye95/interpretable-intuitive-physics-model
Video synthesis from a single image and motion stroke. Hu, Walchli, Portenier, Zwicker, Facaro https://arxiv.org/pdf/1812.01874.pdf
Graph networks as learnable physics engines for inference and control. Sanchez-Gonzalez, Heess, Springenberg, Merel, Riedmiller, Hadsell, Battaglia https://arxiv.org/pdf/1806.01242.pdf https://drive.google.com/file/d/14eYTWoH15T53a7qejvCkDLItOOE9Ve7S/view
Deep dynamical modeling and control of unsteady fluid flows. Morton, Witherden, Jameson, Kochenderfer https://arxiv.org/pdf/1805.07472.pdf https://github.com/sisl/deep_flow_control
Sequential attend, infer, repeat: generative modelling of moving objects. Kosiorek, Kim, Posner, Teh https://arxiv.org/pdf/1806.01794.pdf https://github.com/akosiorek/sqair https://www.youtube.com/watch?v=-IUNQgSLE0c&feature=youtu.be
Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. Xiong, Luo, Ma, Liu, Luo https://arxiv.org/pdf/1709.07592.pdf
Integrating accounts of behavioral and neuroimaging data using flexible recurrent neural network models. Dezfouli, Morris, Ramos, Dayan, Balleine https://papers.nips.cc/paper/7677-integrated-accounts-of-behavioral-and-neuroimaging-data-using-flexible-recurrent-neural-network-models.pdf
Autoregressive attention for parallel sequence modeling. Laird, Irvin https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1174/reports/2755456.pdf
Physics informed deep learning: data-driven solutions of nonlinear partial differential equations. Raissi, Perdikaris, Karniadakis https://arxiv.org/pdf/1711.10561.pdf https://github.com/maziarraissi/PINNs
Unsupervised real-time control through variational empowerment. Karl, Soelch, Becker-Ehmck, Benbouzid, van de Smagt, Bayer https://arxiv.org/pdf/1710.05101.pdf https://github.com/tessavdheiden/Empowerment
z-forcing: training stochastic recurrent networks. Goyal, Sordoni, Cote, Ke, Bengio https://arxiv.org/abs/1711.05411 https://github.com/ujjax/z-forcing
View synthesis by appearance flow. Zhou, Tulsiani, Sun, Malik, Efros https://arxiv.org/pdf/1605.03557.pdf
Learning to see physics via visual de-animation . Wu, Lu, Kohli, Freeman, Tenenbaum https://jiajunwu.com/papers/vda_nips.pdf https://github.com/pulkitag/pyphy-engine
Deep predictive coding networks for video prediction and unsupervised learning. Lotter, Kreiman, Cox https://arxiv.org/pdf/1605.08104.pdf
The predictron: end-to-end learning and planning. Silver, Hasselt, Hessel, Schaul, Guez, Harley, Dulac-Arnold, Reichert, Rabinowitz, Barreto, Degris https://arxiv.org/pdf/1612.08810.pdf
Recurrent ladder networks. Premont-Schwarz, Llin, Hao, Rasmus, Boney, Valpola https://arxiv.org/pdf/1707.09219.pdf
A disentangled recognition and nonlinear dynamics model for unsupervised learning. Fraccaro, Kamronn, Paquet, Winther https://arxiv.org/pdf/1710.05741.pdf
MoCoGAN: decomposing motion and content for video generation. Tulyakov, Liu, Yang, Kautz https://arxiv.org/pdf/1707.04993.pdf
Temporal generative adversarial nets with singular value clipping. Saito, Matsumoto, Saito https://arxiv.org/pdf/1611.06624.pdf
Multi-task self-supervised visual learning. Doersch, Zisserman https://arxiv.org/pdf/1708.07860.pdf
Prediction under uncertainty with error-encoding networks . Henaff, Zhao, LeCun https://arxiv.org/pdf/1711.04994.pdf https://github.com/mbhenaff/EEN.
Unsupervised learning of disentangled representations from video. Denton, Birodkar https://papers.nips.cc/paper/7028-unsupervised-learning-of-disentangled-representations-from-video.pdf https://github.com/ap229997/DRNET
Self-supervised visual planning with temporal skip connections. Erbert, Finn, Lee, Levine https://arxiv.org/pdf/1710.05268.pdf
Unsupervised learning of disentangled and interpretable representations from sequential data. Hsu, Zhang, Glass https://papers.nips.cc/paper/6784-unsupervised-learning-of-disentangled-and-interpretable-representations-from-sequential-data.pdf https://github.com/wnhsu/FactorizedHierarchicalVAE https://github.com/wnhsu/ScalableFHVAE
Decomposing motion and content for natural video sequence prediction. Villegas, Yang, Hong, Lin, Lee https://arxiv.org/pdf/1706.08033.pdf
Unsupervised video summarization with adversarial LSTM networks. Mahasseni, Lam, Todorovic http://web.engr.oregonstate.edu/~sinisa/research/publications/cvpr17_summarization.pdf
Deep variational bayes filters: unsupervised learning of state space models from raw data. Karl, Soelch, Bayer, van der Smagt https://arxiv.org/pdf/1605.06432.pdf https://github.com/sisl/deep_flow_control
A compositional object-based approach to learning physical dynamics. Chang, Ullman, Torralba, Tenenbaum https://arxiv.org/pdf/1612.00341.pdf https://github.com/mbchang/dynamics
Bayesian learning and inference in recurrent switching linear dynamical systems. Linderman, Johnson, Miller, Adams, Blei, Paninski http://proceedings.mlr.press/v54/linderman17a/linderman17a.pdf https://github.com/slinderman/recurrent-slds
SE3-Nets: learning rigid body motion using deep neural networks. Byravan, Fox https://arxiv.org/pdf/1606.02378.pdf
Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in video. Pigou, van den Oord, Dieleman, Van Herreweghe, Dambre https://arxiv.org/abs/1506.01911
Dynamic filter networks. De Brabandere, Jia, Tuytelaars, Gool https://arxiv.org/pdf/1605.09673.pdf
Dynamic movement primitives in latent space of time-dependent variational autoencoders. Chen, Karl, van der Smagt https://ieeexplore.ieee.org/document/7803340
Learning physical intuiting of block towers by example. Lerer, Gross, Fergus https://arxiv.org/pdf/1603.01312.pdf
Structured inference networks for nonlinear state space models. Krishnan, Shalit, Sontag https://arxiv.org/pdf/1609.09869.pdf https://github.com/clinicalml/structuredinference
A recurrent latent variable model for sequential data. Chung, Kastner, Dinh, Goel, Courville, Bengio https://arxiv.org/pdf/1506.02216.pdf https://github.com/jych/nips2015_vrnn
Recognizing micro-actions and reactions from paired egocentric videos Yonetani, Kitani, Sato http://www.cs.cmu.edu/~kkitani/pdf/YKS-CVPR16.pdf
Anticipating visual representations from unlabeled video. https://github.com/chiawen/activity-anticipation https://www.zpascal.net/cvpr2016/Vondrick_Anticipating_Visual_Representations_CVPR_2016_paper.pdf
Deep multi-scale video prediction beyond mean square error. Mathieu, Couprie, LeCun https://arxiv.org/pdf/1511.05440.pdf
Generating videos with scene dynamics. Vondrick, Pirsiavash, Torralba https://papers.nips.cc/paper/6194-generating-videos-with-scene-dynamics.pdf
Disentangling space and time in video with hierarchical variational auto-encoders. Grathwohl, Wilson https://arxiv.org/pdf/1612.04440.pdf
Understanding visual concepts with continuation learning. Whitney, Chang, Kulkarni, Tenenbaum https://arxiv.org/pdf/1602.06822.pdf
Contextual RNN-GANs for abstract reasoning diagram generation. Ghosh, Kulharia, Mukerjee, Namboodiri, Bansal https://arxiv.org/pdf/1609.09444.pdf
Interaction networks for learning about objects, relations and physics . Battaglia, Pascanu, Lai, Rezende, Kavukcuoglu https://arxiv.org/pdf/1612.00222.pdf https://github.com/jsikyoon/Interaction-networks_tensorflow https://github.com/higgsfield/interaction_network_pytorch https://github.com/ToruOwO/InteractionNetwork-pytorch
An uncertain future: forecasting from static images using Variational Autoencoders. Walker, Doersch, Gupta, Hebert https://arxiv.org/pdf/1606.07873.pdf
Unsupervised learning for physical interaction through video prediction. Finn, Goodfellow, Levine https://arxiv.org/pdf/1605.07157.pdf
Sequential neural models with stochastic layers. Fraccaro, Sonderby, Paquet, Winther https://arxiv.org/pdf/1605.07571.pdf https://github.com/marcofraccaro/srnn
Learning visual predictive models of physics for playing billiards. Fragkiadaki, Agrawal, Levine, Malik https://arxiv.org/pdf/1511.07404.pdf
Attend, infer, repeat: fast scene understanding with generative models. Eslami, Heess, Weber, Tassa, Szepesvari, Kavukcuoglu, Hinton https://arxiv.org/pdf/1603.08575.pdf http://akosiorek.github.io/ml/2017/09/03/implementing-air.html https://github.com/akosiorek/attend_infer_repeat
Synthesizing robotic handwriting motion by learning from human demonstrations. Yin, Alves-Oliveira, Melo, Billard, Paiva https://pdfs.semanticscholar.org/951e/14dbef0036fddbecb51f1577dd77c9cd2cf3.pdf?_ga=2.78226524.958697415.1583668154-397935340.1548854421
Learning stochastic recurrent networks. Bayer, Osendorfer https://arxiv.org/pdf/1411.7610.pdf https://github.com/durner/STORN-keras
Deep Kalman Filters. Krishnan, Shalit, Sontag https://arxiv.org/pdf/1511.05121.pdf https://github.com/k920049/Deep-Kalman-Filter
Unsupervised learning of visual representations using videos. Wang, Gupta https://arxiv.org/pdf/1505.00687.pdf
Embed to control: a locally linear latent dynamics model for control from raw images. Watter, Springenberg, Riedmiller, Boedecker https://arxiv.org/pdf/1506.07365.pdf https://github.com/ericjang/e2c
Seeing the arrow of time. Pickup, Pan, Wei, Shih, Zhang, Zisserman, Scholkopf, Freeman https://www.robots.ox.ac.uk/~vgg/publications/2014/Pickup14/pickup14.pdf
Activity Forecasting. Kitani, Ziebart, Bagnell, Hebert http://www.cs.cmu.edu/~kkitani/pdf/KZBH-ECCV12.pdf
Information flows in causal networks. Ay, Polani https://sfi-edu.s3.amazonaws.com/sfi-edu/production/uploads/sfi-com/dev/uploads/filer/45/5f/455fd460-b6b0-4008-9de1-825a5e2b9523/06-05-014.pdf
Slow feature analysis. Wiskott, Sejnowski http://www.cnbc.cmu.edu/~tai/readings/learning/wiskott_sejnowski_2002.pdf
Learning variational latent dynamics: towards model-based imitation and control. Yin, Melo, Billard, Paiva https://pdfs.semanticscholar.org/40af/a07f86a6f7c3ec2e4e02665073b1e19652bc.pdf