Skip to content

A PyTorch implementation of R2Plus1D and C3D based on CVPR 2017 paper "A Closer Look at Spatiotemporal Convolutions for Action Recognition" and CVPR 2014 paper "Learning Spatiotemporal Features with 3D Convolutional Networks"

Notifications You must be signed in to change notification settings

leftthomas/R2Plus1D-C3D

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

R2Plus1D-C3D

A PyTorch implementation of R2Plus1D and C3D based on CVPR 2017 paper A Closer Look at Spatiotemporal Convolutions for Action Recognition and CVPR 2014 paper Learning Spatiotemporal Features with 3D Convolutional Networks.

Requirements

conda install pytorch torchvision -c pytorch
  • opencv
conda install opencv
  • rarfile
pip install rarfile
  • rar
sudo apt install rar
  • unrar
sudo apt install unrar
  • ffmpeg
sudo apt install build-essential openssl libssl-dev autoconf automake cmake git-core libass-dev libfreetype6-dev libsdl2-dev libtool libva-dev libvdpau-dev libvorbis-dev libxcb1-dev libxcb-shm0-dev libxcb-xfixes0-dev pkg-config texinfo wget zlib1g-dev nasm yasm libx264-dev libx265-dev libnuma-dev libvpx-dev libfdk-aac-dev libmp3lame-dev libopus-dev
wget https://ffmpeg.org/releases/ffmpeg-4.1.3.tar.bz2
tar -jxvf ffmpeg-4.1.3.tar.bz2
cd ffmpeg-4.1.3/
./configure --prefix="../build" --enable-static --enable-gpl --enable-libass --enable-libfdk-aac --enable-libfreetype --enable-libmp3lame --enable-libopus --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-nonfree --enable-openssl
make -j4
make install
sudo cp ../build/bin/ffmpeg /usr/local/bin/ 
rm -rf ../ffmpeg-4.1.3/ ../ffmpeg-4.1.3.tar.bz2 ../build/
  • youtube-dl
pip install youtube-dl
  • joblib
pip install joblib
  • PyTorchNet
pip install git+https://github.com/pytorch/tnt.git@master

Datasets

The datasets are coming from UCF101HMDB51 and KINETICS600. Download UCF101 and HMDB51 datasets with train/val/test split files into data directory. We use the split1 to split files. Run misc.py to preprocess these datasets.

For KINETICS600 dataset, first download train/val/test split files into data directory, then run download.py to download and preprocess this dataset.

Usage

Train Model

visdom -logging_level WARNING & python train.py --num_epochs 20 --pre_train kinetics600_r2plus1d.pth
optional arguments:
--data_type                   dataset type [default value is 'ucf101'](choices=['ucf101', 'hmdb51', 'kinetics600'])
--gpu_ids                     selected gpu [default value is '0,1']
--model_type                  model type [default value is 'r2plus1d'](choices=['r2plus1d', 'c3d'])
--batch_size                  training batch size [default value is 8]
--num_epochs                  training epochs number [default value is 100]
--pre_train                   used pre-trained model epoch name [default value is None]

Visdom now can be accessed by going to 127.0.0.1:8097 in your browser.

Inference Video

python inference.py --video_name data/ucf101/ApplyLipstick/v_ApplyLipstick_g04_c02.avi
optional arguments:
--data_type                   dataset type [default value is 'ucf101'](choices=['ucf101', 'hmdb51', 'kinetics600'])
--model_type                  model type [default value is 'r2plus1d'](choices=['r2plus1d', 'c3d'])
--video_name                  test video name
--model_name                  model epoch name [default value is 'ucf101_r2plus1d.pth']

The inferences will show in a pop up window.

Benchmarks

Adam optimizer (lr=0.0001) is used with learning rate scheduling.

For ucf101 and hmdb51 dataset, the models are trained with 100 epochs and batch size of 8 on one NVIDIA Tesla V100 (32G) GPU.

For kinetics600 dataset, the models are trained with 100 epochs and batch size of 32 on two NVIDIA Tesla V100 (32G) GPU. Because the training time is too long, so this experiment have not been finished.

The videos are preprocessed as 32 frames of 128x128, and cropped to 112x112.

Dataset UCF101 HMDB51 Kinetics600
Num. of Train Videos 9,537 3,570 375,008
Num. of Val Videos 756 1,666 28,638
Num. of Test Videos 3,783 1,530 56,982
Num. of Classes 101 51 600
Accuracy (R2Plus1D) 63.60% 24.97% \
Accuracy (C3D) 51.63% 25.10% \
Num. of Parameters (R2Plus1D) 33,220,990 33,195,340 33,476,977
Num. of Parameters (C3D) 78,409,573 78,204,723 80,453,976
Training Time (R2Plus1D) 19.3h 7.3h 350h
Training Time (C3D) 10.9h 4.1h 190h

Results

The train/val/test loss、accuracy and confusion matrix are showed on visdom. The pretrained models can be downloaded from BaiduYun (access code: ducr).

UCF101

R2Plus1D result C3D result

HMDB51

R2Plus1D result C3D result

About

A PyTorch implementation of R2Plus1D and C3D based on CVPR 2017 paper "A Closer Look at Spatiotemporal Convolutions for Action Recognition" and CVPR 2014 paper "Learning Spatiotemporal Features with 3D Convolutional Networks"

Topics

Resources

Stars

Watchers

Forks

Languages