A PyTorch implementation of R2Plus1D and C3D based on CVPR 2017 paper A Closer Look at Spatiotemporal Convolutions for Action Recognition and CVPR 2014 paper Learning Spatiotemporal Features with 3D Convolutional Networks.
conda install pytorch torchvision -c pytorch
- opencv
conda install opencv
- rarfile
pip install rarfile
- rar
sudo apt install rar
- unrar
sudo apt install unrar
- ffmpeg
sudo apt install build-essential openssl libssl-dev autoconf automake cmake git-core libass-dev libfreetype6-dev libsdl2-dev libtool libva-dev libvdpau-dev libvorbis-dev libxcb1-dev libxcb-shm0-dev libxcb-xfixes0-dev pkg-config texinfo wget zlib1g-dev nasm yasm libx264-dev libx265-dev libnuma-dev libvpx-dev libfdk-aac-dev libmp3lame-dev libopus-dev
wget https://ffmpeg.org/releases/ffmpeg-4.1.3.tar.bz2
tar -jxvf ffmpeg-4.1.3.tar.bz2
cd ffmpeg-4.1.3/
./configure --prefix="../build" --enable-static --enable-gpl --enable-libass --enable-libfdk-aac --enable-libfreetype --enable-libmp3lame --enable-libopus --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-nonfree --enable-openssl
make -j4
make install
sudo cp ../build/bin/ffmpeg /usr/local/bin/
rm -rf ../ffmpeg-4.1.3/ ../ffmpeg-4.1.3.tar.bz2 ../build/
- youtube-dl
pip install youtube-dl
- joblib
pip install joblib
- PyTorchNet
pip install git+https://github.com/pytorch/tnt.git@master
The datasets are coming from UCF101、
HMDB51
and KINETICS600.
Download UCF101
and HMDB51
datasets with train/val/test
split files into data
directory.
We use the split1
to split files. Run misc.py
to preprocess these datasets.
For KINETICS600
dataset, first download train/val/test
split files into data
directory, then
run download.py
to download and preprocess this dataset.
visdom -logging_level WARNING & python train.py --num_epochs 20 --pre_train kinetics600_r2plus1d.pth
optional arguments:
--data_type dataset type [default value is 'ucf101'](choices=['ucf101', 'hmdb51', 'kinetics600'])
--gpu_ids selected gpu [default value is '0,1']
--model_type model type [default value is 'r2plus1d'](choices=['r2plus1d', 'c3d'])
--batch_size training batch size [default value is 8]
--num_epochs training epochs number [default value is 100]
--pre_train used pre-trained model epoch name [default value is None]
Visdom now can be accessed by going to 127.0.0.1:8097
in your browser.
python inference.py --video_name data/ucf101/ApplyLipstick/v_ApplyLipstick_g04_c02.avi
optional arguments:
--data_type dataset type [default value is 'ucf101'](choices=['ucf101', 'hmdb51', 'kinetics600'])
--model_type model type [default value is 'r2plus1d'](choices=['r2plus1d', 'c3d'])
--video_name test video name
--model_name model epoch name [default value is 'ucf101_r2plus1d.pth']
The inferences will show in a pop up window.
Adam optimizer (lr=0.0001) is used with learning rate scheduling.
For ucf101
and hmdb51
dataset, the models are trained with 100 epochs and
batch size of 8 on one NVIDIA Tesla V100 (32G) GPU.
For kinetics600
dataset, the models are trained with 100 epochs and
batch size of 32 on two NVIDIA Tesla V100 (32G) GPU. Because the training time
is too long, so this experiment have not been finished.
The videos are preprocessed as 32 frames of 128x128, and cropped to 112x112.
Dataset | UCF101 | HMDB51 | Kinetics600 |
---|---|---|---|
Num. of Train Videos | 9,537 | 3,570 | 375,008 |
Num. of Val Videos | 756 | 1,666 | 28,638 |
Num. of Test Videos | 3,783 | 1,530 | 56,982 |
Num. of Classes | 101 | 51 | 600 |
Accuracy (R2Plus1D) | 63.60% | 24.97% | \ |
Accuracy (C3D) | 51.63% | 25.10% | \ |
Num. of Parameters (R2Plus1D) | 33,220,990 | 33,195,340 | 33,476,977 |
Num. of Parameters (C3D) | 78,409,573 | 78,204,723 | 80,453,976 |
Training Time (R2Plus1D) | 19.3h | 7.3h | 350h |
Training Time (C3D) | 10.9h | 4.1h | 190h |
The train/val/test loss、accuracy and confusion matrix are showed on visdom. The pretrained models can be downloaded from BaiduYun (access code: ducr).