Skip to content

Latest commit



108 lines (76 loc) · 6.78 KB

File metadata and controls

108 lines (76 loc) · 6.78 KB



RGBPoseConv3D is a framework that jointly use 2D human skeletons and RGB appearance for human action recognition. It is a 3D CNN with two streams, with the architecture borrowed from SlowFast. In RGBPoseConv3D:

  • The RGB stream corresponds to the slow stream in SlowFast; The Skeleton stream corresponds to the fast stream in SlowFast.
  • The input resolution of RGB frames is 4x larger than the pseudo heatmaps.
  • Bilateral connections are used for early feature fusion between the two modalities.


  title={Revisiting skeleton-based action recognition},
  author={Duan, Haodong and Zhao, Yue and Chen, Kai and Lin, Dahua and Dai, Bo},
  journal={arXiv preprint arXiv:2104.13586},

How to train RGBPoseConv3D (on NTURGB+D, for example)?

Step 0. Data Preparation

Besides the skeleton annotations, you also need RGB videos to train RGBPoseConv3D. You need to download them from the official website of NTURGB+D ( and put these videos in $PYSKL/data/nturgbd_raw. After that, you need to use the provided script to compress the raw videos (from 1920x1080 to 960x540) and switch the suffix to .mp4:

# That step is mandatory, unless you know how to modify the code & config to make it work for raw videos!

After that, you will find processed videos in $PYSKL/data/nturgbd_videos, named like S001C001P001R001A001.mp4.

Step 1. Pretraining

You first need to train the RGB-only and Pose-only model on the target dataset, the pretrained checkpoints will be used to initialize the RGBPoseConv3D model.

You can either train these two models from scratch with provided configs files:

# We train each model for 180 epochs. By default, we use 8 GPUs.
# Train the RGB-only model
bash tools/ configs/rgbpose_conv3d/ 8 --validate --test-last --test-best
# Train the Pose-only model
bash tools/ configs/rgbpose_conv3d/ 8 --validate --test-last --test-best

or directly download and use the provided pretrain models:

Dataset Config Checkpoint Top-1 (1 clip testing) Top-1 (10 clip testing)
NTURGB+D XSub rgb_config rgb_ckpt 94.4 95.1
NTURGB+D XSub pose_config pose_ckpt 92.9 93.2

Step 2. Generate the initializing weight for RGBPoseConv3D

You can use the provided IPython notebook to merge two pretrained models into a single rgbpose_conv3d_init.pth.

You can do it your own or directly download and use the provided rgbpose_conv3d_init.pth.

Step 3. Finetune RGBPoseConv3D

You can use our provided config files to finetune RGBPoseConv3D, jointly with two modalities (RGB & Pose):

# We finetune RGBPoseConv3D for 20 epochs on NTURGB+D XSub (8 GPUs)
bash tools/ configs/rgbpose_conv3d/ 8 --validate --test-last --test-best
# After finetuning, you can test the model with the following command (8 GPUs)
bash tools/ configs/rgbpose_conv3d/ $CKPT 8 --eval top_k_accuracy --out result.pkl


  1. We use linear scaling learning rate (Initial LRBatch Size). If you change the training batch size, remember to change the initial LR proportionally.

  2. Though optimized, multi-clip testing may consumes large amounts of time. For faster inference, you may change the test_pipeline to disable the multi-clip testing, this may lead to a small drop in recognition performance. Below is the guide:

    test_pipeline = [
        dict(type='MMUniformSampleFrames', clip_len=dict(RGB=8, Pose=32), num_clips=10), # change `num_clips=10` to `num_clips=1`
        dict(type='MMCompact', hw_ratio=1., allow_imgpad=True),
        dict(type='Resize', scale=(256, 256), keep_ratio=False),
        dict(type='GeneratePoseTarget', sigma=0.7, use_score=True, with_kp=True, with_limb=False, scaling=0.25),
        dict(type='Normalize', **img_norm_cfg),
        dict(type='FormatShape', input_format='NCTHW'),
        dict(type='Collect', keys=['imgs', 'heatmap_imgs', 'label'], meta_keys=[]),
        dict(type='ToTensor', keys=['imgs', 'heatmap_imgs', 'label'])


On action recognition with multiple modalities (RGB & Pose), RGBPoseConv3D can achieve better recognition performance than the late fusion baseline.

Dataset Fusion Config Checkpoint RGB Stream Top-1
(1-clip / 10-clip)
Pose Stream Top-1
(1-clip / 10-clip)
2 Stream Top-1 (1:1)
(1-clip / 10-clip)
NTURGB+D XSub Late Fusion rgb_config
94.4 / 95.1 92.9 / 93.2 95.8 / 96.1
NTURGB+D XSub Early Fusion + Late Fusion config ckpt 96.2 / 96.4 95.9 / 96.1 96.6 / 96.9


For both Late Fusion and Early Fusion + Late Fusion, we combine the action scores based on two modalities with 1:1 ratio to get the final prediction.