sample code for few-shot action recognition on UCF101
UCF101.py sampler supports autoaugment[1] when scarcity of frames in video(optional).
- torch>=1.6.0
- torchvision>=0.7.0
- tensorboard>=2.3.0
download and extract frame from UCF101 videos. UCF101 Frame Extractor
split dataset for few-shot learning(if you already has csv files then you can skip this step)
python splitter.py --frames-path /path/to/frames --labels-path /path/to/labels --save-path /path/to/save
train(resnet18)
python train.py --frames-path /path/to/frames --save-path /path/to/save --tensorboard-path /path/to/tensorboard --model resnet --uniform-frame-sample --learning-rate 5e-4 --frame-size 168 --way 5 --shot 1 --query 5
train(r2plus1d18)
python train.py --frames-path /path/to/frames --save-path /path/to/save --tensorboard-path /path/to/tensorboard --model r2plus1d --uniform-frame-sample --metric cosine --way 5 --shot 1 --query 5
test(resnet18)
python test.py --frames-path /path/to/frames --load-path /path/to/load --use-best --model resnet --frame-size 168 --way 5 --shot 1 --query 5
test(r2plus1d18)
python test.py --frames-path /path/to/frames --load-path /path/to/load --use-best --model r2plus1d --metric cosine --way 5 --shot 1 --query 5
device information: GPU: RTX 2080 Ti(11GB)
data settings: train class: 71 (9473 videos), test(val) class: 30 (3847 videos)
option settings
frame size: 112(r2plus1d), 168(resnet)
num epochs: 30
train iter size: 100
val iter size: 200
metric: cosine
random pad sample: False
pad option: default
uniform frame sample: True
random start position: False
max interval: 7
random interval: False
sequence length: 35
num_layers:1 (resnet)
hidden_size: 512 (resnet)
learning rate: 1e-4(r2plus1d), 5e-4(resnet)
scheduler step: 10
scheduler gamma: 0.9
way: 5
shot: 1
query: 5
require video memory: resnet: about 7538 MB, r2plus1d: about 10042 MB
all accuracy results are averaged over 6000 test episodes with 95% confidence intervals
model | Accuracy |
---|---|
resnet18 | 70.08 ±0.32 |
r2plus1d18 | 94.29 ±0.67 |
- model: choose for different normalization value of model
- frames_path: frames path
- labels_path: labels path
- frame_size: frame size(width and height are should be same)
- sequence_length: number of frames
- setname: sampling mode, if this mode is 'train' then the sampler read a 'train.csv' file to load train dataset [default: 'train', others: 'test']
- random_pad_sample: sampling frames from current frames with randomly for making some pads when frames are insufficient, if this value is False then only use first frame repeatedly [default: True, other: False]
- pad_option: if this value is 'autoaugment' then pads will augmented by autoaugment policies [default: 'default', other: 'autoaugment']
- uniform_frame_sample: sampling frames with same interval, if this value is False then sampling frames with ignored interval [default: True, other: False]
- random_start_position: decides the starting point with randomly by considering the interval, if this value is False then starting point is always 0 [default: True, other, False]
- max_interval: setting of maximum frame interval, if this value is high then probability of missing sequence of video is high [default: 7]
- random_interval: decides the interval value with randomly, if this value is False then use a maximum interval [default: True, other: False]
- labels: this parameter only receive classes in csv files, so this value must be UCF101.classes
- iter_size: number of iteration per episodes(total episode = epochs * iter_size)
- way: number of way(number of class)
- shot: number of shot
- query: number of query
*way, shot, query => we follow episodic training stratiegy[2]
[1] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, Quoc V. Le, "AutoAugment: Learning Augmentation Strategies From Data", Computer Vision and Pattern Recognition(CVPR), 2019, pp. 113-123
[2] Vinyals, Oriol and Blundell, Charles and Lillicrap, Timothy and kavukcuoglu, koray and Wierstra, Daan, "Matching Networks for One Shot Learning", Neural Information Processing Systems(NIPS), 2016, pp. 3630-3638