- Model: Several transformer-based milestone models are reimplemented from scratch via pytorch
- Vision Transformer (CV) : AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
- Vanilla Transformer (NLP) : Attention Is All You Need
- Experiments: Conduct experiments on CV/NLP benchmark, respectively
- Image Classification :
- Neural Machine Translation :
- Image Classification :
- Pipeline: End-to-end pipeline
- Conveniently Playing : integrate data processing and model training/validation into one-stop shop pipeline
- Efficiently Training : accelerate training and evaluating via DistributeDataParallel(DDP) and Mixed Precision(fp16)
- Neatly Reading : neat file structure, easy for reading but non-trivial
- ./script → run train/eval
- ./model → model implementation
- ./data → data processing
# Conda Env
python 3.6.10
torch 1.4.0+cu100
torchvision 0.5.0+cu100
torchtext 0.5.0
spacy 3.4.1
tqdm 4.63.0
# Apex (For mix precision training)
## run `gcc --version`
gcc (GCC) 5.4.0
## apex installation
git clone https://github.com/NVIDIA/apex
cd apex
git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
rm -rf ./build
python setup.py install --cuda_ext --cpp_ext
# System Env
## run `nvcc --version`
Cuda compilation tools, release 10.0, V10.0.130
# run `nvidia-smi`
Check your own gpu device status
- multi30k, cifar10 could be automatically downloaded in pipeline
- imagenet1k(ILSVRC2012) need manual download (Guide for download imagenet1k)
- Wait until three files download.
- ILSVRC2012_devkit_t12.tar.gz (2.5M)
- ILSVRC2012_img_train.tar (138G)
- ILSVRC2012_img_val.tar (6.3G)
- Run imagenet1k pipeline, ILSVRC2012_img_train.tar and ILSVRC2012_img_val.tar will be automatically unzipped and arranged in two directories 'data/ILSVRC2012/train' and 'data/ILSVRC2012/val'.
- But the unzip process costs more than a few hours or you can do it faster by shell anyway.
- Wait until three files download.
# Guide for download imagenet1k
mkdir -p data/ILSVRC2012
cd data/ILSVRC2012
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_devkit_t12.tar.gz --no-check-certificate
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar --no-check-certificate
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar --no-check-certificate
- Download pretrained ViT_B_16 model parameters from official storage.
cd data
curl -o ViT_B_16.npz https://storage.googleapis.com/vit_models/augreg/B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_224.npz
curl -o ViT_B_16_384.npz https://storage.googleapis.com/vit_models/augreg/B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_384.npz
- Before run experiments
- Set CUDA env in script/run_img_cls_task.py/__main__ according to your GPU device
- Adjust train/eval settings in script/run_img_cls_task.py/get_args() and launch the experiment
cd script
# run experiments on cifar10
# (4mins/epoch, 3.5hours totally | GPU device: P40×4)
python ./run_image_cls_task.py cifar10
# run experiments on imagenet1k
# (less than 5hours/epcoch, more than 10hours totally | GPU device: P40×4 )
python ./run_image_cls_task.py ILSVRC2012
# Tips:
# 1. Both DDP and FP16 Mixed Precision Training are adopted for accelerating
# 2. The ratio of acceleration depends on your specific GPU device
- Before run experiments
- Set CUDA env in script/run_nmt_task.py/__main__ according to your GPU device
- Adjust train/eval settings in script/run_nmt_task.py/get_args() and launch the experiment
# run experiments on multi30k (small dataset ,3mins total | GPU device : P40×4 | U can also fork and adjust the pipeline and run this experiments in a small capacity gpu device)
cd script
python ./run_nmt_task.py multi30k
# Tips:
# 1. DDP is adopted for accelerating
# 1. For inference, "greedy search" and "beam search" are also included in the nmt task pipeline.
- This repo
- Imagenet1k : ACC 84.9% (result on 50,000 val set for | resolution 384 | extra label smoothing confidence 0.9 | batch size 160, nearly 15,000 training steps)
- Cifar10 : ACC 99.04% (resolution 224 | batch size 640, nearly 5500 training steps)
- Comparison to official result ViT Implementation by Google
- This repo
- Multi30k : BLEU 38.6 (en→de | nearly 17M #Params | batch size 512, nearly 1200 training steps)
- Comparison to results in Dynamic Context-guided Capsule Network for Multimodal Machine Translation
- Transformer Survey
- Vanilla Transformer Component Structures
- self-attention
- multi-head
- feed forward network
- residual connection & layer norm
- label smoothing related
- Recent Transformer Milestone Work in CV