Official implementation of the paper "SNP: Structured Neuron-level Pruning to Preserve Attention Scores" accepted at European Conference on Computer Vision (ECCV) 2024.
Structured Neuron-level Pruning (SNP) prunes neurons with less informative attention scores and eliminates redundancy among heads. Our approach effectively accelerates Transformer-based models for both edge devices and server processors. SNP with head pruning can compress the DeiT-Base by 80% of the parameters and computational costs and achieve 4.93× speed up on Jetson Nano and 3.85× on RTX3090.
SNP prunes graphically connected query and key layers having the least informative attention scores while preserving the overall attention scores. Value layers, which can be pruned independently, are pruned to eliminate inter-head redundancy. For more details, please refer to the main paper.
Inference speed and Top-1 accuracy of the compressed model across different devices. Latency is benchmarked with 200 warmup runs and averaged over 1000 runs (all units in the table are in milliseconds). A single image is used as the batch size, except for the RTX 3090, where a batch size of 64 images is employed.
Model | Top-1 (%) | GFLOPs | Raspberry Pi 4B (.onnx) | Jetson Nano (.trt) | Xeon Silver 4210R (.pt) | RTX 3090 (.pt) |
---|---|---|---|---|---|---|
DeiT-Tiny | 72.2 | 1.3 | 139.1 | 41.0 | 34.7 | 18.7 |
+ SNP (Ours) | 70.2 | 0.6 | 81.6 (1.70×) | 26.7 (1.54×) | 25.3 (1.38×) | 17.8 (1.05×) |
DeiT-Small | 79.8 | 4.6 | 401.3 | 99.3 | 53.4 | 46.1 |
+ SNP (Ours) | 78.5 | 2.0 | 199.2 (2.01×) | 45.5 (2.18×) | 38.6 (1.38×) | 32.9 (1.40×) |
+ SNP (Ours) | 73.3 | 1.3 | 136.7 (2.94×) | 32.0 (3.10×) | 33.5 (1.60×) | 27.0 (1.71×) |
DeiT-Base | 81.8 | 17.6 | 1377.7 | 293.3 | 122.0 | 151.4 |
+ SNP (Ours) | 79.6 | 6.4 | 565.7 (2.44×) | 132.6 (2.21×) | 64.7 (1.89×) | 73.00 (2.07×) |
+ SNP (Ours) + Head | 79.1 | 3.5 | 307.0 (4.48×) | 59.5 (4.93×) | 46.1 (2.65×) | 39.3 (3.85×) |
EfficientFormer-L1 | 79.2 | 1.3 | 169.1 | 31.0 | 43.8 | 26.2 |
+ SNP (Ours) | 75.5 | 0.6 | 95.1 (1.78×) | 19.8 (1.56×) | 38.3 (1.14×) | 17.2 (1.52×) |
+ SNP (Ours) | 74.5 | 0.5 | 82.6 (2.05×) | 17.8 (1.74×) | 35.2 (1.24×) | 16.0 (1.64×) |
conda create -n snp python=3.8
conda activate snp
git clone https://github.com/Nota-NetsPresso/SNP.git
cd SNP
pip install -r requirements.txt
Sign Up for NetsPresso
To compress the DeiT model using SNP, you need to sign up for a NetsPresso account. You can sign up here or go directly to the Sign Up page.
Following steps compress the DeiT-T model using SNP and train it for 20 epochs:
- Run the main script:
bash main.sh
- When prompted, enter your NetsPresso user information:
Please enter your NetsPresso Email: Please enter your NetsPresso Password:
- Enter the path to your ImageNet-1K dataset:
Please enter the path to your ImageNet dataset:
Compressed DeiT-T: 0.6 GFLOPs and 70.29% Top-1 Acc.:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7\
python -m torch.distributed.launch --nproc_per_node 8 --master_addr="127.0.0.1" --master_port=12345 \
train.py --model "./reported_models/compressed_models/DeiT-T.pt" \
--lr 0.00025 \
--batch-size 256 \
--epochs 300 \
--output_dir ${OUPUT_DIR} \
--data-path ${IMAGENET_PATH}\
> ./txt_logs/training_deit_t.txt 2>&1 &
Compressed DeiT-S: 2.0 GFLOPs and 78.52% Top-1 Acc.:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7\
python -m torch.distributed.launch --nproc_per_node 8 --master_addr="127.0.0.1" --master_port=12345 \
train.py --model "./reported_models/compressed_models/DeiT-S_2GFLOPs.pt" \
--lr 0.00025 \
--batch-size 256 \
--epochs 300 \
--output_dir ${OUPUT_DIR} \
--data-path ${IMAGENET_PATH}\
> ./txt_logs/training_deit_s_2GFLOPs.txt 2>&1 &
Compressed DeiT-S with 1.3 GFLOPs and 73.32% Top-1 Acc.:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7\
python -m torch.distributed.launch --nproc_per_node 8 --master_addr="127.0.0.1" --master_port=12345 \
train.py --model "./reported_models/compressed_models/DeiT-S_1_3GFLOPs.pt" \
--lr 0.00025 \
--batch-size 256 \
--epochs 300 \
--output_dir ${OUPUT_DIR} \
--data-path ${IMAGENET_PATH}\
> ./txt_logs/training_deit_s_1_27GFLOPs.txt 2>&1 &
-
To compress the DeiT model, use the following command:
python compress.py --NetsPresso-Email ${USER_NAME} \ --NetsPresso-Pwd ${USER_PWD} \ --model deit_tiny_patch16_224 \ --data-path ${IMAGENET_PATH}\ --output_dir ${OUPUT_DIR} \ --num-imgs-snp-calculation 64\
-
To train the compressed model (saved in the
compressed
directory withinoutput_dir
), use the following command:CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7\ python -m torch.distributed.launch --nproc_per_node 8 --master_addr="127.0.0.1" --master_port=12345 \ train.py --model "${OUPUT_DIR}/compressed/compressed.pt" \ --batch-size 256 \ --epochs 300 \ --output_dir ${OUPUT_DIR} \ --data-path ${IMAGENET_PATH}\ > ./txt_logs/training_test.txt 2>&1 &
YOLO Fastest | YOLOX | YOLOv8 | YOLOv7 | YOLOv5 | PIDNet | PyTorch-CIFAR-Models
from netspresso import NetsPresso
from netspresso.enums import CompressionMethod, GroupPolicy, LayerNorm, Policy
from netspresso.clients.compressor.v2.schemas import Options
# Step 0: Login to NetsPresso
netspresso = NetsPresso(email=args.NetsPresso_Email, password=args.NetsPresso_Pwd)
# Step 1: Declare the compressor
compressor = netspresso.compressor_v2()
# Step 2: Upload the model
# Provide the path to your model and specify the input shape
model = compressor.upload_model(
input_model_path=${MODEL_PATH},
input_shapes=[{"batch": 1, "channel": 3, "dimension": [224, 224]}],
)
# Step 3: Select the compression method
# Specify the compression method and options
compression_info = compressor.select_compression_method(
model_id=model.ai_model_id,
compression_method=CompressionMethod.PR_SNP,
options=Options(
policy=Policy.AVERAGE,
layer_norm=LayerNorm.TSS_NORM,
group_policy=GroupPolicy.NONE,
reshape_channel_axis=-1,
),
)
# Step 4: Load the compression ratio for each layer
# Assign the compression ratio for each available layer
for available_layer in compression_info.available_layers:
available_layer.values = [${COMPRESS_RATIO}[available_layer.name]]
# Step 5: Compress the model
# Perform the compression and save the compressed model
compressed_model_info = compressor.compress_model(
compression=compression_info,
output_dir=${SAVE_DIR},
)
# Load the compressed model
compressed_model=torch.load(compressed_model_info.compressed_model_path)
# After compressing the model, the user needs to train the compressed model to compensate for the performance loss.
- All rights related to this repository and the compressed models are reserved by Nota Inc.
- The intended use is strictly limited to research and non-commercial projects.
@article{shim2024snp,
title={SNP: Structured Neuron-level Pruning to Preserve Attention Scores},
author={Shim, Kyunghwan and Yun, Jaewoong and Choi, Shinkook},
journal={arXiv preprint arXiv:2404.11630},
year={2024},
url={https://arxiv.org/abs/2404.11630}
}