This is a warehouse for Agent-Attention-Models based on pytorch framework, can be used to train your image datasets. The code mainly comes from official source code
├── datasets: Load datasets
├── my_dataset.py: Customize reading data sets and define transforms data enhancement methods
├── split_data.py: Define the function to read the image dataset and divide the training-set and test-set
├── threeaugment.py: Additional data augmentation methods
├── models: UniRepLKNet Model
├── agent_cswin.py: Construct "Agent-cswin" models
├── agent_deit.py: Construct "Agent-deit" models
├── agent_pvt.py: Construct "Agent-pvt" models
├── agent_swin.py: Construct "Agent-swin" models
├── deit.py: Construct "deit" models
├── util:
├── engine.py: Function code for a training/validation process
├── losses.py: Knowledge distillation loss, combined with teacher model (if any)
├── optimizer.py: Define Sophia optimizer
├── samplers.py: Define the parameter of "sampler" in DataLoader
├── utils.py: Record various indicator information and output and distributed environment
├── estimate_model.py: Visualized evaluation indicators ROC curve, confusion matrix, classification report, etc.
└── train_gpu.py: Training model startup file
Before you use the code to train your own data set, please first enter the train_gpu.py file and modify the data_root, batch_size, num_workers and nb_classes parameters. If you want to draw the confusion matrix and ROC curve, you only need to set the predict parameter to True
You can use anther optimizer sophia, just need to change the optimizer in train_gpu.py, for this training sample, can achieve better results
# optimizer = create_optimizer(args, model_without_ddp)
optimizer = SophiaG(model.parameters(), lr=2e-4, betas=(0.965, 0.99), rho=0.01, weight_decay=args.weight_decay)
1. nproc_per_node: <The number of GPUs you want to use on each node (machine/server)>
2. CUDA_VISIBLE_DEVICES: <Specify the index of the GPU corresponding to a single node (machine/server) (starting from 0)>
3. nnodes: <number of nodes (machine/server)>
4. node_rank: <node (machine/server) serial number>
5. master_addr: <master node (machine/server) IP address>
6. master_port: <master node (machine/server) port number>
If you want to use multiple GPU for training, whether it is a single machine with multiple GPUs or multiple machines with multiple GPUs, each GPU will divide the batch_size equally. For example, batch_size=4 in my train_gpu.py. If I want to use 2 GPUs for training, it means that the batch_size on each GPU is 4. Do not let batch_size=1 on each GPU, otherwise BN layer maybe report an error. If you recive an error like "unrecognized arguments: --local-rank=1___" when you use distributed multi-GPUs training, just replace the command "torch.distributed.launch" to "torch.distributed.run".
python train_gpu.py
python -m torch.distributed.launch --nproc_per_node=8 train_gpu.py
(using a specified part of the GPUs: for example, I want to use the second and fourth GPUs)
CUDA_VISIBLE_DEVICES=1,3 python -m torch.distributed.launch --nproc_per_node=2 train_gpu.py
(For the specific number of GPUs on each machine, modify the value of --nproc_per_node. If you want to specify a certain GPU, just add CUDA_VISIBLE_DEVICES= to specify the index number of the GPU before each command. The principle is the same as single-machine multi-GPU training)
On the first machine: python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=<Master node IP address> --master_port=<Master node port number> train_gpu.py
On the second machine: python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=<Master node IP address> --master_port=<Master node port number> train_gpu.py
@article{han2023agent,
title={Agent Attention: On the Integration of Softmax and Linear Attention},
author={Han, Dongchen and Ye, Tianzhu and Han, Yizeng and Xia, Zhuofan and Song, Shiji and Huang, Gao},
journal={arXiv preprint arXiv:2312.08874},
year={2023}
}