Finding the Effective Batchsize #390

m-parchami · 2023-08-21T18:04:22Z

m-parchami
Aug 21, 2023

Hi,

I was wondering how we should interpret the written batch_size for train_loader in the yml config. I suppose for the effective batch size we should multiply it by number of GPUs right? And where can we find that? Currently, I'm checking if DistributedDataParallel is mentioned in the .yml config, if so, I assume the effective batch size is 3x what's in the config, and if not, then just exactly what's in the config.

Could you please clarify this?
Sorry if I missed it in the documentation.
All the best.

yoshitomo-matsubara · 2023-08-21T19:12:20Z

yoshitomo-matsubara
Aug 21, 2023
Maintainer

Hi @m-parchami

Yes, the effective batch size is the number of GPUs you use in distributed training mode. If you do not run the script in distributed training mode, DistributedDataParallel in the yaml file will be considered as DataParallel.

You can also find the description in official directories.
e.g., https://github.com/yoshitomo-matsubara/torchdistill/tree/main/configs/legacy/official/ilsvrc2012/yoshitomo-matsubara/rrpr2020

2 replies

m-parchami Aug 22, 2023
Author

Thanks for the reply.

But it's still not clear to me why here The batch_size is so low and also learning rate here is set to 0.025. Even if we multiply them by 3, we would get batch_size=96 and lr=0.075, which is different than of e.g. AT with batch_size=256 and lr=1 if I read this correctly.

Thanks for answer questions so patiently.
All the best.

yoshitomo-matsubara Aug 22, 2023
Maintainer

Not all the configs are tuned for distributed training.
For those in official directory, I basically reused hyperparameters specified in the original papers/repos or those the authors provided via email. More details can be found in the paper https://github.com/yoshitomo-matsubara/torchdistill#citation

I believe that AT paper did not use distributed training (from their implementation), so if you find small batch size like 32 and learning rate is relatively low (say 0.025), it may be because the paper uses 8 GPUs for distributed training. i.e. batch_size: 32 * 8 = 256.

Note that AT's initial lr in the config is 0.1, not 1. Also, if you want to multiply lr by number of GPUs for distributed training, you need to add -adjust_lr when executing my example scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding the Effective Batchsize #390

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Finding the Effective Batchsize #390

m-parchami Aug 21, 2023

Replies: 1 comment · 2 replies

yoshitomo-matsubara Aug 21, 2023 Maintainer

m-parchami Aug 22, 2023 Author

yoshitomo-matsubara Aug 22, 2023 Maintainer

m-parchami
Aug 21, 2023

Replies: 1 comment 2 replies

yoshitomo-matsubara
Aug 21, 2023
Maintainer

m-parchami Aug 22, 2023
Author

yoshitomo-matsubara Aug 22, 2023
Maintainer