-
Hi! I saw that in DINO paper they had weight decay cosine schedule (paper: p.5, "Implementation details.", code) and it seems useful. Are there any plans to provide such a functionality in |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
@detkov I'm not sure if the DINO authors are aware, but the ADAMW in PyTorch is not fully decoupled as in the paper, so the application of WD actually does decay with the LR schedule (https://github.com/pytorch/pytorch/blob/master/torch/optim/adamw.py#L248). In general though, if anyone can show that it is beneficial for other pretraining schemes besides DINO I'm open to adding... |
Beta Was this translation helpful? Give feedback.
@detkov I'm not sure if the DINO authors are aware, but the ADAMW in PyTorch is not fully decoupled as in the paper, so the application of WD actually does decay with the LR schedule (https://github.com/pytorch/pytorch/blob/master/torch/optim/adamw.py#L248).
In general though, if anyone can show that it is beneficial for other pretraining schemes besides DINO I'm open to adding...