Skip to content

Weight decay cosine schedule #1243

Answered by rwightman
detkov asked this question in Ideas
Discussion options

You must be logged in to vote

@detkov I'm not sure if the DINO authors are aware, but the ADAMW in PyTorch is not fully decoupled as in the paper, so the application of WD actually does decay with the LR schedule (https://github.com/pytorch/pytorch/blob/master/torch/optim/adamw.py#L248).

In general though, if anyone can show that it is beneficial for other pretraining schemes besides DINO I'm open to adding...

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by detkov
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Ideas
Labels
None yet
2 participants