Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L2 Decay for Optimizers #269

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

cangumeli
Copy link
Collaborator

I'm adding l2 weight decay to Knet optimizers. To discuss the interface, I started with Adam.
l2 decay is used in the model I'm currently replicating, so I'll be using it.

@denizyuret
Copy link
Owner

denizyuret commented Feb 13, 2018 via email

@cangumeli
Copy link
Collaborator Author

L2 decay is part of the optimizers in many frameworks like MxNet and PyTorch (they simply call it weight decay).

Adding l2 penalty to objective function means incrementing each gradient by decay_rate * w. Doing this addition in optimizers will save us from the overhead of squaring and reductions.

@denizyuret
Copy link
Owner

denizyuret commented Feb 14, 2018 via email

@Evizero
Copy link

Evizero commented Feb 14, 2018

One side note worth mentioning. This looks like it applies weight decay to all learned parameters equally; including the bias terms

@denizyuret
Copy link
Owner

denizyuret commented Feb 14, 2018 via email

@cangumeli
Copy link
Collaborator Author

Technically, there is an optimizer for each parameter, so user will be able to modify weight decay for each parameter. On the other hand, when optimizers is called with decay option, we will be decaying all parameters including biases. There are studies decaying all parameters, but making this the default behaviour might be misleading.

Also, we will be reporting losses with weight decay penalty excluded. I'm not sure is this a feature or a bug.

I think we may consider weight decay in optimizers as a performance trick to be used by people who know what they are doing.

@Evizero
Copy link

Evizero commented Jul 14, 2018

Very relevant paper on this topic: https://arxiv.org/abs/1711.05101

Particularly it argues that L2 regularization and weight decay are not identical for Adam. Furthermore it argues that L2 regularization is not effective in Adam.

@denizyuret denizyuret self-assigned this Jan 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants