-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
L2 Decay for Optimizers #269
base: master
Are you sure you want to change the base?
Conversation
Isn't weight decay part of the objective function and not the optimizer?
See the overfitting notebook for an example.
…On Tue, Feb 13, 2018, 17:47 cangumeli ***@***.***> wrote:
I'm adding l2 weight decay to Knet optimizers. To discuss the interface, I
started with Adam.
l2 decay is used in the model I'm currently replicating, so I'll be using
it.
------------------------------
You can view, comment on, or merge this pull request online at:
#269
Commit Summary
- L2 Decay added to adam
- Empty space removal
File Changes
- *M* src/update.jl
<https://github.com/denizyuret/Knet.jl/pull/269/files#diff-0> (12)
Patch Links:
- https://github.com/denizyuret/Knet.jl/pull/269.patch
- https://github.com/denizyuret/Knet.jl/pull/269.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#269>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvNpmO2bxDjwg2JR2ObNhGl3cm0ZfNPks5tUaCNgaJpZM4SD3PI>
.
|
Mathematically you are changing the optimization objective, not the
algorithm. If you are claiming there is a computational advantage let's
discuss face to face.
…On Tue, Feb 13, 2018 at 10:00 PM cangumeli ***@***.***> wrote:
L2 decay is part of the optimizers in many frameworks like MxNet
<https://mxnet.incubator.apache.org/api/python/optimization.html#api-reference>
and PyTorch <http://pytorch.org/docs/master/optim.html> (they simply call
it weight decay).
Adding l2 penalty to objective function means incrementing each gradient
by decay_rate * w. Doing this addition in optimizers will save us from
the overhead of squaring and reductions.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#269 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvNpuFUi2GlayIq47nveSrgbgFgrvtRks5tUduxgaJpZM4SD3PI>
.
|
One side note worth mentioning. This looks like it applies weight decay to all learned parameters equally; including the bias terms |
Good point, people usually do not want to apply weight decay to biases.
There is also the popular L1 regularization.
…On Wed, Feb 14, 2018 at 10:16 AM Christof Stocker ***@***.***> wrote:
One side note worth mentioning. This looks like it applies weight decay to
all learned parameters equally; including the bias terms
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#269 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvNphVuLJZ0s5ey86VEbWjPgyHs4CsCks5tUohegaJpZM4SD3PI>
.
|
Technically, there is an optimizer for each parameter, so user will be able to modify weight decay for each parameter. On the other hand, when Also, we will be reporting losses with weight decay penalty excluded. I'm not sure is this a feature or a bug. I think we may consider weight decay in optimizers as a performance trick to be used by people who know what they are doing. |
Very relevant paper on this topic: https://arxiv.org/abs/1711.05101 Particularly it argues that L2 regularization and weight decay are not identical for Adam. Furthermore it argues that L2 regularization is not effective in Adam. |
I'm adding l2 weight decay to Knet optimizers. To discuss the interface, I started with
Adam
.l2 decay is used in the model I'm currently replicating, so I'll be using it.