why separate apply_updates from update? #155

jeremiecoullon · 2021-07-02T10:39:07Z

jeremiecoullon
Jul 2, 2021

From the flax documentation, a simple GD algorithm runs as follows:

tx = optax.sgd(learning_rate=alpha)
opt_state = tx.init(params)
loss_grad_fn = jax.value_and_grad(loss)

for i in range(101):
  loss_val, grads = loss_grad_fn(params)
  updates, opt_state = tx.update(grads, opt_state)
  params = optax.apply_updates(params, updates)

Why is the apply_updates separated from tx.update? In contrast, in Jax's optimizer module the update function return the state (which includes the parameters).

From reading the documentation it seems that the separation is there to be able to use optimisers with extra things like gradient clipping which aren't in the default optimizers (sgd, adam, etc..).
Is this the case or is there another reason? And if so, could these gradient clippings not be applied to some kind of state which includes parameters?

Thanks!

Answered by mtthss

Jul 13, 2021

Hello,

Separating the transformation of updates from their application to the params has several advantages

allows you to combine multiple transformations using chain
(e.g. you might want to create custom optimisers by chaining together different existing gradient transformations,
without having to rewrite the entire thing as a single monolithic optimiser,
For a very trivial example, you might want to first clip gradient then rescale them using Adam, or viceversa.
but you may also do more sophisticated combinations.
If you take a look at alias.py you can see many popular optimisers are actually build from a relatively small set of primitives,
by freely combining these you can experiment …

View full answer

mtthss · 2021-07-13T18:03:53Z

mtthss
Jul 13, 2021
Maintainer

Hello,

Separating the transformation of updates from their application to the params has several advantages

allows you to combine multiple transformations using chain
(e.g. you might want to create custom optimisers by chaining together different existing gradient transformations,
without having to rewrite the entire thing as a single monolithic optimiser,
For a very trivial example, you might want to first clip gradient then rescale them using Adam, or viceversa.
but you may also do more sophisticated combinations.
If you take a look at alias.py you can see many popular optimisers are actually build from a relatively small set of primitives,
by freely combining these you can experiment with a much larger space of options.)
allows you to easily use multiple optimisers operating on shared variables
(e.g. in multi-objective optimisation, you might have a different optimiser instance,
with its own statistics, for each loss, and you want to be able to merge these before applying the updates).
allows you to insert logging of both the gradients and the rescaled/transformed gradients,
as you have access to the transformed updates, not just to the resulting parameters
allows you to mix and match pre-built optimisers/transformations with your own custom code,
you can just chain your own jax transformation to any existing optax component

1 reply

jeremiecoullon Jul 14, 2021
Author

Thanks for the very comprehensive answer!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why separate apply_updates from update? #155

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

why separate apply_updates from update? #155

jeremiecoullon Jul 2, 2021

Replies: 1 comment · 1 reply

mtthss Jul 13, 2021 Maintainer

jeremiecoullon Jul 14, 2021 Author

jeremiecoullon
Jul 2, 2021

Replies: 1 comment 1 reply

mtthss
Jul 13, 2021
Maintainer

jeremiecoullon Jul 14, 2021
Author