How learning rate decay
like driving, if you drive in the same direction then you supposed to drive faster (bigger step). if you change your direction frequently then you should drive slower (smaller step)
- Also called exponentially weighted moving average.
- Exponential moving average, which using exponential smoothing technique, is in contrast to a simple moving average.
When training a model, it is often beneficial to maintain moving averages of the trained parameters. Evaluations that use averaged parameters sometimes produce significantly better results than the final trained values.
The decay is used to make the order samples decay in weight exponentially.
Reasonable values for decay are close to 1.0, typically in the multiple-nines range: 0.999, 0.9999, etc.
ema = tf.train.ExponentialMovingAverage(decay=0.998)
Formula:
-
$v_t$ : Forecasting value at time t (exponential smoothing result) -
$\beta$ : decay -
$\theta_t$ : Actual data value at time t (maybe with some bias)
Meaning of Decay
e.g. decay (
$\beta$ ) = 0.98 -->$\displaystyle\frac{1}{1-0.98} = 50$
(
e.g.
$\displaystyle0.98^{50} \simeq \frac{1}{e}$ and$\displaystyle\frac{0.98^{50}}{0.98} \simeq 0.37$
That means it decay to
e.g. it takes about 50 rounds decay to 37% (we can generally say that we took EMA(50))
TBD
AdaGrad stands for Adaptive Gradient Algorithm
RMSProp stands for Root Mean Square Propagation
- Adam stands for Adaptive Moment Estimation
- Taking momentum and RMSProp and putting them together
Adam Optimization1 uses a more sophisticated update rule with two additional steps than Stochastic Gradient Descent.
-
First, momentum
$m$ : a rolling average of the gradients$$ \begin{aligned} \mathbf{m} \leftarrow \beta_{1} \mathbf{m}+\left(1-\beta_{1}\right) \nabla_{\boldsymbol{\theta}} J_{\text {minibateh }}(\boldsymbol{\theta}) \ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\alpha \mathbf{m} \end{aligned} $$
where
$\beta_1$ is a hyperparameter between 0 and 1 (often set to 0.9). -
Second, adaptive learing rates: keeping track of
$v$ - a rolling average of the magnitudes of the gradients$$ \begin{array}{l}{\mathbf{m} \leftarrow \beta_{1} \mathbf{m}+\left(1-\beta_{1}\right) \nabla_{\boldsymbol{\theta}} J_{\text {minibatch }}(\boldsymbol{\theta})} \ {\mathbf{v} \leftarrow \beta_{2} \mathbf{v}+\left(1-\beta_{2}\right)\left(\nabla_{\boldsymbol{\theta}} J_{\text {minibatch }}(\boldsymbol{\theta}) \odot \nabla_{\boldsymbol{\theta}} J_{\text {minibatch }}(\boldsymbol{\theta})\right)} \ {\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\alpha \odot \mathbf{m} / \sqrt{\mathbf{v}}}\end{array} $$
where
$\odot$ and$/$ denote elementwise multiplication and division (so$z\odot z$ is elementwise squaring) and$\beta_2$ is a hyperparameter between 0 and 1 (often set to 0.99).Adam divides the update by
$\sqrt{\mathbf{v}},$
- TENSORFLOW GUIDE: EXPONENTIAL MOVING AVERAGE FOR IMPROVED CLASSIFICATION
- An overview of gradient descent optimization algorithms
- [1412.6980] Adam: A Method for Stochastic Optimization
- [PDF] ADADELTA: An Adaptive Learning Rate Method - Semantic Scholar
- Dive Into Deep Learning
- Ch7.2 Gradient Descent
- Ch7.4 Momentum
- Ch7.5 AdaGrad
- Ch7.6 RMSProp
- Ch7.7 AdaDelta
- Ch7.8 Adam