forked from rubenvereecken/thesis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathann.tex
773 lines (663 loc) · 24.7 KB
/
ann.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
\chapter{Artificial Neural Networks}
Neural networks provide a robust method to learn
vector-valued target functions.
They have effectively been used for a multitude of machine learning domains,
including recognition of handwritten characters
\parencite{LeCun1989}
and
face recognition
\parencite{Cottreil1991},
but also in reinforcement learning
(\cite{anderson1989}; \cite{lin1993}). % inverted pendulum
Both popularity and effectiveness of neural networks can be attributed to
their ability to cope with noisy, real-world data,
as well as being able to approximate any function,
albeit with possibly very large complexity
% TODO ref up
(\cite{mitchell1997}, chapter 4).
\paragraph{}
In part, they are man's attempt
to model the biology of the brain,
or at least that is where the inspiration is drawn from.
The artificial neural networks we construct
consist of interconnected units,
each taking in real-valued inputs
and outputting a real value,
possibly connecting to multiple other units.
This can be in a directed, feed-forward manner,
but it does not have to be.
Interesting architectures have been discovered
that make use of recurrent connections.
Such architectures will be considered in
section \ref{sec:rnn}.
To complete the analogy,
the human brain consists of a large amount of densely packed neurons,
approximately $10^{11}$ in total.
Compared to an artificial neuron,
they fire a lot slower
yet still humans are capable of
making incredibly complicated decisions
(or recognitions)
in a matter of less than a second.
This hints at great parallelism,
an attribute that is still important
in the artificial neural networks we construct,
especially in real-life situation reinforcement learning
where decisions must be made swiftly and we cannot wait
for a slower model to finish computing.
While our general purpose machines
are entirely sequential in nature,
parallel hardware has been built specially
for neural network applications.
These are not generally available,
as each domain and even each problem
still require their own specific architectures.
Still, in modern day, many make use
of the parallel capabilities of
Graphic Processing Units (GPUs),
effectively making artificial neural networks even more useful.
The analogy is limited however.
Our use of neural networks draws inspiration from nature
but does not aim to mimic perfectly.
Whereas our neural networks output real values,
biological neurons produce a series of spikes
where both timing and intensity impact the represented information
\parencite{bialek1991reading}.
The reader interested in spiking neuron networks
is referred to a good overview by
\citeauthor{paugam2012computing} (\citeyear{paugam2012computing}).
For now,
we shall content ourselves without the dimension of time in this regard.
\section{Building Blocks}
Artificial neural networks are made up of simple elements called units.
Units are connected through directed connections
that are associated with a weight.
These weights are what make up the parameters of the model.
Units are usually divided in layers depending on their function
and location within the network.
Non-input units get their activation signal, i.e. their input,
from some function of the values of the incoming connections
and the weights associated with them.
We call these functions \textit{activation functions}.
They are usually non-linear and play an important role
in how the input value is expressed, or \textit{activated}.
\begin{figure}[h]
\label{fig.neuralnet}
\center
\includegraphics[width=0.6\linewidth]{net.pdf}
\caption[Multi-layer neural network]{
Multi-layer neural network with a single hidden layer.
}
\end{figure}
\section{Activation Functions}
\label{sec:activation}
Activation functions are used to transform the output
of a unit in a neural network.
They have a crucial impact on the shape of the output space,
so some care should be taken when deciding on one.
Formally, an activation function $\phi$ will compute
\begin{equation}
o = \phi (\overrightarrow{w} \cdot \overrightarrow{x})
\end{equation}
where $\overrightarrow{x}$
is the vector of incoming values from other units
and $\overrightarrow{w}$
is the vector of associated weights.
\subsection{Perceptron}
Activation functions can be linear or nonlinear.
A special kind of unit is the \textit{perceptron}
\parencite{rosenblatt1958perceptron},
which outputs either a $-1$ or $1$,
depending on the value of a linear combination of the input:
\begin{equation}
\label{eq.perceptron}
o(\overrightarrow{x}) = \begin{cases}
1 & if \overrightarrow{w}\cdot\overrightarrow{x} > 0 \\
-1 & otherwise
\end{cases}
\end{equation}
\begin{figure}
\label{fig.ml.perceptron}
\center
\includegraphics[width=.8\linewidth]{perceptron.pdf}
\caption{Basic perceptron unit.}
\end{figure}
where $\overrightarrow{w}$ denotes the weights of the incoming connection
and $\overrightarrow{x}$ the corresponding incoming values.
This function effectively creates a hyperplane decision surface;
it separates the input space in two.
\subsection{Sigmoid}
\label{sec:sigmoid}
A problem with perceptrons
is that they are not differentiable
in their whole domain.
As will become clear in section \ref{sec:backprop},
differentiability is a useful and ofttimes required attribute.
As a way to circumvent this difficulty of non-differentiability,
the \textit{sigmoid} (denoted $\sigma(x)$) is often used:
\begin{equation}
\label{eq.ml.sigmoid}
\sigma(x) = \frac{1}{1 + e^{-x}}
\end{equation}
\begin{figure}[h]
\center
\includegraphics[width=.7\textwidth]{sigmoid.pdf}
\caption{The Sigmoid function}
\label{fig.ml.sigmoid}
\end{figure}
As can be seen on fig \ref{fig.ml.sigmoid},
it has the same tendency as the perceptron activation function
to separate the input space in two
but there is a transition area of uncertainty between the two extremes,
allowing the function to be differentiable in its whole domain.
In practice, the range is normalized between -1 and 1
just like the perceptron.
\subsection{Hyperbolic Tangent}
As a cousin to the sigmoid function, the hyperbolic tangent
deserves mentioning as well.
It bears the same shape as the sigmoid function
since it is really only a stretched
and shifted version.
\begin{equation}
\label{eq.ml.tanh}
htan(x) = \frac{1}{1 + e^{-2x}}+1
\end{equation}
\begin{figure}[h]
\center
\includegraphics[width=.7\textwidth]{tanh.pdf}
\caption{The hyperbolic tangent function}
\label{fig.ml.htan}
\end{figure}
\section{Rectified Linear Unit}
\label{sec:relu}
A fairly recent activation function is the
Rectified Linear Unit, or ReLU for short
\parencite{Nair2010}.
\begin{equation}
\label{eq.relu}
f(x) =
\begin{cases}
0 & \text{for } x < 0 \\
x & \text{for } x \geq 0
\end{cases}
\end{equation}
\begin{figure}[h]
\center
\includegraphics[width=.7\textwidth]{relu.pdf}
\caption{Rectified Linear Unit}
\label{fig.ml.relu}
\end{figure}
While it is very simple in its essence,
it has some important properties
that can be leveraged.
Among the advantages is that anything
below zero is mapped to zero.
This is great for sparse
network activations.
In a scenario with weights initialized
symmetrically (e.g. uniformly) around 0,
only half the neurons will activate.
This sparsity is great for discrimination
and effectively makes training
neural networks faster,
making it especially interesting for
deep learning
\parencite{Y.2015a}.
Linear rectifier functions also help mitigate
the \textit{vanishing gradient} problem
since gradients above zero
are simply propagated
\parencite{Glorot2011}.
A more detailed discussion on the vanishing gradient problem
can be found in section \ref{sec:rnn}
pertaining recurrent neural networks.
\paragraph{}
The rectified unit family has gained quite some attention
ever since the first promising results.
Leaky rectified units comprise a subfamily
of functions that are designed to have a non-zero
gradient over their entire domain.
The basic version simply
has a fixed slope below zero
that is kept constant throughout training.
One such variant is the
\textit{Parametric rectified linear unit}
\parencite{He2015a},
a unit with a parametric slope below zero
that can be trained on data.
Another variant is the
\textit{Randomized rectified linear unit}
which has a randomized slope below zero.
The supposition is that the noise
will help prevent overfitting during training,
so the slopes are fixed outside of training.
\section{Gradient Descent and Backpropagation}
\label{sec:backprop}
Earlier we talked about a learning algorithm's error
and even glossed over ways to minimize this error.
In this section we will discuss gradient descent
as a general approach
followed by backpropagation,
a specific gradient descent algorithm
designed to minimize errors
for artificial neural networks.
\paragraph{}
The approaches discussed here
make certain assumptions about
what the algorithm's error looks like
in function of the parameters of the algorithm.
In other words, they make assumptions
about the error surface.
Differentiability is even a requirement,
meaning the derivative of the error
exists in each point,
so for each combination of algorithm parameters.
This gives rise to smooth surfaces
(though possibly steep)
that do not contain gaps,
such as the one in
Figure \ref{fig:error_surface}.
Despite the requirement,
the methods described here still perform badly
on irregular surfaces:
surfaces where the gradient changes sign often for example,
so surfaces with many local minima.
\begin{figure}[ht]
\centering
\includegraphics[width=0.7\textwidth]{error_surface.pdf}
\caption[Error surface]{Error surface in function of two parameters of the model
that are to be learned.}
\label{fig:error_surface}
\end{figure}
\subsection{Gradient Descent}
\label{sub:gradient_descent}
Gradient descent or steepest descent
can be described as the general approach
of moving across an error surface
by following the path with the greatest gradient
- the path with the steepest slope -
downward.
This makes it a greedy algorithm.
Following such a slope
means adjusting the parameters (or weights)
in the direction of the slope.
\paragraph{}
As stated before,
gradient descent requires the derivative of the error E
to exist in the whole domain.
Formally, in function of the weight vector
$\overrightarrow{w}$ of length $n$,
we write the derivative or gradient of E as:
\begin{equation}
\label{eq:gradient_descent_derivative}
\nabla E(\overrightarrow{w}) = \left(
\frac{\delta E}{\delta w_0}, \\
\frac{\delta E}{\delta w_1}, \\
\dots, \\
\frac{\delta E}{\delta w_n}, \\
\right)
\end{equation}
$\nabla E(\overrightarrow{w})$
is in itself a vector and corresponds to the steepest slope
I described earlier,
with both a size and direction.
Since this gradient describes the steepest
\textit{increase}
of E with respect to $\overrightarrow{w}$,
we need to use the negated gradient
$-\nabla E(\overrightarrow{w})$
to follow the slope downward.
We can now use this for our weight update:
\begin{equation}
\Delta \overrightarrow{w} \leftarrow \overrightarrow{w} - \\
\eta \nabla E(\overrightarrow{w})
\label{eq:gradient_descent}
\end{equation}
We call $\eta$ the learning rate,
useful for tuning the step size of the algorithm.
Smaller values will converge more slowly
while larger ones have the danger of overstepping local minima,
making convergence even slower.
Gradient descent is actually known to 'zigzag'
near convex valleys.
Because of this,
the learning rate is sometimes reduced over time
so the learner is less prone to overstep
and instead settle into minima.
% The method in equation \ref{eq:gradient_descent}
% is also called
% \textit{batch} gradient descent
% because all
\paragraph{}
In order to actually compute the gradient iteratively,
we still need a way to calculate the partial derivations
$\frac{\delta E}{\delta w_i}$.
We still have not defined the error E,
so let us start with that.
For a training data set $D$
and learning algorithm $\hat{f}$,
we define the error $E$ w.r.t. to
$\overrightarrow{w}$ as:
\begin{equation}
E(\overrightarrow{w}) \equiv \frac{1}{2} \\
\sum_{(x_i, y_i) \in D}(y_i - \hat{f}(x_i; \overrightarrow{w}))^2
\end{equation}
The reason why it is chosen so specifically
will become clear in a moment.
\paragraph{}
In order to algebraically derive $E$ w.r.t to $w$,
we also need to define the learning algorithm $f$,
which of course plays a key role
in what the error surface looks like.
% TODO
It is possible to approximate
the gradient near a certain point $x$
in case the algebraic derivation is too hard to calculate
or the calculations are too complex.
However, we will content ourselves with a simpler example
that allows us to perform an algebraic derivation
and calculate the gradient without such methods.
For $f$, we will pick
a simple linear unit that will suffice as an example:
$$
f(\overrightarrow{x};\overrightarrow{w}) = \overrightarrow{x} \cdot \overrightarrow{w}
$$
Finally then, the differentation of $E$:
\begin{equation}
\begin{split}
\frac{\delta E}{\delta w_i} &= \frac{\delta}{\delta w_i} \frac{1}{2}
\sum_{(x_j, y_j) \in D} (y_j - \hat{y}_j)^2 \\
&= \sum_{(x_j, y_j) \in D}(y_j - \hat{y}_j)
\frac{\delta}{\delta w_i}(y_j - \overrightarrow{w} \cdot \overrightarrow{x_j}) \\
&= \sum_{(x_j, y_j) \in D}(y_j - \hat{y}_j)(-x_{ij})
\end{split}
\end{equation}
with $x_{ij}$ the $i$'th component of $x_j$.
As you can see, the derivation is comfortably easy
because of our choice of the linear unit
and the way the error function is defined.
Other errors but most of all other learning
algorithms can have far less obvious differentiations.
Combining this with \ref{eq:gradient_descent}
gives us the update rule for a specific weight component:
\begin{equation}
\Delta \overrightarrow{w}_i = \eta \sum_{(x_j, y_j \in D)} \\
(y_j - \hat{y}_j) x_{ij}
\label{eq:gradient_descent_update}
\end{equation}
Equation \ref{eq:gradient_descent_update}
describes the gradient descent algorithm for the linear unit.
Each iteration the weight update gets calculated
for all the training examples by computing $\hat{y}_j$
and following \ref{eq:gradient_descent_update},
then summing all weight updates $w_{ij}$
to get $w_i$.
Only once all weight deltas $\Delta w_i$
are calculated should the actual weight vector be updated
(with $\Delta w$ comprised of the individual $w_i$ of course):
\begin{equation}
\overrightarrow{w} = \overrightarrow{w} + \overrightarrow{\Delta w}
\end{equation}
This then gets repeated until some termination condition is met,
like a sufficiently small error or a desired amount of passes
over the training set.
\paragraph{}
Because first the whole weight update is calculated
with respect to the whole training data set
and only then the weight vector is adjusted,
this vanilla version of gradient descent is also
called \textit{batch gradient descent}.
It can be quite inefficient and downright
impractical for large training sets that do not fit into memory.
For this reason,
the stochastic variant is often used.
\subsection{Stochastic Gradient Descent}
A variation on regular gradient descent
that alleviate the problems
with regular stochastic gradient descent's
convergence speed and even tendency to fall
into local minimum pitfalls
(\cite{mitchell1997}, chapter 4)
is \textit{stochastic gradient descent}.
Instead of performing the parameter updates in batch fashion,
parameters are updated incrementally.
This means that for every training sample $(x_j, y_j)$,
$\overrightarrow{w}$
gets updated immediately
and therefore affects the calculation of the next
$\Delta \overrightarrow{w}$.
Stochastic gradient descent basically descends towards the steepest
gradient of the error with respect to a single example.
This makes it much faster than the regular batch version
and allows it to be used for online learning.
It also means that stochastic gradient descent
does not use the true gradient to adjust parameters,
instead it uses some complicated estimate.
It should be obvious that it is computationally cheaper
because it performs a step for every sample
instead of once for the whole sample set,
% same ref as before really, mitchell
but it can also sometimes avoid falling into local minima
because it uses a different gradient than regular gradient descent does.
\paragraph{}
By adjusting $\eta$ to be smaller as compensation,
stochastic gradient descent can be made to approximate
the true gradient arbitrarily closely.
Even better, it has been shown that slowly decreasing
the learning rate, as proposed before,
will allow stochastic gradient descent
to match the convergence behavior
of its batch counterpart.
\subsection{Backpropagation}
We have now laid out the basis for the \textit{backpropagation} algorithm,
a learning algorithm designed for multilayer networks
that are capable of representing even the most nonlinear surfaces.
Backpropagation \parencite{rumelhart1988learning}
will allow us to learn the weights of a multilayer network
by minimizing a measure of difference between network output and target output.
This minimization will be done with gradient descent
and the measure of error we will employ is
\begin{equation}
E(\overrightarrow{w}) \equiv \frac{1}{2} \sum_{(x_j, y_j) \in D}
\sum_{k \in outputs}(y_{kj} - \hat{y}_{kj})^2
\end{equation}
where $y_{kj}$ is the \textit{k}th element of the output value
that corresponds to input sample $x_{j}$,
and $\hat{y}_{j}$ is the network value for $x_{j}$,
so $\hat{y}_{kj}$ corresponds to its $k$th value.
Having the error defined like this allows gradient descent
to minimize the error across all training samples.
\paragraph
The network should have an activation function
as described in Section \ref{sec:activation};
we will refrain from picking a specific example for now
and call this activation function $\phi$.
However, since backpropagation is based on gradient descent,
it has the same requirement of differentiability
of the function in question.
Since the network is made up partly of these activation functions,
we require these be differentiable.
Because of this,
backpropagation can not readily be used with a unit such as the perceptron
because it is not differentiable.
\paragraph{}
In the algorithm, we will need a delta
that signifies how off the network's output is.
The $\delta_n$ described here are actually the negative derivatives of our error
w.r.t. to the network in node n,
but we will not go into a full explanation of how this result was achieved.
I only mention this because this
$-\frac{\delta E}{\delta net_n}$
corresponds closely to the negative gradient we used in gradient descent.
\paragraph{}
First off, create a neural network
and initialize all network weights.
Initialization can be done in different ways,
according to different distributions,
but a popular standard way of doing things
is to initialize them all according to a small uniform distribution
centered around zero.
Once created, feed the given training sample
through the network, giving $o_k$ as output
for every unit $k$.
The units $o_k$ in the outer layer together
make up $\hat{y}$.
Computing the error term $\delta_k$
for an output unit $k$
and a given training sample
is pretty straightforward:
\begin{equation}
\delta_k = \psi(o_k) (y_k - \hat{y}_k)
\label{eq:delta_k}
\end{equation}
with $(y_k - \hat{y}_k)$
simply the difference between the target output and the network output
and $\psi$ the derivative of the activation function $\phi$.
For example, if we were to use
the sigmoid function $\sigma$ (Section \ref{sec:sigmoid}):
\begin{equation}
\psi(x) = \sigma(x) (1 - \sigma(x))
\end{equation}
and since $o_k$ is calculated by the activation function $\sigma$:
\begin{equation}
\psi(o_k) = o_k (1 - o_k)
\end{equation}
The $\delta_k$ in \ref{eq:delta_k}
describes the error for an output unit,
but how do we compute the error for a hidden unit,
since a hidden unit has no provided target value?
It turns out it should depend on the error terms
of the units it connects to,
weighted by the weights associated to those connections.
For a hidden unit $h$, write its error term $\delta_h$ as:
\begin{equation}
\delta_h = \psi(o_h) \sum_{k \in outputs} w_{kh}\delta_k
\end{equation}
where $w_{kh}$ denotes the weight associated to
the connection from unit $h$ to unit $k$.
These are the weights that of course have to be tuned,
so intuitively backpropagation will tune
the responsibility or impact of each node for/on the next node,
effectively learning hidden representations about the hypothesis in question.
\paragraph{}
Now that we have defined the error terms for all weights,
we can calculate the weight update
for the weight associated with the connection
from unit $i$ to unit $j$:
\begin{equation}
\Delta w_{ji} = \eta \delta_j x_{ji}
\end{equation}
Just as with gradient descent,
this can be done until some termination criterion is reached,
be it convergence or amount of iterations.
In the stochastic gradient descent setting of this algorithm,
$\Delta w_{ji}$ would be used immediately to update
$w_{ji}$.
This is how backpropagation is commonly encountered
but one could also choose to employ the batch version
just as was the case with regular gradient descent.
In that scenario,
sum all $\Delta w_{ji}$s
gotten for all examples of the training set
and only then perform weight updates.
In practice this is never done because the approach scales horribly
with training set size.
In fact, backpropagation is usually unleashed on the training set multiple times
in order to converge to a local minimum,
seeing every sample more than once.
\subsubsection{Momentum}
A common extension to backpropagation is momentum.
Every weight update,
you carry over some of the weight update from last iteration:
\begin{equation}
\Delta w_{ji}(t+1) = \eta \delta_j x_{ji} + \\
\alpha \Delta w_{ji}(t)
\label{eq:momentum}
\end{equation}
We call $\alpha$ in \ref{eq:momentum} the \textit{momentum}.
It is aptly named after the physical property
and describes how a moving object needs a
larger force to be applied (or for longer)
in order for it to come to a standstill,
the larger the momentum
(and analogously for getting up to speed,
though we conveniently ignore that).
As a result,
a high momentum will keep the weight vector
rolling in the same direction in flat regions
and even roll up small slopes thanks to momentum
accumulated previously,
great for escaping local minima
overcoming flat plateaus.
As an extra,
convergence is sped up
because regions with unchanging downward slopes (negative gradients)
will accumulate a lot of speed.
\subsection{Adagrad and RMSProp}
\label{sub:adagrad_and_rmsprop}
Some parameters get updated less frequently than others
while they may play a key role for the others,
they are the proverbial needs in the haystack of parameters.
Adagrad \parencite{Duchi2011},
its name derived from `adaptive gradient',
addresses exactly this need of allowing important
yet less updated parameters
to play a bigger role in gradient updates.
It builds on stochastic gradient descent
but adds the notion
of parameter-specific learning rates
that are continuously updated to reflect
the scarcity of updates for a parameter.
Let us start by introducing a shorthand for
the gradient of the error at time step $t$
for a parameter vector $\theta$:
\begin{equation}
g_{t} \equiv \nabla_\theta E(\theta)
\end{equation}
Our stochastic gradient descent update rule then becomes:
\begin{equation}
\theta_{t+1} = \theta_{t} - \eta \cdot g_{t}
\end{equation}
Now we introduce a diagonal matrix $G_t$,
which I will discuss right after:
\begin{equation}
\theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot g_{t}
\end{equation}
Note that $\odot$ denotes element-wise matrix-vector multiplication
and $\epsilon$ is simply a tiny non-zero value
to avoid division by zero.
The matrix $G_t$ is a diagonal matrix
of which each element $i,i$
corresponds to the sum of squares of the gradients
corresponding to $\theta_i$
up to time step $t$.
The element $G_{t, ii)}$ looks as follows.
\begin{equation}
G_{t, ii} = \sum^t_{\tau=1}g_{\tau, i} g_{\tau, i}^\top
\end{equation}
\subsubsection{RMSProp}
Adagrad effectively takes into account
the past updates to a parameter.
However, since it uses all past gradients
the values of $G_t$ do tend to blow up over long
series of samples
and as a result the learning rate vanishes almost entirely,
preventing the algorithm from learning anything beyond that point.
Multiple approaches have been developed to tackle this problem.
One such is \textit{RMSProp},
introduced by \citeauthor{hinton2012neural}
in a lecture in his Coursera Class (\citeyear{hinton2012neural}).
It solves the ever-growing $G_t$
and the resulting ever-diminishing learning rate
by keeping a moving average of the squared gradient for all parameters:
\begin{equation}
E[g^2]_t = \rho E[g^2]_{t-1} + (1-\rho) g^2_t
\end{equation}
The update rule then becomes
\begin{equation}
\label{eq:rmsprop}
\theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_{t}
\end{equation}
This is what it is named after,
as it keeps the Root Mean Squared error in the denominator
and propagates it by keeping a moving average.