Backpropagation is a technique used to teach a neural network that has at least one hidden layer.
- part 1 - simplest network
- part 2 - backpropagation (you are here)
- part 3 - backpropagation-continued
A perceptron is a processing unit that takes an input , transforms it using an activation function and outputs the result .
Within a neural network, its input is the sum of the previous layer node outputs times their corresponding weight, plus the previous layer bias:
If we treat the bias as an additional node in a layer with a constant value of , then we can simplify the equation:
Why do we need an activation function? Without it the output of every node will be linear, making the neural network output a linear function of the inputs. Since the combination of two linear functions is also a linear function, you can't compute more interesting functions without non-linear ones. This means that the network will only be able to solve problems that can be solved with linear regression.
If then typical activation functions are:
The backpropagation algorithm is used to train artificial neural networks, more specifically those with more than two layers.
It's using a forward pass to compute the outputs of the network, calculates the error and then goes backwards towards the input layer to update each weight based on the error gradient.
- , are inputs to a node for layers respectively.
- , are the outputs from a node for layers respectively.
- is the expected output of a node of the output layer.
- are weights of node connections from layer to and from layer to respectively.
- is the current association out of associations.
We will assign the following activation functions to each layer nodes for all following examples:
- input layer -> identity function
- hidden layer -> sigmoid function
- output layer -> identity function
During the forward pass, we feed the inputs to the input layer and get the results in the output layer.
The input to each node in the hidden layer is the sum of the output from all nodes of the input layer times their corresponding weight:
Since the hidden layer's activation function for each node is the sigmoid, then their output will be:
In the same manner, the input to the output layer nodes are
and their output is the same since we assigned them the identity activation function.
Once the inputs have been propagated through the network, we can calculate the error. If we have multiple associations, we simply sum the error of each association.
Now that we have the error, we can use it to update each weight of the network by going backwards layer by layer.
We know from part 1 of this series that the change of a weight is the negative of that weight's component in the error gradient times the learning rate. For a weight between the last hidden layer and the output layer, we then have
We can find the error gradient by using the chain rule
Therefore the change in weight is
For multiple associations, then the change in weight is the sum of each association
Similarly, for a weight between hidden layers, in our case between the input layer and our first hidden layer, we have
Here the calculations are slightly more complex. Let's analyze the delta term and understand how we got there. We start by calculating the partial derivative of in respect to the error by using the chain rule
Remember that our activation function is the sigmoid function and that its derivative is
Again, the change in weight for all associations is the sum of each association
First, initialize network weights to a small random value.
Repeat the steps below until the error is about 0
- for each association, propagate the network forward and get the outputs
- update all weights and reset accumulated gradients ()
In this example, we'll use actual numbers to follow each step of the network. We'll feed our 2x2x1 network with inputs and we will expect an output of . To make matters simpler, we'll initialize all of our weights with the same value of . However, keep in mind that normally weights are initialized using random numbers. We will also design the network with a sigmoid activation function for the hidden layer and the identity function for the input and output layers and we'll use as our learning rate.
We start by setting all of the nodes of the input layer with the input values; .
Since the input layer nodes have the identity activation function, then .
We then propagate the network forward by setting the layer node inputs () with the sum of all of the previous layer node outputs times their corresponding weights:
We then activate the layer nodes by passing it's inputs to the sigmoid function
And we propagate those results to the final layer
Since the activation function of our output nodes is the identity, then
On the way back, we first calculate the term for the output node,
And using the term we calculate the gradient for each weight between and layer nodes:
We then do the same thing for each hidden layer (just the one in our case):
And calculate the gradient for each weight between and layer nodes:
The last step is to update all of our weights using the calculated gradients. Note that if we had more than one association, then we would first accumulate the gradients for each association and then update the weights.
As you can see the weights changed by a very little amount, but if we were run a forward pass again using the updated weights, we should normally get a smaller error than before. Let's check...
We had on our first iteration and we get after the weight changes.
We had and we get after the weight changes.
We successfully reduced the error! Although these numbers are very small, they are much more representative of a real scenario. Running the algorithm many times over would normally reduce the error down to almost 0 and we'd have completed training the network.
The example teaches a 2x2x1 network the XOR operator.
Where is the sigmoid function for the hidden layer nodes.
Note that the XOR operation could not be solved with the linear network used in part 1 because the dataset is distributed non-linearly. Meaning you could not pass a straight line between the four XOR inputs to divide them into the correct two categories. If we replaced the hidden layer node activation functions from sigmoid to identity, this network wouldn't be able to solve the XOR problem as well.
Feel free to try it out yourself and experiment with different activation functions, learning rates and network topologies.
- Artificial intelligence engines by James V Stone (2019)
- Complete guide on deep learning: http://neuralnetworksanddeeplearning.com/chap2.html
- Flow of backpropagation visualized: https://google-developers.appspot.com/machine-learning/crash-course/backprop-scroll/
- Activation functions: https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0