First Breath

Blue eyes opened wide new breath drawn from Spring’s first day I met my mother. “First Breath” is published by Heath ዟ in Haiku Hub.

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Insight of Backpropagation Algorithm

New to Artificial Neural Networks? If yes,then you might be wondering why ANNs are so powerful? What that’s one thing that make them powerful? Well,the answer to these questions is Backpropagation Algorithm.Apart from this algorithm there are many other concepts that contribute towards the correctness of ANNs. But what is at the center is the Backpropagation Algorithm.

Things you should know before reading this blog is Calculus and been more specific:

What are derivative?

What are partial derivation?

What is chain rule(very important)?

If you know these concepts then you are good to go or else I might suggest to google them and should have a good grasp of these concepts.

We are gonna start with a Hight-level Intuition then apply the idea in more mathematical terms to a Perceptron(one layer NN) and then extend that idea all the way to Artificial Neural Networks.

Let’s start with what is our motivation to learn how this algorithm works.

Why?

Why we should invest our time in knowing how backpropagation works .We all know that in libraries like tensorflow etc. we don’t have to work backpropagation ourselves as these libraries take care of it.Then why?Well,to understand and enjoy what is really happening when we train our neural network.What people mean when they say stuffs like,train the model for longer period of time.So,just to understand the intuition and enjoy the idea of Backpropagation Algorithm and who knows you might be the one to suggest an improvement in this algorithm!

Before moving to backpropagation algorithm let’s understand forward propagation which is pretty straight forward.

Describing Neural Networks at a very basic level:

Here x is our input,w1 and w2 are the corresponding weights.Those circles are called neuron (as in human brain) .During forward propagation what happens is that the input x is multiplied to the weight w1 and added with the bias(here we are assuming no bias for simplicity),this gives us z1.Now, to capture complex patterns in predicting data some activation function is used (i.e.sigmoid,tanh,reLU etc).When activation function is applied to z1 we get the activation of that neuron.In a similar fashion ,this activation is multiplied with weight w2 and then passed through some activation to get the activation for next neuron.

Since,the forward propagation is demonstrated on a very simple nerual network.But,same idea applies to powerful ANNs with many hidden layers each with many neurons.

Presenting the above ,in a more mathematical terms:

Forward Propagation for the above Neural Network

Here,A2 is our neural network’s output.Now, to calculate it’s cost we take the square of the difference between A2 and the ground truth(say Y).

Since out weights (w1 and w2) were randomly initialize therefore,out neural networks output is utter rubbish!Thus, resulting in high cost.

We somehow need to find a to minimize this cost.This is where Backpropagation comes to save our life!

Backpropagation algorithm is based on the idea of how sensitive the cost that we calculated above is to the weights of the network.How much our cost changes when we change the weights of our network.This is called the derivative of the cost with respect to the weights of the network.Then we subtract this derivative from the weights.To put it more clearly:

If we plot the Cost-Weight graph, we will get a parabola like the one in Figure 3.Our goal is to minimize the cost, which is done by approaching the Global minimum.

In Figure 3,we have assumed the weights to be a ball and our job is to make this ball reach global minimum where the derivative of cost with respect to the weights is minimum.Which in turn will minimize our cost. Think about minimizing the cost as the weights of our network approaching global minimum, our network will become more accurate meaning the difference between our predicted output and ground truth will become small .Thus,making our cost small(minimizing).

This was the main intuition behind Backpropagation algorithm.In the next section let’s apply this idea in a more mathematical way to a perceptron.

Note:Next two sections go through the mathematics behind backpropagation.

In Figure 4,assume X to be a vector of inputs and W1(weights) to be a matrix and a output neuron Z1. At the time of training, there is a forward pass as described earlier.Then we use the output of the forward pass to calculate the cost using the formula described earlier.Now,our last objective is to minimize this cost using backpropagation.The intuition as described earlier is same for a perceptron as well.

Important:Just remember in backpropagation what we are doing is calculating how sensitive our network weights are to the cost and then subtracting this derivative from the weights of the network until we reach global minimium.

Here’s a small forward-pass and a backward pass through the neural network in Figure 4 :

Forward Propagation:

Backward Propagation:

Putting the idea of backpropagation as explained in previous section in a more mathematical way using the chain rule in calculus:

Let C be the cost and A1 our network output .

We need to draw a connection between this cost to the weights of the network using the chain rule from calculus.

Figure 6.Chain Rule for calculating Partial derivative of cost with respect to Weights.

Let’s call this equation 1.

The chain rule goes like this:the cost C is dependent on A1 which is the activation of the output neuron. A1 is in turn dependent on Z1.Derivative of A1 with respect to the Z1 is the derivative of the activation function.Then Z1 is actually dependent on the weights(W1)of the neural network.So, this was the pathway that linked Cost to the weights of the network.

I know you might be scared by this big equation so let’s break it down and try to solve it piece by piece:

Recall that our cost is :

So the derivative of our cost with respect to A1 is:

Figure 8.Cost Derivative with respect to the output.

Now,let’s calculate the derivative of the output(A1) with respect to Z1:

As you can see the derivative of A1 w.r.t to Z1 is just the derivation of the activation function used.

Now,let’s calculate the derivative of Z1 with respect to W1(weights):

Putting the above derivative together in our equation 1,we get:

Figure 11.Derivative of cost w.r.t the weights in a Perceptron.

That’s it!

Now, the next step is to subtract the derivative of cost with respect to the weights from the weights of the network.This way we are slowly approaching towards global minimum and our model is improving!

Now, let’s extend this idea to Artificial Neural Networks.The only difference between algotithm for backpropagation for a perceptron and ANNs is that now we need to apply the above concept to multi-layers with more than one neuron.

You might be wondering that now since the size of the network is going to be big then the backpropagation will become more complicated, ahhh well not much!It’s just couple of indices we need to keep track of.

Let’s assume an ANN like this one:

Figure 12.Three-Layer Artificial Neural Network

The Image looks messy so pardon me for that.

Let’s do the same thing we did for our perceptron model:Forward propagation,Calculating cost and the Backpropagation.

Forward Propagation through the Neural Network in Figure 10:

Note that this is a Vectorize implementation. For Example — A1 is a vector that contains the activations of neurons in layer Z[l-2] and so.on. Even when we are calculating Z[l-2] we are element-wise multiplying the weights of the network to the input X.Similary for other layers in the ANN.

Calculating the cost:

Our cost will be same as before.

Let C be the cost and A3 be our networks output.Time to define a chain rule for calculating cost with respect to the weights connected to the output (i.e. W[l]):

Let’s call it equation 1.Now,you may have noticed that it’s same as our Preceptron chain rule, what is different is now we have indices to keep track off.The derivatives to these is same that of perceptron model ,so we are not gonna derive them again!But that’s not it!

The above chain rule is just calculating the derivative of the cost with respect to the weights connect with the output layer.What about the weights deeper in the network (i.e. that are not connected to the network).So, let’s look at the chain rule for calculating the derivatives of the cost w.r.t the weights that are not connected with the output layer.

The chain rule is:

Figure 16.Chain Rule for weights not directly connected to the output

Let this be equation 2.above the second and third term are same as before, the only difference is the indices.We are going to discuss how to calculate the first term in the equation ,since the cost and the activation is not directly connected.So the cost is not a direct function of the activation in layer l-1.

Figure 17.Calculating first term from equation two.

Let’s call this equation 3.

If we compare equation 3 with equation 1,we see that the first two terms are same only third term is different.So,we are not going to calculate these two terms again.Let’s calculation the third term of the above equation:

Recall that Z[l]=W[l].A[l-1] + b .Therefore differentiating Z[l] w.r.t A[l-1] gives us W[l] only:

Figure 18.Derivative of Z[l] w.r.t A[l].

Therefore,equation 3 gives us:

Now,we can calculate the derivative of the cost w.r.t the weights that are not directly connected to the output neurons in the same way we did for the neurons directly connected to the output neurons.

Similarly,we can calculate the derivative of the cost w.r.t to the weights W[l-2] .

Don’t be demotivated if you did not understand it for the first time,you are not alone.I will love to have a discussion in the comment section:D

I remember when I was trying to do math behind Backpropagation,I did not understand a single equation because I was knowing nothing about Differential calculus!That was about 7 months ago I guess.

But today I am proud to say that I self-taught myself differentiation and then made my way through the algorithm.I hope ,I was able to provide value to your knowledge.

That’s it for this story.Congratulations for learning something new!

First Breath

Insight of Backpropagation Algorithm

Add a comment

Related posts:

BlockAce and Zasco Announce a Media Partnership

My personal presentation

Prototyping in Mechanical Engineering