Why subtract learning rate * gradient from old weight to get new weight and not add ??!!!!!

2 min readSep 29, 2020

I have been thinking about this concept ever since I started poking around the concept of gradient descent and the way weights are updated. So, here’s my understanding and hopefully this can help you answer the very question.

Let’s take the example of simple linear regression,

Y = m. X + c

for the sake of simplicity we will not consider the intercept, rather focus on the weight. So the equation now becomes:

Y =m.X

While performing the gradient descent, three primary steps are followed:

Forward pass: where the prediction is computed
Backward pass: where the gradients are computed
Finally updating weights (W_new = W_old — learning rate * gradient)

The key objective here is minimizing the loss which in the case of linear regression is just the MSE (Mean Squared Error). In the backward pass step we calculate the gradient which is nothing but derivative of loss with respect to the weight i.e, dLoss/dw

Now let’s quickly, look at the following image:

From the diagram, it is quite evident that Global cost minimum is the point where the loss is minimum and we have to reach there, meaning we have to get the weight value at which global cost is minimum. Now let’s consider the weight updating equation:

W_new = W_old — learning_rate*( dLoss/dW_old) # learning rate 0.01

Now at the initial weight, gradient or slope is positive, so if we subtract, our new weight will be less than initial weight meaning we are moving towards the weight where the global cost is minimum (at the bottom).

But if the initial weight is on the left, then the gradient will be negative, so on subtraction, our new weight will increase, meaning we are again moving towards the weight where the global cost is minimum.

So the gradient gives us the sense of the direction, and the learning rate defines the steps meaning how fast or how slow can we get to the point of global minimum.

Takeaway: minimizing the loss and getting the correct weight is the goal and gradient helps us steer the direction.

I hope this is helpful !!

References:

Why do we subtract the slope * a in Gradient Descent?

Ok, I got it. But still, why is it MINUS?

medium.com

https://medium.com/@faisalshahbaz/best-optimization-gradient-descent-algorithm-4ca5a3be3776

Why subtract learning rate * gradient from old weight to get new weight and not add ??!!!!!

Y = m. X + c

Y =m.X

Why do we subtract the slope * a in Gradient Descent?

Ok, I got it. But still, why is it MINUS?

Written by Dipanwita Mallick

No responses yet