In the Last Section (Linear Regression - Theory), we saw how to find the best combination of slope,intercept (m,b) to fit a line to our data points. The method we used was called Ordinary Least Squares Method. 

     I am just going to call (m,b) as weights for the sake of explanation.  

 

Linear Regression ordinary least squared

What is Gradient Descent?

     Let me take the same example of our Linear Regression for fitting the best line to represent our data points. All we need to fit our line is slope and intercept (m,b). So, there can be infinite combinations of m,b (or infinite number of lines) to fit our data. Which one is best?

     Best line is a line which has least SSE (Sum of Squared Error). Please refer my previous section (Linear Regression - Theory) to know what SSE is and how to calculate.

Let me explain Gradient Descent (as simple as i can)

     Gradient Descent is an algorithm which is used to iterate through different combinations of Weights (m,b in our case) in an optimal way to find the best combination of Weights which has minimum Error.

     I am going to introduce a term called as cost function/loss function now. 

     For a data point, lets assume \(y_{a}\) is actual value and \(y_{p}\) be predicted value. Thus error is given by,

error=\(y_{a}-y_{p}\)

\(error^2=(y_{a}-y_{p})^2\)

So for all the data point, we need to calculate \(error^2\) and add it all up.This can be represented as:

\(\sum(y_{a}-y_{p})^2\)

This is called cost function or loss function.

J(m,b)=\(\sum(y_{a}-y_{p})^2\)

Now, our goal is to find (m,b) which minimizes this cost function.

To explain how gradient descent works, i am going to consider a cost function with only one variable x.

J(x)=\(x^2\)

So now our goal is to find x which minimizes this cost function.

Our cost function is just a parabola. (\(y=x^2\)). Just by looking at the graph, we can determine that the value of x where cost function is minimum is at the point (x=0). This was easy to find since we are considering a simple function. In reality we have functions with lot of variables which represents more dimensions.

 

 

Gradient descent

     Now since we know our answer should be x=0, Lets find the same by using gradient descent algorithm:

1.)The first step is making a random initial guess for the value of x. Let \(x_{i}\) =5 be our initial guess. The gradient descent algorithm takes this input and gives the next guess. Lets compute the value of our cost function for our initial guess.

\(J(x)=x^2=5^2=25\)

so the value is 25. We will proceed with using gradient descent algorithm to find next guess of x which minimizes this value.

2.)The equation for next guess is given by:

\(x_{i+1}=x_{i} - \alpha \frac{\mathrm{d} }{\mathrm{d} x} J(x)\)

In this equation, \(\alpha\) is called as learning rate. It tells the algorithm whether to choose the value of next guess which is little far or more far from our current guess. For example: if our current guess is 10, our next guess can be 9.9 if we choose a low learning rate or it can be 8 if we choose a high learning rate. For our example, lets choose learning rate as 0.2 .

Thus, \(\alpha =0.2\)

\(\frac{\mathrm{d} }{\mathrm{d} x} J(x)\) is called derivative of our cost function. Lets compute the derivative of our cost function.

\(\frac{\mathrm{d} }{\mathrm{d} x} J(x)=\frac{\mathrm{d} }{\mathrm{d} x} x^{2}=2x\)

So the equation for our next guess can be written as:

\(x_{i+1}=x_{i} - 0.2*2* x_{i}\)

\(x_{i+1}=x_{i} - 0.4 x_{i}\)

3.)So our initial guess was \(x_{i}=5\) and value of cost function for it was 25.

Lets use the gradient algorithm to calculate next guess.

\(x_{i+1}=x_{i}-0.4 x_{i}\)

\(x_{i+1}=5-0.4*5=5-2=3\)

Thus our next guess is 3 and its cost function value is \(3^2=9\)

We have minimized the value from 25 to 9 .

So now our guess is 3, we use the same function again to get next guess. So this method is iterated again and again until we finally find a value which minimizes our cost function. 

This is how Gradient Descent algorithm woks.