### What is Linear Regression?

Linear regression is a statistical method of finding the relationship between independent and dependent variables.

To understand the theory, i am going to use some data. The data basically tells the number of hours studied and the marks scored in the exam by students.

 Hours Marks 3 35 4 50 5 45 6 64 7 66 8 70

In this table, Hours is an independent variable, Marks is a dependent variable. Lets assume we only have dependent variable(marks). For some reason we have not collected Hours data. We are asked to represent the given data so that we can predict the marks scored by other students. So the best you could do is just fit a line which shows the average value.

We have just plotted a constant line y=55 . The reason we have done this is to use this line as a reference and check how well we can fit a new line using linear regression. To compare the lines we use a metric called as  "Sum of Squared errors" (SSE) .

### How to Calculate SSE?

For each data point, we should find the difference between actual value (for ex: 35) and predicted value (which is our average value:55). This difference is called "error" (for ex: 35-55=-20). Then we have to square this error ((-20*-20)=400) .  This process is repeated for all data points. Next we should sum all the squared errors calculated for all the data points. This will give SSE (sum of squared error).

Please refer the below table to understand the calculations

 Hours Marks error=Marks-55 error2 3 35 -20 400 4 50 -5 25 5 45 -10 100 6 64 9 81 7 66 11 121 8 70 15 225 Avg=55 SSE=952

So the Sum Of Squared Error (SSE) for a simple constant line which was fit to our data points  is 952.

So after we fit a best line by doing Linear Regression using the independent variable (Hours) , the SSE should reduce drastically. If SSE increases, then we are doing very bad job at fitting the line for our data.

The Linear Regression uses Slope - Intercept form of a line. The equation of a line in slope intercept form is given by:

y=mx+b

'x' is our independent variable.

'm' is the slope. It is measure to tell how steep our line.

'b' is the intercept. It tells where the line crosses y-axis

The very idea of Linear Regression is to find the best combination of slope (m) and intercept (b) which minimizes the SSE . We are going to use method called as 'Ordinary Least Squared' to calculate the slope and intercept.

### Ordinary Least Squares Method  :

Lets consider the equation of a line:

y=mx+b

In Ordinary least squared method, the formula for slope(m) is given by:

### $$b=\bar{y}-m*\bar{x}$$

here $$x_{i}$$ is our independent variable (hours)

$$\bar{x}$$ is the average of independent variable (hours)

$$y_{i}$$ is our dependent variable (hours)

$$\bar{y}$$ is the average of dependent variable (hours)

 Hours Marks $$x_{i}-\bar{x}$$ $$y_{i}-\bar{y}$$ $$(x_{i}-\bar{x})(y_{i}-\bar{y})$$ x y =$$x_{i}-5.5$$ =$$y{i}-55$$ $$(x_{i}-\bar{x})^2$$ 3 35 -2.5 -20 50 6.25 4 50 -1.5 -5 7.5 2.25 5 45 -0.5 -10 5 0.25 6 64 0.5 9 4.5 0.25 7 66 1.5 11 16.5 2.25 8 70 2.5 15 37.5 6.25 $$\bar{x}=$$5.5 $$\bar{y}=$$55 $$\sum=$$121 $$\sum=$$17.5

### b=16.973

Thus our line equation can be written as :

### y=6.914 $$x$$ + 16.973

If we plot our line using this equation it should look something like this (you can plot it in excel or any other graphing tool like GeoGebra)

Now Lets calculate our SSE again:

 Hours Marks from equation error x y yp=6.194x + 16.973 y - yp error2 3 35 37.672 -2.672 7.139584 4 50 44.586 5.414 29.3114 5 45 51.5 -6.5 42.25 6 64 58.414 5.586 31.2034 7 66 65.328 0.672 0.451584 8 70 72.242 -2.242 5.026564 SSE=115.3

The error has reduced from 952 to 115.3 , So we have done a pretty good job in fitting the line to our data points using Linear Regression. 🙂