What is Linear Regression?

    Linear regression is a statistical method of finding the relationship between independent and dependent variables.

     To understand the theory, i am going to use some data. The data basically tells the number of hours studied and the marks scored in the exam by students.

HoursMarks
335
450
545
664
766
870

     In this table, Hours is an independent variable, Marks is a dependent variable. Lets assume we only have dependent variable(marks). For some reason we have not collected Hours data. We are asked to represent the given data so that we can predict the marks scored by other students. So the best you could do is just fit a line which shows the average value.

Linear Regression Theory

     We have just plotted a constant line y=55 . The reason we have done this is to use this line as a reference and check how well we can fit a new line using linear regression. To compare the lines we use a metric called as  "Sum of Squared errors" (SSE) .

How to Calculate SSE?

     For each data point, we should find the difference between actual value (for ex: 35) and predicted value (which is our average value:55). This difference is called "error" (for ex: 35-55=-20). Then we have to square this error ((-20*-20)=400) .  This process is repeated for all data points. Next we should sum all the squared errors calculated for all the data points. This will give SSE (sum of squared error).

  Please refer the below table to understand the calculations

HoursMarkserror=Marks-55error2
335-20400
450-525
545-10100
664981
76611121
87015225
 Avg=55 SSE=952

 

     So the Sum Of Squared Error (SSE) for a simple constant line which was fit to our data points  is 952.

   So after we fit a best line by doing Linear Regression using the independent variable (Hours) , the SSE should reduce drastically. If SSE increases, then we are doing very bad job at fitting the line for our data.

     The Linear Regression uses Slope - Intercept form of a line. The equation of a line in slope intercept form is given by:

y=mx+b

'x' is our independent variable.

'm' is the slope. It is measure to tell how steep our line.

'b' is the intercept. It tells where the line crosses y-axis

     The very idea of Linear Regression is to find the best combination of slope (m) and intercept (b) which minimizes the SSE . We are going to use method called as 'Ordinary Least Squared' to calculate the slope and intercept.

Ordinary Least Squares Method  :

     Lets consider the equation of a line:

y=mx+b

In Ordinary least squared method, the formula for slope(m) is given by:

\(m=\frac{\sum(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum(x_{i}-\bar{x})^2}\)

\(b=\bar{y}-m*\bar{x}\)

here \(x_{i}\) is our independent variable (hours)

\(\bar{x}\) is the average of independent variable (hours)

\(y_{i}\) is our dependent variable (hours)

\(\bar{y}\) is the average of dependent variable (hours)

 

 

HoursMarks    \(x_{i}-\bar{x}\) \(y_{i}-\bar{y}\) \((x_{i}-\bar{x})(y_{i}-\bar{y})\) 
xy =\(x_{i}-5.5\) =\(y{i}-55\) \((x_{i}-\bar{x})^2\) 
335-2.5-20506.25
450-1.5-57.52.25
545-0.5-1050.25
6640.594.50.25
7661.51116.52.25
8702.51537.56.25
\(\bar{x}=\)5.5\(\bar{y}=\)55   \(\sum=\)121 \(\sum=\)17.5

\(m=\frac{\sum(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum(x_{i}-\bar{x})^2}\)

=121/7.5

m=6.914

\(b=\bar{y}-m*\bar{x}\)

b=55 - 6.914*5.5

b=16.973

Thus our line equation can be written as :

y=6.914 \(x\) + 16.973

If we plot our line using this equation it should look something like this (you can plot it in excel or any other graphing tool like GeoGebra)

 

Linear Regression ordinary least squared

Now Lets calculate our SSE again:

HoursMarksfrom equationerror 
xyyp=6.194x + 16.973y - yperror2
33537.672-2.6727.139584
45044.5865.41429.3114
54551.5-6.542.25
66458.4145.58631.2034
76665.3280.6720.451584
87072.242-2.2425.026564
    SSE=115.3

 

 

     The error has reduced from 952 to 115.3 , So we have done a pretty good job in fitting the line to our data points using Linear Regression. 🙂