Correlation and Linear Regression

The idea of correlation

Bob and Jim work for the same company. Bob drives a Porsche, costing £200,000, and Jim drives an Austin Allegro, costing £9,000. Which man has the greater salary?

In this case, we can reasonably assume that it must be Bob who earns more, as he drives the more expensive car. As he earns a larger salary, the chances are that he can afford a more expensive car. We can't be absolutely certain, of course. It could be that Bob's Porsche was a gift from a friend, or part of the divorce settlement from his wife - or he could have stolen it! However, most of the time, an expensive car means a larger salary.

In this case, we say that there is a correlation between someone's salary and the cost of the car that he/she drives. This means that as one figure changes, we can expect the other to change in a fairly regular way.

Both Bob and Jim are married. Which man has more children?

Ah! This is a different matter. How much a man earns does not influence how many children he has (as far as I know). In this case, we say that there is no correlation between a person's salary and how many children he/she has. As one figure changes, we cannot say how the other figure will change.

Scattergrams

Now let's extend the comparison so that we are comparing several items, not just two. In this case, we won't have to presume that there must be a correlation - we will be able to see whether there is one or not! Here is a table showing the results of two examinations set to students that I teach. I set them a maths exam and an English exam and record the scores that they get in both:

John
Betty
Sarah
Peter
Fiona
Charlie
Tim
Gerry
Martine
Rachel
Maths score
72
65
80
36
50
21
79
64
44
55
English score
78
70
81
31
55
29
74
64
47
53

We take a piece of graph paper and draw two axes. The horizontal axis will represent the score on the English exam. The vertical axis will represent the score on the Maths exam. For each student, we then mark a small dot at the co-ordinates representing their two scores. Below, I have done this:

You can see that the points follow a fairly strong pattern. People who are good at maths tend to be good at English as well. The marks lie fairly close to an imaginary straight line that we can draw on the graph. In the diagram below, I have drawn in this straight line, and also included another point (in red) which I will explain later:

The fact that the points lie close to the straight line is called a strong correlation. The fact that this line points upwards to right - indicating that the English mark tends to increase as the maths mark increases - is called a positive correlation.

Line of Best Fit (Regression Line)

The straight line that we draw through the points is called either the line of best fit or the regression line. It describes the relationship between the two variables (the quantities compared) mathematically. There is a standard way to draw this line to ensure that it fits as closely to the data points as possible. Later on, we will investigate exactly what that mathematical way is. For now, we only have to remember one thing:

The regression line goes through the point whose co-ordinates are the mean values of the variables

The arithmetic means are found by adding the relevant scores, and dividing by 10. This is because there are ten students in the table. We work out the arithmetic mean of the maths scores ...

mean maths score = (72 + 65 + 80 + 36 + 50 + 21 + 79 + 64 + 44 + 55) / 10 = 56.6

... then we work out the arithmetic mean of the English scores ...

mean English score = (78 + 70 + 81 + 31 + 55 + 29 + 74 + 64 + 47 + 53) / 10 = 58.2

and we can be sure that the line must go through the point (56.6, 58.2). This is the point marked in red on the graph above. You will notice that there are roughly the same number of data point lying above this line as there are below it.

We can use the regression line to make predictions. For instance, what English mark would we expect someone to receive if they received a maths mark of 30. If we look at the straight line, we can see that when the maths mark is 30, the English mark is approximately 28. Similarly, we can assume that anyone who got an English mark of 40, would also get a maths mark of about 40. However, there are limits on the predictions that we can make, as you will see later on.

Negative Correlation

In the following table, I have duplicated the maths marks for the ten students and this time added the number of absences from maths lessons for each student:

John
Betty
Sarah
Peter
Fiona
Charlie
Tim
Gerry
Martine
Rachel
Maths score
72
65
80
36
50
21
79
64
44
55
Absences
4
6
0
13
8
15
2
3
9
5

In this case, the scattergram looks like this. I have added the regression line.

Again, there is a good correlation between the maths scores and the absences from maths lessons, except that as the number of absences increases, the maths score goes down. This is referred to as negative correlation. Again, we can use the line of best fit to make predictions. What score would a student have received if he had been absent 10 times. According to the graph, it would have been about 41. If a student received a mark of 30, how many times would you expect him to have been absent? From the graph, it seems to be about 13 times.

However, this graph shows well the limitations of making predictions. What score would someone have received if they had been absent for all 30 maths lessons? According to the graph, the score would be less than zero! Similarly, how many times would a student have had to be absent in order to gain a score of 90? Well, the line hits the horizontal axis when the score is just over 80, so in order to get a score of 90, a student would have to be absent a negative number of times. Clearly, these conclusions are stupid, and they lead us to another general principle:

You can only use linear regression to draw conclusions about values within the range of the data point themselves. You might just be able to get away with drawing conclusions about values just outside that range, but the further away from the data range you move, the less reliable the conclusions become!

No correlation

Finally, one more table, this time showing the English marks compared with the average length of time the students spend travelling to college each morning, recorded in minutes.

John
Betty
Sarah
Peter
Fiona
Charlie
Tim
Gerry
Martine
Rachel
English score
78
70
81
31
55
29
74
64
47
53
Time
12
32
19
31
30
15
22
10
17
16

In this case, the scattergram shows no particular pattern. It is clear that we can't draw a straight line anywhere near the data points, and we say that there is no correlation between the length of time taken to travel to college and the final English mark that a student gets. We cannot predict the English mark of any student based on how long it takes him to get to college. Nor can we predict how long it takes a student to get to college given that student's English mark.

Non-linear correlations

A bus company wanted to discover if there was any relationship between the number of buses it ran and the number of complaints it received. It carried out a survey testing the average number of buses per hour for different days, and the number of complaints that it received on those days. Here are the results:

As you can see, there is a negative correlation between the number of buses per hour and the number of complaints, but in this case, a curved line fits the data better than a straight line. We are about to investigate the rule that lets you fit a straight line to the data points - it is enough to say at this point that similar rules exist which let you fit various curved lines to the data points as well.

Correlation and Cause

Just because two variables are correlated, does not mean that one of the variables is the cause of the other. It could be the case, but it does not necessarily follow:

Although a correlation between two variables doesn't mean that one of them causes the other, it can suggest a way of finding out what the true cause might be. There may be some underlying variable that is causing both of them. For instance, if a survey found that there is a correlation between the time that people spend watching television and the amount of crime that people commit, it could be because unemployed people tend to sit around watching the television, and that unemployed people are more likely to commit crime. If that were the case, then unemployment would be the true cause!

Calculating the Correlation Coefficent

We can see by looking at the graph whether there is a strong or weak correlation between two variables, and whether that correlation is positive or negative. However, there is a mathematical way of working it out, and that is to calculate the correlation coefficient. This is also known as Pearson's Correlation Coefficient (Don't ask me who Pearson was!), represented by the letter r, and it is a single number which ranges from -1 (strong negative correlation) to +1 (strong positive correlation). Correlation coefficients which are close to -1 or +1 indicate a strong correlation. Values close to 0 indicate a weak correlation, with 0 itself indicating no correlation at all.

Here is how we calculate the correlation coefficient:

In the table below, I have done exactly that with the maths and the English marks that you saw above. I have called the maths marks the "X" values, and the English marks the "Y" values.

Maths marks (X)
English marks (Y)
X2
Y2
XY
72
78
5184
6084
5616
65
70
4225
4900
4550
80
81
6400
6561
6480
36
31
1296
961
1116
50
55
2500
3025
2750
21
29
441
841
609
79
74
6241
5476
5846
64
64
4096
4096
4096
44
47
1936
2209
2068
55
53
3025
2809
2915

566
582
35344
36962
36046

The last line of the table shows the total values for the columns (the S values).

We then calculate three values from these total values, which we will then use to calculate the correlation coefficient. The three values that we need are the variance of X (written as S2X), the variance of Y (written as S2Y) and the covariance of X with Y (written as S2XY). Here are the formulae:

Variance of X, S2X =
SX2
N
¾
(SX)2
N2
Variance of Y, S2Y =
SY2
N
¾
(SY)2
N2
Covariance of X with Y, S2XY =
SXY
N
¾
SXSY
N2

These formulae may look horrible, but they are all just variations on the same formula. In fact, if you took the formula for the covariance and substituted Xs for all the Ys, then you would get the variance for X. Similarly, if you substituted Ys for all the Xs in the covariance formula, then you get the variance of Y. Effectively, this means that you only have one formula to learn.

Having got these values, we are now one step away from calculating the correlation coefficient itself, r. Here it is:

Correlation coefficient, r =
S2XY
Ö( S2X . S2Y )

Providing you have done the calculations correctly, this is guaranteed to lie within the range -1 to 1.

The Coefficient of Determination

Another figure that is useful is the coefficient of determination. This is written as r2 and is found by squaring the correlation coefficient. Because the correlation coefficient must be in the range -1 to +1, and square numbers must be positive, the coefficient of determination must be in the range 0 to +1.

You may well be asked what these coefficients mean. Well...

The correlation coefficient indicates whether there is a relationship between the two variables, and whether the relationship is a positive or a negative number.

The coefficient of determination tells you what proportion of the variation between the data points is explained or accounted for by the best fit line fitted to the points. It indicates how close the points are to the line.

Calculating the Equation of the Regression Line

The regression line is defined by two numbers - the gradient and the intercept on the vertical axis of the line that best fits those points. I always refer to the gradient of the line as m and the intercept as c, which gives the equation of the regression line as y = mx + c. However, in newer textbooks, you will see the gradient referred to as b and the intercept as a, which gives the equation y = a + bx.

Anyway, whatever you call them, here are the formulae that calculate the gradient and the intercept:

gradient =
SX.SY - N.SXY
(SX)2 - N.SX2
intercept =
SX.SXY - SY.SX2
(SX)2 - N.SX2

You will notice that the denominator is the same in each case. The formulae use terms which we have already calculated in order to get the correlation coefficient. You'll notice, in fact, that we don't even need to calculate a SY2 value!


Menu       Worked example