![]() |
| Bob and Jim work for the same company. Bob drives a Porsche, costing £200,000, and Jim drives an Austin Allegro, costing £9,000. Which man has the greater salary? | ![]() |
In this case, we can reasonably assume that it must be Bob who earns more, as he drives the more expensive car. As he earns a larger salary, the chances are that he can afford a more expensive car. We can't be absolutely certain, of course. It could be that Bob's Porsche was a gift from a friend, or part of the divorce settlement from his wife - or he could have stolen it! However, most of the time, an expensive car means a larger salary.
In this case, we say that there is a correlation between someone's salary and the cost of the car that he/she drives. This means that as one figure changes, we can expect the other to change in a fairly regular way.
| Both Bob and Jim are married. Which man has more children?
Ah! This is a different matter. How much a man earns does not influence how many children he has (as far as I know). In this case, we say that there is no correlation between a person's salary and how many children he/she has. As one figure changes, we cannot say how the other figure will change. |
![]() |
We take a piece of graph paper and draw two axes. The horizontal axis will represent the score on the English exam. The vertical axis will represent the score on the Maths exam. For each student, we then mark a small dot at the co-ordinates representing their two scores. Below, I have done this:

You can see that the points follow a fairly strong pattern. People who are good at maths tend to be good at English as well. The marks lie fairly close to an imaginary straight line that we can draw on the graph. In the diagram below, I have drawn in this straight line, and also included another point (in red) which I will explain later:

The fact that the points lie close to the straight line is called a strong correlation. The fact that this line points upwards to right - indicating that the English mark tends to increase as the maths mark increases - is called a positive correlation.
The arithmetic means are found by adding the relevant scores, and dividing by 10. This is because there are ten students in the table. We work out the arithmetic mean of the maths scores ...
... then we work out the arithmetic mean of the English scores ...
and we can be sure that the line must go through the point (56.6, 58.2). This is the point marked in red on the graph above. You will notice that there are roughly the same number of data point lying above this line as there are below it.
We can use the regression line to make predictions. For instance, what English mark would we expect someone to receive if they received a maths mark of 30. If we look at the straight line, we can see that when the maths mark is 30, the English mark is approximately 28. Similarly, we can assume that anyone who got an English mark of 40, would also get a maths mark of about 40. However, there are limits on the predictions that we can make, as you will see later on.
|
In this case, the scattergram looks like this. I have added the regression line.
Again, there is a good correlation between the maths scores and the absences from maths lessons, except that as the number of absences increases, the maths score goes down. This is referred to as negative correlation. Again, we can use the line of best fit to make predictions. What score would a student have received if he had been absent 10 times. According to the graph, it would have been about 41. If a student received a mark of 30, how many times would you expect him to have been absent? From the graph, it seems to be about 13 times. |
![]() |
However, this graph shows well the limitations of making predictions. What score would someone have received if they had been absent for all 30 maths lessons? According to the graph, the score would be less than zero! Similarly, how many times would a student have had to be absent in order to gain a score of 90? Well, the line hits the horizontal axis when the score is just over 80, so in order to get a score of 90, a student would have to be absent a negative number of times. Clearly, these conclusions are stupid, and they lead us to another general principle:
| In this case, the scattergram shows no particular pattern. It is clear that we can't draw a straight line anywhere near the data points, and we say that there is no correlation between the length of time taken to travel to college and the final English mark that a student gets. We cannot predict the English mark of any student based on how long it takes him to get to college. Nor can we predict how long it takes a student to get to college given that student's English mark. | ![]() |

As you can see, there is a negative correlation between the number of buses per hour and the number of complaints, but in this case, a curved line fits the data better than a straight line. We are about to investigate the rule that lets you fit a straight line to the data points - it is enough to say at this point that similar rules exist which let you fit various curved lines to the data points as well.
Although a correlation between two variables doesn't mean that one of them causes the other, it can suggest a way of finding out what the true cause might be. There may be some underlying variable that is causing both of them. For instance, if a survey found that there is a correlation between the time that people spend watching television and the amount of crime that people commit, it could be because unemployed people tend to sit around watching the television, and that unemployed people are more likely to commit crime. If that were the case, then unemployment would be the true cause!
Here is how we calculate the correlation coefficient:
In the table below, I have done exactly that with the maths and the English marks that you saw above. I have called the maths marks the "X" values, and the English marks the "Y" values.
The last line of the table shows the total values for the columns (the S values).
We then calculate three values from these total values, which we will then use to calculate the correlation coefficient. The three values that we need are the variance of X (written as S2X), the variance of Y (written as S2Y) and the covariance of X with Y (written as S2XY). Here are the formulae:
| Variance of X, S2X = | N |
¾ | N2 |
| Variance of Y, S2Y = | N |
¾ | N2 |
| Covariance of X with Y, S2XY = | N |
¾ | N2 |
These formulae may look horrible, but they are all just variations on the same formula. In fact, if you took the formula for the covariance and substituted Xs for all the Ys, then you would get the variance for X. Similarly, if you substituted Ys for all the Xs in the covariance formula, then you get the variance of Y. Effectively, this means that you only have one formula to learn.
Having got these values, we are now one step away from calculating the correlation coefficient itself, r. Here it is:
| Correlation coefficient, r = | Ö( S2X . S2Y ) |
Providing you have done the calculations correctly, this is guaranteed to lie within the range -1 to 1.
You may well be asked what these coefficients mean. Well...
The correlation coefficient indicates whether there is a relationship between the two variables, and whether the relationship is a positive or a negative number.
The coefficient of determination tells you what proportion of the variation between the data points is explained or accounted for by the best fit line fitted to the points. It indicates how close the points are to the line.
Anyway, whatever you call them, here are the formulae that calculate the gradient and the intercept:
|
|
You will notice that the denominator is the same in each case. The formulae use terms which we have already calculated in order to get the correlation coefficient. You'll notice, in fact, that we don't even need to calculate a SY2 value!
Menu
Worked example