Hypothesis Testing

What is a hypothesis?

A hypothesis (plural "hypotheses") is a statement which may or may not be true. It is a statement made about the result of an experiment which we then test. For instance, we may want to test how effective a new drug is at curing illness, so we produce a hypothesis ("The new drug reduces the level of illness by at least 20%") and then carry out an experiment to see if it is true.

In formal hypothesis testing we actually produce two hypotheses, called H0 (known as the "null hypothesis") and H1 (known as the "alternative hypothesis"). In fact, these two are always given as opposites of each other. Using the drug example above, the two hypotheses might be stated as:

H0: "The new drug doesn't reduce illness by at least 20%."
H1: "The new drug reduces illness by at least 20%."

Since these exactly contradict each other, one of them must be true, whatever the result of the experiment. After we have carried out the experiment, we will either accept H1 (and reject H0) or accept H0 (and reject H1).

In general, we usually arrange the hypotheses so that H0 states that the accepted status quo is correct, and H1 states that the situation is really different from what we would expect. For example, if we are told that a machine usually packs 500 nails on average into each box, and we wanted to test whether this were true, we would probably write the hypotheses as follows:

H0: "The mean number of nails per box is 500."
H1: "The mean number of nails per box is not 500."

A word of warning

The two hypotheses must say the exact opposite of each other to cover all possibilities. Suppose we had the following:

H0: The people in the sample are significantly shorter than the population in general.
H1: The people in the sample are significantly taller than the population in general.

In this case we have left a loop hole. What happens if the experimental results indicate that the people in the sample are generally the same height as the population (i.e. not significantly taller or shorter)? In this case we would rewrite the hypotheses as:

H0: The people in the sample are not significantly taller than the population in general.
H1: The people in the sample are significantly taller than the population in general.

In this way we know that one of the two hypotheses must be true.

Here are some questions for you to try. In each case, state whether the pair of hypotheses given are suitable - i.e. whether they cover all possibilities or not. Then click on the button to mark your answers.

Suitable
Unsuitable
H0: The machine packs more than 500 nails on average into each box.
H1: The machine packs 500 nails or fewer on average into each box.
H0: The machine packs 500 nails or fewer on average into each box.
H1: The machine packs 501 nails or more on average into each box.
H0: The machine packs fewer than 500 nails on average into each box.
H1: The machine packs more than 500 nails on average into each box.

Significance Level

When we test the hypotheses, we can never be 100% certain of our conclusions. We can only be confident to a certain level - hopefully a high one. Typically we construct our test so that we will be 95% certain that the conclusion we draw is a correct one. This is called a 95% confidence level, or a 5% significance level.

Other figures which are quite common are the 99% confidence level (1% significance level) or 90% confidence level (10% significance level). In each case, the percentage indicates how confident we are that our conclusion is correct. The higher the confidence level (99% is higher than 95%), the more certain we are, but the less likely it is that our test data will pass the test!

It may turn out later that we have made a mistake (as the years roll past, and more data comes pouring in). Such is the nature of statistics! If it turns out that we wrongly accepted H1, when we should have accepted H0, then we call this a Type I error. On the other hand, if it turns out that we wrongly accepted H0, when in fact H1 was the correct statement, then we have made a Type II error.

Sampling

It may seem as though I am jumping around here quite a bit. After all, when am I going to get on to hypothesis testing? Well, in order to understand it, there are a few threads that need to be explained, and then brought together for a "Eureka" moment, so bear with me!

The art of sampling means taking a small number of a population of items and testing them, and then drawing a conclusion about the population as a whole. For instance, if you wanted to estimate how many hours of television people in Britain watched on average, you couldn't possibly ask them all, so you would ask a sample of people and then draw a conclusion based on what they said.

Clearly the larger the sample, the more representative the results that you get. Let's illustrate this with an example. I asked every person in my street (that's 20 people all together - it's a very short street!) how many hours of television they watched per night, and recorded the results in the table below. Actually, most people said "Buzz off, sonny!" so I found out how much television they watched by spying on them through their letter boxes.

4
7
1
0
1
2
5
2
4
3
1
6
4
1
6
2
3
2
3
0

The mean value of all those figures is found, as you would expect, by adding them all together and dividing by 20. It comes to 2.85.

But suppose I hadn't been able to find out viewing figures for all the people. Instead of asking everyone, I would take a sample. Below I have chosen three of the figures at random:

5
6
2

The mean of these figures comes to 4.33 (to 2 decimal places), which is quite a lot different from 2.85! All right then, that sample wasn't a great success, so let's choose three more numbers randomly from the list of 20:

0
6
3

The mean of these numbers is 3, which is quite a lot closer. We were lucky that time. Of course, if we carry on choosing groups of three numbers at random from the list (re-using numbers we have previously chosen if they happen to come up in the draw), then the means will vary quite considerably. I have persuaded a computer program to do just that, and recorded the results in the table below. In each case, the computer has chosen three numbers at random from the list above and calculated the mean to 2 decimal places:

1.33
3
1.33
2.33
4
3
4.33
2
5
3.67

These mean values are quite spread out - particularly when you consider that the numbers themselves on which the means are based only vary from 0 to 7. How spread out are they exactly? Well, the same computer program that did the choosing also calculated their standard deviation for me. It is 1.193.

Would the accuracy be improved if we choose 5 items out of the 20 for each sample. Yes, I think it would, as 5 items is a larger percentage of the population (25%) than 3 items (15%). When I rerun the computer program to choose ten samples of 5 numbers, the results (and the standard deviation) are as follows:

2.8
2.6
2
2.8
2.8
2.6
3.6
3.6
1.4
2
Standard deviation = 0.654

Ah, that's a bit better. I'm sure that if we chose a higher sample size (6 numbers, 7 numbers, 8 numbers per sample) then the means would get closer to the true mean and the standard deviation would go down.

Standard Error

When carrying out hypothesis testing on samples we use a measure called the Standard Error. This is based on the standard deviation of a population, but takes into account the size of the sample which we draw from the population and on which we base any conclusions. To get the standard error (S.E.) we divide the standard deviation by the square root of the number of items in the sample, n:

Standard Error (S.E.) =
s
Ö
n

Critical region

The sort of hypotheses that we are going to test will involve comparing the mean of a sample of items against a true mean for a population. This true mean applies to a whole population (too many to count), although it may be only a claim (i.e. someone may tell us what the mean of the population is, and we may want to test it).

Either way, the symbol that we use for the true mean is m (the Greek letter "mu", equivalent to our letter "m" - "m" for mean, geddit?) and the mean of the sample of items will be called . We define a critical region around the true mean, and then we see if the sample mean lies within that region.

Firstly, decide on a significance level. We normally choose a 5% significance level (or a 95% confidence level, if you prefer), which means that we will be 95% certain of drawing the correct conclusion, although there will be a 5% chance that we will have made the wrong decision (even if we do the maths correctly). You must remember that nothing is 100% certain in statistics!

Look at the hypotheses carefully. Do they imply that something will be different to the mean value, or do they imply that it will be higher or lower? If the crucial word is "different" (or a word that means the same thing) then we call the test a "two tail test", i.e. any item which is substantially different from the mean in either direction count as "different". However, if the hypotheses use words like "taller", "longer", "better" (or "shorter", "worse", "less efficient" for that matter) then it is a "one tailed test". For instance, if we want to know whether the machine packs significantly more than 500 nails into each box or not, then a box containing 497 certainly wouldn't provide any evidence to support the hypothesis!

If the test is a two-tailed test, then the critical region has an upper limit and a lower limit, with the true mean exactly in the middle. The distance from the mean to each limit is the standard error (not the standard deviation in this case) multiplied by a certain number which will depend on what significance level we are using.

In the case of a 5% significance level (95% confidence level), the critical number is 1.96. This is the same as for the 95% confidence interval that is part of the theory of the Normal Distribution, although it is wrong to think of the critical region as a 95% confidence interval. How could it contain 95% of the items in the population when it is based on the standard error, which in turn depends on the size of our sample? If we altered the number of items in the sample, then the size of the critical region would also change!

For instance, if the mean were 100 and the standard error were 8, then we would multiply 8 by 1.96 (to give 15.68). The lower limit of the critical region would then be 100 - 15.68 = 84.32, and the upper limit would be 100 + 15.68 = 115.68.

Of course, it's a different matter if the test is a one tailed test. In this case, the critical region only has one limit:

right-tailed test
Critical region = up to m + 1.64 S.E.
left-tailed test
Critical region = m - 1.64 S.E. upwards

What does the critical region mean?

The critical region marks the range of values in which we can be fairly certain that the true mean, from which our sample was taken, lies. For instance, if we have calculated that the critical region at a 95% confidence level is between 10 and 20, then we can be 95% confident that the true mean lies within that region. Similarly, if the critical region is one-tailed at the 1% significance level, with a lower limit at 25 and no upper limit, then we can be 99% confident that the true mean is greater than 25.

Hypothesis Testing itself

Now we come to the actual testing where we pull the strands together. Let's consider a typical problem for hypothesis testing:

A company sells cotton reels and claims that the average (mean) length of the cotton on each reel is 250m with a standard deviation of 14m. To test this claim, I buy a sample of 30 cotton reels and measure the length of the cotton on them. The mean length of the cotton in my sample is 243m per reel. Can I accept the company's claim?

  1. What are the hypotheses?

    The company claims that the mean length of the cotton is 250m. We think it may be something different from that. The two hypotheses should therefore be:

    H0: "The mean length of the cotton is 250m per reel."
    H1: "The mean length of the cotton is something other than 250m per reel."

  2. Decide what type of test it is, and what significance level is required

    Well, clearly we want to see whether the mean length is 250m or something different from 250m, so it is a two-tailed test. We will be boringly conventional and choose a 5% significance level (95% confidence level).

  3. Calculate the standard error.

    This is fairly straight-forward, simply divide the standard deviation by the square root of the number of items in the sample, i.e.

    Standard error = 14 / Ö30 = 14 / 5.28 = 2.56

  4. Calculate the critical region.

    Since the test is a two-tailed test, we will need a symmetrical critical region:
    Lower limit = 243 - 1.96 x 2.56 = 237.98 metres
    Upper limit = 243 + 1.96 x 2.56 = 248.02 metres

  5. Compare the company's claim for the true mean (m) with the critical region. If it is inside the critical region, then accept H0 and reject H1. If it is outside the critical region, accept H1 and reject H0.

    In this case, the claimed true mean is 250m, which is outside the critical region (just!) This means that we can be 95% certain that H0 is wrong, and that H1 is correct. We accept H1 at the 5% significance level.

    This means that we can conclude that the company's claim (250 metres on average on every reel) is probably wrong, although there probably isn't enough evidence to go to court!

These are the five basic steps in Hypothesis Testing. Basically you apply them in order. Here's another, slightly different example:

A company has a machine that manufactures lightbulbs with a mean lifetime of 5000 hours and a standard deviation of 160 hours. The company is considering buying a new machine which promises to make lightbulbs which last significantly longer than those produced by the old machine. A sample of 200 bulbs from the new machine are tested and found to have a mean life time of 5020 hours. Does the new machine produce longer-lasting bulbs?

This question is a little different, but works the same way. We are not being asked to test whether the true mean has a certain value - we are told the value of the true mean (5000 hours) and just have to accept that. Instead, we are being asked whether the sample is compatible with that true mean, or whether it is substantially larger.

However, the same method can be applied. In both this and the previous question, we are asked whether the sample is compatible with the true mean. In the previous question, it was the "true" mean that was slightly suspect (and that turned out to be wrong!) In this question, it is the sample itself which is suspect. Is it better than the true mean, or is it just a fluke?

  1. What are the hypotheses?

    H0: The mean lifetime of the bulbs from the new machine is not greater than 5000 hours.
    H1: The mean lifetime of the bulbs from the new machine is greater than 5000 hours.

  2. What sort of test is it? Well, we want to know whether the lifetime is greater than 5000 hours, so it is a one-tailed test, more specifically a right-tailed test.

    What significance level do we want? Well, normally we would pick 95% confidence level, but let's be daring for once! Let's choose a 99% confidence level (ooOOooh!), i.e. 1% significance level. For this test, a one-tailed test requires the special number 2.33.

  3. Calculate the standard error.

    Standard error =
    s
    Ön
    =
    160
    Ö200
    =
    160
    14.14
    = 11.31

  4. Calculate the critical region.

    We must ask ourselves carefully exactly what this critical region is to mean. We are basing our results on a sample of lightbulbs that come from some unknown population, whose true mean may be the same as for the current process or may be bigger (certainly we have no reason to believe that it is smaller based on our sample!) In this case, we calculate a critical region for the mean from our sample There is no upper limit. The lower limit = - 2.33 S.E. = 5020 - 2.33 x 11.31 = 4993.65

    What does this critical region mean exactly? Well, we can be 99% certain that the mean of the lifetimes of the bulbs produced by the new machine will lie within this region. There is only a 1% chance that the mean will be less than 4993.65. If the established mean lies within this region, then it is compatible with the mean of the bulbs from the new machine - i.e. the mean lifetimes produced by both machines could well be the same (no significant difference)

  5. Compare the established value of the true mean with the critical region.

    5000 is not smaller than 4993.65, so the established mean is within the critical region. This means that we can accept H0 and reject H1. The mean lifetime of the bulbs in the sample is not significantly higher than 5000 at the 1% significance level, and so the machine does not produce significantly longer-lasting bulbs. Our advice to the company would be to stick with the machine that they've already got!

Postscript on confidence intervals

My students often get confused between critical regions for 95% confidence levels and 95% confidence intervals, so I thought it would be a good idea if I summarised them. Here is a brief summary of the differences:

If you have a true population mean m, and a true standard deviation s (i.e. values that you know apply to the population as a whole), then the 95% confidence interval is calculated as follows:

m ± 1.96s

This region will contain 95% of all the items in the population.

If, on the other hand, you only have a sample of the items in the population, with a sample mean of and a standard deviation of s, then the 95% critical region is calculated as follows:

± 1.96 S.E.

where the standard error, S.E., is s / Ön. In this case, we can be 95% certain that the true mean of the population from which this sample was taken lies within this region.

The same is true for other degrees of certainty (e.g. 99% or 90%) except that the critical number is not 1.96 (it is 2.58 for 99%, 1.64 for 90% etc.) Note that these numbers change again when you are considering a one-tailed test instead of a two-tailed test.



Back