The best analogy is learning to count in base 10 - the sort of thing you did when you were a little child. You will remember columns marked "hundreds" "tens" and "units" etc. like this:
In the binary counting system, we have the same idea, except this time the columns contain powers of 2 rather than powers of 10:
I have stopped the columns at the 8s column, but there is no reason why you can't go as far as you like. The next column would be 16s, then 32s etc.
![]() |
![]() |
Try it for yourself! Click on the 1 and 0 digits in the table below to change them, and see the base 10 number calculated as a result. |
|
|
|
|
|
|
|
|
| 5661 | ||||||||
| 5662 | ||||||||
| 5663 | ||||||||
| 5664 | ||||||||
| 5665 | ||||||||
| 5666 | ||||||||
| 5667 | ||||||||
| 5668 | ||||||||
| 5669 |
Each memory slot also has an address so that the computer knows where to find it. The bytes above show the contents of the memory from addresses 5661 to 5669.
Because each byte consists of 8 bits, it can be used to store 28 = 256 possible numbers. The lowest number is 00000000 (i.e. the equivalent of 0) and the highest is 11111111, which is the equivalent of 255. You will notice that although it can store 256 possible numbers, the highest number that can be stored is one less than this, as the counting starts from 0.
Of course, these binary patterns don't have to represent numbers. The standard way of representing letters and symbols such as punctuation, for example, is the ASCII code (pronounced "Askee") - the American Standard Code for Information Interchange - which represents the letter "A" as 01000001 (the equivalent of 65), "B" as 01000010 (which is 66) etc. The point is that all the memory holds is patterns of voltages, and it is up to the program how it interprets these patterns. The computer program itself is stored as binary patterns, arranged into bytes.
Multiple byte variables
Most computer languages have at least one variable type which is stored as one byte. Turbo Pascal (the most common version of Pascal around) has a variable type byte which can hold numbers in the range 0 to 255, as can the C++ type unsigned char (although variables of type unsigned char are more usually used to represent ASCII codes).
However, computer languages would be useless if they couldn't store numbers larger than 255, so languages also implement variables which require more than one byte. With two bytes, you have 16 bits to play around with, which gives 216 = 65,536 numbers (actually in the range 0 to 65,535 - see above!) Pascal has a 2-byte variable type called word, for example.
Using even more bytes allows even higher numbers to be stored. Here are the data types from C++ which only allow positive numbers to be stored (I will come on to how negative numbers can be stored later):
You will notice that all these variables are preceded by the word "unsigned". This is because there is a "signed" version of each of these C++ variable types which can handle negative numbers. However, before we look at negative numbers, we must deal with binary addition:
Binary addition
Let us consider one column from an addition sum which has been written using the binary notation. There are four possible values that the digits in this binary column could have:
The sum on the left shows that 0 + 0 gives you 0 with no carry to the next column. The next two sums show that 0 + 1 and 1 + 0 both give the same answer (1), again with no carry to the next column. 1 + 1, on the other hand, gives 2 - which we have to represent as 10 in binary (1 in the 2s column, remember?). This means that we put 0 in the answer slot and carry 1 into the next column (the equivalent of carrying 10 in a normal base 10 calculation).
What we have described is called a binary half adder. It can produce a carry-out to the next column but doesn't worry about any carry in from the previous column. A binary full adder would have a carry in from the previous column. This time there are 8 possibilities:
I have shown the carry in from the previous column in brackets. A binary adding device would consist of a series of circuits, each identical and each adding up one column of the sum according to the plan shown above.
Why am I mentioning this? Well, binary addition coupled with the fact that variables have a limited range gives us a way of representing negative numbers in the computer. Consider, for example, this calculation, done using one-byte variables:
Check that each of the columns is correct binary addition. You will see that the answer produces a final carry value which cannot be added to a column as we have run out of columns! Instead, the computer stores this final carry value in a special place called the carry flag (a 1-bit value which is 1 if a final carry was produced and 0 if it wasn't).
Let us look at the values involved in the calculation: 01101101 translates into 109 in normal numbers, 11011011 translates into 219 and the answer, 01001000 translates into 72. The computer might conclude that 109 + 219 gives 72. The wrong answer is produced because the calculation overflowed off the end of the number and couldn't give the correct answer, which is 328. However, consider this similar calculation:
The top number is still 109, but the bottom number is 147, and they add up to give 256, which the computer is forced to store as 0. We have found two numbers that add to give 0. If the top number is 109, we may reasonably conclude that the bottom number can represent -109.
Similarly, if you added 43 to the binary pattern 213 using one-byte variables which where the final carry dropped off the end of the sum, you would find that it gives 0, so we could say that the pattern for 213 also represents the number -43.
This gives a way of interpreting binary patterns for negative numbers called two's complement. This is how you form the binary pattern for a negative number (I use the number 17 as an example):
The pattern 11101111 now represents -17, and if you add it to 17 using one-byte variables, it will give you the answer 0. You will notice if you try a few of these, that all negative numbers have a 1 in the bit on the left of the number (what we call the most significant bit). This bit can be used to the sign of the number.
| I'm hugely confused, Professor! Does 11011011 in binary represent 147 or -109 in normal numbers? | ![]() |
![]() |
![]() |
![]() |
I'm glad you asked me that one, Jimmy! The pattern of bits can represent either of those numbers, depending on how you look at it. Remember, 11011011 is just a pattern of 1s and 0s. When the computer adds 11011011 to 01101101 it gets the answer 00000000 with a final carry of 1. This can be interpreted either as 256 (taking the carry into account) or as 0 (ignoring it)! |
The fact that variables can be interpreted in these two different ways, signed numbers (negative and positive numbers) and unsigned (positive numbers only) gives a whole new set of variable types. Here are the signed types for the C++ language:
You will notice that you sacrifice some of the numbers on the positive side of the scale in order to store the negative numbers. However, you can still store the same amount of numbers, so -128 to +127 is still a range of 256 numbers, except the counting now starts from a negative number.
Turbo Pascal has signed types as well. For instance, it has a type called integer which is equivalent to the C++ short type, and a type called longint which is the equivalent of the int type in C++.
When you were very small, you learned decimals by considering columns of figures after the decimal point. The columns were called "tenths", "hundredths" etc. like this:
We do something similar in binary, except that this time the columns are powers of two, as you would expect. The first column after the "binial" point (we can't really call it a decimal point, as it is no longer base 10) is "halves", the next column is "quarters", the next "eighths" etc.:
Similarly, 0.00101 would be the binary equivalent of 1/8 plus 1/32 which is 5/32. You can see that fractions which have denominators like 32, 16, 64 etc. are easy to represent in binary. These are equivalent to decimals, so 1/2 = 0.5 = 0.1 (in binary). Similarly, 5/8 = 0.625, so 0.625 in binary becomes 0.101
The hard part is representing fractions such as 1/3 or 1/10. 1/3 in decimal gives a recurring decimal 0.333333 and the same thing happens when you try to represent 1/3 in binary - you get some never-ending string of 1s and 0s. The same happens when you try to represent 1/10. In decimal, this comes out as simply 0.1, but because 1/10 can't be represented as the sum of fractions with denominators like 1/8, 1/16 etc. it can't be represented exactly in binary and again it gives a recurring decimal (whoops, sorry, I meant "binial").
This can lead to slight errors in calculations. If you type a program like the following:
#include <iostream.h>
void main ()
{ float x = 2.47;
float y = 6.19;
float answer = x + y;
cout << answer;
}
the program should give the answer 2.47 + 6.19 = 8.66 on the screen. Some versions of C++ have representations of floating point numbers that "lose" digits off the end of the numbers, and you won't get the correct number when you run the program!
Exponent and Mantissa form
The other thing that I wanted to remind you about is when you learned scientific notation at school (also called Standard Form). This was written as follows:
The decimal number at the front was called the mantissa (Don’t ask me ...) and it always had to be between 1 and 10. The power was always a power of 10 and the power itself was called the exponent. In the example above, the mantissa is 3.611 and the exponent was 65.
Numbers with a positive exponent were very big numbers. This following example shows a big number being converted to standard form:
7110 = 711 x 101
= 71.1 x 102
= 7.11 x 103
You can see that the number is continually divided by 10 until its mantissa is between 1 and 10. At the same time, the exponent is increased to compensate for the division ... after all, we want the number to stay the same size, even though the mantissa looks smaller. To double check, 7.11 x 103 is equivalent to 7.11 x 1000, which is 7110 ... the number we started with.
Numbers with a negative exponent were very small numbers (less than 1). This following example shows a small number being converted to standard form:
0.000053 = 0.00053 x 10-1
= 0.0053 x 10-2
= 0.053 x 10-3
= 0.53x 10-4
= 5.3 x 10-5
In this case, the mantissa has to be multiplied by 10 in order for it to be in the range 1 to 10, so the exponent has to be negative to compensate. 10-5 is equivalent to 0.00001, so 5.3 x 10-5 is the same as 5.3 x 0.00001 = 0.000053.
Why am I telling you this? Because a very similar format is used to represent floating point numbers in binary. In this case, the power is always a power of 2 (not 10) and the mantissa has to be divided or multiplied until it is in the range 1 to 2 (not 1 to 10). For example:
7.1 = 3.55 x 21
= 1.775 x 22
-0.0356 = -0.0712 x 2-1
= -0.1424 x 2-2
= -0.2848 x 2-3
= -0.5696 x 2-4
= -1.1392 x 2-5
You’ll notice that the last example was a negative number. Standard form numbers can be positive or negative just as any other numbers can.
A floating point number has several bytes set aside for it. For example, the type float in C++ takes up 4 bytes. The bytes are divided up as follows:

The first byte has one bit reserved for the sign of the number. This is 0 for positive numbers, 1 for negative numbers. There is no equivalent of Two’s Complement for floating point numbers. Having said that, the next seven bits form the exponent, which is stored in Two’s Complement form, so an exponent of 0000101 would represent a power of +7, whereas an exponent of 11110001 would represent an exponent of -15. The last three bytes (24 bits) are reserved for the mantissa.
The range of the numbers that can be stored is really set by the exponent. The number of bits sets aside for the exponent determine the maximum and minimum available powers. The precision of the numbers that can be stored is set by the mantissa. The number of bits set aside for the mantissa determine the number of decimal places that can be stored before a number "drops off the end".
In fact, it is possible to wangle an extra bit for the precision. You will notice that when a number is converted to standard form in binary, the first digit (the only one in front of the decimal point) is always a 1. This is because the number must be between 1 and 2. If this is always a 1, then there is no point in storing it, and for this reason, the first bit of the mantissa (at the start of the second byte) always represents halves, with the next bit representing quarters etc.
Here are a couple of examples of floating point numbers:
| 0 1110001 | 1010 0000 | 0000 0000 | 0000 0000 |
The sign bit is 0, so this number is positive. The power is the Two’s Complement equivalent of -15. The mantissa consists of one half (the first bit) and one eighth (the third bit) with all the others zero = 0.5 + 0.125 = 0.625, to which 1 must be added (as that is always assumed - see the paragraph above), giving 1.625, so this pattern represents 1.625 x 2-15.
| 1 0010001 | 1110 0000 | 0000 0000 | 0000 0000 |
The sign bit is 1, so this is a negative number. The exponent is 0010001 which is +17 and the mantissa has the first three bits set, meaning (1 +) ½ + ¼ + 1/8 = 1 and 7/8 which is 1.875, so this number is
-1.875 x 217.
Arithmetic using floating point numbers in binary
With normal whole numbers, it is addition and subtraction which are easy, and multiplication and division which are harder. In the case of floating point numbers, that is the other way round - multiplication and division are relatively easy and addition and subtraction are the hard things to do.
Back to base 10 again. To multiply two numbers in standard form, you multiply the mantissas and add the exponents. You may need to "restandardize" the mantissa of the answer to make sure that it is still between 1 and 10:
8.1 x 104 x 3.44 x 108 = 8.1 x 3.44 x 10(4 + 8)
= 27.864 x 1012
= 2.7864 x 1013 (back in standard form)
To divide base 10 numbers in standard form, the mantissas are divided and the exponents are subtracted, and the result is restandardized to put it in standard form again.
A similar procedure applies in the case of binary floating point. The processor first has to fill in the missing 1s at the beginning of the mantissas that it had assumed, then it adds the exponents and multiplies the mantissas, reducing the mantissa of the answer back to 3 bytes.
The hard part is addition and subtraction. In this case, the processor needs to shift one of the numbers left or right, adjusting the exponent of that number to compensate, until the exponents of the two numbers are the same. This effectively takes the number out of standard form. When the two numbers have the same exponent, they can be added and then the answer is shifted left or right until it is back in standard form. This is the equivalent of doing the following:
3.1 x 105 + 2.6 x 103 = 31 x 104 + 2.6 x 103
= 310 x 103 + 2.6 x 103
= 312.6 x 103
= 31.26 x 104
= 3.126 x 105
You can see that the larger number was converted until it had the same exponent as the smaller one. Then and only then could the numbers be added, and the answer was converted back into standard form.
Menu