- Run this code and look at the output:
11-numlimits.ox 1: #include
2: #include 3: main() { 4: println("Some predefined constants ", 5: "\n +Infinity : ",M_INF_POS , 6: "\n -Infinity : ",M_INF_NEG , 7: "\n Decimal digits of precision : ",DBL_DIG , 8: "\n Machine precision : ",DBL_EPSILON , 9: "\n Number of bits in mantissa : ",DBL_MANT_DIG , 10: "\n Maximum double value : ",DBL_MAX , 11: "\n Minimum positive double value: ",DBL_MIN , 12: "\n Minimum exp(.) argument : ",DBL_MIN_E_EXP , 13: "\n Maximum exp(.) argument : ",DBL_MAX_E_EXP , 14: "\n Maximum integer value : ",INT_MAX , 15: "\n Minimum integer value : ","%-15i",INT_MIN ); 16: println("\nWhat happens when 1 is added to ","%-15i",INT_MAX,"?"); 17: println( "Answer: ","%15i",INT_MAX+1); 18: 19: println("\nAnd if 11 is subtracted from ","%15i",INT_MIN,"?"); 20: println( "Answer: ","%15i",INT_MIN-11); 21: 22: println("\nCompute a big exponent:\n exp(DBL_MAX_E_EXP-1.0) = ",exp(DBL_MAX_E_EXP-1.0), 23: "\nEven Bigger:\n exp(DBL_MAX_E_EXP+1.0)= ",exp(DBL_MAX_E_EXP+1.0)); 24: println("\nCut the minimum positive value in half:\n ",DBL_MIN,"/2 =", 25: "%25.22f",DBL_MIN / 2.0 ); 26: }
It is likely that most of the output means next to nothing to you at this point. The point of this chapter is to understand what these values mean for doing computational economics.
- Two Bits, Four Bits, Eight Bits, A Dollar A computer word (meaning a fixed number of bits) contains a finite number of bits so it cannot represent every possible integer value. Instead, a word) will be used to represent a finite number of different integers as different bit patterns. We will use four bit words, but word sizes of 32, 64 and 128 bits are standard hardware.
- Representing 5 and -5 in binary 2s Complement.
+5 = 0101 -5 = 1010 + 0001 = 1011 -(-5) = 0100 + 0001 = 0101 = +5
If we carry out ordinary addition for +5 + (-5) we get what we would want. Note that in binary - 2s Complement integers in four-bits
-8 -7 -6 -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 +6 +7 ------------------------------------------------------------------------------------- 1000 1001 1010 1011 1100 1101 1110 1111 0000 0001 0010 0011 0100 0101 0110 0111
5 + (-5) does equal 0.0 1 0 1 + 1 0 1 1 ------------ [1] 0 0 0 0
In the four bit word we get 0000. But there is a leading 1 that would go in the fifth bit if it existed. But unlike our symbolic operations there is a finite number of bits, four. The extra 1 simply disappears into the bit bucket. Of course, if the word have another bit then -5 would be a different pattern and adding +5 to it would still result in 0 in the word.
- Float On♭The Floaters 1977 Understanding the coding integers, whether signed or unsigned, on a computer is fairly straightforward. Conceptually there is not much more to it than the discussion above. Electronically, semi-conductor components are designed to code arithmetic operations on integers, which goes beyond what an economist needs to know.
- 31.94 as a FPR
319410 = 1100 0111 10102 = 1.100 0111 1010 x 20000 1011 [decimal moves 1110 places, which is 10112] = 0 100 0111 1010 x 20000 1011 [implicit first 1 handled by hardware]
On our model computer Tim, the exponent and mantissa are stored in a single word. The floating point hardware on the CPU treats the two parts of the word differently. So some number of bits are reserved for the signed binary integer for the mantissa and some for the signed binary integer for the exponent. If 64 bits are in the word, and, say, 53 bits are used to store the mantissa that leaves 11 bits for the exponent. With this assumption we can complete a FPR coding for 3194.0:
- 31.94 as a FPR in a 64 bit word
Base10: 3194 = Base 2: 0 0000 0000 0000 0000 … 0000 0000 0100 0111 1010 0 00 0000 1011 |---------------------mantissa------------------------| |--exponent--| = Base 16: 0000 0000 00023 D00B
- Making Flippy Floppy♭Talking Heads 1983 In computer lingo, a floating point operation (FLOP) is the multiplication or division of two FPRs. So $ab + (c/d)$ involves to FLOPs. In turns out that adding FPRs takes much less time that multiplying, so typically the number of additions and subtractions is ignored. The cost of carrying out an algorithm can be measured in how FLOPs it requires. And sometimes the acronym FLOPS refers to floating point operations per second, which is a way to measure how quickly a computer system can perform numerical operations. Multiplication of FPRs is fairly easy to understand:
3194 × .036 =.3194 × 104 × .36 × 10-1 =.3194 × .36 × 104+(-1) = .114984 × 103
In other words, FPR multiplication needs an integer multiplication and an integer addition.31,940,000 + 0.0003194 = 31940000.0003194 =.3194000000031940 × 108
The mantissa contains 15 significant decimal digits. Note that - Overflow and Underflow Overflow occurs when an operation results in a number larger in magnitude than the coding scheme can store in the available number of bits. For example, with signed binary integers in 4 bits the largest value that can be stored in +8 =
- Operations with $+\infty_N$ result in $+\infty_N$: $$+\infty_N + x\quad=\quad +\infty_N - x \quad=\quad +\infty_N * x \quad=\quad +\infty_N \div x \quad=\quad +\infty_N.\nonumber$$
- Operations with $-\infty_N$ result in $-\infty_N$: $$-\infty_N + x\quad=\quad -\infty_N - x \quad=\quad -\infty_N * x \quad=\quad -\infty_N \div x \quad=\quad -\infty_N.\nonumber$$
- Dividing by infinity or 0 give the proper results: $$x \div 0 \quad=\quad +\infty_N\nonumber$$ $$x \div -\infty_N \quad=\quad x \div +\infty_N \quad=\quad 0\nonumber$$
- The puzzle solved. We can now explain the puzzle posed by this bit of code:
08-If6was9.ox 1: #include "oxstd.h" 2: main() { 3: decl x; 4: x = 0.25; 5: print( "Is 2.5=10*(0.25)? "); 6: // 0 1 2 3 4 5 6 7 8 9 check that there are 10 x's 7: if (2.5== x+x+x+x+x+x+x+x+x+x) 8: println("Yes"); 9: else 10: println("No"); 11: x = 0.1; 12: print( "Is 1.0=10*(0.10)? "); 13: if (1.0== x+x+x+x+x+x+x+x+x+x ) 14: println("Yes"); 15: else 16: println("No"); 17: }
As you may recall, the problem is that Ox seems to conclude that $10\times 0.1 \ne 1.0$. The first piece of the puzzle is now in place: Ox uses floating point arithmetic and so only some real numbers can be stored exactly. Perhaps there is "round off" error occurring. Yes, it is round-off and it is the kind of round-off you are already familiar with! From elementary school, we learn that
1 / 3 = 0.3
That is, in base 10, the fraction 1/3 cannot be stored exactly in a finite number of places. It is a repeating fraction. Of course, other fractions can be stored exactly, including
1 / 10 = 0.1 1 / 4 = 0.25
- Fuzzy Logic To Repeat: In general, calculations carried out on FPRs should also be expected to contain round-off error. So
if ( fabs(y - 10*x) < 1E-15 ) println("Yes") else println("No");
Routine Purpose -------------------------------------------------------------------- isfeq(a,b) Check if a is numerically equal to b isdotfeq(a, b) Check numerical equality element-by-element fuzziness(eps) Set the tolerance on numerical equality (default is 1E(-13)
The first will compare the two arguments accounting for round-off error. It returns 1 (TRUE) if they are close enough and 0 (FALSE) otherwise. In this sense if ( fabs(y - 10*x) < 1E-13 ) if (feq(y,10*x)) println("Yes") println("Yes") else else println("No"); println("No");
Further, the very useful println( feq(y,10*x) ? "Yes" : "No" );
fuzziness(epsilon); ⋮ isfeq(x,z);
Four bits has 2$^4$ or 16 different values: 0000 0001 0010 … 1111
. So at most we can code 16 different integers in a four bit word. By analogy, two decimal digits can store 10$^2$=100 different values, 0, 1, … 99. (We apparently adopted the decimal system because we have ten fingers. But we can represent 11 different values with our hands if no fingers equals 0. So perhaps if computer geeks had been around in the Stone Age we would count using base 11!) We think of the number 35
as 3$\times 10^1$+5$\times 10^0$. With four binary digits a computer engineer could wire arithmetic circuitry to interpret the bits as powers of two. For example, a 4-bit word equal to 0101
would be interpreted (by hardware) as
$$0101_2 = 0\times 2^3 +1\times 2^2 +0\times 2^1 + 1\times 2^0 = 5_{10}\nonumber$$
This is an example of a four bit unsigned binary integer.
What about negative numbers? Let's first think how humans do this and then compare it to the ways developed to do it digitally. Humans use a binary code, "+" and "-", to modify the interpretation of a number. We allow + to be implied or the default, which means there are two different symbols for one state, "+" and "\ ". Humans place the sign bit in the leftmost, highest valued, digit, as in $-55$ and $+55$.
Naturally, a computer can store the binary sign information in a single bit: positive and negative being coded as either 0 and 1. And it matters where in the circuitry this "sign bit" shows up since the hardware must treat it differently than a power of 2. For our purposes we can write the sign bit as the leftmost bit just as with signed decimal numbers. So that means that in the example above, 0101
, the leftmost 0 would not be the 2$^3$ position. The 0 could mean either positive or negative. As it turns out, it will be treated as a plus sign. So now we reinterpret the four bits as $0101 = (+) 1\times 2^2 + 0 + 1\times 2^0 = +5_{10}$.
It is worth noting (and this will be repeated several times), the bit pattern has not changed only the interpretation of the bit pattern. The interpretation of the pattern is done by the circuitry and the CPU can have circuitry for both signed and unsigned integer arithmetic. In fact, it does, because both interpretations of a fixed number of bits will be important.
At this point, the symbolic and digital representation of integers part company. If we were to continue with the human approach computers would treat 1101
as -5$_{10}$. That is, the sign bit turns on and changes the absolute value stored in the three other bits negative. But, it turns out, this is not the best way to store negative numbers digitally. Instead, computers use 2s Complement Arithmetic. The negative of a number is found by switching 0s to 1s and vice versa (the complementary bit) then adding 1 to the result. The complement of a bit is to switch its value: 0 becomes 1 and 1 becomes 0. In 2s complement arithmetic, a number is negated by complementing each bit in the word and then adding 1 to it.
1+1
equals 0
and "carry the 1". So the 16 possible patterns of bits map into the signed integers in a wrap around fashion. Start at 0 and go up to +7.
If we added 1 to 0111
we wrap around to 1000 = -8
. Keep adding one to make it way back to 0. This means that in four bits $7+1 = -8$? This is an example of numeric overflow. The number we are trying to represent, +8, does not fit in the fixed number of bits using 2s complement of arithmetic. So in the final value position, the extra bit overflows into the sign bit. This would be the same as if on paper we only had two places for decimal digits and we added +01 to +99. Without the ability to magically insert a new place for 10$^2$, the final 1 that we carry over spills in to the sign column turns + into -.
Things are a bit different when we ask about real numbers, an element of $\Re$, the real line. Obviously we cannot represent real numbers exactly on a computer because we only have a finite number of states, even if we used all of memory to represent a single number we are still dealing with a finite system. The digits of an irrational number like $\pi$ go on forever. So a computer can easily represent a concept like 0 or -5 in an exact sense, but not the concept of an point on the real number line.
Let's first rule out the irrational numbers like $\pi$ to just consider how to approximate the set of rational numbers. Recall that a rational number $x$ takes the form $x = a/b$ where $a$ and $b$ are integers ($b\neq 0$). There are an infinite number of rationals in any interval, such as $[-1,+1]$. So in the case of integers we can use 2s complement to represent all integers within a finite distance of 0 (the distance is determined by the number of bits in the word). But this is not true of the rationals, so there is no hope of being able to represent all the rationals in any interval of the real line.
Before we get specific about how rationals are represented digitally, consider the fact that we would not want our digital rational numbers to be evenly spaced on the real line like the signed binary integers are. We would want them to be closer together where we expect to do most of the computing. There is no reason to suppose that any particular calculation is going to require close approximations to numbers around, say, 10 million or -623.231529. Instead, it makes sense that the digital approximation is best near 0, meaning that we want more digital rationals around 0 than other points on the real line.
For a rational number, such as 31.94, we can always rewrite it as: $31.94 = .3194 \times 10^2$. In this formulation the decimal place "floats" so that the first significant digit in the number, namely 3
, is in the 10$^-1$ place. (Or you could write it as $3.194 \times 10^1$). Now notice that the rational number is coded using three integers: 3194, 10, and 2. These three elements of scientific notation are called the mantissa, base, and exponent, respectively. If we always use the same base, then two signed integers can be used to represent any (non-repeating) decimal. Of course, the number of bits available to store the integers limit how many different rationals can be stored this way.
Thus, on computers, a real number is approximated by a floating point real number (FPR). In FPR the base is implicit, and is always 2. So only two signed integers code a rational, the exponent and the mantissa. In base 10 the leading digit (the 3 in 31.94) can be anything between 1 and 9, but not 0. But in base 2 the leading bit is always 1 so it contains no information. Therefore the leading 1 does not need to be stored. Instead, the hardware can be wired to treat the FPR as if a 1 was there without actually needing a bit to store it.
The distribution of FPRs on the real line depends on the number of bits in the mantissa and exponent. Binary integers are distributed evenly, FPRs are not. So there can be many FPRs near 0 and some FPRs much larger than the largest binary integer. There exist smallest positive and negative floating points different from zero, but it is pretty easy to see by analogy that these numbers are not completely useful.
Suppose I am working in decimal and can only store 6 digits, but the decimal point can float. The smallest positive number I can store is .000001. And the biggest is 999999. If I add these together I get 999999.000001. However, that takes 12 digits to store exactly and I only have 6. Keeping the most significant digits the result is simply 999999. In other words, I added a positive number to something and got back the same number. Just because I can store .000001 does not mean I can do precise arithmetic with it. Less extreme cases of round-off error can also occur. In 6 digits, 888 + 0.0012 = 888.0012, which would be rounded to 888.001. In this case the result of adding a positive number to 888 is more than 888 but the result is not exact.
Consider round-off error in scientific notation:252= 4,503,599,627,370,496
which is smaller than the mantissa shown above. When 53 bits are used to store the mantissa of a FPR only 52 bits are available for the magnitude (because half of the bit patterns are used for negative values.) That means that Ox or any other standard application using current hardware could not store the result exactly. Round-off error would occur. While the difference between the exact value and the stored value is not large (in percentage terms), the point is that 0.0003194 is not a very small number (its exponent in scientific notation is -4). A number with a few more leading zeros would results in no detectable change when added to the large number 31,940,000.
The standard measure of precision is called machine precision. We will denote machine precision $\epsilon_m$, and define it as the smallest FPR such that such that $1.0-\epsilon_m \lt 1.0$.
Adding and subtracting of FPRs of very different magnitudes can lead to large round off errors, even if both of the original numbers can be stored exactly in the given number of bits. A good idea is to define the units of the variables that appear in your program so that the magnitudes of numbers are not wildly different. If so the round-off error will stay small and the overall precisions of the calculation is still within the range Ox reports of DBL_DIG = 15 = decimal digits of precision.
However, anytime a FPR operation is performed there is a potential of round off error and not just when adding numbers of wildly different magnitudes.
0111
. So 8+1 in this case results in a number that over flows the storage. The same is true for FPRs. Underflow occurs for FPR when an operation results in a non-zero number closer to zero than the FPR can store in the space available.
In the early days of computing, these were major concerns because the result of these operations can be unpredictable. If not handled by the hardware or software, overflow can result in a value that looks perfectly reasonable. From then the calculations using the result are completely meaningless but there might be no way to discover this. The extreme case of overflow is trying to compute and store infinity.
Symbolically, $3/0$ is not a real number, although we can associate it with +$\infty$ in the "extended reals". We can also say $ln(0)= -\infty$. Note that $0\times 10^1 = 0 \times 10^2 = 10^{-1} \dots$. This means there are many equivalent bit patterns in a 64 bit FPR. If hardware is built to ensure that true 0s are only stored one way, the other bit patterns that equal 0 can be used by hardware for special purposes.
Numerically, two special bit patterns are designated numerical infinity: $+\infty_N$ and $-\infty_N$. Hardware treats them properly, in the following sense:
What about $0/0$? Even in the extended reals this expression is undefined. It is misleading to assign it any value. Numerically, it is assigned another special bit pattern called NaN. NaN
is treated differently than numerical infinity. 1/$\infty$=0, but any operation involving NaN
results in NaN
.
However, it turns out that whether a fraction (a rational number) is repeating or not depends on the base the number is represented in. It turns out that $1/10$ is a repeating fraction in base 2. The number 1.0 = 10*0.1
is of course stored exactly in base 2. But with 53 bits in the mantissa there will always be round-off error when operations involve 1/10.
NO operation on FPRs should be considered as exact unless it is guaranteed to involve only integer values.
y==10*x
is not a safe way to check whether y=10x or not. Exact comparisons of FPR values (using ==
) should not be relied on in your code, as in the case that 10*0.1 ≠ 1.0
.
For typical magnitudes (not enormous or infinitesimal), FPRs have no more that 15 decimal digits of accuracy. So a safer way to test equality of FPRs might look like this:
fabs()
is Ox's absolute value function. And Ox understands scientific notation, so 1E-15 = 10-15
. This says that if the difference between the two FPR expressions is small enough then call them equal.
Ox has built-in routines for doing this
feq()
is a "fuzzy equality" tester. The default level of fuzziness is 1E-13 or 10-13.
Thus, these two code segments are equivalent:
? … : …
operator let's you test for a condition and return a value without using if() else ;
. So this is also equivalent:
Definition 1. Numerical Equality
Exercises
.NaN
or .Inf
.12-fpr-lims.ox 1: #include2: #include 3: main(){ 4: decl x; 5: println("Extreme Calculations in Ox\n"); 6: println("1. Ways to produce numerical Infinity and 0:", 7: "\n log(0) = ",log(0), 8: "\n 1/0 = ",1/0, 9: "\n 1/log(0)= ",1/log(0)); 10: x = 0/0; 11: println("2. Ways to produce NaNs: x = 0/0", 12: "\n x =",x, 13: "\n x+5 =",x+5, 14: "\n 1/log(x)=",1/log(x)); 15: println("Precision around 1.0:", 16: "\n 1-DBL_MIN =","%25.23f",1-DBL_MIN, 17: "\n 1-epsilon =","%25.23f",1-DBL_EPSILON); 18: println("Checking for NaNs in your code:"); 19: if (isnan(x)) oxrunerror("Big problems with my project: Check value of x!"); 20: }
11-numlims.ox
and explain the output. Explain the relationship between these pairs of machine limits:DBL_MAX and DBL_MAX_E_EXP DBL_MIN and DBL_MIN_E_EXP DBL_DIG and DBL_EPSILON