Principle of Floating point

- March 13, 2012

Introduction:

Compared to fixed point arithmetic, floating point allows us to use both larger scope of integer and fraction in a expression because of the floating exponent.

A floating point memory block is composed of sign flag, exponent(2^e) and mantissa. The format is shown below.

64-bit floating point (double)

32-bit floating point (float)

We need the exponent part to be signed, so the compiler will subtract the q with a bias. The calculations are shown below.

e=q-bias

The number the format above represents is:

Here you must remember it is 1.M which is we call "significant figure" (有效数字). So when you do the some arithmetics, you have to ensure you shif the number to keep there is a "1" in the left of ".".

Precision of floating point:

Because the significant digits(有效数字) only have sizeof(M) bits, so the precision of floating point is only 2^(-sizeof(M)-1) (here we have a virtual bit). When a number is smaller than this precision, it will not be represented any more.

On the other hand, if the number exceeds 2^(sizeof(M)+1), its precision cannot be guaranteed anymore, which is also because the significant digits(有效数字) have limited bits (sizeof(M)). If a number exceeds the those significant digits, there is no enough room for the extra precision which will be missed. An example about this is shown below.

     #include <stdio.h>

     main()

         int x;

         float y;

         x=1;

         while (x>0)

             y=x+1;

             x=y;

This program will never end, because x will never overflow. When x exceeds the significant

digits(有效数字), y cannot increase anymore. So this program will be stuck at x=16777216 (2^24).

The arithmetic (addition & subtraction) of floating point:

Here we talk about how to do addition and subtraction of floating point. As we know when we want dto do addition or subtraction of numbers with exponent, we have to align the exponent first. The floating point calculation needs the same operation.

So here we use shift on mantissa and increase on exponent to solve this problem. We start from the smaller number. When we shift the mantissa 1 bit to left, we increase the exponent by 1, until the exponent is equal to the bigger number.

For example, we try to add 100.0 with 0.25 whose formats are "0 10000101 10010000000000000000000" and "0 01111101 00000000000000000000000" respectively.

Then we start to shift 0.25:

0 01111101 00000000000000000000000 (original number)

0 01111110 10000000000000000000000 (right shift 1)

0 01111111 01000000000000000000000 (right shift 2)

0 10000000 00100000000000000000000 (right shift 3)

0 10000001 00010000000000000000000 (right shift 4)

0 10000010 00001000000000000000000 (right shift 5)

0 10000011 00000100000000000000000 (right shift 6)

0 10000100 00000010000000000000000 (right shift 7)

0 10000101 00000001000000000000000 (right shift 8)

Then we add 0 10000101 10010000000000000000000 with 0 10000101 00000001000000000000000.

0 10000101 10010000000000000000000

+ 0 10000101 00000001000000000000000

-----------------------------------------------------------

0 10000101 10010001000000000000000

100.25

References:

[1] Chapter 7 -- floating point arithmetic, http://pages.cs.wisc.edu/~smoler/x86text/lect.notes/arith.flpt.html

Search This Blog

Captain Mingdos

Principle of Floating point

Comments

Post a Comment

Popular posts from this blog

Basic understanding of TLS-PSK protocol

Differences between ASIC, ASSP and ASIP

Orthogonal instruction set