Converting from IEEE-754 to Fixed Point with nearest rounding

Diego Ruiz

I am implementing a converter for IEEE 754 32 bits to a Fixed point with S15.16 in a FPGA. The IEEE-754 standard represent the number as:

Where s represent the sign, exp is the exponent denormalized and m is the mantissa. All these values ​​separately are represented in fixed point.

Well, the simplest way is take the IEEE-754 value and multiplies by 2**16. Finally, round it to the nearest to get the less error in truncation.

Problem: I'm doing in a FPGA device, so, I can't do it in this way.

Solution: Use the binary representations from values to perform the conversion via bitwise operations

From the previous expression, and with the condition of the exponent and mantissa are in fixed point, logic says me that I can perform as this:

Because powers of two are shifts in fixed point, is possible to rewrite the expression as (with Verilog notation):

x_fixed = ({1'b1, m[22:7]}) << (exp - 126)

Ok, this works perfectly, but not all the times... The problem here is: How can I apply nearest rounding? I have performed experiments to see what happens, in different ranges. The ranges are contained within powers of 2. I want to say:

  • For values from 0 < x < 1
  • For values from 1 <= x < 2
  • For values from 2 <= x < 4

And so on with the values contained in the following powers of two... When the values are contained from 1 to 2, I have been able to round without problems seeing the behaviour of the 2 followings bits that have been discarded in the mantissa. This bits show that:

if 00: Rounding is not necessary
if 01 or 10: Adding one to the shifted mantissa
if 11: adding two to the shifted mantissa.

To perform the experiments I have implemented a minimal solution in Python using bitwise operations. Codes are:

# Get the bits of sign, exponent and mantissa
def FLOAT_2_BIN(num):
    bits, = struct.unpack('!I', struct.pack('!f', num))
    N = "{:032b}".format(bits)
    a = N[0]        # sign
    b = N[1:9]      # exponent
    c = "1" + N[9:] # mantissa with hidden bit
    return {'sign': a, 'exp': b, 'mantissa': c}

# Convert the floating point value to fixed via
# bitwise operations
def FLOAT_2_FIXED(x):
    # Get the IEEE-754 bit representation
    IEEE754 = FLOAT_2_BIN(x)
    
    # Exponent minus 127 to normalize
    shift = int(IEEE754['exp'],2) - 126
    
    # Get 16 MSB from mantissa
    MSB_mnts = IEEE754['mantissa'][0:16]
    
    # Convert value from binary to int
    value = int(MSB_mnts, 2)
    
    # Get the rounding bits: similars to guard bits???
    rnd_bits = IEEE754['mantissa'][16:18]
            
    # Shifted value by exponent
    value_shift = value << shift
    
    # Control to rounding nearest
    # Only works with values from 1 to 2 
    if rnd_bits == '00':
        rnd = 0
    elif rnd_bits == '01' or rnd_bits == '10':
        rnd = 1
    else:
        rnd = 2
    return value_shift + rnd

The test with values between 0 and 1 gives the following results:

Test for values from 1 <= x < 2

 FLOAT 32 VALUE    16 MSB MANTISSA    THEORICAL FIXED    PRACTICAL FIXED    RND BITS    DIFS    4 LSB MANTISSA
----------------  -----------------  -----------------  -----------------  ----------  ------  ----------------
       1          1000000000000000         65536              65536            00        0           0000
      1.1         1000110011001100         72090              72090            11        0           1101
      1.2         1001100110011001         78643              78643            10        0           1010
      1.3         1010011001100110         85197              85197            01        0           0110
      1.4         1011001100110011         91750              91750            00        0           0011
      1.5         1100000000000000         98304              98304            00        0           0000
      1.6         1100110011001100        104858             104858            11        0           1101
      1.7         1101100110011001        111411             111411            10        0           1010
      1.8         1110011001100110        117965             117965            01        0           0110
      1.9         1111001100110011        124518             124518            00        0           0011

Obviously: if I take values that have a decimal part multiple of a power of two, there is don't need rounding:

In this case the values have an increment of 1/32

FLOAT 32 VALUE    16 MSB MANTISSA    THEORICAL FIXED    PRACTICAL FIXED    RND BITS    DIFS    4 LSB MANTISSA
----------------  -----------------  -----------------  -----------------  ----------  ------  ----------------
       10         1010000000000000        655360             655360            00        0           0000
    10.0312       1010000010000000        657408             657408            00        0           0000
    10.0625       1010000100000000        659456             659456            00        0           0000
    10.0938       1010000110000000        661504             661504            00        0           0000
     10.125       1010001000000000        663552             663552            00        0           0000
    10.1562       1010001010000000        665600             665600            00        0           0000
    10.1875       1010001100000000        667648             667648            00        0           0000
    10.2188       1010001110000000        669696             669696            00        0           0000
     10.25        1010010000000000        671744             671744            00        0           0000
    10.2812       1010010010000000        673792             673792            00        0           0000
    10.3125       1010010100000000        675840             675840            00        0           0000
    10.3438       1010010110000000        677888             677888            00        0           0000
     10.375       1010011000000000        679936             679936            00        0           0000
    10.4062       1010011010000000        681984             681984            00        0           0000
    10.4375       1010011100000000        684032             684032            00        0           0000
    10.4688       1010011110000000        686080             686080            00        0           0000
      10.5        1010100000000000        688128             688128            00        0           0000
    10.5312       1010100010000000        690176             690176            00        0           0000
    10.5625       1010100100000000        692224             692224            00        0           0000
    10.5938       1010100110000000        694272             694272            00        0           0000
     10.625       1010101000000000        696320             696320            00        0           0000
    10.6562       1010101010000000        698368             698368            00        0           0000
    10.6875       1010101100000000        700416             700416            00        0           0000
    10.7188       1010101110000000        702464             702464            00        0           0000
     10.75        1010110000000000        704512             704512            00        0           0000
    10.7812       1010110010000000        706560             706560            00        0           0000
    10.8125       1010110100000000        708608             708608            00        0           0000
    10.8438       1010110110000000        710656             710656            00        0           0000
     10.875       1010111000000000        712704             712704            00        0           0000
    10.9062       1010111010000000        714752             714752            00        0           0000
    10.9375       1010111100000000        716800             716800            00        0           0000
    10.9688       1010111110000000        718848             718848            00        0           0000

But, if 2 <= x < 4 and the increments is not a multiple of a power of two:

Test for values from 2 <= x < 4. Increment is 0.1
Here, I am not applying the rounding in order to show how the rounding error 
increase with the exponent. e.g: shift**2 - 1, where shift is exponent - 126

FLOAT 32 VALUE    16 MSB MANTISSA    THEORICAL FIXED    PRACTICAL FIXED    RND BITS    DIFS    4 LSB MANTISSA
----------------  -----------------  -----------------  -----------------  ----------  ------  ----------------
       2          1000000000000000        131072             131072            00        0           0000
      2.1         1000011001100110        137626             137624            01        -2          0110
      2.2         1000110011001100        144179             144176            11        -3          1101
      2.3         1001001100110011        150733             150732            00        -1          0011
      2.4         1001100110011001        157286             157284            10        -2          1010
      2.5         1010000000000000        163840             163840            00        0           0000
      2.6         1010011001100110        170394             170392            01        -2          0110
      2.7         1010110011001100        176947             176944            11        -3          1101
      2.8         1011001100110011        183501             183500            00        -1          0011
      2.9         1011100110011001        190054             190052            10        -2          1010
       3          1100000000000000        196608             196608            00        0           0000
      3.1         1100011001100110        203162             203160            01        -2          0110
      3.2         1100110011001100        209715             209712            11        -3          1101
      3.3         1101001100110011        216269             216268            00        -1          0011
      3.4         1101100110011001        222822             222820            10        -2          1010
      3.5         1110000000000000        229376             229376            00        0           0000
      3.6         1110011001100110        235930             235928            01        -2          0110
      3.7         1110110011001100        242483             242480            11        -3          1101
      3.8         1111001100110011        249037             249036            00        -1          0011
      3.9         1111100110011001        255590             255588            10        -2          1010

It is clearly that the rounding is not correct, and also I have perceived that the maximun rounding error in fixed point is always 2**shift - 1.

Any idea or sugerence? I have thought that the problem here is that I'm not taking into account the guard bits: GSR, but in the other hand, if actually the problem was this: What's happens when the necessary rounding is higher than one, e.g: 2, 3, 4... ?

njuffa

The ISO-C99 code below demonstrates one possible way of doing the conversion. The significand (mantissa) bits of the binary32 argument form the bits of the s15.16 result. The exponent bits tell us whether we need to shift these bits right or left to move the least significant integer bit to bit 16. If a left shift is required, rounding is not needed. If a right shift is required, we need to capture any less significant bits discarded. The most significant discarded bit is the round bit, all others collectively represent the sticky bit. Using the literal definition of the rounding mode, we need to round up if (1) either the round bit and the sticky bit are set, or (2) the round bit is set and the sticky bit clear (i.e., we have a tie case), but the least significant bit of the intermediate result is odd.

Note that real hardware implementations often deviate from such a literal application of the rounding-mode logic. One common scheme is to first increment the result when the round bit is set. Then, if such an increment occurred, clear the least significant bit of the result if the sticky bit is not set. It is easy to see that this achieves the same effect by enumerating all possible combinations of round bit, sticky bit, and result LSB.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

#define USE_LITERAL_RND_DEF  (1)

uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

#define FP32_MANT_FRAC_BITS  (23)
#define FP32_EXPO_BITS       (8)
#define FP32_EXPO_MASK       ((1u << FP32_EXPO_BITS) - 1)
#define FP32_MANT_MASK       ((1u << FP32_MANT_FRAC_BITS) - 1)
#define FP32_MANT_INT_BIT    (1u << FP32_MANT_FRAC_BITS)
#define FP32_SIGN_BIT        (1u << (FP32_MANT_FRAC_BITS + FP32_EXPO_BITS))
#define FP32_EXPO_BIAS       (127)
#define FX15P16_FRAC_BITS    (16)
#define FRAC_BITS_DIFF       (FP32_MANT_FRAC_BITS - FX15P16_FRAC_BITS)

int32_t fp32_to_fixed (float a)
{
    /* split binary32 operand into constituent parts */
    uint32_t ia = float_as_uint32 (a);
    uint32_t expo = (ia >> FP32_MANT_FRAC_BITS) & FP32_EXPO_MASK;
    uint32_t mant = expo ? ((ia & FP32_MANT_MASK) | FP32_MANT_INT_BIT) : 0;
    int32_t sign = ia & FP32_SIGN_BIT;
    /* compute and clamp shift count */
    int32_t shift = (expo - FP32_EXPO_BIAS) - FRAC_BITS_DIFF;
    shift = (shift < (-31)) ? (-31) : shift;
    shift = (shift > ( 31)) ? ( 31) : shift;
    /* shift left or right so least significant integer bit becomes bit 16 */
    uint32_t shifted_right = mant >> (-shift);
    uint32_t shifted_left = mant << shift;
    /* capture discarded bits if right shift */
    uint32_t discard = mant << (32 + shift);
    /* round to nearest or even if right shift */
    uint32_t round = (discard & 0x80000000) ? 1 : 0;
    uint32_t sticky = (discard & 0x7fffffff) ? 1 : 0;
#if USE_LITERAL_RND_DEF
    uint32_t odd = shifted_right & 1;
    shifted_right = (round & (sticky | odd)) ? (shifted_right + 1) : shifted_right;
#else // USE_LITERAL_RND_DEF
    shifted_right = (round) ? (shifted_right + 1) : shifted_right;
    shifted_right = (round & ~sticky) ? (shifted_right & ~1) : shifted_right;
#endif // USE_LITERAL_RND_DEF
    /* make final selection between left shifted and right shifted */
    int32_t res = (shift < 0) ? shifted_right : shifted_left;
    /* negate if negative */
    return (sign < 0) ? (-res) : res;
}

int main (void)
{
    int32_t res, ref;
    float x;

    printf ("IEEE-754 binary32 to S15.16 fixed-point conversion in RNE mode\n");
    printf ("use %s implementation of round to nearest or even\n",
            USE_LITERAL_RND_DEF ? "literal" : "alternate");

    /* test positive half-plane */
    x = 0.0f;
    while (x < 0x1.0p15f) {
        ref = (int32_t) rint ((double)x * 65536);
        res = fp32_to_fixed(x);
        if (res != ref) {
            printf ("error @ x = % 14.6a: res=%08x ref=%08x\n", x, res, ref);
            printf ("Test FAILED\n");
            return EXIT_FAILURE;
        }
        x = nextafterf (x, INFINITY);
    }
    
    /* test negative half-plane */
    x = -1.0f * 0.0f;
    while (x >= -0x1.0p15f) {
        ref = (int32_t) rint ((double)x * 65536);
        res = fp32_to_fixed(x);
        if (res != ref) {
            printf ("error @ x = % 14.6a: res=%08x ref=%08x\n", x, res, ref);
            printf ("Test FAILED\n");
            return EXIT_FAILURE;
        }
        x = nextafterf (x, -INFINITY);
    }
    printf ("Test PASSED\n");
    return EXIT_SUCCESS;
}

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Commonest rounding algorithm for IEEE754 floating point

Converting hexadecimal to IEEE 754

Converting a number to IEEE 754

Converting a decimal number to floating point notation and IEEE 754 format

IEEE 754 Denormalized Decimal Converting to Half-Point Binary

Converting IEEE 754 from bit stream into float in JavaScript

IEEE 754 Bit manipulation Rounding Error

Understanding IEEE-754 64-bit fixed point representation in C# and Java

Converting Decimal to Single-Precision IEEE 754

converting decimal to IEEE-754 format

Converting hexadecimal to IEEE754 hexadecimal representation

Unpack IEEE 754 Floating Point Number

Invertability of IEEE 754 floating-point division

Floating point number in JavaScript (IEEE 754)

Displaying IEEE 754 floating point representation in GDB?

zero point one in the IEEE 754 standard

Converting from decimal to IEEE 8-bit floating point representation

Correct way of reading bytes from IEEE754 floating point format

fast rounding fixed point number

How to reduce the float rounding error when converting it into fixed-point in C++?

Converting Bytes to Fixed point

Python rounding a floating point number to nearest 0.05

Converting IEEE-754 double and single precision to decimal Java bug

IEEE 754 Floating point overflow/denormals and symmetry questions

Is there any IEEE 754 standard implementations for Java floating point primitives?

Interchangeability of IEEE 754 floating-point addition and multiplication

IEEE 754 binary floating-point numbers imprecise for money

How to perform a multiplication to a floating point number represented in IEEE754

Are the results of floating point calculations involving infinity and NaN specified in IEEE 754?

TOP Ranking

HotTag

Archive