Converting from decimal to IEEE 8-bit floating point representation

Helen Grey

My answers to the problem below differ from the answer key. The problem: We assume that IEEE decided to add a new 8-bit representation with its main characteristics consistent with the 32/64-bit representations. Consider the following four 8-bit numbers:

A: 11100101 B: 00111001 C: 00001100 D: 00011101

The decimal values represented by the above numbers are as follows, in no particular order: 3.125, -21, 29/32, 3/8. Q1: Which 8-bit floating point number represents 29/32(choose from A, B, C, D)? A1: D Given the above information, figure out the following: Q2: Number of bits needed for exponent A2: 3 Q3: Number of bits needed for fraction: A3: 4

I agree with the answer to Q1, but I got different answers for A2 and A3 (A2: 2 and A3: 5) 29/32 = 29 * 2 ^-5 => in binary 11101 * 2^-5. If we shift the decimal point to convert it to binary normalized form: 1.1101 * 2^-1. So the answer for Q1 should be the bit pattern that ends in 1101, hence D. Answering Q2: If the answer is 3: 0 001 1101, frac = 1101, exp = 001 (normalized), bias = 3 => E = exp - bias; E = 1 - 3 = -2. If we convert it all back to the binary normalized form (1.frac * 2^E) we will get: 1.1101 * 2^-2 = 11101 * 2^-6 = 29/64 (not 29/32 as stated initially). But when I use the following representation 0 00 11101: 2 bits for exp (bias = 2^1 - 1 = 1), 5 bits for frac the results match. exp = 00, therefore de-normalized notation is used (0.frac * 2^E, where E = (exp+1) - bias): E = 0+1-1=0 => 0.11101 * 2^0 = 11101 * 2^-5 = 29/32. What am I doing wrong? Thank you!

Eric Postpischil

−21 must be represented by A, 11100101, since that is the only one with the sign bit set. With three bits for the exponent encoding and four for the main significand encoding, we have an exponent bias of 3, so 1102 = 6 and represents an exponent of 3, and 0101 in the significand field represents 1.01012 = 21/16, so the value represented is −1 • 23 • 21/16 = −10½, which is half what we expected, −21.

For B, we have 00111001 → 0 011 1001 → +1 • 23−3 • 1.10012 → +1 • 1 • 25/16 = 1.5625, which is also half what we expected, 3.125.

For C, we have 00001100 → 0 000 1100 (which is subnormal) → +1 • 21−3 • 0.11002 → +1 • 2−2 • 12/16 = 3/16, which is again half what we expected, 3/8.

So it is apparent an error was made in constructing the problem; the values use an exponent bias of −4 instead of the −3 that would follow from the IEEE-754 pattern. (Or an equivalent error was made, such as positioning the significand with the leading bit after the radix point instead of before it.)

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Converting a decimal number to floating point notation and IEEE 754 format

Displaying IEEE 754 floating point representation in GDB?

IEEE 754 Floating Point Representation addition sum exceeds places in 16 bit format

IEEE floating point and fast floating point representation in memory

What is the decimal equivalent of the 32-bit IEEE floating point value CC4C0000?

Representation of floating point (14bit)

128 bit floating point binary representation error

Floating point quantization from double to 8bit

How to convert decimal to IEEE double precision floating point binary in Matlab?

converting c# float to IEEE single precision floating point bytes

An 8-bit representation of decimal Numbers in java

IEEE 754 Denormalized Decimal Converting to Half-Point Binary

Floating point IEEE guarantees

Python 3.6 converting 8 bit binary to decimal

Converting a calculated decimal number into 8 bit binary

Converting a 64 bit HEX to Decimal Floating-Pointt in Javascript

Bluetooth Floating Point representation

Custom Floating Point Representation

Floating Point representation in Binary

Floating point types representation

converting a floating-point to decimal in the scientific form with fixed length

Converting a floating point to its corresponding bit-segments

Convert uint32 floating point representation to uint8

Converting IEEE 754 from bit stream into float in JavaScript

IEEE floating point exception - why?

Understanding IEEE-754 64-bit fixed point representation in C# and Java

Converting from IEEE-754 to Fixed Point with nearest rounding

Python: Converting two sequential 2-byte registers (4 bytes) into IEEE floating-point big endian

Can IEEE 754 floating-point numbers represent the exact same value with multiple bit arrangements?

TOP Ranking

HotTag

Archive