Compression by quazi-logarithmic scale

valdo

I need to compress a large set of (unsigned) integer values, whereas the goal is to keep their relative accuracy. Simply speaking, there is a big difference between 1,2,3, but the difference between 1001,1002,1003 is minor.

So, I need a sort of a lossy transformation. The natural choice is to build a logarithmic scale, but the drawback is that conversion to/from it requires floating-point operations, log/exp calculation and etc. OTOH I don't need a truly logarithmic scale, I just need it to resemble it in some sense.

I came up with an idea of encoding numbers in a floating-point manner. That is, I allocate N bits for each compressed number, from which some represent the mantissa, and the remaining are for the order. The choice for the size of the mantissa and order would depend on the needed range and accuracy.

My question is: is it a good idea? Or perhaps there exists a better encoding scheme w.r.t. computation complexity vs quality (similarity to logarithmic scale).

What I implemented in details:

As I said, bits for mantissa and for order. Order bits are leading, so that the greater the encoded number - the greater the raw one.

The actual number is decoded by appending an extra leading bit to mantissa (aka implicit bit), and left-shifting it by the encoded order. The smallest decoded number would be 1 << M where M is the size of mantissa. If the needed range should start from 0 (like in my case) then this number can be subtracted.

Encoding the number is also simple. Add the 1 << M, then find its order, i.e. how much it should be right-shifted until it fits our mantissa with implicit leading bit, and then encoding is trivial. Finding the order is done via median-search, which results to just a few ifs. (for example, if there are 4 order bits, the max order is 15, and it's found within 4 ifs).

I call this a "quazi-logarithmic" scale. The absolute precision decreases the greater is the number. But unlike the true logarithmic scale, where the granularity increases contiguously, in our case it jumps by factor of 2 after each fixed-size range.

The advantages of this encoding:

  • Fast encoding and decoding
  • No floating-point numbers, no implicit precision loss during manipulations with them, boundary cases and etc.
  • Not dependent on standard libraries, complex math functions, etc.
  • Encoding and decoding may be implemented via C++ template stuff, so that conversion may even be implemented in compile-time. This is convenient to define some compile-time constants in a human-readable way.
Paul

In your compression-algorithm every group of numbers that result in the same output after being compressed will be decompressed to the lowest number in that group. If you changed that to the number in the middle the average fault would be reduced.

E.g. for a 8-bit mantissa and 5bit exponent the numbers in the range[0x1340, 0x1350) will be translated into 0x1340 by decompress(compress(x)). If the entire range would first be compressed and afterwards decompressed the total difference would be 120. If the output would be 0x1348 instead, the total error would only be 64, which reduces the error by a solid 46.7%. So simply adding 2 << (exponent - 1) to the output will significantly reduce the error of the compression-scheme.

Apart from that I don't see much of an issue with this scheme. Just keep in mind that you'll need a specific encoding for 0. There would be alternative encodings, but without knowing anything specific about the input this one will be the best you can get.

EDIT:
While it is possible to move the correction of the result from the decompression to the compression-step, this comes at an increased expenses of enlarging the exponent-range by one. This is due the fact that for the numbers with the MSB set only half of the numbers will use the corresponding exponent (the other half will be populated by numbers with the second-most significant bit set). The higher half of numbers with the MSB set will be placed in the next-higher order.

So for e.g. for 32-bit numbers encoded with a 15bit-mantissa only numbers until 0x8FFF FFFF will have order 15 (Mantissa = 0x1FFF and Exponent = 15). All higher values will have order 16 (Mantissa = 0x?FFF and Exponent = 16). While the increase of the exponent by 1 in itself doesn't seem much, in this example it already costs an additional bit for the exponent.

In addition the decompression-step for the above example will produce an integer-overflow, which may be problematic under certain circumstances (e.g. C# will throw an exception if the decompression is done in checked-mode). Same applies for the compression-step: unless properly handled, adding 2^(order(n) - 1) to the input n will cause an overflow, thus placing the number in order 0.

I would recommend moving the correction to the decompression-step (as shown above) to remove potential integer-overflows as a source of problems/bugs and keep the number of exponents that need to be encoded minimal.

EDIT2:
Another issue with this approach is the fact that half of the numbers (excluding the lowest order) wind up in a larger "group" when the correction is done on compression, thus reducing precision.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related