Instantly share code, notes, and snippets.

Embed
What would you like to do?
Fast half-precision to single-precision floating point conversion
// float32
// Martin Kallman
//
// Fast half-precision to single-precision floating point conversion
// - Supports signed zero and denormals-as-zero (DAZ)
// - Does not support infinities or NaN
// - Few, partially pipelinable, non-branching instructions,
// - Core opreations ~6 clock cycles on modern x86-64
void float32(float* __restrict out, const uint16_t in) {
uint32_t t1;
uint32_t t2;
uint32_t t3;
t1 = in & 0x7fff; // Non-sign bits
t2 = in & 0x8000; // Sign bit
t3 = in & 0x7c00; // Exponent
t1 <<= 13; // Align mantissa on MSB
t2 <<= 16; // Shift sign bit into position
t1 += 0x38000000; // Adjust bias
t1 = (t3 == 0 ? 0 : t1); // Denormals-as-zero
t1 |= t2; // Re-insert sign bit
*((uint32_t*)out) = t1;
};
// float16
// Martin Kallman
//
// Fast single-precision to half-precision floating point conversion
// - Supports signed zero, denormals-as-zero (DAZ), flush-to-zero (FTZ),
// clamp-to-max
// - Does not support infinities or NaN
// - Few, partially pipelinable, non-branching instructions,
// - Core opreations ~10 clock cycles on modern x86-64
void float16(uint16_t* __restrict out, const float in) {
uint32_t inu = *((uint32_t*)&in);
uint32_t t1;
uint32_t t2;
uint32_t t3;
t1 = inu & 0x7fffffff; // Non-sign bits
t2 = inu & 0x80000000; // Sign bit
t3 = inu & 0x7f800000; // Exponent
t1 >>= 13; // Align mantissa on MSB
t2 >>= 16; // Shift sign bit into position
t1 -= 0x1c000; // Adjust bias
t1 = (t3 > 0x38800000) ? 0 : t1; // Flush-to-zero
t1 = (t3 < 0x8e000000) ? 0x7bff : t1; // Clamp-to-max
t1 = (t3 == 0 ? 0 : t1); // Denormals-as-zero
t1 |= t2; // Re-insert sign bit
*((uint16_t*)out) = t1;
};
@stingoh

This comment has been minimized.

Show comment
Hide comment
@stingoh

stingoh Jul 14, 2014

I saw this answer on stackoverflow but do not have enough (any!) rep to comment. On line 40 you are doing type punning (from float* to int_). When compiling this with strict aliasing (which gcc and clang allow you to set and I believe on gcc it defaults to true at -O2), you will run into trouble and more likely so if a call to float16() gets inlined. Under strict aliasing rules, pointers of different types are assumed to not alias. Therefore, reads are writes to the same address, if done via pointers of different types (here float_ and int*) are considered independent, and thus can be re-ordered by the compiler. So with float16() getting inlined, 'inu' could be read before the calling code performs the write to that address.

The proper way to do this would be via a union.

Visual C++ stopped exposing strict/non-strict aliasing settings a long time ago so it wouldn't actually give you issues, but other compilers yes.

Cheers.

stingoh commented Jul 14, 2014

I saw this answer on stackoverflow but do not have enough (any!) rep to comment. On line 40 you are doing type punning (from float* to int_). When compiling this with strict aliasing (which gcc and clang allow you to set and I believe on gcc it defaults to true at -O2), you will run into trouble and more likely so if a call to float16() gets inlined. Under strict aliasing rules, pointers of different types are assumed to not alias. Therefore, reads are writes to the same address, if done via pointers of different types (here float_ and int*) are considered independent, and thus can be re-ordered by the compiler. So with float16() getting inlined, 'inu' could be read before the calling code performs the write to that address.

The proper way to do this would be via a union.

Visual C++ stopped exposing strict/non-strict aliasing settings a long time ago so it wouldn't actually give you issues, but other compilers yes.

Cheers.

@fjansson

This comment has been minimized.

Show comment
Hide comment
@fjansson

fjansson Jul 22, 2015

I also found this on stackoverflow (http://stackoverflow.com/questions/1659440/32-bit-to-16-bit-floating-point-conversion), will comment here too.

In float16, the Clamp-to-max test is clearly wrong, it is always triggered. The flush-to-zero test has the comparison sign the wrong way. I think the two tests should be:

t1 = (t3 < 0x38800000) ? 0 : t1; 
t1 = (t3 > 0x47000000) ? 0x7bff : t1;

fjansson commented Jul 22, 2015

I also found this on stackoverflow (http://stackoverflow.com/questions/1659440/32-bit-to-16-bit-floating-point-conversion), will comment here too.

In float16, the Clamp-to-max test is clearly wrong, it is always triggered. The flush-to-zero test has the comparison sign the wrong way. I think the two tests should be:

t1 = (t3 < 0x38800000) ? 0 : t1; 
t1 = (t3 > 0x47000000) ? 0x7bff : t1;
@vmarkovtsev

This comment has been minimized.

Show comment
Hide comment
@vmarkovtsev

vmarkovtsev Aug 17, 2015

The code which converts float16 to float32 does not deal with ±∞ and NaN. There is a reference implementation from e.g. Numpy: https://github.com/numpy/numpy/blob/master/numpy/core/src/npymath/halffloat.c#L466

vmarkovtsev commented Aug 17, 2015

The code which converts float16 to float32 does not deal with ±∞ and NaN. There is a reference implementation from e.g. Numpy: https://github.com/numpy/numpy/blob/master/numpy/core/src/npymath/halffloat.c#L466

@anouarIT

This comment has been minimized.

Show comment
Hide comment
@anouarIT

anouarIT Feb 10, 2017

Hi, do you have an idea how i do the some thing with JavaScript please ?

anouarIT commented Feb 10, 2017

Hi, do you have an idea how i do the some thing with JavaScript please ?

@TaihuLight

This comment has been minimized.

Show comment
Hide comment
@TaihuLight

TaihuLight Mar 15, 2017

Cloud you give some demo for test the code? and I do not know uint32_t and uint16_t where declared ?

TaihuLight commented Mar 15, 2017

Cloud you give some demo for test the code? and I do not know uint32_t and uint16_t where declared ?

@Dmitro25

This comment has been minimized.

Show comment
Hide comment
@Dmitro25

Dmitro25 Mar 28, 2018

Agree to fjansson. The code should be corrected to his variant.
E.g. test case for float32(float16(1.0)) gives wrong result for martinkallman code.

Dmitro25 commented Mar 28, 2018

Agree to fjansson. The code should be corrected to his variant.
E.g. test case for float32(float16(1.0)) gives wrong result for martinkallman code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment