## Fast Multiplication of Normalized 16 bit Numbers with SSE2

If you are compositing pixels with 16 bits per component, you often need this computation:

uint16_t a, b, r; r = (a * b + 0x7fff) / 65535;

There is a well-known way to do this quickly without a division:

uint32_t t; t = a * b + 0x8000; r = (t + (t >> 16)) >> 16;

Since we are compositing pixels we want to do this with SSE2 instructions, but because the code above uses 32 bit arithmetic, we can only do four operations at a time, even though SSE registers have room for eight 16 bit values. Here is a direct translation into SSE2:

a = punpcklwd (a, 0); b = punpcklwd (b, 0); a = pmulld (a, b); a = paddd (a, 0x8000); b = psrld (a, 16); a = paddd (a, b); a = psrld (a, 16); a = packusdw (a, 0);

But there is another way that better matches SSE2:

uint16_t lo, hi, t, r; hi = (a * b) >> 16; lo = (a * b) & 0xffff; t = lo >> 15; hi += t; t = hi ^ 0x7fff; if ((int16_t)lo > (int16_t)t) lo = 0xffff; else lo = 0x0000; r = hi - lo;

This version is better because it avoids the unpacking to 32 bits. Here is the translation into SSE2:

t = pmulhuw (a, b); a = pmullw (a, b); b = psrlw (a, 15); t = paddw (t, b); b = pxor (t, 0x7fff); a = pcmpgtw (a, b); a = psubw (t, a);

This is not only shorter, it also makes use of the full width of the SSE registers, computing eight results at a time.

Unfortunately SSE2 doesn’t have 8-bit variants of `pmulhuw`

, `pmullw`

, and
`psrlw`

, so we can’t use this trick for the more common case where
pixels have 8 bits per component.

Exercise: Why does the second version work?