If you are compositing pixels with 16 bits per component, you often need this computation:
uint16_t a, b, r; r = (a * b + 0x7fff) / 65535;
There is a well-known way to do this quickly without a division:
uint32_t t; t = a * b + 0x8000; r = (t + (t >> 16)) >> 16;
Since we are compositing pixels we want to do this with SSE2 instructions, but because the code above uses 32 bit arithmetic, we can only do four operations at a time, even though SSE registers have room for eight 16 bit values. Here is a direct translation into SSE2:
a = punpcklwd (a, 0); b = punpcklwd (b, 0); a = pmulld (a, b); a = paddd (a, 0x8000); b = psrld (a, 16); a = paddd (a, b); a = psrld (a, 16); a = packusdw (a, 0);
But there is another way that better matches SSE2:
uint16_t lo, hi, t, r; hi = (a * b) >> 16; lo = (a * b) & 0xffff; t = lo >> 15; hi += t; t = hi ^ 0x7fff; if ((int16_t)lo > (int16_t)t) lo = 0xffff; else lo = 0x0000; r = hi - lo;
This version is better because it avoids the unpacking to 32 bits. Here is the translation into SSE2:
t = pmulhuw (a, b); a = pmullw (a, b); b = psrlw (a, 15); t = paddw (t, b); b = pxor (t, 0x7fff); a = pcmpgtw (a, b); a = psubw (t, a);
This is not only shorter, it also makes use of the full width of the SSE registers, computing eight results at a time.
Unfortunately SSE2 doesn’t have 8-bit variants of
psrlw, so we can’t use this trick for the more common case where
pixels have 8 bits per component.
Exercise: Why does the second version work?