In this post I'd like to promote the SSE 4.1 instruction set extension as it relates to the quantization loop I wrote about a few months ago. As you may recall, the modified code from ‘quantize_xrpow_lines" looked like this:
for(i=0; i < l; i++)
float x0 = xr[i] * istep;
int rx0 = (int)x0;
x0 += adj43[rx0];
ix[i] = (int)x0;
The function consumes about 10% of the run-time for constant bit-rate encoding. I had enabled compiler vectorization using "#pragma vector always" but the resulting SSE2 code didn't make a dent in the overall encode time.
The critical bottleneck with this loop is the "gather" or table-lookup sequence which demands individual reads from the adj43 array for each result. Our instruction set currently does not provide a method to execute such parallel loads, so each rx0 element must be extracted into a general purpose register (eax, ebx, etc.) to set up the loads and likewise each float value retrieved as adj43[rx0] must be packed back into an xmm register. The SSE2 form of this algorithm needs 22 total instructions to produce four results in one register.
My first crack at an SSE 4.1 version used the PEXTRD and INSERTPS instructions to tighten up the sequence in moving the data to and from the GPR's. This shortened the whole loop to 14 instructions, but oddly didn't improve performance by any measure. Breaking the function out into a micro-kernel testing jig actually showed that this version was slightly slower than the SSE2 code on a Nehalem 3.2 Ghz system.
So what happened? I understood this better after digging into the micro-op sequences used in decoding each instruction. Micro-ops are the RISC-like elements that combine to execute each x86 instruction inside the CPU execution pipeline. Naturally, I wanted to choose instructions that decode into shorter micro-op flows so the CPU could get by doing less work. Our CPU definition files showed that PEXTRD and INSERTPS decode to 2 and 3 micro-ops respectively. So, while I reduced instruction count at the externally-visible level, my SSE4.1 code actually increased the number of micro-ops from 25 to 26 for the whole loop.
Curiously, while I was inspecting the micro-op flows I noticed that EXTRACTPS and PINSRD each used one less micro-op than their counterparts noted above. The difference between PEXTRD and EXTRACTPS is mainly conceptual based on the data type, e.g. PEXTRD is for extracting 32-bit integers (DWORD's) and EXTRACTPS is for 32-bit floats (Packed Single-precision data). I intuitively chose PEXTRD and INSERTPS because the algorithm was extracting DWORD array indices and inserting floats. So why the extra micro-ops? In the extract case, I am not sure... However, the INSERTPS instructions does offer the added ability to select the input operand from another xmm register, so that explains why it uses one more shuffle micro-op than PINSRD.
Another factor to consider is port bindings. Each micro-op can only be dispatched on a certain subset of the ports (0-5 on Nehalem) and each port is limited to dispatching one operation per cycle (try Googling "Nehalem block diagram" for a good visual). My initial SSE4.1 loop was bound by having seven micro-ops that map to Port 0, so it had an ideal throughput of 7 cycles. To figure this out I listed all 26 micro-ops and tried to arrange the port mappings so they would be spread as evenly as possible over all ports. It sounds complex but some people here can do it in their head... for almost any piece of code... on any CPU from the 1995 Pentium Pro out to what we are designing for 2013. (In a similar fashion I often proof-read these blog posts over and over trying to figure out how I can reduce the number of words... but I'm not going to do that today ).
After all that I rewrote my algorithm using the combination of EXTRACTPS and PINSRD and saw a 2% speedup in encode time. Not much you might think, but consider I only used two new instructions. The selection went against the data type in each case but nonetheless reduced the micro-op count from 26 to 20. Also the throughput improved to 5 cycles/result, bound on ports 1 and 2.
Here is my intrinsics implementation, a bit ugly for the need to use casts to match the data types:
int i, t0, t1, t2, t3;
__m128 x4h, xr4, x4, istep4 = _mm_set1_ps(istep);
for(i=0; i < l; i+=4)
xr4 = _mm_load_ps(&xr[i]);
xr4 = _mm_mul_ps(xr4, istep4);
rx4 = _mm_cvttps_epi32(xr4);
t0 = _mm_cvtsi128_si32(rx4);
t1 = _mm_extract_ps(_mm_castsi128_ps(rx4), 1);
t2 = _mm_extract_ps(_mm_castsi128_ps(rx4), 2);
t3 = _mm_extract_ps(_mm_castsi128_ps(rx4), 3);
x4 = _mm_load_ss(&adj43[t0]);
x4 = _mm_castsi128_ps(_mm_insert_epi32(_mm_castps_si128(x4), *(int*)&adj43[t1], 1));
x4 = _mm_castsi128_ps(_mm_insert_epi32(_mm_castps_si128(x4), *(int*)&adj43[t2], 2));
x4 = _mm_castsi128_ps(_mm_insert_epi32(_mm_castps_si128(x4), *(int*)&adj43[t3], 3));
xr4 = _mm_add_ps(xr4, x4);
rx4 = _mm_cvttps_epi32(xr4);
Alternatively I have put in a request for the Intel Compiler to emit similar code when vectorizing with the /QxS switch (targeting SSE4.1 code generation). This could appear in the 11.0 release but no guarantees as yet.