English | 中文 | Русский | Français
2,590 Posts served
8,335 Conversations started
One of the nice things about working on open source code is that any interesting findings can be freely discussed, such as in this blog. With that in mind I recently took up a project to optimize performance of the popular LAME mp3 encoder. Over the years I had seen LAME used in several other studies involving threading, compiler optimization, new architecture evaluation and the like. I wasn’t sure if any new frontier remained for me to discover. However, an initial VTune profiling session turned up some “low hanging fruit” optimization targets that I picked apart for a 70% reduction in encode time. I’ll try to cover these changes in detail over the next few posts.
The first thing that jumped out from the VTune run was a function called “quantize_xrpow_lines”. This was one of the top hotspot functions, consuming about 10% of the total run-time. Here is a link to the latest source file residing on Sourceforge:
http://lame.cvs.sourceforge.net/viewvc/lame/lame/libmp3lame/takehiro.c?revision=1.75&view=markup
The code employs a bit of trickery known as the “Takehiro IEEE754 hack” which uses a sequence of adds to convert a floating point value to its integer counterpart. The hack was conceived back in 2000, during the era of MSVC++ 6.0 which used an expensive _ftol service routine to convert floats to ints. Coincidentally I wrote an article about this issue back then that is still on-line at http://software.intel.com/en-us/articles/fast-floating-point-to-integer-conversions/ (note you may need to Google for it, the page has moved around more than transient NBA coach Larry Brown over the years). The gist of the paper is that prior to the Pentium III, the x86 ISA did not provide instructions that explicitly performed the float-to-int truncation cast required by the ANSI C standard. The ftol routine had to modify the FP control word to achieve this behavior on the x87 stack. At that time, the “magic float” hack was a method to improve convert performance without requiring any of the new SSE instructions that were emerging on the latest CPU’s.
Getting back to the present day mindset, SSE has been around for years and the most common compilers will at least generate scalar forms of the instructions. Also, the hardware implementations of convert-truncate instructions (e.g. CVTTSS2SI) have improved to where they only take a few cycles on Core2 Duo. Since the hotspot code does two such converts in a tight loop, I wanted to see if the Takehiro hack was still providing the benefits originally intended.
Ultimately I found that disabling the hack and reverting to the original code leads to 30% faster encoding time, under certain compilation conditions. The best result comes from building with Intel Compiler 10.1 and the –QxT switch which targets Supplemental SSE3 code generation (note, -QxW SSE2 code generation is nearly as fast but I’m sure our compiler group would like to promote the latest switches). MSVC++ 2005 compilation also chips off significant encoding time as long as you use the switch combination “/arch:SSE2 /fp:fast”. Those parameters allow the compiler to generate SSE2 code by default and relax precision requirements. (Without the latter, MSVC will do all calculations in double precision, even if the source code specifies float data types. Double precision is free on the x87 floating-point stack, but in the SSE context you’ll see many CVTSS2SD and CVTSD2SS instructions throughout the code which will cripple performance.)
Finally, though it only gave a modest performance gain, I came up with a more concise coding of the “quantize_xrpow_lines” loop:
for(i=0; i < l; i++) { float x0 = xr[i] * istep; int rx0 = (int)x0; x0 += adj43[rx0]; ix[i] = (int)x0; }
The adj43 table lookup prevents a straightforward SIMD implementation,but the Intel Compiler can still vectorize this if you specify “#pragma vector always”. It uses shift and unpack operations to extract the rx0 indices into general purpose registers and gather the array values back into one xmm register. This measured about 20% faster in a microkernel (aka a test app separate from the full encoder), but didn’t trim off any appreciable encode time. Nonetheless, restructuring the loop in this fashion leaves it in better position to leverage SIMD hardware improvements down the line.
