4,391 Posts served
10,712 Conversations started
- Academic

- Android

- Art, Music, & Animation

- Embedded Computing

- Events

- Game Development

- Graphics & Media

- Intel SW Partner Program

- Intel® AppUp Developer Program

- Manageability & Security

- Mobility

- Open Source

- Parallel Programming

- Performance and Optimization

- Power Efficiency

- Site News & Announcements

- Software Tools

- Association for Computing Machinery TechNews (ACM)
- Go Parallel! (Dr. Dobbs)
- HPCwire (Tabor Communications, Inc.)
- insideHPC (John West)
- Joe Duffy's Weblog (Microsoft)
- Microsoft Parallel Programming Development Center (Microsoft Germany)
- MultiCoreInfo.com
- scalability.org (Scalable Informatics)
- Software Dev Blog (Intel Germany)
- Soft Talk Blog (Intel United Kingdom)
- The Moth (Microsoft)
Open source project - LAME mp3 encoder optimization
By Michael Stoner (Intel) (7 posts) on October 6, 2008 at 4:17 pm
One of the nice things about working on open source code is that any interesting findings can be freely discussed, such as in this blog. With that in mind I recently took up a project to optimize performance of the popular LAME mp3 encoder. Over the years I had seen LAME used in several other studies involving threading, compiler optimization, new architecture evaluation and the like. I wasn’t sure if any new frontier remained for me to discover. However, an initial VTune profiling session turned up some “low hanging fruit” optimization targets that I picked apart for a 70% reduction in encode time. I’ll try to cover these changes in detail over the next few posts.
The first thing that jumped out from the VTune run was a function called “quantize_xrpow_lines”. This was one of the top hotspot functions, consuming about 10% of the total run-time. Here is a link to the latest source file residing on Sourceforge:
http://lame.cvs.sourceforge.net/viewvc/lame/lame/libmp3lame/takehiro.c?revision=1.75&view=markup
The code employs a bit of trickery known as the “Takehiro IEEE754 hack” which uses a sequence of adds to convert a floating point value to its integer counterpart. The hack was conceived back in 2000, during the era of MSVC++ 6.0 which used an expensive _ftol service routine to convert floats to ints. Coincidentally I wrote an article about this issue back then that is still on-line at http://software.intel.com/en-us/articles/fast-floating-point-to-integer-conversions/ (note you may need to Google for it, the page has moved around more than transient NBA coach Larry Brown over the years). The gist of the paper is that prior to the Pentium III, the x86 ISA did not provide instructions that explicitly performed the float-to-int truncation cast required by the ANSI C standard. The ftol routine had to modify the FP control word to achieve this behavior on the x87 stack. At that time, the “magic float” hack was a method to improve convert performance without requiring any of the new SSE instructions that were emerging on the latest CPU’s.
Getting back to the present day mindset, SSE has been around for years and the most common compilers will at least generate scalar forms of the instructions. Also, the hardware implementations of convert-truncate instructions (e.g. CVTTSS2SI) have improved to where they only take a few cycles on Core2 Duo. Since the hotspot code does two such converts in a tight loop, I wanted to see if the Takehiro hack was still providing the benefits originally intended.
Ultimately I found that disabling the hack and reverting to the original code leads to 30% faster encoding time, under certain compilation conditions. The best result comes from building with Intel Compiler 10.1 and the –QxT switch which targets Supplemental SSE3 code generation (note, -QxW SSE2 code generation is nearly as fast but I’m sure our compiler group would like to promote the latest switches). MSVC++ 2005 compilation also chips off significant encoding time as long as you use the switch combination “/arch:SSE2 /fp:fast”. Those parameters allow the compiler to generate SSE2 code by default and relax precision requirements. (Without the latter, MSVC will do all calculations in double precision, even if the source code specifies float data types. Double precision is free on the x87 floating-point stack, but in the SSE context you’ll see many CVTSS2SD and CVTSD2SS instructions throughout the code which will cripple performance.)
Finally, though it only gave a modest performance gain, I came up with a more concise coding of the “quantize_xrpow_lines” loop:
for(i=0; i < l; i++) { float x0 = xr[i] * istep; int rx0 = (int)x0; x0 += adj43[rx0]; ix[i] = (int)x0; }
The adj43 table lookup prevents a straightforward SIMD implementation,but the Intel Compiler can still vectorize this if you specify “#pragma vector always”. It uses shift and unpack operations to extract the rx0 indices into general purpose registers and gather the array values back into one xmm register. This measured about 20% faster in a microkernel (aka a test app separate from the full encoder), but didn’t trim off any appreciable encode time. Nonetheless, restructuring the loop in this fashion leaves it in better position to leverage SIMD hardware improvements down the line.
| Optimization Notice |
|---|
|
Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. Notice revision #20101101 |
Categories: Graphics & Media, Open Source, Parallel Programming, Software Tools
For more complete information about compiler optimizations, see our Optimization Notice.
Comments (1)
Trackbacks (2)
- Sonarca Sound Recorder Free 3.2.5 | Daily Freeware Download
December 29, 2011 11:19 PM PST - Accord CD Ripper 6.2.6 | Daily Freeware Download
December 30, 2011 7:36 PM PST


SEO Manchester
==================