Optimizing Without Breaking a Sweat

Published:03/14/2012   Last Updated:03/13/2012

How we optimized DreamWorks Animation's rendering, animation, and special effects applications without recompiling or relinking using LD_PRELOAD to call highly optimized functions provided by Intel® Threading Building Blocks, Intel® Integrated Performance Primitives, and run-time libraries provided with Intel® C++ Compiler for Linux*


We present novel techniques we developed to optimize DreamWorks Animation's rendering, animation, and special effects applications without recompiling or re-linking by preloading highly optimized libraries at run-time using the Linux* loader's LD_PRELOAD environment variable. For an overview, see "DreamWorks Animation and Intel: Forging An Alliance To Advance S3D Entertainment". We optimized common bottlenecks in Linux applications including memory allocation, math functions, and zlib compression without any source or build system changes by using Intel® Threading Building Blocks, Intel® Integrated Performance Primitives, and run-time libraries provided with Intel® C++ Compiler for Linux*. This paper first discusses the performance optimization methodology used, the LD_PRELOAD optimization technique, and the impressive performance gains. As these techniques don't require any source code or build system changes, they can be used to quickly evaluate the impact of high performance libraries on your applications, including optimizing 3rd party libraries and applications that you don't have access to the source code. The LD_PRELOAD technique is a great way to quickly determine if it is worth the effort to integrate optimized libraries into your software development environment.

LD_PRELOAD Optimization Technique Overview

This section describes the LD_PRELOAD optimization technique and how it is used to quickly evaluate the performance impact of using optimized libraries on your existing applications without making source code or build system changes. The Linux* loader's LD_PRELOAD environment variable forces loading a shared library at run-time, allowing you to replace or extend the functionality of your application. The Linux loader does this by preempting functions in your applications to versions in the preloaded shared library. Functions from the preloaded shared library are inserted first into the Procedure Lookup Table (PLT), the global table of symbols for your application by the linker. This allows the replacement of globally bound, preemptable functions in your application, and is how we call highly optimized functions without any source or build system changes.

The readelf utility is used to determine which symbols can be preempted from an object file, shared library or executable named filename: readelf -s filename. Figure 14 shows example readelf output, symbols with TYPE=FUNC, BIND=GLOBAL, and VIS=DEFAULT can be preempted. See Figure 14 and section Extending LD_PRELOAD Optimization Technique for additional details on readelf.

Figures 1a and 1b show how to preload the optimized math function library, libimf.so, provided with Intel® C++ and Fortran Compilers for Linux, first in sh shell syntax, followed by csh shell syntax:

prompt> export LD_PRELOAD=/opt/intel/Compiler/11.1/038/lib/intel64/libimf.so
prompt> time ./app.exe

Figure 1a. Seting the LD_PRELOAD environment variable for sh shell syntax.

prompt> setenv LD_PRELOAD /opt/intel/Compiler/11.0/027/lib/intel64/libimf.so
prompt> time ./app.exe

Figure 1b. Seting the LD_PRELOAD environment variable for csh shell syntax.

As shown, it is very easy to use this technique: just set the LD_PRELOAD environment variable to point to the high performance library you want to preload and then run your application.

Optimization Methodology

We discuss the optimization methodology we used to optimize DreamWorks Animation's rendering, animation & special effects applications. The optimization methodology consists of four phases and is summarized graphically in Figure 2. The first phase is Establish Baseline, which means establishing a performance baseline that will be used to evaluate the benefit of optimizations to your application. The critical part of this step is selecting workloads that are representative of work the application performs, that are repeatable, and are not too long or too short. The workloads used in this work ranged from 1 minute to 60 minutes. For most applications, workloads lasting a few minutes are preferred, as there is enough run-time to profile accurately down to individual lines of code, and it is possible to quickly evaluate different optimization techniques. While this phase seems straightforward, working with many different companies has shown some common problems to be changing the workload(s) or application during the optimization phase. These occur more frequently in very large development organizations, and can be addressed by making sure the entire organization agrees on the workloads to be used, and to make sure the application doesn't change during the optimization phase. One needs to know a change in performance is due to an optimization, and not from a co-worker changing other parts of the application or workloads.

The second phase is Performance Analysis, where you measure and analyze the performance of the application running the workloads(s). Different techniques used to measure application performance include using the system time utility to run the application, inserting timer code into your application, determining the call graph of your application, and using a low overhead profiler to measure clock ticks at the library, function and source file level. The key to the performance analysis is to identify the functions of the application that take the most time.

The third phase is Optimization. The key to successful optimization is to focus on speeding up the parts of the application which take the most run-time, as identified in the Performance Analysis phase. Obviously optimizing a function that is rarely called will not have a large impact on overall application performance. The optimization phase covers a wide variety of techniques that can be used to optimize applications, including introducing threads to allow the application to run in parallel. However, this paper focuses on how to optimize single threaded applications. The paper "Best Practices for Developing and Optimizing Threaded Applications, Part 1" discusses optimization of multi-threaded applications in detail, including an optimization methodology for threading applications and software development tools to improve performance and correctness of threaded applications. There are many different techniques to optimize an application. Examples include using improved algorithms, increasing compiler optimizations, and using high performance libraries. This paper focuses on using LD_PRELOAD to enable using high performance libraries on existing applications. Finally, the forth phase is Validate Results to make sure the optimized application generates the correct results.

Figure 2. The steps in the Optimization Methodology. The fundamental idea is to iterate over the Performance Analysis, Optimization and Validate Results phases, trying different optimization techniques.

We used the Optimization Methodology in this work. For each application, we established a performance baseline in the Establish Baseline phase, using several different workloads that exercised different parts of the applications. The performance metric we used to measure the impact of the performance optimizations was the average speedup across all workloads for a given application. We used Intel® VTune™ Performance Analyzer for the Performance Analysis phase, as well as Intel® Performance Tuning Utility. We used the low overhead sampling to profile the applications, with the Sample After Value (SAV) set to (CPU frequency)/1000 to allow simple conversion from CPU clock samples to seconds. For this analysis, we mainly used CPU clock ticks to identify functions in the applications that took the most time. Detailed information on more advanced Performance Analysis can be found in the white paper Using Intel® VTune™ Performance Analyzer to Optimize Software on Intel® Core™ i7 Processors. After we identified functions taking the most time, we applied various high performance libraries to optimize memory allocation, math functions, memcpy, and zlib compression, and validated the results using the results from the baseline workloads.

Optimizing Memory Allocation

We first looked at an animation application and found when we profiled that it spent a large fraction of time in memory allocation, which motivated us to look at the memory allocator available in Intel® Threading Building Blocks (TBB). New in Intel® TBB version 2.1 is malloc proxy library, libtbbmalloc_proxy.so.2, which allows the Intel® TBB scalable memory allocator to be used as a drop in replacement for the standard C/C++ library memory allocation routines using the LD_PRELOAD environment variable. One also needs to set the LD_LIBRARY_PATH environment variable to point to the directory where libtbbmalloc_proxy.so.2 and other Intel® TBB libraries are located. Additional information on the Intel® TBB memory allocator is available in the paper "The Foundations for Scalable Multi-Core Software in Intel® Threading Building Blocks", which includes benchmark results compared to other popular memory allocators. The Intel® TBB memory allocator is especially beneficial for C++ apps that build and destroy large numbers of small objects.

Figure 3a. Profile data showing the un-optimized animation application took 298 seconds to run.

Figure 3b. Profile data showing the application spent 69 seconds in libc.so.

Figure 3c. Profile data drilling down to the functions in libc.so, showing the majority of time is spent in memory allocation.

Figure 3 shows the baseline profile data we collected for the animation application. The application took 298 seconds to run, with 69 seconds of those seconds within libc.so. Figure 3c shows the profile data for libc.so functions, clearly showing significant time in memory allocation.

Figure 4 shows the profile data we collected for the application optimized by using the Intel® TBB memory allocator. The application optimized with Intel® TBB spent 23.6 seconds in libc.so, compared to 69.5 seconds in libc.so for the un-optimized version, so we spent 69.5 - 23.6 = 45.9 seconds in libc.so memory allocation. Note that the figures don't show all of the functions in libc.so so the time spent in libc.so is slightly larger. Adding the time spent in libtbbmalloc.so.2 and libtbbmalloc_proxy.so.2 gives 19.6 seconds in Intel® TBB memory allocator, giving a 2.3x speedup in memory allocation and 10% speedup of the entire application, just by using the Intel® TBB memory allocator. We set LD_PRELOAD to use libtbbmalloc_proxy.so.2, set LD_LIBRARY_PATH to point to the Intel® TBB library directory, and ran our application - truly optimizing without breaking a sweat!

Figure 4a. Profile data for animation application optimized using Intel® TBB memory allocator showing 24 seconds in libc.so. The application's memory allocation has been sped up by 2.3x.

Figure 4b. Profile data on the optimized application drilling down to the functions in libc.so, showing the time in libc.so memory allocation routines have been replaced with libtbbmalloc.so.2 and libtbbmalloc_proxy.so.2.

We have explored Intel® TBB memory allocator on the applications discussed in this paper and found it to be faster than the default allocator for all workloads. Figure 5 shows the speedup we measured using the Intel® TBB memory allocator on a second application, a cloth simulation. On average, memory allocation is 1.5x faster on average using Intel® TBB allocator leading to a 8% speedup on the entire cloth application, simply by using LD_PRELOAD to force using Intel® TBB memory allocator. Obviously, the overall speedup of your application will depend on the fraction of time the application spends in memory allocation.

Figure 5. Application 2 showing an average 1.5x faster memory allocation using Intel® TBB memory allocator vs. libc.so.

Optimizing Math Functions

The math function library, libimf.so, provided with Intel® C++ Compiler for Linux* contains highly optimized versions of math functions found in the standard C run-time library (libm.so), that match or surpass the accuracy of versions in libm.so. We used the LD_PRELOAD technique to measure the speedup of using libimf.so compared to the system math library, libm.so on render and particle simulation applications built with gcc. As libimf.so implements the API in C/C++ math runtime library, it is a drop-in replacement for libm.so, and you can evaluate the benefit of using LD_PRELOAD to point to libimf.so.

Figure 6. Image from DreamWorks Animation's Monsters vs. Aliens generated by the render application.

Figure 6 shows an image produced by the render application from DreamWorks Animation's Monsters vs. Aliens that we used to measure the benefit of libimf.so. For this image, the run-time is dominated by the time spent in the system math library, libm.so.

Figure 7a. Application level profile data for un-optimized render shading application which took 3,366 seconds to run.

Figure 7b. Library level profile data for un-optimized render shading application showing 1, 364 seconds were spent in system math library, libm.so. The majority of the time in libm.so was in the pow function.

Figure 7 shows the profile data for application built with gcc, where 41% of the overall time was spent in libm.so, the majority of the time in libm.so was in the pow function. We then ran the application built with gcc and set LD_PRELOAD to use libimf.so. Figure 8 shows the profile data with 1364 seconds spent in libm.so and just 438 seconds in libimf.so, a 3x speedup in the amount of time spent in math functions and a 1.4x speedup of the overall application time.

Figure 8a. Application level profile data for optimized render shading application which took 2,442 seconds to run.

Figure 8b. Library level profile data for optimized render shading application showing 438 seconds were spent in optimized math function library, libimf.so, provided with Intel® C++ Compiler for Linux*.

Next, we investigated the performance benefits of libimf.so for the particle simulation application. Figure 9 shows an image from an animation sequence of a fire simulation produced by the particle simulation application.

Figure 9. Fire simulation generated by the particle simulation application.

The profile data in Figure 10 shows for the application built with gcc that we spent 5.8 seconds in libm.so. The profile data shows that the application for this workload doesn't spend a large fraction of time in libm.so. As discussed in the Optimization Methodology section, optimizing the parts of your application with the most runtime will lead to the greatest optimization potential. For this workload of the particle simulation, the application is dominated by the amount of time spent in libraries we will call libB.so and libC.so, and optimizing these libraries would have a bigger optimization potential than optimizing math functions. We investigated the functions in libB.so and libC.so and didn't find any general purpose functions that were available in high performance libraries. We were able to optimize the libB.so and libC.so libraries using the Intel® C++ Compiler for Linux*, improving the algorithms, and data structures, but these details are beyond the scope of this paper as this required hard work and this paper focuses on optimizing without breaking a sweat! We used LD_PRELOAD to force using libimf.so instead of libm.so and show the optimized profile data in Figure 11.

For this workload, libimf.so (1.338 seconds) was 4.3x faster than libm.so (5.752 seconds). Overall, using libmf.so on this workload improved the performance of particle simulation by ~5%. The Intel® C++ Compiler provides a large number of advanced compiler optimizations, and will automatically call libmf.so and other highly optimized run-time libraries provided with the compiler. In addition to performance benefits we measured with libmf.so, we were able to get additional performance by building the application with Intel® C++ Compiler.

Figure 10. Profile data from the un-optimized particle render simulation of fire shown in Figure 9, showing 5.7 seconds spent in libm.so.

Figure 11. Profile data from the optimized particle render simulation of fire shown in Figure 9, showing 1.3 seconds spent in libimf.so.

Extending the LD_PRELOAD Optimization Technique

The previous examples used premade shared libraries that just dropped in using LD_PRELOAD to replace a subset of functions. There are other optimized libraries that are currently not available as "drop in replacements", but could still provide substantial speedups in an application. By building your own shared libraries, you can harness the power of existing optimized libraries or even provide your own. This section describes optimizing memcpy, zlib, and qsort.

Optimize memcpy

High performance versions of commonly used functions may exist, but perhaps not readily usable as a "drop in" replacement. For instance, Intel® C++ Compiler provides optimized versions of memcpy, memset, memcmp, and memmove inside the run-time library called libirc.a, although these versions have different names: _intel_fast_memcpy, _intel_fast_memset, _intel_fast_memcmp. The Intel® C++ Compiler automatically generates calls to these functions for optimized builds, but for non-optimized (-O0) builds calls the versions in libc.so. That's great when using the Intel® C++ Compiler, but adopting a different compiler can be a large task. 

Without changing compilers, we can provide some of the speedups possible with the Intel® C++ Compiler by using just the run-time libraries provided with Intel® C++ Compiler. One technique is to build our own shared library that links in libirc.a and contains symbols that will map the libc.so symbols to the optimized versions in libirc.a, or even a custom implementation. The key technique when building this shared library is to use linker option -defsym to create a global symbol to another symbol. Rather than creating a shim function called memcpy that turns around and calls _intel_fast_memcpy, one can simply add a the symbol memcpy pointing to _intel_fast_memcpy at link time and avoid the overhead of an extra function call. This technique allows global symbols to map directly to a different, faster version. See the Appendix for a detailed description, source code and build scripts to create the shared library libFastLibC.so with optimized versions of these functions. Additional information on creating shared libraries, see the paper How to Write Shared Libraries by Ulrich Drepper. One can also provide implementations of functions that aren't available in performance libraries. For example, Listing 1 in the Appendix shows straightforward implementations of strcmp and strlen we compiled with Intel® C++ Compiler that perform better than their counter parts in of libc.so.

Figure 12a. Profile data for the animation application after memory allocation provided with Intel® TBB.

Figure 12b. Profile data for the animation application after using libFastLibC.so, which contains optimized versions of memcpy/memcmp/memset from libirc.a run-time library provided with Intel® C++ Compilers for Linux*, in addition to using Intel® TBB memory allocator. Notice that the total time spent in libc.so and libFastLibc.so is 21.2 seconds, compared to 23.6 seconds in Figure 12a.

Using LD_PRELOAD with libFastLibC.so in the animation application, we were able to reduce the time spent in libc.so from 23.6 to 12.3 seconds, shown in Figures 12a and 12b, respectively. This indicates the work being done used to take 11.3 seconds and using libFastLibC.so takes 8.9 seconds, representing a 1.26X speed increase over the libc.so versions.

ZLIB Compression Optimization

This section describes another example of how we extended the LD_PRELOAD technique to optimize zlib, the popular data compression library, using Intel® Integrated Performance Primitives (Intel® IPP) library. Intel® IPP is a collection of high performance libraries for digital media and data-processing applications, and includes optimized functions for lossless compression methods, such as those used in zlib (inflate and deflate) and libbzip2 libraries. We describe how we used Intel® IPP functions to create a shared library that is a drop-in replacement for zlib, and we present results on how we used LD_PRELOAD to optimize existing applications that use zlib compression.

Here are the steps we followed to create a "drop-in" zlib replacement optimized using Intel® IPP:

  1. Navigate to the Intel® IPP sample code, then select Application-level Samples to download samples related to zlib compression. The zlib files are in the directory ipp-samples/data-compression/ipp_zlib. There is a readme.htm file that contains additional information.
  2. Comment out link_to_ipp_lib at the beginning of deflate.c provided in the sample.
  3. Create an Intel® IPP optimized zlib shared library, where ${IPP_DIR} is the main directory where Intel® IPP was installed:

    gcc -O2 -g -I${IPP_DIR}/include -c -fPIC adler32.c crc32.c zutil.c trees.c compress.c uncompr.c inflate.c gzio.c infback.c inffast.c inftrees.c deflate.c
    ld -shared -soname libfastlibz.so -o libfastlibz.so -lc deflate.o adler32.o crc32.o trees.o compress.o zutil.o uncompr.o inflate.o gzio.o infback.o inffast.o inftrees.o ${IPP_DIR}/sharedlib/libippdcem64t.so ${IPP_DIR}/sharedlib/libippsem64t.so ${IPP_DIR}/sharedlib/libippcoreem64t.so

You can now LD_PRELOAD libfastlibz.so to evaluate the performance benefit of using Intel® IPP on your application. We used libfastlibz.so to optimize the convert application that is part of the ImageMagick* software available at http://www.imagemagick.org, converting a large JPEG file format to the PNG file format, where re-compression is used when writing the PNG file. Using system time command, we measured the default version of convert taking 2.3 seconds. Using LD_PRELOAD to use libfastlibz.so reduced the runtime to 0.8 seconds, a 2.9x speedup from using Intel® IPP.

Finding Additional Optimization Opportunities: QSORT

Most of this paper has explained how to use LD_PRELOAD to drop in existing performance libraries. One can apply the same techniques to identify preemptable functions in library or client code that take a lot of run-time but are not available in existing libraries. If the function's use is well-defined and/or you have access to the source and or documentation, you might be able to build your own version of the function(s) inside a shared library and use LD_PRELOAD to use the optimized version(s). Sometimes the same algorithm compiled with Intel® C++ Compiler will be faster. This approach can also be applied where the full build environment is unavailable, as only single functions would need to be built. It is common to start using the Intel® C++ Compiler to build hot functions, and using gcc and g++ binary compatibility to link together a hybrid application built by both gcc and the Intel® C++ Compiler.

Figure 13: Profing data for libc.so library for the animation application.

After additional optimizations to the animation application discussed in the Optimizing Memory Allocation section, we can see in Figure 13 that 5.5 seconds are spent in __GI__memcpy and 2.3 seconds in msort_with_tmp within libc.so. After perusing libc.so source code, we discover __GI__ is the prefix for internal versions of the functions and therefore are not preemptable. However we also find that qsort calls msort_with_tmp, who then calls __GI__mempcpy, so the majority of the 7.8 seconds could be the result of the function qsort. This could be confirmed with a call graph analysis using Intel® VTune™ Performance Analyzer.

Next we examined the output of the readelf utility, readelf -s filename, shown in Figure 14. We looked at the symbol table '.dynsym' for TYPE=FUNC, BIND=GLOBAL, and VIS=DEFAULT to identify functions that have the default visibility and are global. We find that qsort is a global symbol that is preemptable.

Figure 14: readelf -s output showing global functions that can be preempted.

Ideally, one would recommend changing the source code to use std::sort to avoid the overhead of the comparison function call in qsort, but the number of changes could be large, and risky. Alternately we can implement qsort in a shared library and introduce it to the existing binary with LD_PRELOAD. We have now identified libc.so qsort as an optimization candidate. Using the same process as above, a speedup of over 6 seconds was obtained by compiling with Intel® C++ Compiler and automatically using the optimized memcpy instead of the slower libc.so internal __GI__mempcpy. Improving the algorithm could also improve performance further.


We presented a novel optimization technique of using LD_PRELOAD and highly optimized replacement libraries to optimize existing applications without any source code or build system changes. We showed how we optimized common bottlenecks in Linux* applications, including memory allocation, math functions, memset/memmov/etc and zlib compression using Intel® Threading Building Blocks, Intel® Integrated Performance Primitives, and run-time libraries provided with Intel® C++ Compiler for Linux*. We were able to obtain impressive performance results using these highly optimized libraries on DreamWorks Animation's rendering, animation and special effects applications. We encourage readers to use the techniques presented in this paper to quickly evaluate if these libraries can benefit their applications.


The Appendix describes how to create FastLibC.so to take advantage of the optimized implementations of memcpy, memset, memcmp, and memmove in run-time library called libirc.a provided with Intel® C++ Compiler for Linux*.

  1. Create a source file with a stub function that calls all the functions you want linked into your shared library. You could also add any custom implementations as well. See Listing 1 for source code.
  2. Build and link passing flags to control symbol visibility and remap global symbols. See Listing 2 for the build script.
    1. -defsym creates a global symbol pointing to the name of the library's internal version.
    2. Prevent the global symbols from libraries used to build this shared library from being exposed to avoiding pollution to the Procedure Lookup Table.
      1. --exclude-libs=libirc (if your platform's ld supports it or…)
      2. -version-script option to explicitly define what symbols will be locally visible.
        1. This approach simulates the -exclude-libs feature by using a version script with all the symbols from the library you wish to exclude. It may be tempting to explicitly state the global symbols desired instead, but that could hide symbols required and cause compatibility problems.
        2. The tool nm is used to list all globally defined symbols.
        3. The tools echo and sed is used to filter and format the list of symbols into a version script file
  3. Verify the library's contents with readelf -s libFastLibC.so
typedef unsigned long size_t;
#ifdef __cplusplus
extern "C" {
//NOTE: Internal functions names and emay change with each compiler release
extern int _intel_fast_memcmp(__const void *__s1,__const void *__s2,size_t __n);
extern void *_intel_fast_memcpy(void *b, const void *a, size_t n);
extern void *_intel_fast_memset(void *__s, int __c, size_t __n);
extern void *__intel_new_memmove(void *__dest,__const void *__src,size_t __n);

void ForceLinkerToLinkInAnyFunctionsItCalls()
    void *a = 0;
    void *b = 0;
    const char *c = 0;
    _intel_fast_memset(a, 0, 1);
    __intel_new_memmove(a, b, 1);
    _intel_fast_memcmp(a, b, 1);
int strcmp (const char * s1, const char * s2) __attribute__ ((visibility("default")));	
int strcmp (register const char * s1, register const char * s2)
    register int dist = 0 ;
    while( ! (dist = *(unsigned char *)s1 - *(unsigned char *)s2) && *s2)
        ++s1, ++s2;
    return( dist );
int strlen (const char * s1) __attribute__ ((visibility("default")));	
int strlen(register const char * theString)
    register const char * startOfString = theString ;
    return( theString - startOfString );
#ifdef __cplusplus

Listing 1: Source code FastLibC.cc.

# because --exclude-libs isn't supported on this platform and we want
# to exlude all of the global symbols from libirc, we must use a version
# map to make the symbols visibility local (versus the default global).
# generate a visibility map containing all the externally defined symbols of libirc
echo "{ local:"`nm --defined-only --extern-only /rel/third_party/intelcompiler64/11.1.038_baseline/lib/intel64/libirc.a | sed -n 's/................ . (.*)$/1;/p'`"};" > visLibIrcAsLocal.map

-o libFastLibC.so

Listing 2. Build script to create libFastLibC.so.


We would like to acknowledge Charles Congdon, Ram Ramanujam and Sheng Fu for their contributions to this work. In addition Alexandr Konovalov, Alexey Kukanov and Robert Reed on the Intel® TBB Team addressed questions on memory allocation optimization. Melanie Blower explained the fine details on how to exclude symbols in a shared library, and Knud Kirkegaard provided useful discussions on optimization and symbol preemption on Linux*.

About the Authors


  John O'Neill, Ph.D. is an application engineer for Intel in Applications and Solutions Engineering team within the Software and Services Group (SSG) working on optimizing applications to take advantage of the latest Intel software and hardware innovations. Previously at Intel, John was a technical consultant on the Intel® Compiler team and worked with numerous companies in digital content creation, financial services and enterprise/database industries to optimize their applications. Before joining Intel, John worked for the University of Minnesota as a researcher in high energy physics. John has published numerous articles in academic journals and books. He holds a Ph.D. in physics from the University at Albany, State University of New York.


  Alex Wells is a software engineer in Intel's Applications and Solutions Engineering team within the Software and Services Group (SSG). Alex developed a number of games and edutainment titles, co-architected a cross-platform runtime 3D engine in C++ targeting Windows*, Mac OS*, & Sony PlayStation*. Alex has also designed SOA business solutions with COM, C#, XML, WWF, and SQL Server. He specializes in the 3D tool chain and performance optimization. His current work is to maximize animation tool performance for DreamWorks Animation. He holds a BS in Computer Science from the University of California, Santa Barbara.


  Matt Walsh is a software engineer in Intel's Applications and Solutions Engineering team within the Software and Services Group (SSG). Matt's career has spanned computer software design and optimization in High Performance / Clustered computing, e-Commerce applications, PC and arcade games, FPGA hardware acceleration products and compilers. Prior to Intel he co-founded Boxaroo.com and worked at other startups. He holds a BS in Mechanical Engineering from the University of Illinois, Urbana-Champaign.



  1. "The Foundations for Scalable Multi-Core Software in Intel® Threading Building Blocks", http://www.intel.com/technology/itj/2007/v11i4/5-foundations/5-memory.htm
  2. Intel® C++ Compiler Professional Edition for Linux* - Documentation
  3. Intel® Integrated Performance Primitives product website is at Intel® Integrated Performance Primitives, select Application-level Samples to download samples related to ZLIB compression.
  4. Intel® Threading Building Blocks (TBB) on GitHub*
  5. How to Write Shared Libraries by Ulrich Drepper.
  6. Rethinking the Pipeline: DreamWorks Animation Advances the Art

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804