Modern high performance computers are built with a combination of resources including: multi-core processors, many core processors, large caches, high speed memory, high bandwidth inter-processor communications fabric, and high speed I/O capabilities. High performance software needs to be designed to take full advantage of these wealth of resources. Whether re-architecting and/or tuning existing applications for maximum performance or architecting new applications for existing or future machines, it is critical to be aware of the interplay between programming models and the efficient use of these resources. Consider this a starting point for information regarding Code Modernization. When it comes to performance, your code matters!
Building parallel versions of software can enable applications to run a given data set in less time, run multiple data sets in a fixed amount of time, or run large-scale data sets that are prohibitive with un-optimized software. The success of parallelization is typically quantified by measuring the speedup of the parallel version relative to the serial version. In addition to that comparison, however, it is also useful to compare that speedup relative to the upper limit of the potential speedup. That issue can be addressed using Amdahl's Law and Gustafson's Law.
Good code design takes into consideration several levels of parallelism.
Developing code which uses all three levels of parallelism effectively, efficiently, and with high performance is optimal for modernizing code.
Factoring into these considerations is the impact of the memory model of the machine: amount and speed of main memory, memory access times with respect to location of memory, cache sizes and numbers, and requirements for memory coherence.
Poor data alignment for vector parallelism will generate a huge performance impact. Data should be organized in a cache friendly way. If it is not, performance will suffer, when the application requests data that’s not in the cache. The fastest memory access occurs when the needed data is already in cache. Data transfers to and from cache are in cache-lines, and as such if the next piece of data is not within the current cache-line or is scattered amongst multiple cache-lines, the application may have poor cache efficiency.
Divisional and transcendental math functions are expensive even when directly supported by the instruction set. If your application uses many division and square root operations within the run-time code, the resulting performance may be degraded because of the limited functional units within the hardware; the pipeline to these units may be dominated. Since these instructions are expensive, the developer may wish to cache frequently used values to improve performance.
There is no “one recipe, one solution” technique. A great deal depends on the problem being solved and the long term requirements for the code, but a good developer will pay attention to all levels of optimization, both for today’s requirements and for the future.
Intel has built a full suite of tools to aid in code modernization - compilers, libraries, debuggers, performance analyzers, parallel optimization tools and more. Intel even has webinars, documentation, training examples, and best known methods and case studies which are all based on over thirty years of experience as a leader in the development of parallel computers.
The Code Modernization optimization framework takes a systematic approach to application performance improvement. This framework takes an application though five optimization stages, each stage iteratively improving the application performance. But before you start the optimization process, you should consider if the application needs to be re-architected (given the guidelines below) to achieve the highest performance, and then follow the Code Modernization optimization framework.
By following this framework, an application can achieve the highest performance possible on Intel® Architecture. The stepwise approach helps the developer achieve the best application performance in the shortest possible time. In another words, it allows the program to maximize its use of all parallel hardware resources in the execution environment. The stages:
At the beginning of your optimization project, select an optimizing development environment. The decision you make at this step will have a profound influence in the later steps. Not only will it affect the results you get, it could substantially reduce the amount of work to do. The right optimizing development environment can provide you with good compiler tools, optimized, ready-to-use libraries, and debugging and profiling tools to pinpoint exactly what the code is doing at the runtime. Check out the webinars on the Intel® Advisor XE tool, that can be used to identify vectorization & threading opportunities.
Once you have exhausted the available optimization solutions, in order to extract greater performance from your application you will need to begin the optimization process on the application source code. Before you begin active parallel programming, you need to make sure your application delivers the right results before you vectorize and parallelize it. Equally important, you need to make sure it does the minimum number of operations to get that correct result. You should look at the data and algorithm related issues such as:
You may also have to deal with language-related performance issues. If you have chosen C/C++, the language related issues are:
Try vector level parallelism. First try to vectorize the inner most loop. For efficient vector loops, make sure that there is minimal control flow divergence and that memory accesses are coherent. Outer loop vectorization is a technique to enhance performance. By default, compilers attempt to vectorize innermost loops in nested loop structures. But, in some cases, the number of iterations in the innermost loop is small. In this case, inner-loop vectorization is not profitable. However, if an outer loop contains more work, a combination of elemental functions, strip-mining, and pragma/directive SIMD can force vectorization at this outer, profitable level.
Now we get to thread level parallelization. Identify the outermost level and try to parallelize it. Obviously, this requires taking care of potential data races and moving data declaration to inside the loop as necessary. It may also require that the data be maintained in a cache efficient manner, to reduce the overhead of maintaining the data across multiple parallel paths. The rationale for the outermost level is to try to provide as much work as possible to each individual thread. Amdahl’s law states: The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. Since the amount of work needs to compensate for the overhead of parallelization, it helps to have as large a parallel effort in each thread as possible. If the outermost level cannot be parallelized due to unavoidable data dependencies, try to parallelize at the next-outermost level that can be parallelized correctly.
Lastly we get to multi-node (Rank) parallelism. To many developers message passing interface (MPI) is a black box that “just works” behind the scenes, to transfer data from one MPI task (process) to another. The beauty of MPI for the developer is that the algorithmic coding is hardware independent. The concern that developers have, is that with the many core architecture with 60+ cores, the communication between tasks may create a communication storm either within a node or across nodes. To mitigate these communication bottlenecks, applications should employ hybrid techniques, employing a few MPI tasks and many OpenMP threads.
A well-optimized application should address vector parallelization, multi-threading parallelization, and multi-node (Rank) parallelization. However to do this efficiently it is helpful to use a standard step-by-step methodology to ensure each stage level is considered. The stages described here can be (and often are) reordered depending upon the specific needs of each individual application; you can iterate in a stage more than once to achieve the desired performance.
Experience has shown that all stages must at least be considered to ensure an application delivers great performance on today’s scalable hardware as well as being well positioned to scale effectively on upcoming generations of hardware.
Give it a try!
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804