Performance tuning is typically an afterthought. Most software development projects focus on analyzing, architecting, and implementing the core business requirements. Increasingly demanding time to market requirements force software deployment right after the core business requirements are implemented and tested without much emphasis on performance tuning. Similarly, hardware deployment may not devote sufficient attention to the performance requirements of the software being hosted. Upgrading to the latest hardware achieves performance gains in most scenarios. However, more often than not a lot of potential performance is left on the table. The need to exploit system capabilities is becoming more and more relevant and important in era of multi-core processors and NUMA system configurations. It is essential to take advantage of the processor's micro-architectural enhancements. It is also critical to exploit platform-level improvements made in other subsystems – memory, storage, and networking. Intel® I/O Acceleration Technology (Intel® I/OAT) is one example of that. Making good use of new operating systems capabilities is important as well.
Objective and Expectations
The purpose of Tuning Corner is to provide somes that should help boost system performance. Each tuning note would highlight a small conceptual performance problem and demonstrate a simple technique to improve the performance. It is important to note, however, that the amount of gain would largely depend on the nature of the system itself and the optimization effort put in. Tuning techniques use Intel software tools to extract better system performance.
System performance issues discussed here refer to both software and hardware. Picking the "right" hardware is as important as optimizing the software being deployed on that hardware. The major focus here, however, is on software optimization. The approach is to achieve the most-bang-for-the-buck, where the emphasis is on easy-to-make enhancements that deliver the majority of the performance gains.
Note: More and more tuning notes will be documented and made available here over time. Until then, treat the topics listed below as tuning notes ideas to be completed in the near future. The posted tuning notes may be updated in future as time permit to make necessary changes or to provide additional content. Some of the contents may be reorganized as necessary to cover both Windows and Linux, or to cover both 32-bit and 64-bit environments. Please feel free to recommend new topics for tuning notes that you may benefit from.
Performance tuning notes are categorized in three levels based on the amount of effort required and anticipated performance gains:
Low Hanging Fruit
Tuning notes in this section cover quick and low-risk tuning activities that may be achieved in matter of hours, not days/weeks.
- System-level bottlenecks:
- Underutilized cores in the system – avoiding situation where few cores in the system are utilized while the rest have not much to do
- Overutilized cores in the system – avoiding situation where fewer cores have to do too many tasks
- Overworked networking subsystem
- Overworked storage subsystem
- Just Link In To Intel Compiler's Math Library (Windows) and enjoy high performance math functionality in your math heavy applications
- Using Intel Compiler without any code changes – take advantage of auto-vectorization, auto-parallelization, high-level optimization, better math library, etc.
Tuning notes in this section cover lightweight tuning activities that may be achieved in matter of days, not weeks/months. Once the low hanging fruit and systemic bottlenecks are taken care of, a quick VTune analysis may be useful to identify top 10 hotspots in your application. Some of the minor tweaks recommended here may boost performance significantly.
- VTune Analysis (detailed micro-architectural analysis with counters):
- Last level cache (LLC) misses
- Resource contention (Locks)
- Intel Compiler usage:
- Encouraging vectorization w/ code-reorganization (SVML)
- Minor tweaks to loops
- Mild adjustments to data layout
- Open MP
- OS Capabilities – NUMA systems:
- Thread/process Affinity
- Memory allocation Affinity
Tuning notes in this section cover heavyweight tuning activities that may take days/weeks/months. Examples here tend to focus on re-architecting the application usi ng cutting edge technologies to gain serious performance improvement.
- Re-architecting the application
- Explicit threading – take advantage of multi-core systems with OS specific threading
- Worker Thread Pool with I/O completion port in Windows
- Windows Thread parallel Library
- Threading Building Blocks (TBB)