Intel® C++ Compiler has supported Profile-guided Optimization (PGO) for a long history, PGO improves application performance by reorganizing code layout to reduce instruction-cache problems, shrinking code size, and reducing branch mispredictions. PGO provides information to the compiler about areas of an application that are most frequently executed. By knowing these areas, the compiler is able to be more selective and specific in optimizing the application.
Fluid animate is one of a class of algorithms for calculating fluid flow. Specifically, it utilizes the Smoothed-Particle Hydrodynamics model. In this model, the fluid is represented as a gridless collection of particles that will move based on forces applied to the sample.
In this article ,I will introduce the procedures to make this optimization on the code of Fluid animate and will compare the differences between the code's performance with and without this PGO optimization. The code in use is from Intel® C++ Compiler Code Samples ,you can find this full sample at (http://software.intel.com/en-us/code-samples/intel-c-compiler/application-domains/cfd/fluid-animate) ,also to mention ,I would only demonstrate the serial version and vectorized version of this sample since PGO would not help much for this program's parallel version ,which is common for parallel code especially when code is linked with 3rd party runtime libraries or code is doing raw data processing in small calculation loop kernels.
1) After downloading and unpacking the FluidAnimate.zip ,double click the FluidAnimate.sln to open it in Visual Studio and change the project's properties for PGO:
1. Selecting the Intel Composer XE as the default compiler ,first change the value to 'Disabled' for “Profile-Guided Build Options”<-General<-Configuration Properties ,which is used to run the application as a performance baseline.
Clean the project -> Build again-> Ctrl+F5 to see the output and the runtime of this application.
2. Change the value to 'Phase 1: Instrument for Optimization (Qprof-gen)' for “Profile-Guided Build Options”<-General<-Configuration Properties ,which is used to compile this application with instrumentation and collected runtime information would be feeded to the last-phase optimization.
Clean the project -> Build again-> Ctrl+F5 to see the output and runtime of this application. This time the application would take longer time to execute to generate instrumented information.
3. Change the value to 'Phase 3: Optimize with Profile Data (Qprof-use)' for “Profile-Guided Build Options”<-General<-Configuration Properties ,which is used to generate the final PGO-optimized application.
(No Cleaning This Time !)Simply Build -> Ctrl+F5 to see the output and the runtime of this application. This time the application would take less time to execute since it has been applied on PGO.
As you can see from the above two pictures ,it demonstrates the code with and without PGO ,about 5% performance gain can be reached by this optimization.
2) Also you can set PERF_NUM to the 'Preprocessor Definitions'<-C/C++<-Configuration Properties to make the application run 5 times to amortize the cache related effects in order to get a more appropriate baseline time and make instrumentation more thorough.
For simplicity ,I would only attach and no more explain the related runtime of these three-phases' variations after setting PERF_NUM.
1. Baseline of runtime
2. Instrument the code after enabling the Prof-use option
3. After getting Profile-guided Optimization ,near 5% around performance can be gained compared to the baseline code
3) Alternately ,you can enable the PGO by building from the command line :
1. Compile the application with /Qprof-gen specified. This creates an instrumented executable.
2. Run the application using a reduced-size dataset that is representative of the actual workload. Each run will create a .dpi file with profile information.
3. Compile the application with /Qprof-use specified. This will create an optimized executable.
4) The code is experimented on the below platform.
|Modified Speedup||Compiler (Intel® 64)||Compiler options||System specifications|
(code size 0.9x)
/O3 /QxAVX /Qipo
/O3 /QxAVX /Qipo
Windows 7 Enterprise (X64)
Windows 7 Enterprise (X64)
5) Performance analysis and conclusion.
Since PGO works best for code with many frequently executed branches that are difficult to predict at compile time ,which also work for Computational Fluid Dynamics since there contains a few 'if branches' in the code ,especially many branches locating in the loop-kernel .
5% performance gain is acceptable for this application since basically Computational Fluid Dynamics is compute-bound by its very nature and since runtime is 3000ms around ,which is short for make PGO's effect apparent ,after changing the framenum to be 2000 instead of 200 ,the executable would got 8% performance gain after PGO.
For detailed PGO's uasge ,please refer to《User and Reference Guide for the Intel® C++ Compiler 14.0》,and for more PGO building instructions on Linux and Max OSX of Computational Fluid Dynamics application, please waiting for the upcoming Intel® C++ Compiler releases. (Period.)