Develop a methodology for the analysis phase of the development cycle. Typically, the analysis stage for a threaded application involves profiling a serial application to determine regions of the application that are potential candidates for parallelization.
Prepare a baseline measurement of the performance of the serial application and determine the regions of potential parallelism in the application. Use a representative workload that exercises most of the code path being analyzed to gather performance data on the serial application.
The workloads that are selected should be as small as possible to keep the memory footprint and the application runtime low. The primary tool that is used in this phase is the VTune™ Performance Analyzer. The following figure shows the analysis stage captured as a flowchart:
Once a workload or workloads have been selected, the application is run on the workload, and sampling and call graph statistics are collected using VTune Performance Analyzer. The critical paths in the call graph are analyzed, and the most time-consuming (or parallelizable) path is selected. The selected path is then examined by looking at the call sequence and the most appropriate function (node) in which the threading calls should be made is identified.
The reason that using the call-graph utility in VTune Performance Analyzer is recommended for determining the region of interest is that sampling information may not be suitable for all types of applications. Sampling data sometimes may have flat profiles, and picking the right level may not be possible, or they may point to the functions that result in the most number of calls, but which may not be at the right level in the code path for threading. To get the right perspective, call graph data has to be used even though it perturbs the execution time due to instrumentation.