Aggregation reduces the amount of data by aggregating events into thread groups and into function groups.
The following aggregation types are discussed in this topic:
A striking example for the benefit of thread groups is a parallel code that runs on a cluster of SMP systems. In fact this scenario was the inspiration to introduce this concept. To analyze the behavior of such an application, the data transfer rate is verified to check if the reached rate is plausible with respect to the data rates that are expected (maybe a fraction of the data rates advertised). Of course the effective and expected data transfer rates differ for messages that travel inside an SMP node (intra-node) and between two SMP nodes (inter-node).
In the Intel® Trace Analyzer selecting Aggregation into the predefined
process group is enough to make the distinction between intra-node and inter-node messages very easy: in the Message Profile the values for the intra-node messages appear on the diagonal of the matrix.
Selecting a process group generally results in displaying the information for the group children (with the notable exception of the function profile). That is the reason why you cannot select single, unthreaded processes or single threads for aggregation.
The hierarchy is quite complicated: threads living on the same core (due to Hyper Threading), threads living on different cores in the same CPU, threads living on the same FSB in different CPUs, threads living in the same SMP box on different FSBs, threads living in different boxes connected by a faster interconnect, threads living in SMP boxes connected by a not so fast interconnect and so on. But such hierarchies allow for deeply nested thread groups.
If you select the thread group representing a single node to concentrate on intra-node effects, then the analysis becomes slower than using the thread group
alone. Why does it happen? First of all, Intel Trace Analyzer does not have to do any aggregation for the
thread group because it is flat (assuming no threads are used). The second is, despite the fact that only a single SMP node is chosen, all other threads go through the analysis and are thrown into the artificially created thread group
. Click on
Advanced > Show Process Group 'Other'
to make this group visible. To speed things up, choose a filter that only lets the threads of the selected SMP node pass.
Filtering and Aggregation are orthogonal mechanisms in the Intel Trace Analyzer.
Aggregation into function groups enables you to decide on what level of detail to look at the threads or thread groups' activity. In many cases it might be enough to see that a code spends some percent of its time in MPI without knowing in which particular function. In some cases optimizing the serial parts of the program might seem more rewarding than optimizing the communication structure.
However, if the fraction spent in MPI exceeds the expectation, then it is interesting to know in which particular MPI call the time was spent. Function grouping allows exactly this shift in perspective by ungrouping the function group MPI.
While the argumentation given in
for having nested thread groups may not be that compelling, the reason for having nested function groups comes quite clear as soon as there occur nested modules, classes and/or name spaces.
Provided that there are adequate function groups, it is also much easier to categorize code by library or by author. In this way, it is possible to concentrate precisely on the code that is considered tunable while code that is controlled by third parties is aggregated into coarse categories.
Selecting a function group generally results in displaying the information for the group's children. That is the reason why single functions cannot be selected for aggregation.
In the case of timelines, certain events may not be visible at all times. It does not necessarily mean that they are not there. It can happen because the finer your grouping is, the less time is spent in each individual function/group. On aggregating over processes in a time interval where some processes are idle, nothing is displayed because of the idle state of the processes. Zooming in helps to see these events better. For a better overview, check one of the corresponding profiles. If the timeline and the profile seem to contradict, then the information from the profile is more precise.