by Michael Hebenstreit
The power of the modern HPC Cluster is brought to the end user more often than not through software like StarCD* or Fluent*, two typical applications in the area of computational fluid dynamics (CFD). As long as everything works correctly – or at least as expected – neither the engineer setting up the computations nor the administrator ensuring the smooth work of the cluster actually knows what goes on inside the binary package doing all the calculations.
Unfortunately things tend to go wrong. At this point, the question "What is going on?" is fundamental to answering "How can I fix this problem?". If you are not able to answer the former question, everything you do is akin to feeling around in the dark. Even if you are able to solve the issue, you might still be wondering exactly why your fix worked, and if it is going to be an only temporary solution. Here we will speak about the Intel Cluster Tools, specifically the Intel Trace Collector, Intel Trace Analyzer and Intel MPI Benchmarks, which should be able to shed some light into this binary darkness.
At a customer visit, the author was faced with an Hewlett-Packard (HP)*/Intel Itanium 2 1.5GHz cluster running StarCD* over a Gigabit Ethernet switch. The customer was very dissatisfied with the performance of the system, pointing out that they were neither able to run a simple 32-thread job with high efficiency nor could they reproduce the benchmark results originating from the time of purchase. The author quickly verified during the visit that the main problem was the CPU utilization of each node – showing only around 60% instead of the usual mark of 95-100%
The customer and author could quickly agree on a simple benchmark with four million cells, containing no intellectual property. Fortunately this benchmark gave the same behavior on-site, showing that it was a problem of the system, and not of the software itself. Both the author and Hewlett-Packard were able to verify, that the problem did not show up on similar Intel Itanium systems installed at Intel or HP sites.
The author assumed this behavior was caused by a problem in the network hardware, as the problem was independent from the MPI implementation used. The next step had to be an investigation of StarCD's communication structure. Intel Trace Collector (ITC) and Intel Trace Analyzer are the tools of choice here. Luckily, an ITC enabled version of StarCD was available at Intel. As the behavior of StarCD is only dependant on the dataset and the program version, one is able to run the test on any hardware available. Although the timing will be different depending on hardware, the communication structure will not. The investigation came up with 3 major findings:
1. StarCD uses many communication calls – as shown in this picture. Even on a timeframe of 0.1s, the threads are exchanging data
2. The main MPI function used is the alltoall-call.
3. The load balancing on some 13 nodes is not perfect, but works well enough so that the cluster should work at least with 90% efficiency.
These findings make it easy to test the MPI-behavior on the cluster without relying on StarCD and introducing additio nal dependencies. The tools of choice are the Intel MPI benchmarks, which produce automated and reliable measurements. HP measured the optimal timing on their own cluster, and came up with a delay time of ~3µs using alltoall at a message size of 35kb.
The corresponding measurements at the customer location gave an erratic result that showed only about 50% of the achievable performance (base result in the picture).
A change in the configuration of the switch gave no real improvements (result corrected switch).
The breakthrough came only after looking at the cabling used on the cluster. The 63 nodes were all connected to a switch containing 5 inserts. A user would typically reserve 8 or 16 nodes (each with 2 CPUs) for a job – for instance the nodes 1 to 16. As one can see from the layout, this configuration forces the load to go onto only one or two inserts – and is thereby overloading the switch.
A slight change in the layout of the cabling proposed by the author takes care of this problem and dramatically improves the response time within the cluster (result striped switch in the picture here).
An even more detailed analysis of the message structure allows the user to find what size of messages causes most delays in the cluster. This type of chart is not implemented in Intel Trace Analyzer at the present time, but the information can be directly extracted from the tracing files. After the run first convert the data to ASCII, using the xstftool supplied with Intel Trace Collector:
xstftool FILE.stf --dump
This will give you an output similar to:
1998984 EXCHEXT CPU 0:1 DOWNTO "MPI:MPI_Bcast" 1998984 GLOBALOP MPI_Bcast ON 0:1 COM COMM_WORLD(2) ROOT 1 SENT 4 RECVD 0 DURATION 499 1999149 EXCHEXT CPU 0:3 UPTO "Application:User_Code"
The first entry gives you a timestamp, and the rest is easy to understand. In this case, we processed a call from the application to MPI_Broadcast, going from group 0, thread 1 to 4 receivers, 4 bytes in length lasting 499 ticks and returning to the application (line 3).
One has to write a program (or script) to parse the output, as the data of interest are probably different with each application.
The author analyzed The StarCD benchmark used above to get a distribution of overall time spent in MPI_alltoall calls over message size.
The chart shows about 40% Elapsed Time are spend by messages of very small size, another 40% by messages larger than 100kb.
This kind of information could be used to further optimize the network and switches – although in the case shown optimizing for very large and small message sizes at the same time might be extremely difficult.
Although the Intel Cluster Tools have been developed for software developers, they have evolved far beyond that. Even without access to source code, they provide insight into the basic message structure of MPI programs and can give help in debugging problems.
This use case shows in a simple example on how system administrators and software users will find the tools very helpful in their day to day tasks. From a point of view of the user, the application – especially load balancing of the job – very worked well. The administrator of the cluster became aware of how much communication StarCD is doing, and thereby how important the network layer is with regard to performance. Faster CPUs or increased memory would take little effect if network itself is the bottleneck.
With the help of the MPI benchmarks, the administrator can model the behavior of the application far faster than by running the original StarCD case. This becomes an issue if one has to vary the parameters to find the optimal solution – like in the instance of configuring a network on the lowest level.
All these data points do not contain any specific Intellectual Property (IP), and therefore can be exchanged with 3rd parties. This becomes more important in an environment where many functions, like IT-administration, have been outsourced.
Usage and administration of modern HPC systems is becoming a commodity, and every system is unique and is causing its own problems. Tools to debug these issues are in demand and one solution to your organization’s specific problem might be to use the Intel Cluster Tools.