Heterogeneous Support in the Intel Distribution for LINPACK Benchmark
Intel Distribution for LINPACK Benchmark achieves heterogeneous support by distributing the matrix data unequally between the nodes. The heterogeneous factor commandline parameter f controls the amount of work to be assigned to the more powerful nodes, while the commandline parameter c controls the number of process columns for the faster nodes:
./xhpl –n <problem size> –b <block size> –p <grid row dimn> –q <grid column dimn> –f <heterogeneous factor> –c <number of faster processor columns>
If the heterogeneous factor is 2.5, roughly 2.5 times the work will be put on the more powerful nodes. The more work you put on the more powerful nodes, the more memory you might be wasting on the other nodes if all nodes have equal amount of memory. If your cluster includes many different types of nodes, you may need multiple heterogeneous factors.
Let P be the number of rows and Q the number of columns in your processor grid (PxQ). The work must be homogeneous within each processor column because vertical operations, such as pivoting or panel factorization, are synchronizing operations. When there are two different types of nodes, use MPI to process all the faster nodes first and make sure the "PMAP process mapping" (line 9) of HPL.dat is set to 1 for Columnmajor mapping. Because all the nodes must be the same within a process column, the number of faster nodes must always be a multiple of P, and you can specify the faster nodes by setting the number of process columns C for the faster nodes with the c commandline parameter. The f 1.0 c 0 setting corresponds to the default homogeneous behavior.
To understand how to choose the problem size N for a heterogeneous run, first consider a homogeneous system, where you might choose N as follows:
N ~= sqrt(Memory Utilization * P * Q * Memory Size in Bytes / 8)
Memory Utilization is usually around 0.8 for homogeneous Intel Xeon processor systems. On a heterogeneous system, you may apply a different formula for N for each set of nodes that are the same and then choose the minimum N over all sets. Suppose you have a cluster with only one heterogeneous factor F and the number of processor columns (out of the total Q) in the group with that heterogeneous factor equal to C. That group contains P*C nodes. First compute the sum of the parts: S =F*P*C + P*(QC). Note that on a homogeneous system S=P*Q,F=1, and C=Q. Take N as
N ~= sqrt(Memory Utilization * P * Q * ((F*P*C)/S) * Memory Size in Bytes / 8)
or simply scale down the value of N for the homogeneous system by sqrt(F*P*C/S).
Example
Suppose the cluster has 100 nodes each having 64 GB of memory, and 20 of the nodes are 2.7 times as powerful as the other 80. Run one MPI process per node for a total of 100 MPI processes. Assume a square processor grid P=Q=10, which conveniently divides up the faster nodes evenly. Normally, the HPL documentation recommends choosing a matrix size that consumes 80 percent of available memory. If N is the size of the matrix, the matrix consumes 8N^2/(P*Q) bytes. So a homogeneous run might look like:
./xhpl –n 820000 –b 256 –p 10 –q 10
If you redistribute the matrix and run the heterogeneous Intel Distribution for LINPACK Benchmark, you can take advantage of the faster nodes. But because some of the nodes will contain 2.7 times as much data as the other nodes, you must shrink the problem size (unless the faster nodes also happen to have 2.7 times as much memory). Instead of 0.8*64GB*100 total memory size, we have only 0.8*64GB*20 + 0.8*64GB/2.7*80 total memory size, which is less than half the original space. So the problem size in this case would be 526000. Because P=10 and there are 20 faster nodes, two processor columns are faster. If you arrange MPI to send these nodes first to the application, the command line looks like:
./xhpl –n 526000 –b 1024 –p 10 –q 10 –f 2.7 –c 2
The m parameter may be misleading for heterogeneous calculations because it calculates the problem size assuming all the nodes have the same amount of data.
Warning
The number of faster nodes must be C*P. If the number of faster nodes is not divisible by P, you might not be able to take advantage of the extra performance potential by giving the faster nodes extra work.
While it suffices to simply provide f and c commandline parameters if you need only one heterogeneous factor, you must add lines to the HPL.dat input to support multiple heterogeneous factors. For the above example (two processor columns have nodes that are 2.7 times faster), instead of passing f and c commandline parameters you can modify the HPL.dat input file by adding these two lines to the end:
1 number of heterogeneous factors 0 1 2.7 [start_column, stop_column, heterogeneous factor for that range]
Note
Numbering of processor columns starts at 0. The start and stopping numbers must be between 0 and Q1 (inclusive).
If instead there are three different types of nodes in a cluster and you need at least two heterogeneous factors, change the number in the first row above from 1 to 2 and follow that line with two lines specifying the start column, stopping column, and heterogeneous factor.
When choosing parameters for heterogeneous support in HPL.dat, primarily focus on the most powerful nodes. The larger the heterogeneous factor, the more balanced the cluster may be from a performance viewpoint, but the more imbalanced from a memory viewpoint. At some point, further performance balancing might affect the memory too much. If this is the case, try to reduce any changes done for the faster nodes (such as in block sizes). Experiment with values in HPL.dat carefully because wrong values may greatly hinder performance.
When tuning on a heterogeneous cluster, do not immediately attempt a heterogeneous run, but do the following:

Break the cluster down into multiple homogeneous clusters.

Make heterogeneous adjustments for performance balancing. For instance, if you have two different sets of nodes where one is three times as powerful as the other, it must do three times the work.

Figure out the approximate size of the problem (per node) that you can run on each piece.

Do some homogeneous runs with those problem sizes per node and the final block size needed for the heterogeneous run and find the best parameters.

Use these parameters for an initial heterogeneous run.
Optimization Notice 

Intel's compilers may or may not optimize to the same degree for nonIntel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 