Download the PDF: HPL using Intel MPI
This is a step by step procedure of how to run the High Performance Linpack (HPL)benchmark on a Linux cluster using Intel-MPI. This was done on a Linux cluster of 128 nodes running Intel’s Nehalem processor 2.93 MHz with 12GB of RAM on each node. The operating system that was used is RedHat Enterprise Linux 5.3. The interconnectivity between the nodes was via Infiniband 4x-DDR using the standard RedHat EL 5.3 drivers.
You can also use a simple PHP web tool to enter you system specs and it will suggest for you optimal input parameters for your HPL file before running the benchmark on the cluster.
Download the HPL tool from SourceForge: http://sourceforge.net/projects/hpl-calculator/

Comments
can download this white paper...
In my smaller scale experiments I managed to obtain: 128 total cores used, 1.258e+03 of hpl-2.00 GFlops, that is, %87.75 of theoretical peak (9.8281GFlops / core attained over 11.2 GFlops/core theoretical)
Experiment: 16 nodes x 8 Nehalem cores/node, at 2.8GHz with 22GiBs usable DRAM; CentOS 5.4; QDR IB
HPL-2.00 : N=199864, Nb=172, PxQ=4x4
export I_MPI_FABRICS=shm:dapl
export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1
I have to add I used MKL with OMP_NUM_THREADS=8 per MPI task so the layout was 1 MPI rank / node and 8 OMP thraeds / node;
did try 2, 4, 8 ranks / node with corresponding less OMP threads/rank and results were that not much worse
Intel Compilers 11.1.073 and IntelMPI 4.0.0 (028) with newer versions maybe better perfromance ... :)