Use of Intel® MKL in HPCC benchmark

HPCC Application Note

Step 1 - Overview

This guide is intended to help current HPCC users get better benchmark performance by utilizing Intel® Math Kernel Library (Intel® MKL).

HPCC stands for High Performance Computing Challenge benchmark and is actually a suite of benchmarks that measure performance of the CPU, memory subsystem and interconnect. It consists of 7 benchmark tests - HPL (High Performance LINPACK), DGEMM (Double-precision GEneral Matrix-Matrix multiply), STREAM, PTRANS (Parallel TRANSpose, Random Access, FFT (Fast Fourier Tranform and communication bandwidth/latency.

Please find more information on HPCC from: http://icl.cs.utk.edu/hpcc/* .

Version Information

This application note was created to help users who benchmark clusters using HPCC to make use of the latest versions of Intel MKL on Linux platforms on Xeon systems.  Please note that previous versions of MKL may require other steps to successfully compile and link with HPCC.

Step 2 - Downloading HPCC Source Code

The HPCC source code can be downloaded from: http://icl.cs.utk.edu/hpcc/software/index.html*.

Prerequisites

1.     Intel MKL contains highly optimized FFT and also the wrappers for FFTW, which can be obtained from the following options:

•  Download a FREE evaluation version of the Intel MKL product.

•  Download the FREE non-commercial* version of the Intel MKL product.

All of these can be obtained at: Intel® Math Kernel Library product web page.

Intel® MKL is also bundled with the following products


2.     Intel MPI can be obtained from Intel® Cluster Tools. Open source MPI (MPICH2) can be obtained from http://www.mcs.anl.gov/research/projects/mpich2/*.

Step 3 - Configuration

Use the following commands to extract the HPCC tar files from the downloaded hpcc-x.x.x.tar.gz.

[shell]$gunzip hpcc-x.x.x.tar.gz
$tar -xvf hpcc-x.x.x.tar
[/shell]
The above will create a directory named hpcc-x.x.x

Make sure that MPI, C++ and FORTRAN compilers are installed and they are in PATH. Also set LD_LIBRARY_PATH to your compiler (C++ and FORTRAN), MPI, and MKL libraries.

Step 4 - Building HPCC

•  Build MPI MKL FFTW library.

Change the directory to <your MKL installation>/interfaces/fftw2x_cdft.

From the fftw2x_cdft directory, run the following command:

[shell]$make libintel64 PRECISION=MKL_DOUBLE interface=ilp64[/shell]
Here we are building for Intel64 architecture with Intel MPI (default for Makefile, you may use a different mpi), with Intel compilers, DOUBLE precision and ilp64 interface. This will create the MKL MPI FFTW interface library libfftw2x_cdft_DOUBLE_ilp64.a in lib/intel64 directory.

Note: Please note that by setting the interface parameter to ilp64 we require to build the FFTW MPI wrappers which admit 64-bit parameters in their interface to match the calls from HPCC.  These 64-bit aware wrappers are not to be used with usual applications complying with traditional FFTW interfaces.  Please execute $make to see the full set of options.

•  Build FFTW C wrapper library

Change the directory to <your MKL installation>/interfaces/fftw2xc.

Then build the FFTWC wrapper by running the command as below

[shell]$make libintel64 PRECISION=MKL_DOUBLE[/shell]
This will create libfftw2xc_intel.a library in <your mkl installation>/lib/intel64 directory

•  Build HPCC

Change directory to hpcc-x.x.x/hpl

Create a Makefile from the existing one, for e.g. Make.intel. You can reuse one from the hpl/setup directory.

Edit Make.intel as follows: modify the LAdir, LAlib lines as below to point to MKL libraries.
[shell]LAdir = /opt/intel/mkl/lib/intel64

LAlib = -Wl,--start-group $(LAdir)/libfftw2x_cdft_DOUBLE_lp64.a $(LAdir)/libfftw2xc_intel.a $(LAdir)/libmkl_intel_lp64.a $(LAdir)/libmkl_intel_thread.a $(LAdir)/libmkl_core.a $(LAdir)/libmkl_blacs_intelmpi_lp64.a $(LAdir)/libmkl_cdft_core.a -Wl, --end-group -lpthread -lm
[/shell]
Please make sure to following compiler options on the compile line:

[shell]-DUSING_FFTW -DMKL_INT=long -DLONG_IS_64BITS[/shell]
Build HPCC by using

[shell]$make all arch=intel[/shell]
This will create an executable with name hpcc in the hpcc-x.x.x directory and a file _hpccinf.txt which is a template input file for hpcc. Rename the file to hpccinf.txt.

Step 5 - Running HPCC

Modify the configuration parameters in hpccinf.txt file.

Run hpcc by executing the following command.

[shell]$mpirun -np 4 hpcc[/shell]
hpccinf.txt is the same as standard hpl input file with a few additional lines. Please refer our HPL application note on tuning parameters in the configuration file.

Appendix A - Performance Results

Below are the hpcc benchmark results of Intel Endeavor cluster which can also be found in hpcc website*.

HPC Challenge Benchmark Record

System Information
Affiliation:

Intel Corporation



URL:

http://www.intel.com/

Location:

USA, Washington, DuPont



System Use:

Vendor

System Manufacturer:

Intel



System Name:

Intel Endeavor cluster

Interconnect Manufacturer:

Mellanox



Interconnect Type:

QDR Infiniband (40 Mellanox MTS3600Q-1UNC switches, Mellanox MHGH28-XTC adapters on nodes, only one port used per adpater, slot type is PCIe x8 Gen2)
Operating System:

Red Hat EL 5.4, kernel 2.6.18-164



MPI:

Intel MPI 4.0

MPI Wtick:

0.000001



BLAS:

Intel MKL 10.3

Language:

C



Compiler:

Intel C/C++ Compiler 11.1.064

Compiler Flags:

-O2 -xSSE4.2 -ip -ansi-alias -fno-alias -DUSING_FFTW -DMKL_INT=long -DLONG_IS_64BITS -DRA_SANDIA_OPT2 -DHPCC_FFT_235 (and "-opt-streaming-stores always" for stream.c)

Processor Type:

Xeon X5670 (SMT OFF, Turbo OFF, DDR3-1333)

Processor Speed:

2.93 GHz



Total Processors:

4320

Processors Entered:

4320



Processors determined:

4320

Cores per chip:

6



HPL Processes:

4320

MPI Processes:

4320



Threads Entered:

1

Threads determined:

1



FLOPs per cycle:

Theoretical peak:

50.6304 TFlop/s



Total memory:

8640 GiB

FFT library:

Intel MKL 10.3





HPL

HPL:

43.722 Tflop/s



HPL time:

14002.6

HPL eps:

2.22045e-16



HPL Rnorm1:

0.000000140367

HPL Anorm1:

243723



HPL AnormI:

243671

HPL Xnorm1:

1011140



HPL XnormI:

6.36053

HPL N:

972000



HPL NB:

168

HPL NProw:

60



HPL NPcol:

72

HPL depth:

0



HPL NBdiv:

2

HPL NBmin:

4



HPL CPfact:

R

HPL CRfact:

C



HPL CPtop:

1

HPL order:

R

HPL dMach EPS:

2.220446e-16



HPL sMach EPS:

0.0000001192093

HPL dMach sfMin:

2.2250739999999997e-308



HPL sMach sfMin:

1.1754939999999999e-38

HPL dMach Base:

2



HPL sMach Base:

2

HPL dMach Prec:

4.440892e-16



HPL sMach Prec:

0.0000002384186

HPL dMach mLen:

53



HPL sMach mLen:

24

HPL dMach Rnd:

0



HPL sMach Rnd:

0

HPL dMach eMin:

-1021



HPL sMach eMin:

-125

HPL dMach rMin:

2.2250739999999997e-308



HPL sMach rMin:

1.1754939999999999e-38

HPL dMach eMax:

1025



HPL sMach eMax:

129

HPL dMach rMax:

0



HPL sMach rMax:

0

dweps:

1.110223e-16



sweps:

0.00000005960464



PTRANS
PTRANS:

549.988 GB/s



PTRANS time:

3.43075 seconds

PTRANS residual:

0



PTRANS N:

486000

PTRANS NB:

232



PTRANS NProw:

60

PTRANS NPcol:

72



STREAM
S-STREAM Copy:

8.30307 GB/s



S-STREAM Scale:

8.2778 GB/s

S-STREAM Add:

11.0563 GB/s



S-STREAM Triad:

11.0009 GB/s

EP-STREAM Copy:

3.33023 GB/s



EP-STREAM Scale:

3.32376 GB/s

EP-STREAM Add:

3.48553 GB/s



EP-STREAM Triad:

3.5357 GB/s

STREAM Vector Size:

72900000



STREAM Threads:

1



RandomAccess
S-RandomAccess:

0.035379 Gup/s



EP-RandomAccess:

0.0166186 Gup/s

G-RandomAccess:

10.8309 Gup/s



G-RandomAccess N:

549755813888

G-RandomAccess time:

203.033 seconds



G-RandomAccess Check Time:

187.382 seconds

G-RandomAccess Errors:

1343419



G-RandomAccess Errors Fraction:

0.00000244366

G-RandomAccess TimeBound:

-1



G-RandomAccess ExeUpdates:

2199023255552

RandomAccess N:

134217728



FFT
S-FFT:

2.3047 GFlop/s



EP-FFT:

1.14392 GFlop/s

MPIFFT:

1173.89 GFlop/s



MPIFFT N:

116640000000

MPIFFT Max Error:

0.00000000000000431742



MPIFFT time0:

0 seconds

MPIFFT time1:

0 seconds



MPIFFT time2:

0 seconds

MPIFFT time3:

0 seconds



MPIFFT time4:

0 seconds

MPIFFT time5:

0 seconds



MPIFFT time6:

0 seconds

FFTEnblk:

16



FFTEnp:

8

FFTEl2size:

1048576



DGEMM
S-DGEMM:

11.0582 GFlop/s



EP-DGEMM:

10.9366 GFlop/s

DGEMM N:

8537



RandomRing Latency/Bandwidth
RandomRing Latency:

6.43059 usec



RandomRing Bandwidth:

0.131166 GB/s



NaturalRing Latency/Bandwidth
NaturalRing Latency:

3.44515 usec



NaturalRing Bandwidth:

0.962355 GB/s



PingPong Latency/Bandwidth

Maximum PingPong Latency:

4.36604 usec



Maximum PingPong Bandwidth:

4.02814 GB/s

Minimum PingPong Latency:

0.238419 usec



Minimum PingPong Bandwidth:

1.48091 GB/s

Average PingPong Latency:

3.62335 usec



Average PingPong Bandwidth:

1.80222 GB/s

Size of Data Types

char:

1 byte  



short:

2 bytes

int:

4 bytes



long:

8 bytes

void ptr:

8 bytes



float:

4 bytes

double:

8 bytes



size t:

8 bytes

s64Int:

8 bytes



u64Int:

8 bytes



OpenMP
M OpenMP:

-1



OpenMP Num Threads:

0

OpenMP Num Procs:

0



OpenMP Max Threads:

0



Memory
MemProc:

-1



MemSpec:

-1

MemVal:

-1



CPS
CPS_HPCC_FFT_235:

1



CPS_HPCC_FFTW_ESTIMATE:

0

CPS_HPCC_MEMALLCTR:

0



CPS_HPL_USE_GETPROCESSTIMES:

0

CPS_RA_SANDIA_NOPT:

0



CPS_RA_SANDIA_OPT2:

1



Version: 1.4.1.b - Run Type: base
Created: 2010-11-01 - Exported: Thu Mar 17 06:32:04 2011


Appendix C - References

Intel Xeon Processor based Servers Homepage



Étiquettes:
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.