Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

<h2>Introduction</h2>

<p>The MILC software represents a set of codes written by the MIMD Lattice Computation (MILC) collaboration used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics. It performs simulations of four dimensional SU<sup>3</sup> lattice gauge theory on MIMD parallel machines. "Strong interactions" are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus. MILC applications address fundamental questions in high energy and nuclear physics, and is directly related to major experimental programs in these fields. MILC is one of the largest compute cycle users at many US and European supercomputing centers.</p>

<h3>Purpose</h3>

<p>This article provides instructions for code access, build and run directions for the “ks_imp_rhmc” application on <strong>Intel® Xeon® Gold </strong>and Intel® Xeon Phi™ processors for better performance on a single node.</p>

<p>The “ks_imp_rhmc” is a dynamical RHMC (Rational Hybrid Monte Carlo algorithm) code for staggered fermions. In addition to the naive and asqtad staggered actions, the highly-improved-staggered-quark (HISQ) action is also supported.</p>

<p>Currently, the Conjugate Gradient (CG) Solver and the Gauge Force operations in the code uses the QPhiX library. Efforts are on-going to integrate other operations (like Fermion Force (FF)) with the QPhiX library as well.</p>

<p class="greyHighlight">The QPhiX library provides sparse solvers and Dslash kernels for Lattice QCD simulations optimized for Intel® architecture.</p>

<h2>Code Access</h2>

<p>The MILC Software and QPhiX library are required primarily. The MILC software can be downloaded from GitHub* here: <a href="https://github.com/milc-qcd/milc_qcd" target="_blank">https://github.com/milc-qcd/milc_qcd</a> Download (git checkout) the “<strong>develop</strong>” branch. QPhiX support is integrated into this branch for CG solvers and Gauge Force operator. QPhiX support for Gauge Force is currently available on Intel® Xeon® Gold and Intel® Xeon Phi™ processors only.</p>

<pre class="brush:plain;">git clone https://github.com/milc-qcd/milc_qcd.git
git checkout develop</pre>

<p>The QPhiX library and Code Generator for use with Wilson-Clover fermions (e.g., for use with Chroma) are available from <a href="https://github.com/jeffersonlab/qphix.git" target="_blank">https://github.com/jeffersonlab/qphix.git</a> and <a href="https://github.com/jeffersonlab/qphix-codegen.git" target="_blank">https://github.com/jeffersonlab/qphix-codegen.git</a> respectively. For the most up to date version it is suggested to use the 'devel' branch of QPhiX.</p>

<p class="greyHighlight">The MILC version of QPhiX is currently not open source. Please contact the MILC collaboration group for access to the QPhiX (MILC) branch.</p>

<h2>Build Directions</h2>

<p><strong>Compile the QPhiX Library:</strong></p>

<p>Users need to build QPhiX library first before building the MILC package.</p>

<p>The QPhiX library has two repositories <em>milc</em><em>-</em><em>qphix</em> and <em>milc</em><em>-</em><em>qphix</em><em>-</em><em>codegen.</em></p>

<p>Use “gauge_force” branch for both above repositories.</p>

<p><strong>Build </strong><strong>milc</strong><strong>-</strong><strong>qphix</strong><strong>-</strong><strong>codgen</strong><strong>:</strong></p>

<p>The files with intrinsics for QPhiX are built in the milc-qphix-codegen directory.</p>

<p>Enter the milc-qphix-codegen directory. Remember to checkout “gauge_force” branch.</p>

<p>Edit line #3 in “<code>Makefile_xyzt</code>”, enable “<code>milc=1</code>” variable.</p>

<p>Compile as:</p>

<pre class="brush:plain;">source /opt/intel/compiler/&lt;version&gt;/bin/compilervars.sh intel64
source /opt/intel/impi/&lt;version&gt;/mpi/intel64/bin/mpivars.sh
make avx512 # [for Intel® Xeon® Gold and Intel® Xeon Phi™ processors]
</pre>

<p><strong>Build milc-qphix:</strong></p>

<p>Enter the milc-qphix (mbench) directory. Remember to checkout “gauge_force” branch.</p>

<p>Use “<em>Makefile_qphixlib</em>” as makefile.</p>

<p>Set “<em>mode=mic</em>” to compile with Intel® Advanced Vector Extensions 512 (Intel AVX-512) for Intel® Xeon Phi™ processors and “<em>mode=avx512</em>” to compile with Intel AVX-512 for Intel® Xeon® Gold processors.</p>

<p>To enable MPI. Set ENABLE_MPI = 1</p>

<p>Compile as:</p>

<pre class="brush:plain;">make -f Makefile_qphixlib mode=mic AVX512=1 # [Intel® Xeon Phi™ processor]
make -f Makefile_qphixlib mode=avx512 AVX512=1 # [Intel® Xeon® Gold processor]
</pre>

<p><strong>Compile MILC Code:</strong></p>

<ol>
<li>Install/Download the master branch from the above GitHub</li>
<li>Download the Makefile.qphix file from the following location<br>
<a href="http://physics.indiana.edu/~sg/MILC_Performance_Recipe/" target="_blank">http://physics.indiana.edu/~sg/MILC_Performance_Recipe/</a></li>
<li>Copy the Makefile.qphix to the corresponding application directory. In this case, copy the Makefile.qphix to “ks_imp_rhmc” application directory and re-name it as Makefile</li>
<li>Make the following changes to the Makefile:
<ul>
<li>On line #17 - Add/Uncomment the appropriate ARCH variable
<ul>
<li>For example, ARCH = knl (compile with Intel AVX-512 for Intel® Xeon Phi™ Processor)</li>
<li>For example, ARCH = skx (compile with Intel AVX-512 for Intel® Xeon® Gold Processor)</li>
</ul>
</li>
<li>On line #28 - Change MPP variable to “true” if you want MPI</li>
<li>On line #34 - Pick the PRECISION you want
<ul>
<li>1 = Single, 2 = Double. We use Double for our runs</li>
</ul>
</li>
<li>Starting line #37 - Compiler is set up and this should just work If directions above were followed. If not, customize starting at line #40</li>
<li>On line #124 - Setup of Intel compiler starts
<ul>
<li>Based on ARCH it will use the appropriate flags</li>
</ul>
</li>
<li>On line #407 - QPhiX customizations starts
<ul>
<li>On line #413 – Set QPHIX_HOME to correct QPhiX path (Path to milc-qphix directory)</li>
<li>The appropriate QPhiX FLAGS will be set if the above is defined correctly</li>
</ul>
</li>
</ul>
</li>
<li>Build:
<pre class="brush:plain;">cd ks_imp_rhmc #The Makefile with the above changes should be in this directory
source /opt/intel/compiler/&lt;version&gt;/bin/compilervars.sh intel64
source /opt/intel/impi/&lt;version&gt;/mpi/intel64/bin/mpivars.sh
make su3_rhmd_hisq # Build su3_rhmd_hisq binary
make su3_rhmc_hisq # Build su3_rhmc_hisq binary</pre>
</li>
</ol>

<p>Compile the above binaries for Intel® Xeon Phi™ and Intel® Xeon® Gold processors (edit Makefile accordingly).</p>

<h2>Run Directions</h2>

<h3>Input Files</h3>

<p>There are two required input files, params.rest and rat.m013m065m838</p>

<p>They can be downloaded from here:</p>

<p><a href="http://physics.indiana.edu/~sg/MILC_Performance_Recipe/" target="_blank">http://physics.indiana.edu/~sg/MILC_Performance_Recipe/</a></p>

<p>The file rat.m013m065m838 defines the residues and poles of the rational functions needed in the calculation. The file params.rest sets all the run-time parameters, including the lattice size, the length of the calculation (number of trajectories), and the precision of the various conjugate-gradient solutions.</p>

<p>In addition, a params.&lt;lattice-size&gt; <lattice-size> file with required lattice size will be created during runtime. This file essentially has the params.rest appended to it with the lattice size (Nx * Ny * Nz * Ny) to run. </lattice-size></p>

<h3>The Lattice Sizes</h3>

<p>The size of the four-dimensional space-time lattice is controlled by the “nx, ny, nz, nt” parameters.</p>

<p>As an example, consider a problem as (nx x ny x nz x nt) = 32 x 32 x 32 x 64 running on 64 MPI ranks. To weak scale this problem user would begin by multiplying <em><strong><em>nt</em></strong></em> by 2, then <em><strong><em>nz</em></strong></em> by 2, then <em><strong><em>ny</em></strong></em> by 2, then <em><strong><em>nx</em></strong></em> by 2 and so on such that all variables get sized accordingly in a round robin fashion.</p>

<p>This is illustrated in the table below. The original problem size is 32 x 32 x 32 x 64, to keep the elements/rank constant (weak scaling), for 128 rank count, first multiply <strong>nt by 2</strong> (32 x 32 x 32 x 128) Similarly, for 512 ranks, multiply <em><strong><em>nt</em></strong></em> by 2, <em><strong><em>nz</em></strong></em> by 2, <em><strong><em>ny</em></strong></em> by 2 from the original problem size to keep the same elements/rank.</p>

<table align="center" border="1" cellpadding="0" cellspacing="0" class="no-alternate" width="700">
<tbody>
<tr>
<td><strong>Ranks</strong></td>
<td style="text-align: right;"><strong>64</strong></td>
<td style="text-align: center;"><strong>128</strong></td>
<td style="text-align: right;"><strong>256</strong></td>
<td style="text-align: right;"><strong>512</strong></td>
</tr>
<tr>
<td>Nx</td>
<td style="text-align: right;">32</td>
<td style="text-align: right;">32</td>
<td style="text-align: right;">32</td>
<td style="text-align: right;">32</td>
</tr>
<tr>
<td>Ny</td>
<td style="text-align: right;">32</td>
<td style="text-align: right;">32</td>
<td style="text-align: right;">32</td>
<td style="text-align: right;"><strong>64</strong></td>
</tr>
<tr>
<td>Nz</td>
<td style="text-align: right;">32</td>
<td style="text-align: right;">32</td>
<td style="text-align: right;"><strong>64</strong></td>
<td style="text-align: right;">64</td>
</tr>
<tr>
<td>nt</td>
<td style="text-align: right;">64</td>
<td style="text-align: right;"><strong>128</strong></td>
<td style="text-align: right;">128</td>
<td style="text-align: right;">128</td>
</tr>
<tr>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr>
<td>Total Elements</td>
<td style="text-align: right;">2097152</td>
<td style="text-align: right;">4194304</td>
<td style="text-align: right;">8388608</td>
<td style="text-align: right;">16777216</td>
</tr>
<tr>
<td>Multiplier
<p>&nbsp;</p>
</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">2</td>
<td style="text-align: right;">4</td>
<td style="text-align: right;">8</td>
</tr>
<tr>
<td>Elements/Rank</td>
<td style="text-align: right;">32768</td>
<td style="text-align: right;">32768</td>
<td style="text-align: right;">32768</td>
<td style="text-align: right;">32768</td>
</tr>
</tbody>
</table>

<p style="text-align: center;">Table. Illustrates Weak Scaling of Lattice Sizes</p>

<h3>Running with MPI x OpenMP*</h3>

<p>The calculation takes place on a four-dimensional hyper cubic lattice, representing three spatial dimensions and one time dimension. The quark fields have values on each of the lattice points and the gluon field has values on each of the links connecting nearest-neighbors of the lattice sites.</p>

<p>The lattice is divided into equal sub-volumes, one per MPI rank. The MPI ranks can be thought of as being organized into a four-dimensional grid of ranks. It is possible to control the grid dimensions with the params.rest file. Of course, the grid dimensions must be integer factors of the lattice coordinate dimensions.</p>

<p>Each MPI rank executes the same code. The calculation requires frequent exchanges of quark and gluon values between MPI ranks with neighboring lattice sites. Within a single MPI rank the site-by-site calculation is threaded using OpenMP directives, which have been inserted throughout the code. The most time-consuming part of production calculations is the conjugate gradient (CG) solver. In the QPhiX version of the CG solver, the data layout and the calculation at the thread level is further organized to take advantage of the Intel® Xeon® and Intel® Xeon Phi™ processors SIMD lanes.</p>

<h3>Running the Test-cases</h3>

<ol>
<li>Create a “run” directory in the top-level directory and add the input files obtained from above</li>
<li><code>cd <milc>/run</milc></code>
<p>P.S: Run the appropriate binary for each architecture</p>
</li>
<li>Create the lattice volume:
<pre class="brush:plain;">cat &lt;&lt; EOF &gt; params.$nx*$ny*$nz*$nt
prompt 0
nx $nx
ny $ny
nz $nz
nt $nt
EOF
cat params.rest &gt;&gt; params.$nx*$ny*$nz*$nt
</pre>

<p>For this performance recipe, we evaluate the single node performance with the following weak scaled lattice volume:</p>

<p>Single Node (nx * ny * nz * nt): 24 x 24 x 24 x 24</p>
</li>
<li>Run MILC. (source the latest Intel compilers and Intel® MPI Library. Intel® Parallel Studio 2018 and above is recommended)
<p><strong>Single node Intel® Xeon® Gold 6148:</strong></p>

<pre class="brush:plain;">mpiexec.hydra –n 8 –env OMP_NUM_THREADS 5 –env KMP_AFFINITY 'granularity=fine,scatter,verbose' &lt;path-to&gt;/ks_imp_rhmc/su3_rhmd_hisq.skx &lt; params.24x24x24x24</pre>

<p><strong>Single node Intel® Xeon Phi™ 7250:</strong></p>

<pre class="brush:plain;">mpiexec.hydra –n 1 –env OMP_NUM_THREADS 64 –env KMP_AFFINITY 'granularity=fine,scatter,verbose' <strong>numactl –p 1</strong> &lt;path-to&gt;/ks_imp_rhmc/su3_rhmd_hisq.knl &lt; params.24x24x24x24</pre>
</li>
</ol>

<h2>Performance Results and Optimizations:</h2>

<p>The output below shows the performance of CG Solver.</p>

<p>The performance chart below is the speedup with respect to 2S Intel® Xeon® Gold, 2S Intel® Xeon® processor E5-2697 v4 and Intel® Xeon Phi™ processor, based on the CG GFLOPs/sec.</p>

<p style="text-align: center;"><img data-fid="613093" src="/sites/default/files/managed/45/93/Recipe-Building-Running-MILC-Intel-Xeon-Processors-Intel-Xeon-Phi-Processors-fig01.png" typeof="foaf:Image"></p>

<p>The optimizations as part of the QPhiX library include data layout changes to target vectorization and generation of packed aligned loads/stores, cache blocking, load balancing and improved code generation for each architecture (Intel® Xeon® processor, Intel® Xeon Phi™ processor) with corresponding intrinsics where necessary. See Reference section for details.</p>

<h3>Testing Platform Configurations</h3>

<p>The following hardware was used for the above recipe and performance testing.</p>

<table align="center" border="1" cellpadding="0" cellspacing="0" class="no-alternate" width="700">
<thead>
<tr>
<th><strong>Processor</strong></th>
<th><strong>Intel® Xeon® Processor E5-2697 v4</strong></th>
<th><strong>Intel® Xeon Phi™ Processor 7250</strong></th>
<th><strong>Intel® Xeon® Gold 6148 Processor</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>
<p>Sockets / TDP</p>
</td>
<td>
<p>2S / 290W</p>
</td>
<td>
<p>1S / 215W</p>
</td>
<td>
<p>2S/150W</p>
</td>
</tr>
<tr>
<td>
<p>Frequency / Cores / Threads</p>
</td>
<td>
<p>2.3 GHz / 36 / 72</p>
</td>
<td>1.4 GHz / 68 / 272</td>
<td>2.4 GHz / 40 /80</td>
</tr>
<tr>
<td>DDR4</td>
<td>8x16GB 2400 MHz</td>
<td>6x16 GB 2133 MHz</td>
<td>2x16 GB 2666 MHz(192 GB)</td>
</tr>
<tr>
<td>MCDRAM</td>
<td>N/A</td>
<td>16 GB Flat</td>
<td>NA</td>
</tr>
<tr>
<td>Cluster/Snoop Mode</td>
<td>Home</td>
<td>Quadrant/Flat</td>
<td>Home</td>
</tr>
<tr>
<td>Turbo</td>
<td>On</td>
<td>On</td>
<td>On</td>
</tr>
<tr>
<td>BIOS</td>
<td>GRRFSDP1.86B0271.R00.1510301446</td>
<td>GVPRCRB1.86B.0010.R02.1606082342</td>
<td>86B.01.00.0412</td>
</tr>
<tr>
<td rowspan="2">Operating System</td>
<td>Red Hat Enterprise Linux* 6.7</td>
<td>Red Hat Enterprise Linux 6.7</td>
<td style="width: 198px; height: 20px;">Red Hat Enterprise Linux 7.3</td>
</tr>
<tr>
<td style="width: 197px; height: 20px;">(3.10.0-229.20.1.el6.x86_64)</td>
<td style="width: 204px; height: 20px;">(3.10.0-229.20.1)</td>
<td style="width: 198px; height: 20px;">3.10.0-514.el7.x86_64</td>
</tr>
</tbody>
</table>

<h2>MILC Build Configurations</h2>

<p>The following configurations were used for the above recipe and performance testing.</p>

<table border="1">
<tbody>
<tr>
<td nowrap="nowrap">MILC Version</td>
<td nowrap="nowrap">Master version as of December 2017</td>
</tr>
<tr>
<td nowrap="nowrap">Intel® Compiler Version</td>
<td nowrap="nowrap">2018.1.163</td>
</tr>
<tr>
<td nowrap="nowrap">Intel® MPI Library Version</td>
<td nowrap="nowrap">2018.1.163</td>
</tr>
<tr>
<td nowrap="nowrap">MILC Makefiles used</td>
<td>Makefile.qphix, Makefile_qphixlib, Makefile</td>
</tr>
</tbody>
</table>

<h2>References and Resources:</h2>

<ol>
<li>MILC Collaboration <a href="http://physics.indiana.edu/~sg/milc.html" target="_blank">http://physics.indiana.edu/~sg/milc.html</a></li>
<li>QPhiX Case Study - <a href="http://www.nersc.gov/users/computational-systems/cori/application-portin... target="_blank">http://www.nersc.gov/users/computational-systems/cori/application-portin...
<li>MILC Staggered Conjugate Gradient Performance on Intel® Xeon Phi™ Processor - <a href="https://anl.app.box.com/v/IXPUG2016-presentation-10" target="_blank">https://anl.app.box.com/v/IXPUG2016-presentation-10</a></li>
<li><a href='/en-us/forums/intel-software-guard-extensions-intel-sgx'>Intel® Xeon Phi™ Processor</a></li>
</ol>

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.