Recipe: Building and Optimizing mpiHmmer for Intel® Xeon Phi™ Coprocessors

I. Overview

This article provides a recipe for compiling, optimizing and running MPI-HMMER for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors.

 II. Introduction

MPI-HMMER is an Open Source implementation of the HMMER protein sequence analysis suite. The main search algorithms in this suite, hmmsearch and hmmpfam, have been modified to use MPI in order to provide higher throughput HMMER searches on modern computational clusters. The key algorithm in hmmsearch is the Viterbi algorithm implemented as a contained double nested loop which can be vectorized with no data accuracy penalties.

The hmmsearch executable can be compiled for and executed on Intel Xeon Phi coprocessors without changes. However, the code makes use of a number of pointer variables which inhibit vectorization. To obtain better performance, relatively simple changes can be made to the source code and compiler command line to help the compiler vectorize the code.

III. Preliminaries

1. If they are not already present, install Intel® MPI Library 4.1 and Intel® C++ Compiler 13.1 or higher on your host system.

2. Download MPI-HMMER 2.3. (Note - this file was originally obtained from http://www.mpihmmer.org/ and is licensed under GPL v2.)

IV. Compiling the unmodified version of MPI-HMMER

1. Set up the MPI and compiler environments:


$ source /opt/intel/impi/<version_number>/bin64/mpivars.sh

$ source /opt/intel/compserxe/bin/compilervars.sh intel64

2. Unpack the source code and build the executables for Intel Xeon processor


$ export CC=mpiicc

$ export CFLAGS=“-O2 –xAVX”

$ tar xvf hmmer-2.3.2-MPI-0.92.tar.gz

$ mv hmmer-2.3.2-MPI-0.92 hmmer-2.3.2-MPI-0.92-xeon

$ cd hmmer-2.3.2-MPI-0.92-xeon 

$ ./configure --prefix=/<install_directory>/xeon

$ make; make install

$ cd ..

3. Unpack the source code and build the executables for the Intel Xeon Phi coprocessor. Note that it is necessary to set the -mmic flag for the compiler and the --host=x86_64 option for the configure program in order to cross-compile for the coprocessor:


$ export CC=mpiicc

$ export CFLAGS=“-O2 –mmic”

$ tar xvf hmmer-2.3.2-MPI-0.92.tar.gz

$ mv hmmer-2.3.2-MPI-0.92 hmmer-2.3.2-MPI-0.92-mic

$ cd hmmer-2.3.2-MPI-0.92-mic 

$ ./configure --prefix=/<install_directory>/mic –-host=x86_64

$ make; make install

$ cd ..

V. Running hmmsearch

To run hmmsearch, you will need a profile hmm file and a sequence file. If you do not have files of your own to use, you can test the program using files from the tutorial directory that comes with the software. These are small toy files and will not showcase the speedup when you modify the code. To generate an hmm file from the globins50.msf file in the tutorial directory:


$ /<install_directory>/xeon/hmmbuild globin.hmm globins50.msf

1. Be sure your environment is properly set as shown in section IV step 1.

2. To run hmmsearch on one Intel Xeon processor, using 32 MPI ranks and data from the tutorials directory, run:


$ mpiexec -host node1 -n 32 /<install_directory>/xeon/bin/hmmsearch   --mpi globin.hmm  Artemia.fa

3. To run the code on one Intel Xeon Phi coprocessor using 122 MPI ranks, copy the coprocessor version of hmmsearch and the data files to the coprocessor. Also copy the contents of /opt/intel/impi/<version_number>/mic/bin on the processor to /bin on the coprocessor and the contents of /opt/intel/impi/<version_number>/mic/lib and /opt/intel/composerxe/lib/mic on the processor to /lib64 on the coprocessor. Alternately, you can nfs mount the directories from the host to the coprocessor. Then run:


 $ export I_MPI_MIC=1; mpiexec -host mic0 -n 122 /<install_directory>/mic/bin/hmmsearch   --mpi globin.hmm Artemia.fa

4. To run the code on both the host and one coprocessor together, first make sure the required files have been copied to the coprocessor, then run:


$ export I_MPI_MIC=1;mpiexec -host node1 -n 32 /<install_directory>/xeon/bin/hmmsearch   --mpi  globin.hmm Artemia.fa: ;

-host mic0 -n 122 /<install_directory>/mic/bin/hmmsearch --mpi globin.hmm Artemia.fa

VI. Modifying hmmsearch to improve performance

Analysis showed that the P7Viterbi algorithm in fast_algorithms.c contains a hotspot loop that was not vectorizing. The loop uses several pointer variables - mc, dc, ic - which the compiler is unable to determine are not aliased. Since we know that no aliasing is occurring, we can inform the compiler of this by using the -fno-alias option when compiling fast_algorithms.c. With this information, the compiler is able to determine that there are no actual loop carried dependencies and vectorizes the loop, The same optimization can be used for both the host and the coprocessor executables.

Further analysis of the code revealed that unrolling the loop twice improved performance. To unroll the loop, place a "#pragma unroll(2)" directive immediately before the "for" loop at line 132 in fast_algorithms.c.


#pragma unroll(2)

for (k = 1; k < M; k++) {

Finally, it is possible to merge the small one line loop in the E State calculation at line180 in fast_algorithms.c into the main inner loop, saving a small amount of loop overhead.


xme = -INFTY; // moved up from E state calculation

mpc = mmx[i]; // moved up from E state calculation

ep  = hmm->esc; // moved up from E state calculation 

#pragma unroll(2)

for (k = 1; k < M; k++) {

   <original body of loop>

   if ((mpc[k] + ep[k]) > xme) xme = mpc[k] + ep[k] ; // moved up from E state calculation

}

 

<additional code>

                                /* E state */

//xme = -INFTY;

//mpc = mmx[i];

//ep  = hmm->esc;

//for (k = 1; k <= hmm->M; k++)

//  if ((mpc[k] + ep[k]) > xme) xme = mpc[k] + ep[k] ;

xmx[i][XME] = xme;

VII. Compiling and executing the modified code

After making the code changes listed in section VI, rebuild MPI-HMMER.

1. Be sure your environment is properly set as shown in section IV step 1.

2. Rebuild MPI-HMMER for the processor:


$ cd hmmer-2.3.2-MPI-0.92-xeon/src 

$ mpiicc -O2 -DHAVE_CONFIG_H -I../squid -c fast_algorithms.c -fno-alias

$ cd ..

$ make; make install

$ cd ..

3. Rebuild MPI-HMMER for the coprocessor:


$ cd hmmer-2.3.2-MPI-0.92-mic/src 

$ mpicc -O2 -mmic -DHAVE_CONFIG_H -I../squid -c fast_algorithms.c -fno-alias

$ cd ..

$ make; make install

$ cd ..

 

4. Run the code as shown in section V.

Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.