Recipe: ROME1.0/SML for the Intel® Xeon Phi™ Processor 7250

Overview

This article provides a recipe for how to obtain, compile, and run ROME1.0 SML (statistical manifold learning-based deep classification) on Intel® Xeon® processors and Intel® Xeon Phi™ processors. Before you run the ROME SML processing phase, you need to run the ROME MAP processing phase, because ROME SML will use the output of ROME MAP. Please follow the instructions below to run the ROME MAP and ROME SML workloads.

The source and test workloads for this version of ROME can be downloaded from: http://ipccsb.dfci.harvard.edu/rome/download.html.

Introduction

ROME (Refinement and Optimization via Machine lEarning for cryo-EM) is one of the major research software packages from the Dana-Farber Cancer Institute. ROME is a parallel computing software system dedicated to high-resolution cryo-EM structure determination and data analysis, implementing advanced machine learning approaches optimized for HPC clusters. ROME 1.0 introduces SML (statistical manifold learning-based deep classification), following MAP-based (maximum a posteriori) image alignment. More information about ROME can found at http://ipccsb.dfci.harvard.edu/rome/index.html.

The ROME system has be optimized for both Intel® Xeon® processors and Intel® Xeon Phi™ processors. Detailed information about the underlying algorithms and optimizations can be found at http://arxiv.org/abs/1604.04539.

In this document, we used three workloads: Inflammasome, RP-a and RP-b. The workload descriptions are as follows:

  • Inflammasome data: 16306 images of NLRC4/NAIP2 inflammasome with a size of 2502 pixels
  • RP-a: 57001 images of proteasome regulatory particles (RP) with a size of 1602 pixels
  • RP-b: 35407 images of proteasome regulatory particles (RP) with a size of 1602 pixels

In these documents, we use “ring11_all” to refer to the Inflammasome workload, “data6” to refer to the RP-a workload, and “data8” to refer to the RP-b workload.

Preliminaries

  1. To match these results, the Intel Xeon Phi processor machine needs to be booted with BIOS settings for quad cluster mode and MCDRAM cache mode. Please review this document for further information. The Intel Xeon processor system does not need to be started in any special manner.
  2. To build this package, install the Intel® MPI Library for Linux* 5.1(Update 3) and Intel® Parallel Studio XE Composer Edition for C++ Linux* Version 2016 (Update 3) or higher products on your systems.
  3. Download the source ROME1.0a.tar.gz from http://ipccsb.dfci.harvard.edu/rome/download.html
  4. Unpack the source code to /home/users.

    > cp ROME1.0a.tar.gz /home/users
    > tar –xzvf ROME1.0a.tar.gz

     
  5. The workloads are provided by the Intel® Parallel Computing Center for Structural Biology (http://ipccsb.dfci.harvard.edu/). As noted above, the workloads can be downloaded from http://ipccsb.dfci.harvard.edu/rome/download.html. Following the EMPIAR-10069 link, download Inf_data1.* (Set 1) and rename them ring11_all.*. Download RP_data2.* (Set 2) and rename them data8.*. Download RP_data4.* (Set 4) and rename them data6.*. The scripts referred to below can be obtained by pulling the file KNL_LAUNCH.tgz from http://ipccsb.dfci.harvard.edu/rome/download.html and unpacking the file in a similar manner. 
  6. Copy the workloads and run scripts to your home directory. You should have the following files:

    >cp ring11_all.star /home/users
    >cp ring11_all.mrcs /home/users
    >cp data6.star /home/users
    >cp data6.mrcs /home/users
    >cp data8.star /home/users
    >cp data8.mrcs /home/users
    >cp run_ring11_all_map_XEON.sh /home/users
    >cp run_ring11_all_sml_XEON.sh /home/users
    >cp run_ring11_all_map_XEONPHI.sh /home/users
    >cp run_ring11_all_sml_XEONPHI.sh /home/users
    >cp run_data6_map_XEON.sh /home/users
    >cp run_data6_sml_XEON.sh /home/users
    >cp run_data6_map_XEONPHI.sh /home/users
    >cp run_data6_sml_XEONPHI.sh /home/users
    >cp run_data8_map_XEON.sh /home/users
    >cp run_data8_sml_XEON.sh /home/users
    >cp run_data8_map_XEONPHI.sh /home/users
    >cp run_data8_sml_XEONPHI.sh /home/users

Prepare the binaries for the Intel Xeon processor and the Xeon Phi processor

  1. Set up the Intel® MPI Library and Intel® C++ Compiler environments:

    > source /opt/intel/impi/<version>/bin64/mpivars.sh
    > source /opt/intel/composer_xe_<version>/bin/compilervars.sh intel64
    > source /opt/intel/mkl/<version>/bin/mklvars.sh intel64

     
  2. Set environment variables for compilation of ROME:

    >export ROME_CC=mpiicpc
     
  3. Build the binaries for the Intel Xeon processor.

    >cd /home/users/ROME1.0a
    >make
    >mkdir bin
    >mv rome_map bin/rome_map
    >mv rome_sml bin/rome_sml

     
  4. Build the binaries for the Intel Xeon Phi processor.

    >cd /home/users/ROME1.0a
    >vi makefile
    Modify FLAGS to below:
    FLAGS := -mkl -fopenmp -O3 -xMIC-AVX512 -DNDEBUG -std=c++11
    >make
    >mkdir bin_knl
    >mv rome_map bin_knl/rome_map
    >mv rome_sml bin_knl/rome_sml

Run the test workloads on the Intel Xeon processor (an Intel® Xeon® processor E5-2697 v4 is assumed by the scripts)

  1. Running the ROME MAP phase for these workloads:

        Notes:
    • This is just a setup phase for the ROME SML phase. You don’t need to do this step for the Intel Xeon processor if you have already done it on the Intel Xeon Phi processor – the output can be used as the input to the ROME SML phase for both the Intel Xeon processor and Intel Xeon Phi processor runs.
    • This step will take on the order of 1.5 hours for ring11_all, 18 hours for data6, and over 7 hours for data8. To speed this up, we recommend that you spread the work over multiple nodes by specifying a hostfile and a process count that is a multiple of the number of processes run per host by modifying the mpirun commands in the scripts from this:
      mpirun -env OMP_NUM_THREADS 9 -np 4
      to something like this: mpirun –hostfile <hostfile> -env OMP_NUM_THREADS 9 -perhost 4 –np <total_MPI_processes=4*number_hosts>

    Running workload1: ring11_all
    >cd /home/users/
    >sh run_ring11_all_map_XEON.sh


    Running workload2: data6
    >cd /home/users/
    >sh run_data6_map_XEON.sh


    Running workload3: data8
    >cd /home/users/
    >sh run_data8_map_XEON.sh

     
  2. Running the ROME SML phase for these workloads:

    Running workload1: ring11_all
    >cd /home/users/
    >sh run_ring11_all_sml_XEON.sh


    Running workload2: data6
    >cd /home/users/
    >sh run_data6_sml_XEON.sh


    Running workload3: data8
    >cd /home/users/
    >sh run_data8_sml_XEON.sh

Note: When running on Intel Xeon processors, we have found that both ROME MAP and ROME SML phases run best with 4 MPI ranks per host, and 9 OpenMP threads per rank. The provided scripts show how to run this way on a single host.

Run the test workloads on the Intel Xeon Phi processor

  1. Running the ROME MAP phase for these workloads:

    Notes:
    • This is just a setup phase for the ROME SML phase. You don’t need to do this step for the Intel Xeon Phi processor if you have already done it on the Intel Xeon processor – the output can be used as the input to the ROME SML phase for both the Intel Xeon processor and Intel Xeon Phi processor runs.
    • This step will take on the order of 2 hours for ring11_all, 26 hours for data6, and over 10 hours for data8. To speed this up, we recommend that you spread the work over multiple nodes by specifying a hostfile and a process count that is a multiple of the number of processes run per host by modifying the mpirun commands in the scripts from this: mpirun -env OMP_NUM_THREADS 64 -np 4 to something like this: mpirun –hostfile <hostfile> -env OMP_NUM_THREADS 64 -perhost 4 –np <total_MPI_processes=4*number_hosts>
    >cd /home/users/
    Running workload1: ring11_all
    >cd /home/users/
    >sh run_ring11_all_map_XEONPHI.sh


    Running workload2: data6
    >cd /home/users/
    >sh run_data6_map_XEONPHI.sh


    Running workload3: data8
    >cd /home/users/
    >sh run_data8_map_XEONPHI.sh

     
  1. Running ROME SML phase for these workloads:

    Running workload1: ring11_all
    >cd /home/users/
    >sh run_ring11_all_sml_XEONPHI.sh


    Running workload2: data6
    >cd /home/users/
    >sh run_data6_sml_XEONPHI.sh


    Running workload3: data8
    >cd /home/users/
    >sh run_data8_sml_XEONPHI.sh

Note: When running on Intel Xeon Phi processors, we have found that ROME MAP runs best with 4 MPI ranks per host, and 64 OpenMP threads per rank. In contrast, ROME SML runs best with 4 MPI ranks per host, and 16 OpenMP threads per rank. The provided scripts show how to run this way on a single host. 

Performance gain seen with ROME SML

For the workloads we described above, the following graph shows the runtimes and  speedups achieved from running this code on the Intel Xeon Phi processor. As you can see, the ROME SML phase exhibits up to a 3.04x speedup for the Inflammasome  workload when running this code on one Intel® Xeon Phi™ processor 7250 versus one two-socket Intel Xeon processor E5-2697 v4 system. The data used below were stored on a Lustre* file system.

 

Testing platform configuration:

Intel Xeon processor E5-2697 v4: BDW-EP node with dual sockets, 18 cores/socket HT enabled @2.3 GHz 145W (Intel Xeon processor E5-2697 v4 w/128 GB RAM), Red Hat Enterprise Linux Server release 6.7 (Santiago)

Intel Xeon Phi processor 7250 (68 cores): Intel Xeon Phi processor 7250 68 core, 272 threads, 1400 MHz core freq. MCDRAM 16 GB 7.2 GT/s, DDR4 96 GB 2400 MHz, Red Hat Enterprise Linux Server release 6.7 (Santiago), quad cluster mode, MCDRAM cache mode.

For more complete information about compiler optimizations, see our Optimization Notice.