Recipe: RELION for Intel® Xeon Phi™ 7250 processor

I. Overview

This article provides a recipe for how to obtain, compile, and run an optimized version of relion-1.4 on Intel® Xeon® processors and Intel® Xeon Phi™ processors.

The source for this version of relion-1.4 can be downloaded from: http://www2.mrc-lmb.cam.ac.uk/relion/index.php

II. Introduction

RELION is an image processing software package and widely used to achieve high resolution Cryo-EM structures. It uses Bayesian MAP+EM algorithm to provide more reliable structures than existing methods which is more suitable for heterogeneous data. RELION is distributed under a GPL license, it is completely free, open-source software for both academia and industry. The code is based on C++. Parallelization is achieved through the MPI and Pthread. More information about relion can refer to http://www2.mrc-lmb.cam.ac.uk/relion/index.php

This project optimizes the performance of the auto-refine part of RELION on both Intel® Xeon® processor and Intel® Xeon Phi™ processor.

Optimizations in this package include:

  1. Improve data alignment with 64-byte to reach better performance. With this data alignment, it will take about 10% performance improvement for this workload
  2. Vectorize the hotspot loop. Especially for the first hotspot loop, it is used very frequently during program running. So after Vectorize two hotspot loops, it can take above 30% performance improvement for this workload.
  3. RELOIN is memory bond application, and taking advantage of available fast MCDRAM on the Xeon Phi 7250 processor should improve the performance. Using the MCDRAM in cache mode we see about a 10% performance boost for this workload.

III. Preliminaries

  1. To match these results, the Intel® Xeon Phi™ processor machine needs to be booted with bios settings for quad cluster mode and MCDRAM cache mode. Please review this document for further information.
  2. To build this package, install the Intel® MPI Library for Linux* 5.1(Update 3) and Intel® Parallel Studio XE Composer Edition for C++ Linux* Version 2016 (Update 3) or higher products on your system.
  3. Download relion-1.4.tar.bz2 from http://www2.mrc-lmb.cam.ac.uk/relion/index.php
  4. Set up the Intel® MPI Library and Intel® C++ Compiler environments:

    > source /opt/intel/impi/<version>/bin64/mpivars.sh
    > source /opt/intel/composer_xe_<version>/bin/compilervars.sh intel64
  5. Unpack the source code to /home/users.

    > cp relion-1.4.tar.bz2 /home/users
    > tar –xjvf relion-1.4.tar.bz2
    > cd ./relion-1.4
  6. Please contact Peking University, Yanan Zhu <ynzhu@pku.edu.cn> to get testing workload. Please request the version used for the Intel KNL Recipes.
  7. Copy the workload to your home directory, the workload will include the following files:

    >cp relion-autorefine-5000.tar.gz /home/users
    >cd /home/users
    >tar –xzvf relion-autorefine-5000.tar.gz

IV. Add optimized code into relion

  1. Reload new and delete of class MultidimArray in src/multidim_array.h

    >cd /home/users/relion-1.4
    >vi src/multidim_array.h

    Insert the below optimized code before line 496

    void *operator new(size_t size)
    {
         return _mm_malloc(size, 64);
    }
    void operator delete(void *p)
    {
         _mm_free(p);
    }
  2. Vectorize the hotspot loop

    >cd /home/users/relion-1.4
    >vi src/multidim_array.h

    Replace the original code with optimized code

    Original code is(about line 930):
           for (long int l = 0; l < Ndim; l++)
                for (long int k = 0; k < Zdim; k++)
                    for (long int i = 0; i < Ydim; i++)
                        for (long int j = 0; j < Xdim; j++)
                        {
                            T val;
                            if (k >= ZSIZE(*this))
                                val = 0;
                            else if (i >= YSIZE(*this))
                                val = 0;
                            else if (j >= XSIZE(*this))
                                val = 0;
                            else
                                val = DIRECT_A3D_ELEM(*this, k, i, j);
                            new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = val;
                        }

    Optimized code is:

    if ( (ZSIZE(*this)<= Zdim) && (YSIZE(*this)<= Ydim) && (XSIZE(*this)<= Xdim) ) {
            for (long int l = 0; l < Ndim; l++)
                for (long int k = 0; k < Zdim; k++)
                    for (long int i = 0; i < Ydim; i++) {
       #pragma simd
                        for (long int j = 0; j < Xdim; j++)
                        {
                            new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = 0;
                        }
                    }
            for (long int l = 0; l < Ndim; l++)
                for (long int k = 0; k < ZSIZE(*this); k++)
                    for (long int i = 0; i < YSIZE(*this); i++) {
       #pragma simd
                        for (long int j = 0; j < XSIZE(*this); j++)
                        {
                            new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = DIRECT_A3D_ELEM(*this, k, i, j);
                        }
                    }
    } else {
    
            for (long int l = 0; l < Ndim; l++)
                for (long int k = 0; k < Zdim; k++)
                    for (long int i = 0; i < Ydim; i++)
                        for (long int j = 0; j < Xdim; j++)
                        {
                            T val;
                            if (k >= ZSIZE(*this))
                                val = 0;
                            else if (i >= YSIZE(*this))
                                val = 0;
                            else if (j >= XSIZE(*this))
                                val = 0;
                            else
                                val = DIRECT_A3D_ELEM(*this, k, i, j);
                            new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = val;
               }
    }
  3. Vectorize the hotspot loop

    >cd /home/users/relion-1.4
    >vi src/ml_optimiser.cpp

    Replace the original code with optimized code
    Original code is(line 3652):

    FOR_ALL_DIRECT_ELEMENTS_IN_MULTIDIMARRAY(Frefctf)
    {
    diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).real * (*(Fimg_shift + n)).real;
    diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).imag * (*(Fimg_shift + n)).imag;
    suma2 += norm(DIRECT_MULTIDIM_ELEM(Frefctf, n));
    }

    Optimized code is:

    Complex *opp;                                                                                                   FOR_ALL_DIRECT_ELEMENTS_IN_MULTIDIMARRAY(Frefctf)                                                                                {
    diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).real * (*(Fimg_shift + n)).real;
    diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).imag * (*(Fimg_shift + n)).imag;
    opp = & DIRECT_MULTIDIM_ELEM(Frefctf, n);
    suma2 += opp->real * opp->real + opp->imag * opp->imag;
    }
  4. There is a known issue in relion 1.4 as references in the following link:

    http://www2.mrc-lmb.cam.ac.uk/relion/index.php/Known_issue

    Change line 405 in src/ml_optimiser_mpi.cpp from:

    length_fn_ctf = exp_fn_img.length() + 1; // +1 to include \0 at the end of the string

    into:

    length_fn_ctf = exp_fn_ctf.length() + 1; // +1 to include \0 at the end of the string

V. Prepare for Intel® Xeon® processor

  1. Set environment variables for compilation of relion:

    >export CC=icc
    >export CXX=icpc
    >export F77=ifort
    >export MPICC=mpiicc
    >export MPICXX=mpiicpc
    >export CFLAGS="-O3 -xHost -fno-alias -align"
    >export FFLAGS="-O3 -xHost -fno-alias -align"
    >>export CXXFLAGS="-O3 -xHost -fno-alias -align"

    Suggestion: you can also add -qopt-report=5 into CFLAGS/FFLAGS/CXXFLAGS to see optimization report

  2. Build the library for the Intel® Xeon processor.

    > cd /home/users
    > cd ./relion-1.4
    > ./INSTALL.sh

VI. Prepare for Intel® Xeon® Phi™ processor

  1. Set environment variables for compilation of relion:

    >export CC=icc
    >export CXX=icpc
    >export F77=ifort
    >export MPICC=mpiicc
    >export MPICXX=mpiicpc
    >export CFLAGS="-O3 -xMIC-AVX512 -fno-alias -align"
    >export FFLAGS="-O3 -xMIC-AVX512 -fno-alias -align"
    >export CXXFLAGS="-O3 -xMIC-AVX512 -fno-alias -align"

    Suggestion: you can also add -qopt-report=5 into CFLAGS/FFLAGS/CXXFLAGS to see optimization report

  2. Build the library for the Intel® Xeon Phi™ processor.

    cd /home/users
    cd ./relion-1.4
    ./INSTALL.sh 

VII. Run the test workload on Intel® Xeon processor

  1. Create running scripts for this workload

    >vi auotrefine.sh
    #!/bin/sh
    nprocs=9
    nthreads=4
    mrcsfile="adkc_05000.mrcs.mrcs"
    starfile="adkc_05000.star"
    defocusfile="3eulerctf_05000.dat"
    
    echo "" > $starfile
    echo "data_" >> $starfile
    echo "" >> $starfile
    echo "loop_" >> $starfile
    echo "_rlnVoltage #1" >> $starfile
    echo "_rlnDefocusU #2" >> $starfile
    echo "_rlnDefocusV #3" >> $starfile
    echo "_rlnDefocusAngle #4" >> $starfile
    echo "_rlnSphericalAberration #5" >> $starfile
    echo "_rlnAmplitudeContrast #6" >> $starfile
    echo "_rlnImageName #7" >> $starfile
    
    awk 'BEGIN{npart=0}{if($1 !~/^;/){npart++; printf("%s %s %s %s %s %s %d@'"$mrcsfile"'n", $1,$3,$4,$5,$6,$7,npart)}}' $defocusfile >> $starfile
    mkdir -p Refine/adkc_05000
    mpirun -np $nprocs relion_refine_mpi --o Refine/adkc_05000/run01 --j $nthreads --iter 10 --split_random_halves --i $starfile  --particle_diameter 86 --angpix 0.86 --ref adkc.mrc --firstiter_cc --ini_high 60 --ctf --tau2_fudge 4 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 10 --offset_step 2 --sym C1 --norm --scale --memory_per_thread 0.5 --auto_local_healpix_order 5 --low_resol_join_halves 40
  2. Set PATH to add LD_LIBRARY_PATH to running relion autorefine workload

    >export PATH=$PATH:/home/users/relion-1.4/bin
    >export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/users/relion-1.4/lib
  3. Running this workload

    >cd /home/users/5000_vec1
    >sh auotrefine.sh

VIII. Run the test workload on Intel® Xeon Phi™ processor

  1. Create running scripts for this workload

    >vi auotrefine.sh
    #!/bin/sh
    nprocs=65
    nthreads=4
    mrcsfile="adkc_05000.mrcs.mrcs"
    starfile="adkc_05000.star"
    defocusfile="3eulerctf_05000.dat"
    
    echo "" > $starfile
    echo "data_" >> $starfile
    echo "" >> $starfile
    echo "loop_" >> $starfile
    echo "_rlnVoltage #1" >> $starfile
    echo "_rlnDefocusU #2" >> $starfile
    echo "_rlnDefocusV #3" >> $starfile
    echo "_rlnDefocusAngle #4" >> $starfile
    echo "_rlnSphericalAberration #5" >> $starfile
    echo "_rlnAmplitudeContrast #6" >> $starfile
    echo "_rlnImageName #7" >> $starfile
    
    awk 'BEGIN{npart=0}{if($1 !~/^;/){npart++; printf("%s %s %s %s %s %s %d@'"$mrcsfile"'n", $1,$3,$4,$5,$6,$7,npart)}}' $defocusfile >> $starfile
    mkdir -p Refine/adkc_05000
    mpirun -np $nprocs relion_refine_mpi --o Refine/adkc_05000/run01 --j $nthreads --iter 10 --split_random_halves --i $starfile  --particle_diameter 86 --angpix 0.86 --ref adkc.mrc --firstiter_cc --ini_high 60 --ctf --tau2_fudge 4 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 10 --offset_step 2 --sym C1 --norm --scale --memory_per_thread 0.5 --auto_local_healpix_order 5 --low_resol_join_halves 40
  2. Set PATH to add LD_LIBRARY_PATH to running relion autorefine

    >export PATH=$PATH:/home/users/relion-1.4/bin
    >export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/users/relion-1.4/lib
  3. Running this workload

    >cd /home/users/5000_vec1
    >sh auotrefine.sh

IX. Performance gain

For this autorefine workload we described above, the following graph shows the speedup achieved from this optimization. As you can see, up to a 1.31x speedup can be achieved when running this code on one Intel® Xeon Phi™ 7250 vs. one 2-Socket Intel® Xeon® Processor E5-2697 v4. Up to a 1.23x speedup can be achieved when running this code on one Intel® Xeon Phi™ 7210 vs. one 2-Socket Intel® Xeon® Processor E5-2697 v4

  • 2S Intel® Xeon® processor E5-2697 v4, (18 Ranks)
  • 1 Intel® Xeon Phi™ 7210 (63 Ranks)
  • 1 Intel® Xeon Phi™ 7250 (65 Ranks)

Testing platform configuration:

Intel® Xeon® processor E5-2697 v4: Dual Socket ® processor E5-2697 v4 2.3 GHz (Turbo ON) , 18 Cores/Socket, 36 Cores, 72 Threads (HT off), DDR4 128GB, 2400 MHz, CentOS release 6.7 (Final)

Intel® Xeon Phi™ processor 7210 (64 cores): Intel® Xeon Phi™ processor 7210 64 core, 256 threads, 1300 MHz core freq. (Turbo ON), , MCDRAM 16 GB 6.4 GT/s, BIOS GVPRCRB1.86B.0010.R00.1603251732, DDR4 96GB 2133 MHz, Red Hat 7.2(Maipo), quad cluster mode, MCDRAM cache mode

Intel® Xeon Phi™ processor 7250 (68 cores): Intel® Xeon Phi™ processor 7250 68 core, 272 threads, 1400 MHz core freq. (Turbo ON), MCDRAM 16 GB 7.2 GT/s, BIOS GVPRCRB1.86B.0010.D42.1604182214, DDR4 96GB 2400 MHz, Red Hat 7.2(Maipo), quad cluster mode, MCDRAM cache mode

For more complete information about compiler optimizations, see our Optimization Notice.