NOAA NIM with Support for Intel® Xeon Phi™ Coprocessor

Non-hydrostatic Icosahedral Model is a weather forecasting model developed by NOAA. G6 K96 which is a smaller data-set which scales best up to 4 cluster nodes. G9 is useful for studying larger clusters. The code supports the symmetric mode of operation of the Intel® Xeon® processor (Referred to as ‘host’ in this document)  with the Intel® Xeon Phi™ coprocessor (Referred to as ‘coprocessor’ in this document) in a single node and in a cluster environment.

To get access to the code:

Unzip the kit to directory named /nim/. The NIM benchmark contains 5 main subdirectories underneath the nim/ directory: xeonphinim_r2201/, data/, sms_r604/, F2C-ACC_V5.5/ and gptl_v5.0/. Subdirectory xeonphinim_r2201 contains source code for NIM's dynamics core, along with scripts to build and run the NIM dynamics; data/ contains input data sets used by the NIM dynamics core; sms_r604 contains the Scalable Modeling System; F2C-ACC_V5.5/ contains the F2C-ACC code translator and gptl_v5.0/ contains the General Purpose Timing Library.

Build Directions

(Disclaimer: this document describes how to build and run on a specific cluster environment referred to as Endeavor. You will need to adjust these instructions accordingly for your cluster environment.)

  1. Get an interactive compute node on Endeavor to build for Intel® Xeon® processors and for Intel® Xeon Phi™ coprocessors
    bsub -R '1*{select[kncc0] span[ptile=1]}' -q hoelleq -Is -W 700 -l
    MIC_ULIMIT_STACKSIZE=365536 -l MIC_TIMINGDEVICE=tsc   /bin/bash
  2. You will need Intel® Composer XE 2013 or newer C/C++ and Fortran compiler and Intel® MPI Library 4.1.1 or newer.
    1. You can obtain Intel® Composer XE, which includes the Intel® C/C++ and Fortran Compilers1, from https://registrationcenter.intel.com/regcenter/register.aspx, or register at https://software.intel.com/en-us/ to get a free 30-day evaluation copy.
  3. Set environment variables for Intel Composer XE and Intel MPI Library.
  4. All scripts mentioned here are provided in the NIM application kit.
  5. Building Libraries and Tools Used by NIM (SMS, GPTL, Ruby)
    NIM can run in symmetric mode. This means we need to compile both for the host and for the coprocessor.
    1. SMS
      1. Copy or soft-link one of the (GNU Make-compatible) profile* files from sms_r604/etc/ into  the sms_r604/ directory. Rename as "profile".
      2. Edit the profile file, supplying an appropriate value for each variable. The variables most likely to require custom definitions are: CC, CCSERIAL, FC, FCSERIAL, FFLAGS, FIXED and FREE (the last two are the flags indicating   Fortran fixed or free form source for the compiler indicated by the FC and   FCSERIAL variables).  The variable TREETOP refers to the sms_r604/ directory during the build.
      3. In the sms_src/ directory, type "make".  If it succeeds, type "make install".
      4. If "make install" succeeds, new bin/, incldue/ and lib/ directories will have been created under the root-level sms/ (or sms_phi/, see below) directory.

      SMS must be built twice to support symmetric mode on machines with the coprocessor.  In this case, first build for the host by starting with an etc/profile.* file that does not contain "phi" in the name.  This will install SMS in the root-level sms/ directory.  Then clean by executing "make distclean" and rebuild starting with a etc/profile.* file that does contain "phi" in the name.  This will install the coprocessor SMS in the root-level sms_phi/ directory.

    2. GPTL

      NIM uses the GPTL timing library (jmrosinski.github.io/GPTL/) to measure wall-clock execution times for NIM components.  We used v5.0 in gptl_v5.0/.  The gptl_ v5.0/INSTALL file includes instructions for building the library and installing it in the root-level gptl/ directory.

      GPTL must be built twice to support symmetric mode on machines with coprocessor, such as TACC's Stampede system.  In this case, first build for the host by starting with a macros.make.* file that does not contain "phi" in the name. For example, on TACC stampede if you do "cp jrmacros/macros.make.tacc ./macros.make", "make", and then "make install" it should work right out of the box with no modifications.  This will install GPTL in the root-level  gptl/ directory.  Then "make clean" and rebuild starting with a macros.make.* file that does contain "phi" in the name.  This will install the coprocessor GPTL  in the root-level gptl_phi/ directory.

    3. Ruby

      SMS source-to-source translator requires Ruby 1.9.x with YAML support. If your system does not have Ruby then you need to build from source:
      1. Obtain the LibYAML source from http://pyyaml.org/wiki/LibYAML. Version 0.1.4  is known to work. Unpack the archive, configure, make and make install.
      2. Obtain the Ruby source from http://ruby-lang.org. Version 1.9.3 is known to work. Unpack the archive, configure, make and make install. NOTE that you   must use the --with-opt-dir= argument when running configure, and provide    its value the path to your LibYAML installation from the previous step.
      3. Verify your installation
        $ /path/to/new/ruby -e "puts RUBY_VERSION" # should print 1.9.3 or similar
        $ /path/to/new/ruby -e "require 'yaml'"    # should return silently
  6. Building NIM
    Go to dir xeonphinim_r2201/src
    1. Intel® Xeon Phi™ Coprocessor build
      1. Open macros.make.endeavorxeonphi.
        Make sure SMS__RUBY and SMS are pointing to the right dir
        export SMS__RUBY=/panfs/users/Xjrosin/ruby-1.9.3-p484-install/bin/ruby
        SMS = /home/Xjrosin/sms_r604/xeonphi
      2. In cmd prompt type: ./ makenim –i
        select
        1. arch: endeavorxeonphi
        2. Set underlying hardware: cpu
        3. Set parallelism: parallel
        4. Set threading: yes
        5. Set bitwise: no
        6. Set double_precision: false
      3. 2 directories are built in dir xeonphinim_r2201
        1. run_endeavorxeonphi_cpu_parallel_ompyes_bitwiseno
        2. src_endeavorxeonphi_cpu_parallel_ompyes_bitwiseno
    2. Intel® Xeon® Processor build
      1. Open macros.make.endeavor
        Make sure SMS__RUBY and SMS are pointing to the right dir
        export SMS__RUBY=/panfs/users/Xjrosin/ruby-1.9.3-p484-install/bin/ruby
        SMS = /home/Xjrosin/sms_r604/xeon
      2. In cmd prompt type: ./ makenim –i
        select
        1. arch: endeavor
        2. Set underlying hardware: cpu
        3. Set parallelism: parallel
        4. Set threading: yes
        5. Set bitwise: no
        6. Set double_precision: false
      3. 2 directories are built in dir xeonphinim_r2201
        1. run_endeavor_cpu_parallel_ompyes_bitwiseno
  7. Running NIM for G6 data-set

    cd run_endeavor_cpu_parallel_ompyes_bitwiseno

    1. Create NIMnamelist for different configurations
      1. General Settings
        1. MaxQueueTime = '30' ! Run time for the complete job (HH:MM:SS)
        2. DataDir = "/home/ajha1/panfs_users_ajha1/workloads/NIM/nim2/nimdata/G6"
          Point to your G5/G6/G9 data location
        3. G6 specific
          &CNTLnamelist
          glvl = 6   ! Grid level
          gtype = 2   ! Grid type: Standard recursive (0), Modified recursive (2), Modified great circle (3)
          SubdivNum = 2 2 2 2 2 2 2 2 2 2 2 2   ! subdivision number for each recursive refinement
          nz = 96   ! Number of vertical levels
          ArchvTimeUnit = 'ts'   ! ts:timestep; hr:hour dy:day
          itsbeg = 1
          RestartBegin = 0   ! Begin restart if .ne.0
          ForecastLength = 100   ! Total number of timesteps (100/day),(2400/hr),(2400*60/min), (9600/ts)
          ArchvIntvl = 100   ! Archive interval (in ArchvTimeUnit) to do output  (10-day), (240-hr), (240*60-min), (960-ts)
          minmaxPrintInterval = 100   ! Interval to print out MAXs and MINs
          physics = 'none'   ! GRIMS or GFS or none for no physics
          GravityWaveDrag = .true.   ! True means calculate gravity wave drag
          yyyymmddhhmm = "200707170000"   ! Date of the model run
          pertlim = 0.   ! Perturbation bound for initial temperature (1.e-7 is good for 32-bit roundoff)
          curve = 3   ! 0: ij order, 1: Hilbert curve order (only for all-bisection refinement), 2:ij block order, 3: Square Layout
          NumCacheBlocksPerPE = 1   ! Number of cache blocks per processor. Only applies to ij block order
          tiles_per_thread = 1   ! multiplies OMP_NUM_THREADS to give num_tiles for GRIMS
          dyn_phy_barrier = .false.   ! artificial barrier before and after physics for timing
          read_amtx = .false.   ! false means NIM computes amtx* arrays (MUCH faster than reading from a file!)
          writeoutput = .false.
        4. ComputeTasks = 1         ! Compute tasks for NIM (set to 1 for serial)
        5. Number of Cores available to MPI
          &TASKnamelist
             cpu_cores_per_node = 48
             max_compute_tasks_per_node = 1
             omp_threads_per_compute_task = 48
             num_write_tasks = 0
             max_write_tasks_per_node = 1
             root_own_node = .false.
             icosio_debugmsg_on = .false.
             max_compute_tasks_per_mic = 0
             omp_threads_per_mic_mpi_task = 240
      2. 2-Node Host
        1. ComputeTasks = 2         ! Compute tasks for NIM (set to 1 for serial)
        2. Number of Cores available to MPI
          &TASKnamelist
             cpu_cores_per_node = 48
             max_compute_tasks_per_node = 1
             omp_threads_per_compute_task = 48
             num_write_tasks = 0
             max_write_tasks_per_node = 1
             root_own_node = .false.
             icosio_debugmsg_on = .false.
             max_compute_tasks_per_mic = 0
             omp_threads_per_mic_mpi_task = 240
      3. 4-Node Host
        1. ComputeTasks = 4    ! Compute tasks for NIM (set to 1 for serial)
        2. Number of Cores available to MPI
          &TASKnamelist
              cpu_cores_per_node = 48
              max_compute_tasks_per_node = 1
              omp_threads_per_compute_task = 48
              num_write_tasks = 0
              max_write_tasks_per_node = 1
              root_own_node = .false.
              icosio_debugmsg_on = .false.
              max_compute_tasks_per_mic = 0
              omp_threads_per_mic_mpi_task = 240
      4. 1-Node Coprocessor
        1. ComputeTasks = 1      ! Compute tasks for NIM (set to 1 for serial)
        2. Number of Cores available to MPI
          &TASKnamelist
             cpu_cores_per_node = 240
             max_compute_tasks_per_node = 1
              omp_threads_per_compute_task = 240
              num_write_tasks = 0
              max_write_tasks_per_node = 1
              root_own_node = .false.
              icosio_debugmsg_on = .false.
              max_compute_tasks_per_mic = 0
              omp_threads_per_mic_mpi_task = 0
      5. 2-Node Coprocessor
        1. ComputeTasks = 2         ! Compute tasks for NIM (set to 1 for serial)
        2. Number of Cores available to MPI
          &TASKnamelist
             cpu_cores_per_node = 240
             max_compute_tasks_per_node = 1
             omp_threads_per_compute_task = 240
             num_write_tasks = 0
             max_write_tasks_per_node = 1
             root_own_node = .false.
             icosio_debugmsg_on = .false.
             max_compute_tasks_per_mic = 0
             omp_threads_per_mic_mpi_task = 0
      6. 4-Node Coprocessor
        1. ComputeTasks = 4         ! Compute tasks for NIM (set to 1 for serial)
        2. Number of Cores available to MPI
          &TASKnamelist
             cpu_cores_per_node = 240
             max_compute_tasks_per_node = 1
             omp_threads_per_compute_task = 240
             num_write_tasks = 0
             max_write_tasks_per_node = 1
             root_own_node = .false.
             icosio_debugmsg_on = .false.
             max_compute_tasks_per_mic = 0
             omp_threads_per_mic_mpi_task = 0
      7. 1-Node SYMMETRIC
        1. ComputeTasks = 2         ! Compute tasks for NIM (set to 1 for serial)
        2. Number of Cores available to MPI
          &TASKnamelist cpu_cores_per_node = 48
             max_compute_tasks_per_node = 2
             omp_threads_per_compute_task = 48
             num_write_tasks = 0
             max_write_tasks_per_node = 1
             root_own_node = .false.
             icosio_debugmsg_on = .false.
             max_compute_tasks_per_mic = 1
             omp_threads_per_mic_mpi_task = 240
      8. 2-Node SYMMETRIC
        1. ComputeTasks = 4         ! Compute tasks for NIM (set to 1 for serial)
        2. Number of Cores available to MPI
          &TASKnamelist
             cpu_cores_per_node = 48
             max_compute_tasks_per_node = 2
             omp_threads_per_compute_task = 48
             num_write_tasks = 0
             max_write_tasks_per_node = 1
             root_own_node = .false.
             icosio_debugmsg_on = .false.
             max_compute_tasks_per_mic = 1
             omp_threads_per_mic_mpi_task = 240
      9. 4-Node SYMMETRIC
        1. ComputeTasks = 8         ! Compute tasks for NIM (set to 1 for serial)
        2. Number of Cores available to MPI
          &TASKnamelist
             cpu_cores_per_node = 48
             max_compute_tasks_per_node = 2
             omp_threads_per_compute_task = 48
             num_write_tasks = 0
             max_write_tasks_per_node = 1
             root_own_node = .false.
             icosio_debugmsg_on = .false.
             max_compute_tasks_per_mic = 1
             omp_threads_per_mic_mpi_task = 240
    2. Running
      1. Copy NIMnamelist for a given Cluster configuration i.e. 1/2/4 Nodes to NIMnamelist
      2. Run on Host
        1. Execute ./endeavorsubnim.host
      3. Run on Coprocessor
        1. Execute ./endeavorsubnim.mic
      4. Run SYMMETRIC
        1. Execute ./endeavorsubnim.symm
    3. Results
    1. The execution of above script will create directory like
      G6_K96_NONE_P1_44452
    2. Cd to above directory, there will be run files like
      1. Stdout //std output file – check for any errors
      2. nodeconfig.ivb.txt – MPI node config file
      3. NIMnamelist – run config file generated above
      4. taskinfo.yaml – MPI task info
      5. timing.0 - timing for MPI 0
      6. timing.summary – timing summary for overall run
        1. grab the wallmax for the MainLoop – this is your app runtime
    name   ncalls nranks mean_time   std_dev   wallmax (rank  thread)   wallmin (rank  thread) 
    
    Total 1  1   102.552   0.000  102.552 (     0     0)       102.552 (      0    0) 
    
    MainLoop   1    1     68.203     0.000    68.203 (    0   0)   68.203 (    0    0) 
    
Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.