High Bandwidth Memory (HBM): how will it benefit your application?

Purpose

The first step towards the usability of MCDRAM or High Bandwidth Memory (HBM) is assessing the memory bandwidth utilization for your application.

This article provides basic instructions on how to profile and evaluate memory bandwidth utilization for your application using Intel® Vtune™ Amplifier on Intel® Xeon® processors (IvyBridge/Haswell) and Intel® Xeon Phi™ Coprocessors (Knights Corner).

Instructions

Collecting Bandwidth Profile on Intel® Xeon® processors and/or Intel® Xeon Phi™ Coprocessors using Intel® Vtune™ Amplifier

Viewing the Bandwidth Profile on Intel® Xeon® processors and/or Intel Xeon Phi™ Coprocessors using Intel® Vtune™ Amplifier

  1. source the latest version of the Intel® Vtune™ Amplifier.
    • Make sure to use Intel® Vtune™ Amplifier 2015 Update 1 or later.

      Example:
        source /opt/intel/vtune_amplifier_xe_2015./amplxe-vars.sh
  2. Create an appropriate run script to run your application on Xeon and/or Xeon Phi
    • Make sure to compile your application with “-g” to provide debug information
  3. Make sure to set the path to resolve any compiler and MPI dependencies for your application running on Xeon or Xeon Phi

    Example:
     source /opt/intel/composer_xe_2015.1.133/bin/compilervars.sh intel64
     source /opt/intel/impi/5.0.2.044//bin/mpivars.sh
  4. Collecting Bandwidth:

    Intel® Xeon® processors (IvyBridge/Haswell)

    amplxe-cl -collect bandwidth -r <your-result-dir>  -- ./<xeon-binary> (or run script)

    Intel® Xeon® Phi™ coprocessors (Knights Corner)

     Syntax while running the application on Xeon Phi coprocessor natively
     (e.g. ssh mic0 “cd /tmp ; ./a.out)

     Run the amplxe-cl command from the host only
    amplxe-cl -target-system=mic-native:`hostname`-mic<N> -collect bandwidth –r <your-result-dir> -- <full-path-to-app-to-launch-on-TARGET_CARD>
     

    Syntax while running the application on Xeon Phi Coprocessor from host (e.g. mpirun from host, offload, OpenCL etc.):

    Run the amplxe-cl command from the host only

    amplxe-cl -target-system=mic-host-launch:`hostname`-mic<N> -collect bandwidth –r <your-result-dir> -- <full-path-to-app-to-launch-on-host>

  5. Some additional handy Vtune commands:
    • The default Vtune limit for the result data for any profile collection is 500 MB. You can add the following knob to your “amplxe-cl” command to increase the size or even make it unlimited (by specifying 0)

      -data-limit= (default is 500)

      Limit the amount of raw data to be collected. For unlimited data size, specify 0.

    • If you believe that your application is really huge and long running, you can reduce the size of the data collected by adding the knob below to your “amplxe-cl” command

      -target-duration-type=veryshort | short | medium | long (default is ‘short’)
       

      This value affects the size of collected data. For long running targets, sampling interval is increased to reduce the result size. For hardware event-based analysis types, the duration estimate affects a multiplier applied to the configured Sample after value.

  6. After collecting the Bandwidth Profile for your application , the next step is to view and analyze the results

Viewing the Bandwidth Profile on Intel® Xeon® processors and/or Intel Xeon Phi™ Coprocessors using Intel® Vtune™ Amplifier

  1. You need to open a “VNC” session to open the results in Vtune GUI
  2. source the latest version of the Intel® Vtune™ Amplifier
     

    Example:
    source /opt/intel/vtune_amplifier_xe_2015./amplxe-vars.sh>

  3. Open the result using Vtune GUI

    amplxe-gui <your-result-dir>

  4. In the “Summary Tab” you can see the “Average Bandwidth” reported for your application
     

    Example:

    The results in the below snapshots are from one of the Sandia’s Mantevo mini-apps on Intel Xeon® Processors (2 Socket Haswell) and Intel® Xeon Phi™ coprocessor (Knights Corner)

    Note for Intel® Xeon®: Package_0 is Socket 0; Package_1 is Socket 1. The bandwidth is reported for each socket of the N-Socket Xeon® processor you run on.

     

    Intel® Xeon® (Haswell)

    Intel® Xeon Phi™ Coprocessor (Knights Corner)

  5. In the “Bottom-up” tab you can observe the “Peak Bandwidth” utilized and also a time-line view of your application’s bandwidth utilization.
    • You can also see the Read Bandwidth and Write Bandwidth utilization separately in this view.

      Example:
      The results in the below snapshots are from one of the Sandia’s Mantevo mini-apps on Intel® Xeon® Processors (Haswell) and Intel® Xeon Phi™ coprocessor (Knights Corner).

       

      Intel® Xeon® (Haswell)
      The total peak bandwidth reported for this Dual Socket Haswell run is (52.599*2 = 105.198 GB/s)

      Intel® Xeon Phi™ coprocessor (Knights Corner) The total peak bandwidth reported for this Xeon Phi Coprocessor run is (158.580 GB/s)

  6. In addition, one can also select only a portion of the region in the time-line view and then zoom-in and filter on that region (by clicking and dragging on the timeline, as shown in the below snapshot). The GUI will show new bandwidth utilization numbers to reflect the value for the new zoomed-in region.
     
    • This is especially important for applications which have a long initialization. Using this zoom-in feature, we can focus only on the required part (or phase) of the application.
      Snapshot: Users can zoom in on a particular region of the time line by clicking and dragging on the time line, and then selecting “Zoom in and Filter by Selection” menu option.

Analyzing the Bandwidth Profile on Intel® Xeon® processors and/or Intel® Xeon Phi™ Coprocessors using Intel® Vtune™ Amplifier

  1. Understanding memory bandwidth profile and limitations are important for your application mainly because:
    • Bandwidth bottlenecks increase the latency at which cache misses are serviced.
  2. This is more important for Intel® Xeon Phi™ Coprocessor (Knights Landing), since the on-package high bandwidth memory (MCDRAM: up-to 16GB) will have ~3to4x more memory bandwidth of DDR4.
    • Hence it is important to know which data structures/hot arrays one would need to allocate to MCDRAM as opposed to DDR4.
    • But that’s the next step, before that one has to understand if their application has a high memory foot print (> MCDRAM size) and is indeed memory bandwidth limited or not.
  3. The theoretical memory bandwidth peaks for the Intel® Xeon® processors (IvyBridge/Haswell) and Intel® Xeon Phi™ coprocessor (Knights Corner) can be computed as follows:

    Intel® Xeon® (IvyBrdige/Haswell):

    Theoretical Peak (GB/s) [Per Socket] = (MT/s) * 8 bytes/Clock * <num channels> / 1000

    Example:

     For Dual Socket Haswell (2133 MT/s; 4 Channels per Socket)
     Theoretical Peak (GB/s) [Per Socket] = (2133 * 8 * 4) / 1000 = 68.256 GB/s
     Thus, Theoretical peak for Dual Socket = 68.256 * 2 = 136.512 GB/s


    Intel® Xeon Phi™ Coprocessor (Knights Corner):

    Theoretical Peak (GB/s) = (MT/s) * 4 bytes/Clock * <num channels> / 1000

    Example:

     For Intel® Xeon Phi™ Coprocessor (Knight Corner) (5500 MT/s; 16 Channels per Socket)

     Theoretical Peak (GB/s) [Per Socket] = (5500 *4 * 4) / 1000 = 352 GB/s
  4. But due to certain memory limitations and bottlenecks it is not always possible to achieve the theoretical memory bandwidth limits.
    • Hence, it is also necessary to compare the bandwidth rate for the profiled code with the by-design bandwidth limited benchmark (like those from STREAM benchmark).

    The peak STREAM Triad performance for the specified Intel® Xeon® Processors (IvyBridge/Haswell) and Intel® Xeon Phi™ coprocessor (Knights Corner) are as shown below:

    IvyBridge: 2.70 GHz Dual Socket, 12 cores/socket, EIST/Turbo on, SMT on, 64 GB RAM DDR3 1600 8*8GB Haswell: 2.60 GHZ Dual Socket, 14 cores/socket, EIST/Turbo on, SMT on, 64 GB RAM DDR4 2133 8*8GB Knights Corner: 1.23 GHz, 61 Cores/node, 5.5 GT/s, ECC on, Turbo off

     IvyBridgeHaswellKnights Corner
    STREAM Triad (GB/s)87 GB/s110 GB/s177 GB/s
  5. Analyzing the obtained bandwidth vs. peaks from one of the Sandia’s Mantevo mini-apps shown above:

    Intel® Xeon® Processor (Haswell):

     Profiled Bandwidth from Vtune (Dual Socket): 105.198 GB/s

     Theoretical Peak (Dual Socket): 136.512 GB/s

     STREAM Triad: 110 GB/s
     

    The profiled Bandwidth for the application ~77% of the theoretical peak and ~95% of the practical peak (STREAM Triad). Thus this application is indeed memory bandwidth limited on Haswell (>75% theoretical and/or practical peaks).

    Intel® Xeon® Phi™ Co-processor (Knights Corner):

     Profiled Bandwidth from Vtune (Dual Socket): 158.580 GB/s

     Theoretical Peak: 352 GB/s

     STREAM Triad: 177 GB/s
     

    The profiled Bandwidth for the application ~45% of the theoretical peak and ~90% of the practical peak (STREAM Triad).Thus this application is indeed memory bandwidth limited even on Knights Corner (>75% theoretical and/or practical peaks).

  6. The next step in this exercise would be to try to understand the application’s data-structures, find the arrays which are bandwidth hungry, find their memory profile and allocate the arrays to the on-package High Bandwidth Memory (HBM) accordingly.
    • This is a topic which will be described on another white paper in the near future.
For more complete information about compiler optimizations, see our Optimization Notice.