GROMACS recipe for symmetric Intel® MPI using PME workloads


This package (scripts with instructions) delivers a build and run environment for symmetric Intel® MPI runs. This file is actually the README of the package. Symmetric stands for employing a Xeon® executable and a Xeon Phi™ executable both running together exchanging MPI messages and collective data via Intel MPI.

  • 开发人员
  • 合作伙伴
  • 学生
  • Linux*
  • 服务器
  • C/C++
  • 中级
  • 英特尔® Parallel Studio XE Cluster Edition
  • symmetric MPI
  • native MPI
  • cmake
  • heterogeneous clusters
  • Intel® Many Integrated Core (Intel® MIC) Architecture
  • 消息传递接口
  • OpenMP*
  • 学术
  • 集群计算
  • 英特尔® 酷睿™ 处理器
  • Intel® Many Integrated Core Architecture
  • 优化
  • 并行计算
  • 移植
  • 线程
  • OpenMP Shared Arrays

    I have two questions about WRITE/READ operations on shared arrays.
     1) In my program I write a different element of a given shared array at every iteration of an OpenMP-parallelized DO LOOP. The results that I get should be right but I'm just wondering whether this is fine or I should enclose the READ/WRITE section in a CRITICAL block. Then, I also READ elements from a shared array without modifying them and it seems to work. Are these procedures correct?

    [Bug] OSX Yosemite 10.10 fails when compiling

    # ProductName:    Mac OS X
    # ProductVersion:    10.10.3
    # BuildVersion:    14D136

    curl -O
    gunzip -c libomp_20150401_oss.tgz | tar xopf -
    cd libomp_oss

    in line 124..126 of libomp_oss/src/
    ifeq "$(os)" "mac"
        mac_os_new := $(shell /bin/sh -c 'if ; then echo "1"; else echo "0"; fi')

    Inconsistent Speedup


    I'm new in using OpenMP. I would like to ask about speedup ratio.

    I running C source code with OpenMP added with Intel core i5-2410M.

    Based on my understanding, speedup = execution time of code using one thread/execution time of code using N threads 

    The execution time recorded is time_diff in the attached code.

    Basic OMP Parallelized Program Not Scaling As Expected

    #include <iostream>
    #include <vector>
    #include <stdexcept>
    #include <sstream>
    #include <omp.h>
    std::vector<int> col_sums(std::vector<std::vector<short>>& data) {
        unsigned int height = data.size(), width = data[0].size();
        std::vector<int> totalSums(width, 0), threadSums(width, 0);
        #pragma omp parallel firstprivate(threadSums)
            #pragma omp parallel for
            for (unsigned int i = 0; i < height; i++) {
      [0:width] += data[i].data()[0:width];

    Promlems with Intel MPI

    I have trouble with running Intel MPI on cluster with different different numbers of processors on nodes (12 and 32).

    I use Intel MPI 4.0.3 and it works correctly on 20 nodes with 12 processors (Intel(Xeon(R)CPU X5650 @2.67)) at each, and all processors works correctly, then I try to run Intel MPI on other 3 nodes with 32 processors (Intel(Xeon(R)CPU E5-4620 v2@2.00) at each and they work correctly too.

    订阅 OpenMP*