stack overflow on 64bit server but not on 32bit notebook

One of my recent tasks is to parallelize a fairly large program in Fortran 90. The subroutine I am targeting spends a great time doing a computationally extensive loop. I have tried the following:

!$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED (array2D_1, array2D_2, .., scalar1, scalar2,.., scalarN)

<the calculations >


A Parallel Stable Sort Using C++11 for TBB, Cilk Plus, and OpenMP

This article describes a parallel merge sort code, and why it is more scalable than parallel quicksort or parallel samplesort. The code relies on the C++11 “move” semantics. It also points out a scalability trap to watch out for with C++. The attached code has implementations in Intel® Threading Building Blocks (Intel® TBB), Intel® Cilk™ Plus, and OpenMP*.

  • Developers
  • Professors
  • Students
  • C/C++
  • Intermediate
  • Intel® Cilk™ Plus
  • Intel® Threading Building Blocks
  • parallel
  • Merge Sort
  • Cilk Plus
  • tbb
  • openmp
  • OpenMP*
  • Parallel Computing
  • Recipe: Building and Optimizing the Hogbom Clean Benchmark for Intel® Xeon Phi™ Coprocessors


    This article provides a recipe for compiling and running the Hogbom Clean benchmark for the Intel® Xeon Phi™ coprocessor and discusses the various optimizations applied to the code. 

  • Developers
  • Professors
  • Students
  • Linux*
  • Server
  • C/C++
  • Advanced
  • Intermediate
  • Intel® C++ Compiler
  • OpenMP*
  • Intel® Many Integrated Core Architecture
  • Optimization
  • The Chronicles of Phi - part 5 - Plesiochronous phasing barrier – tiled_HT3

    For the next optimization, I knew what I wanted to do; I just didn’t know what to call it. In looking for words that describes loosely-synchronous, I came across plesiochronous:

    In telecommunications, a plesiochronous system is one where different parts of the system are almost, but not quite, perfectly synchronized.

    The Chronicles of Phi - part 4 - Hyper-Thread Phalanx – tiled_HT2

    The prior part (3) of this blog showed the effects of the first-level implementation of the Hyper-Thread Phalanx. The change in programming yielded 9.7% improvement in performance for the small model, and little to no improvement in the large model. This left part 3 of this blog with the questions:

    What is non-optimal about this strategy?
    And: What can be improved?

    There are two things, one is obvious, and the other is not so obvious.

    Data alignment

    Building on Mac OS X 10.9

    I'm trying to build the openmprtl on Mac OS X 10.9, to be used with OpenMP/Clang project. Is this supposed to be possible? A new thing with 10.9 is that gcc is just a macro for clang, which maybe is confusing the build scripts.

    I try to build with:

    make compiler=clang

    And I get a build error in "Cannot parse GNU compiler version" as it has run gcc and get clang output (as gcc is just a macro for clang on 10.9). I was thinking that when you compile with "compiler=clang" the would not look for gcc at all.

    How do I know in which core my thread is running

    Hello guys.

    I'm trying to scale a for loop but I'm getting even worse results.

    My serial code runs in 30s but my openmp version completed in 200s.

    This is my pragma.

    int procs = omp_get_num_procs();
    #pragma omp parallel for num_threads(procs)\
    shared (c, u, v, w, k, j, i, nx, ny) \
    reduction(+: a, b, c, d, e, f, g, h, i)

    And this are my openmp exports :

    export OMP_NUM_THREADS=5
    export KMP_AFFINITY=verbose,scatter 

    And this is my verbose running in 1 node 8 cores

    The Chronicles of Phi - part 3 Hyper-Thread Phalanx – tiled_HT1 continued

    The prior part (2) of this blog provided a header and set of function that can be used to determine the logical core and logical Hyper-Thread number within the core. This determination is to be use in an optimization strategy called the Hyper-Thread Phalanx.

    Subscribe to OpenMP*