omp, more threads more slowly

omp, more threads more slowly

recently, i transplant a program to mic 

i just add a omp pragma, but more threads more slowly

schedule, i tried auto, "guide, 10", "static, 10" 

one thread needs 4minus, 2threads needs 6minus, ,,240 threads need 10hours....

i'm not familar with omp, so anyone help....


fl eval_pairs_deriv(const precalculate& p, fl v, const interacting_pairs& pairs, const vecv& coords, vecv& forces) { 
    const fl cutoff_sqr = p.cutoff_sqr();
    fl e = 0;

    #pragma omp parallel for reduction(+:e) schedule(runtime)

    for(int  i = 0; i < (pairs).size() ; ++(i)){                
        const interacting_pair& ip = pairs[i];
        vec r; r = coords[ip.b] - coords[ip.a]; // a -> b
        fl r2 = sqr(r);
        // f1 tt=0;
        double tt=0;

        if(r2 < cutoff_sqr) {
            pr tmp = p.eval_deriv(ip.type_pair_index, r2);
            vec force; force = tmp.second * r;
            curl(tmp.first, force, v);
            tt= tmp.first;
            forces[ip.a] -= force; 
            forces[ip.b] += force;
    return e;

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Your code looks as if it has at least one correctness problem, aside from performance issues.

Looking at correctness first: once you execute your for loop in parallel you have potentially racy accesses to the "forces" array (remember that += and -= are not atomic operations). The simplest (but most costly) fix is to make these atomic updates by rewriting that section like this

#pragma omp atomic
    forces[ip.a] -= force;
#pragma omp atomic
    forces[ip.b] += force;

There are probably smarter ways to rewrite this using thread local arrays which you reduce by hand.

On performance: 

  1. What is the normal value of "pairs.size()" since that's all the parallelism you have. To run well on N logical cpus you want it to be ~ 10*N, so to exploit 240 threads you want it to be ~2400.
  2. The amount of work in each iteration here looks rather small, so the cost of going parallel and then loop scheduling (particularly if you use a non-static schedule) may dominate.
  3. Have you investigated vectorization of this loop? That's a potentially productive thing to do too.
  4. Have you investigated scaling of the loop on Xeon, or looked at what's going on with VTune?



The problem I see with this loop is the use of OOP (as performed in this program) interferes with vectorization.

While OOP eases the abstraction of the program, it also makes it near impossible to vectorize.

You have the equivalent of a race care with a V16 engine running on one cylinder (or V8 on one cylinder). One of the major features of the MIC is in its 512-bit vector support (16 floats, 8 doubles). Using anything less is inefficient.

The most efficient organization for vectorization is SOA (structure of arrays). You generally have a container that internally uses SOA, but then also has member functions to extract individual objects (structure of AOS), and/or member functions to extract individual member variables from a given object within the container, or a composite such as vec getPos(n) that returns the position vector of n. The force calculations would be performed on the entire collection would be made with a single container member function call (.calculateForces()) *** the forces calculations would NOT use getPos(n) as this would defeat the vectorization.

Jim C also points out the race condition of the force accumulation. He shows atomic and explains why it is a bad idea.

For this type of problem, (exclusive of vectorization) you partition the particle space (object space) into N groups, where N is typically larger than 2x then number of threads.

You then have one parallel loop on the N groups, with each thread computing the intra-group forces (this way no two threads updating the same force).

Then you have a different type of parallel loop (pairing permutation), where you choose pairing of groups, where no two threads work on the same group number at the same time, this loop computes the inter-group forces (this way no two threads updating the same force).

You have a trade-off between the number of groups verses thread idle time as you near completion of the forces calculations. This can be mitigated with different sized sub-groups and near the end of the forces calculations, where some of the threads become idle, you want the permutations of groups to be working on the smallest of the group pairs.

The above removes the requirement of using the expensive atomic operations.

Note, you can (and may already) do the above with repeated calls to eval_pairs_eriv, where any one particle is only listed once in the complete set of pairs.

Before doing this. Rework your data layout to favor vectorization.

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today