a large array on the sum of the optimization problem

a large array on the sum of the optimization problem

fyten1985's picture

This is a large array on the sum of the optimization problem.

There are two double type array, then the code for this problem is as follows:

#pragma omp parallel for

for (long i=0; i<5000000; i++)

{

array1[i] += array2[i];

}

My computer is "Dell PowerEdge 2900III 5U" with Xeon 5420 * 2 and 48G Memory.

And the OS is MS Windows Server 2003 R2 Enterprise x64 Edition sp2.

The C++ compilers are VC++ 2008 and Intel C++ 11.0.061, and the solution platform is x64.

and then i used VC and IC compiled the program,the two result are basiclly the same.

and then i used the funtion of INTEL MKL 10.1 to compute,as follows:

cblas_daxpy(5000000, 1, array2, 1, array1, 1);

the performance of the program have no different.

and the i used other funtino of INTEL MKL 10.1:

vdAdd( n, a, b, y );

Program performance decreased significantly, and only about 80% of the original.

i would like to know what way to optimize this problem by enhancing program performance

36 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Tim Prince's picture

Can you pick a single forum for this question?
If the performance of a vectorizing and a non-vectorizing OpenMP compiler are the same, and the extra operation in daxpy doesn't slow it down, it tends to confirm that memory performance is the main issue.
You might try 2 or 4 threads with Intel OpenMP library, and set kmp_affinity=scatter
Further, you might experiment with software prefetch to get 4KB pages of both arrays in advance.
Up to a year ago, your CPU had the best memory performance of any Intel CPU, but of course it isn't competive now.

bigknife's picture

Hi, tim18,

How toprefetch 4KB pages of both arrays in advance? Thanks!
Dmitry Vyukov's picture

Use _mm_prefetch(addr, _MM_HINT_NTA) from to do prefetch.
You may also consider storing data with non-temporal store instructions, this can reduce memory bus pressure. Check out _mm_stream_si128(addr, value) intrinsic.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Tim Prince's picture

nontemporal certainly should be considered for storing arrays which over-run last level cache. The arrays in this example aren't eligible, as both are read.

Dmitry Vyukov's picture

Tim, you are right. I missed that output array is read.
Actually, I not even sure about _mm_prefetch() with _MM_HINT_NTA. AFAIK, non-temporal prefetch done into L2 on modern Intel processors, but non-temporal data occupy at most 1 way in N-associative L2 cache to minimize pollution (info from the Optimization Reference Manual, search by PREFETCHNTA). So if it happens so that both arrays are equally aligned (which is quite likely since arrays are big), they will evict each other. Probably, _MM_HINT_T0 will be more appropriate. Humm... Tim, may you comment on this?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Tim Prince's picture

I believe the hints are ignored on recent CPUs, including the one mentioned here, so all hints (or none) would be equal. Also, I believe 2-way (or more) associativity should be sufficient to avoid complete eviction of one array by the other.
A single prefetch would be sufficient to bring in 2 cache lines (taking into account adjacent sector prefetch, if active), not an entire 4K page. A typical strategy is to unroll the loop and issue a single prefetch each time through, so as to reduce the number of redundant prefetches.

Dmitry Vyukov's picture

> I believe the hints are ignored on recent CPUs, including the one mentioned here, so all hints (or none) would be equal. Also, I believe 2-way (or more) associativity should be sufficient to avoid complete eviction of one array by the other.

Thank you!

> A single prefetch would be sufficient to bring in 2 cache lines (taking into account adjacent sector prefetch, if active)

What is 'adjacent sector prefetch'? Is hardware prefetch current and next line on every software prefetch?

> A typical strategy is to unroll the loop and issue a single prefetch each time through, so as to reduce the number of redundant prefetches.

Btw, recursive C++ templates are good at unrolling loops. And note that PREFETCH does not cause segmentation/paging faults, so it's Ok to prefetch beyond the end of the arrays.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
bigknife's picture

Tim,

Can you provide a source-code solution of this problem to illustrate how to "unroll the loop and issue a single prefetch each time" ? By the way, I would like to provide some details aboutIntel Xeon E5420: Cores: 4; Threads: 4; L1 Data Cache: 4 x 32 K Bytes, 8-way; L1 Instruct Cache: 4 x 32 K Bytes, 8-way; L2 Cache: 2 x 6144 K Bytes, 24-way. So the computer system is with 8 cores, and 8 threads are used in the computation. Thank Tim very much! AndDmitriy's comments and solution are quiteexpected.
Dmitry Vyukov's picture

My humble guess is:

void sum(double* array1, double* array2, size_t size)
{
    size_t size0 = size / 8 * 8;
    for (size_t i = 0; i != size0; i += 8)
    {
        _mm_prefetch((char*)(array1 + i) + 4096, _MM_HINT_T0);
        _mm_prefetch((char*)(array2 + i) + 4096, _MM_HINT_T0);

        array1[i + 0] += array2[i + 0];
        array1[i + 1] += array2[i + 1];
        array1[i + 2] += array2[i + 2];
        array1[i + 3] += array2[i + 3];
        array1[i + 4] += array2[i + 4];
        array1[i + 5] += array2[i + 5];
        array1[i + 6] += array2[i + 6];
        array1[i + 7] += array2[i + 7];
    }

    for (size_t i = size0; i != size; i += 1)
    {
        array1[i] += array2[i];
    }
}

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

As for parallelization, the situation is more interesting.
Since you system is 2 processor and QPI-based, it's a NUMA system. I.e. each NUMA node (processor in your case) has dedicated memory bandwidth. So parallelization if done correctly must speed-up the program.
I believe that Win2k3 uses a so-called 'first-touch' NUMA allocation policy. So if you initialize your arrays in main thread, they will be placed on a single NUMA node, thus no speed-up from parallelization.
And what you need is as follows. Allocate first half of the arrays on first NUMA node, and second half of the arrays on second NUMA node. Then process first half on first NUMA node, and second half - on second NUMA node.
I am not sure as to whether OpenMP guarantees stable binding of threads to data or not (i.e. given static schedule thread 0 always processes indexes from 0 to N/2 and thread 1 indexes from N/2 to N). But if it's the case (likely) then you just need something along the lines of:

// set kmp_affinity=scatter, it's important, it will distribute your threads across NUMA nodes

#   pragma omp parallel for schedule(static) num_threads(2)
    for (int i = 0; i < size; i++)
    {
        // initialize array1[i] and array2[i]
        array1[i] = i + 100;
        array2[i] = i + 200;
    }

    // ...

#   pragma omp parallel for schedule(static) num_threads(2)
    for (int i = 0; i < size / 8; i++)
    {
        int j = i * 8;

        _mm_prefetch((char*)(array1 + j) + 4096, _MM_HINT_T0);
        _mm_prefetch((char*)(array2 + j) + 4096, _MM_HINT_T0);

        array1[j + 0] += array2[j + 0];
        array1[j + 1] += array2[j + 1];
        array1[j + 2] += array2[j + 2];
        array1[j + 3] += array2[j + 3];
        array1[j + 4] += array2[j + 4];
        array1[j + 5] += array2[j + 5];
        array1[j + 6] += array2[j + 6];
        array1[j + 7] += array2[j + 7];
    }

    for (int i = size / 8 * 8; i != size; i += 1)
    {
        array1[i] += array2[i];
    }

Tim, does what I am saying makes sense?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

Ah, Ok, ignore everything I said. I've just consulted that Xeon 5420 is a FSB-based system:
http://ark.intel.com/Product.aspx?id=33929&processor=L5420&spec-codes=SLARP,SLBBR

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
bigknife's picture
Maybe you missed the OpenMP. I think your code must be:
  • void sum(double*array1, double*array2, size_t size)
  • {
  • size_t size0=size/8*8;
  • #pragma omp parallel for
  • for (size_t i=0;i!=size0;i+=8)
  • {
  • _mm_prefetch((char*)(array1+i)+4096,_MM_HINT_T0);
  • _mm_prefetch((char*)(array2+i)+4096,_MM_HINT_T0);
  • array1[i+0]+=array2[i+0];
  • array1[i+1]+=array2[i+1];
  • array1[i+2]+=array2[i+2];
  • array1[i+3]+=array2[i+3];
  • array1[i+4]+=array2[i+4];
  • array1[i+5]+=array2[i+5];
  • array1[i+6]+=array2[i+6];
  • array1[i+7]+=array2[i+7];
  • }
  • for (size_t i=size0;i!=size;i+=1)
  • {
  • array1[i]+=array2[i];
  • }
  • }
  • bigknife's picture
    Quoting Dmitriy Vyukov Ah, Ok, ignore everything I said. I've just consulted that Xeon 5420 is a FSB-based system:
    http://ark.intel.com/Product.aspx?id=33929&processor=L5420&spec-codes=SLARP,SLBBR

    Then, what should we do?

    Dmitry Vyukov's picture

    Then there is nothing we can do with the program as-is. You hit the memory bandwidth wall, the system is just incapable of shuffling more memory per a unit of time.

    However, you can try to reduce memory bandwidth requirements. The easiest thing is to switch to floats (if it's acceptable of course). You can also try to combine some phases of your application. For example, if you initialize, process and output the arrays in a single loop, then the amount of computations per a unit of memory will be much higher.

    All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
    bigknife's picture
    Quoting tim18 I believe the hints are ignored on recent CPUs, including the one mentioned here, so all hints (or none) would be equal. Also, I believe 2-way (or more) associativity should be sufficient to avoid complete eviction of one array by the other.
    A single prefetch would be sufficient to bring in 2 cache lines (taking into account adjacent sector prefetch, if active), not an entire 4K page. A typical strategy is to unroll the loop and issue a single prefetch each time through, so as to reduce the number of redundant prefetches.

    I was told that the VC++ Compiler2008 and Intel C++ Compiler11.0 can do unrolling loop very well.

    And I did do some experiments to test it. The code is just like

    { size_t size0 = size / 8 * 8; # pragma omp parallel for for (size_t i = 0; i != size0; i += 8) { array1[i + 0] += array2[i + 0]; array1[i + 1] += array2[i + 1]; array1[i + 2] += array2[i + 2]; array1[i + 3] += array2[i + 3]; array1[i + 4] += array2[i + 4]; array1[i + 5] += array2[i + 5]; array1[i + 6] += array2[i + 6]; array1[i + 7] += array2[i + 7]; } for (size_t i = size0; i != size; i += 1) { array1[i] += array2[i]; } } I even changed the step from 8 to 16 and 32, but theexperimental results showed that it made a poor improvement.
    Dmitry Vyukov's picture

    > I even changed the step from 8 to 16 and 32, but theexperimental results showed that it made a poor improvement.

    The point is that you need to insert prefetches into every 8-th iteration. You need an additional 'if' for that.
    Probably the compiler will unroll the loop and then eliminate additional if's. But if you do that manually you are on the safer side.

    All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
    bigknife's picture
    Quoting Dmitriy Vyukov > I even changed the step from 8 to 16 and 32, but theexperimental results showed that it made a poor improvement.

    The point is that you need to insert prefetches into every 8-th iteration. You need an additional 'if' for that.
    Probably the compiler will unroll the loop and then eliminate additional if's. But if you do that manually you are on the safer side.

    Probably you are right. But how to add the 'if' ?

    Dmitry Vyukov's picture
    for (size_t i = 0; i != size; i += 1) {
    if ((i % 8) == 0) _mm_prefetch((char*)(array + i) + 4096, ...); array1[i] += array2[i]; }
    All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
    bigknife's picture
    Quoting Dmitriy Vyukov for (size_t i = 0; i != size; i += 1) {
    if ((i % 8) == 0) _mm_prefetch((char*)(array + i) + 4096, ...); array1[i] += array2[i]; }

    Then, explicit "unrolling the loop" is unnecessary?

    Do you believe that we would get a big perfomanceimprovement in that way?

    Dmitry Vyukov's picture

    It depends.
    It depends on what you consider "big improvement", what compiler you are using, how much do you want to depend on implementation details of the compiler.
    If performance is important for you, then if you do manual loop unrolling you are on a safer side. However, a lot of developers do not borther themselfs with such things at all (parallelization, prefetching, etc).

    All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
    Tim Prince's picture

    Intel C++ documentation includes examples of #pragma prefetch usage. Unrolling the loop by itself isn't likely to be effective in a memory limited case; the point is to assure that time isn't wasted issuing by issuing too many extra prefetches.

    bigknife's picture
    Quoting tim18 Intel C++ documentation includes examples of #pragma prefetch usage. Unrolling the loop by itself isn't likely to be effective in a memory limited case; the point is to assure that time isn't wasted issuing by issuing too many extra prefetches.

    So, Tim, I am looking forward to your source-code solution to this specific problem thirstily!

    bigknife's picture
    Quoting Dmitriy Vyukov It depends.
    It depends on what you consider "big improvement", what compiler you are using, how much do you want to depend on implementation details of the compiler.
    If performance is important for you, then if you do manual loop unrolling you are on a safer side. However, a lot of developers do not borther themselfs with such things at all (parallelization, prefetching, etc).

    "It depends" are political men's words. Just kidding! :)

    We just face this specific problem with VC++ Compiler 2008 and Intel C++ 11.0 running on a Dell server with 2 XEON 5420 CPUs and 48G Memory.

    I have been doing a lot of work on it, but have not got a good improvement by now.

    Dmitry Vyukov's picture

    Well, simple addition of numbers is memory bandwidth bound, so little can be done here. Single addition operation is just too small amount of work per 16-bytes of memory.
    I guess Tim mentioned unrolling just as a good habit.
    To get good improvement on this particular problem you must somehow reduce memory bandwidth requirements (http://software.intel.com/en-us/forums/showpost.php?p=114059) or upgrade to a NUMA system.

    All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
    Tim Prince's picture

    Ideally, #pragma prefetch would take care of issuing a prefetch once per automatically unrolled loop. I'm not in a position to check this, but if this case is an isolated loop (as implied above), and if there's a reason for making this effort for a possible small performance improvement on the old hardware, it may be worth a try.

    grant-haab (Intel)'s picture

    Dmitri wrote:
    "I am not sure as to whether OpenMP guarantees stable binding of threads to data or not (i.e. given static schedule thread 0 always processes indexes from 0 to N/2 and thread 1 indexes from N/2 to N)."

    The current OpenMP v3.0 API (Table 2.1, p.40) does not guarantee this property for the code you included:

    "A compliant implementation of static schedule must ensure that the sameassignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied:
    1) both loop regions have the same number of loop iterations,
    2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, and
    3) both loop regions bind to the same parallel region."

    In your code, only condition 2) is satisfied. However, if you unroll the first loop like the second, and stick them in the same parallel region, then the property you seek is guaranteed.

    Hope that helps.

    - Grant

    - Grant Haab OpenMP Language Committee Member TBB/OpenMP Software Engineer
    Dmitry Vyukov's picture

    Thank you, Grant!

    Humm... can I specify nowait clause for the first loop then? Just curious.

    All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
    farneville's picture

    Wow I am getting lot's of info's here as a noob I am really learning thanks for the codes this is what I am trying to figure out....

    bigknife's picture
    Quoting Grant Haab (Intel)

    Dmitri wrote:
    "I am not sure as to whether OpenMP guarantees stable binding of threads to data or not (i.e. given static schedule thread 0 always processes indexes from 0 to N/2 and thread 1 indexes from N/2 to N)."

    The current OpenMP v3.0 API (Table 2.1, p.40) does not guarantee this property for the code you included:

    "A compliant implementation of static schedule must ensure that the sameassignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied:
    1) both loop regions have the same number of loop iterations,
    2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, and
    3) both loop regions bind to the same parallel region."

    In your code, only condition 2) is satisfied. However, if you unroll the first loop like the second, and stick them in the same parallel region, then the property you seek is guaranteed.

    Hope that helps.

    - Grant

    Whatsoever, Grant, could you provide a soure-code solution for this specific problem and make a significant performance improvement?

    grant-haab (Intel)'s picture

    Dmitriy wrote:

    > Humm... can I specify nowait clause for the first loop then? Just curious.

    Yes! As long as there are no dependences between corresponding chunks of the two loops. (Assuming of course no dependences between different chunks of the same loop in order to run them in parallel as well.)

    , !

    - Grant Haab OpenMP Language Committee Member TBB/OpenMP Software Engineer
    Dmitry Vyukov's picture

    > Yes! As long as there are no dependences between corresponding chunks
    of the two loops.

    Ok, got it. Thank you.

    > , !
    :)
    !

    All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
    grant-haab (Intel)'s picture

    Bigknife,

    Likely I cannot provide a performance improvement for such a code because it is memory bandwidth limited, as Dmitriy already stated.There is simply not enough cache reuse of the data to overcome the more than 100:1 time ratio for data fetch/store time vs. floating point addition. To do that would require an architecture that allowed hundreds of outstanding memory loads/stores to be simultaneously in flight, which doesn't exist in commercial hardware.

    I suggest that you must parallelize your code at a highergranularity than a simple vector addition. Surely, your entire code doesn't just add two vectors together. To get real multi-threaded speedup on any machine, you need to do more computation with the data per memory fetchand/or make it fit into the cache 1)so it doesn't take so long to fetch it each time AND 2) so that it doesn't saturate the memory system. Prefetching can only solve the long fetch time, it cannot solve the memory system saturation problem.

    Is it possible to apply OpenMP to a loop thatcontains the array add loop?

    - Grant Haab OpenMP Language Committee Member TBB/OpenMP Software Engineer
    bigknife's picture
    Grant, Thank you very much for your suggestions. But whyparallelizing the code at a higher granularity can be better than a simple vector addition? As we know, each vector in the code have a length of 5000,000 and need a more than 38M store space, which is much bigger than the Cache size (12M) of E5420. Therefore, if we need to add 100 vectors to one vector, itwould cause a higher Cachemiss rate by'applying OpenMP to a loop thatcontains the array add loop', wouldn't it? Best regards, Bigknife Quoting Grant Haab (Intel) Bigknife,

    Likely I cannot provide a performance improvement for such a code because it is memory bandwidth limited, as Dmitriy already stated.There is simply not enough cache reuse of the data to overcome the more than 100:1 time ratio for data fetch/store time vs. floating point addition. To do that would require an architecture that allowed hundreds of outstanding memory loads/stores to be simultaneously in flight, which doesn't exist in commercial hardware.

    I suggest that you must parallelize your code at a highergranularity than a simple vector addition. Surely, your entire code doesn't just add two vectors together. To get real multi-threaded speedup on any machine, you need to do more computation with the data per memory fetchand/or make it fit into the cache 1)so it doesn't take so long to fetch it each time AND 2) so that it doesn't saturate the memory system. Prefetching can only solve the long fetch time, it cannot solve the memory system saturation problem.

    Is it possible to apply OpenMP to a loop thatcontains the array add loop?

    grant-haab (Intel)'s picture
    Quoting bigknife
    But whyparallelizing the code at a higher granularity can be better than a simple vector addition? As we know, each vector in the code have a length of 5000,000 and need a more than 38M store space, which is much bigger than the Cache size (12M) of E5420. Therefore, if we need to add 100 vectors to one vector, itwould cause a higher Cachemiss rate by'applying OpenMP to a loop thatcontains the array add loop', wouldn't it?

    Well, I didn't know what the outer loop does until you told me above. If you are adding 100 vectors of length 5 million, then memory bandwidth problems are going to limit speedup no matter what you try.

    Is that all the algorithm does with the 100 vectors? Is there anotheralgorithmicstepoutside ofthe 100 vector additions (either around or after the vector adds) that might reuse some of the data? If not, then you are probably getting nearly all the speedup you can. If there is another step, then perhaps blocking the computation and use of the vectors so that the results fit into the cache and can be reusedcould result inbetter speedup.

    But without knowing the entire algorithm or code, it is very difficult to suggest anything more detailed than what I've already suggested. Could you post more of the code so we can see what algorithm you are implementing?

    - Grant Haab OpenMP Language Committee Member TBB/OpenMP Software Engineer
    bigknife's picture
    Quoting Grant Haab (Intel)
    Quoting bigknife
    But whyparallelizing the code at a higher granularity can be better than a simple vector addition? As we know, each vector in the code have a length of 5000,000 and need a more than 38M store space, which is much bigger than the Cache size (12M) of E5420. Therefore, if we need to add 100 vectors to one vector, itwould cause a higher Cachemiss rate by'applying OpenMP to a loop thatcontains the array add loop', wouldn't it?

    Well, I didn't know what the outer loop does until you told me above. If you are adding 100 vectors of length 5 million, then memory bandwidth problems are going to limit speedup no matter what you try.

    Is that all the algorithm does with the 100 vectors? Is there anotheralgorithmicstepoutside ofthe 100 vector additions (either around or after the vector adds) that might reuse some of the data? If not, then you are probably getting nearly all the speedup you can. If there is another step, then perhaps blocking the computation and use of the vectors so that the results fit into the cache and can be reusedcould result inbetter speedup.

    But without knowing the entire algorithm or code, it is very difficult to suggest anything more detailed than what I've already suggested. Could you post more of the code so we can see what algorithm you are implementing?

    My code is very simple and the only bottleneck is the addition of large vectors.

    Thank you very much, Grant!

    Login to leave a comment.