OpenMP min/max reduction

OpenMP min/max reduction

Аватар пользователя Tim Prince

A knowledgeable customer asked why there is no OpenMP min or max reduction for C, as there is for Fortran. I agreed this seems useful. If anyone has suggestions about this, it would help me in submitting a feature request. The main problem I see is the lack of a standard C operator equivalent, and the general C rather than C++ orientation of OpenMP, but now it appears a lot more esoteric C++ stuff is being coupled into OpenMP.

22 posts / 0 новое
Последнее сообщение
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Аватар пользователя Michael Klemm (Intel)
Quoting - tim18 A knowledgeable customer asked why there is no OpenMP min or max reduction for C, as there is for Fortran. I agreed this seems useful. If anyone has suggestions about this, it would help me in submitting a feature request. The main problem I see is the lack of a standard C operator equivalent, and the general C rather than C++ orientation of OpenMP, but now it appears a lot more esoteric C++ stuff is being coupled into OpenMP.

Hey tim18,

you're absolutely right. C misses intrinsic operators for min/max. Fortran has those and that's why they have been enabled with OpenMP. If you have a real world case that would benefit from better OpenMP support for reductions (e.g. min/max, custom types), I'd be happy to gain access to it.I'm seeking for good examples that justify the use of new kinds of reductions in OpenMP. I'm also looking for the work-arounds that people currently need to take.

What do you mean with esoteric C++ stuff that is being coupled into OpenMP? The OpenMP ARB does not focus on any of the three languages but keeps new features compatible to all three base languages.

Cheers,
-michael

Аватар пользователя Tim Prince
Quoting - Michael Klemm, Intel

Hey tim18,

you're absolutely right. C misses intrinsic operators for min/max. Fortran has those and that's why they have been enabled with OpenMP. If you have a real world case that would benefit from better OpenMP support for reductions (e.g. min/max, custom types), I'd be happy to gain access to it.I'm seeking for good examples that justify the use of new kinds of reductions in OpenMP. I'm also looking for the work-arounds that people currently need to take.

What do you mean with esoteric C++ stuff that is being coupled into OpenMP? The OpenMP ARB does not focus on any of the three languages but keeps new features compatible to all three base languages.

Cheers,
-michael

IDF lectures showed several examples of lambda operators to process a stack of queued tasks. That's esoteric to me. It seems to obscure the answer to the question asked on the Threading forum about whether Microsoft intend to support OpenMP 3.0. I could imagine the OpenMP ARB wondering if Intel and Microsoft are trying to fork off from OpenMP with C++0x specific features. Maybe this has been worked out by drawing a distinction from OpenMP.
My own reduction code, which gives each thread groups of data segments and private variables to accumulate partial results, then combines results in a critical section, scales up to 8 threads on NHM with no problem, but doesn't scale further (to 12 threads on Westmere), where it seems a little more attention to the platform would do so. I'll have to work on it more to see if I can demonstrate an advantage for the Fortran built-in OpenMP reduction.

Аватар пользователя jimdempseyatthecove

Tim,

Have you considered creating a min/max reduction array, one element per thread/task, partition the work to threads/tasks passing in the address of the results array. Then on completion, have the main thread perform the final min/max without using critical section?

IOW, each thread computes local min/max, then writes onceto shared array of min/maxwith local result (no lock nor CAS). The main thread can then either wait for all threads to complete (parallel construct synchronization object)OR simply walk through results array waiting for pre-load values to change off of HUGE_VAL (or -HUGE_VAL) (no task synchronization object).

I suspect that when you max'ed out at 12 threads you hit a memory bandwidth problem. Nothing much you can do after that (assuming you are already using SSE) other than reordering your code such that your min/max function proceeds while data is in cache.

Overlapping functionality to take advantage of cache locality is often tricky to do since adding this additional functionality into a well written loop makes the loop less generalized and also tends to insert branches where there were none (slows down the code).

What can aid in this overlapping process and can maintain your well written tight loops separate from the additional functionality (min/max function) is by recoding to use a pipeline architecture. Through use of a pipeline you can maintain cache locality as you pass through functional pipes. Often no change (or very little change)is required to you original code. When change is require, it usualy relates to function entry arguments and not the execution of the functional algorithm.

Jim Dempsey

www.quickthreadprogramming.com
Аватар пользователя Tim Prince
Quoting - jimdempseyatthecove
Tim,

Have you considered creating a min/max reduction array, one element per thread/task, partition the work to threads/tasks passing in the address of the results array. Then on completion, have the main thread perform the final min/max without using critical section?

IOW, each thread computes local min/max, then writes onceto shared array of min/maxwith local result (no lock nor CAS). The main thread can then either wait for all threads to complete (parallel construct synchronization object)OR simply walk through results array waiting for pre-load values to change off of HUGE_VAL (or -HUGE_VAL) (no task synchronization object).

I suspect that when you max'ed out at 12 threads you hit a memory bandwidth problem. Nothing much you can do after that (assuming you are already using SSE) other than reordering your code such that your min/max function proceeds while data is in cache.

Overlapping functionality to take advantage of cache locality is often tricky to do since adding this additional functionality into a well written loop makes the loop less generalized and also tends to insert branches where there were none (slows down the code).

What can aid in this overlapping process and can maintain your well written tight loops separate from the additional functionality (min/max function) is by recoding to use a pipeline architecture. Through use of a pipeline you can maintain cache locality as you pass through functional pipes. Often no change (or very little change)is required to you original code. When change is require, it usualy relates to function entry arguments and not the execution of the functional algorithm.

Jim Dempsey

Agree that it should be possible to achieve full scaling to a reasonable number of threads, if each thread does as much local reduction as possible. The Fortran min reduction achieves that automatically, with no additional mental effort from me for new platforms, and doesn't run afoul of NUMA issues.
As long as C++ specific threading features are becoming widespread, I agreed to put forward our colleague's proposal to add one analogous to the Fortran.

Аватар пользователя Tim Prince

This feature request has been submitted as issue 564720.

Аватар пользователя Igor Levicki
Quoting - tim18 This feature request has been submitted as issue 564720.

Great, now only if they agree to implement it :)

Another thing I would really like to see in ICC is natural alignment for __m64, __m128 and other SIMD vector datatypes without the need to use __declspec(align(n)) all over the code. That would also allow us to use new and delete to allocate aligned arrays. Adding another overload for new and delete with additional parameter which says "align to n bytes" would also be very welcome.

In my opinion those types already are a part of the language which should follow the hardware as it evolves instead of becoming a constraint and requiring additional work from the developers to get some basic things done.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Аватар пользователя Tim Prince

I agree that the attractiveness of C++ is diminished by the recommendation to use work-arounds involving malloc and explicit adjustments or wrappers for alignment, in place of new and delete, for the 32-bit OS, particularly when some of the recommendations aren't at all portable. There was a project to improve the 32-bit linux ABI, but I'm not counting on it.
The 64-bit OS give automatic 16-byte alignment in most of the situations where it is needed. I think it's an open question whether better support for 32-byte alignment will be needed to support AVX. I've heard discussion of an argument to the icc specific pragmas, such as
#pragma vector aligned(32)
so I suppose your suggestion of such an overload for new and delete may help with that problem as well as dealing with already existing problems on 32-bit OS.
I would expect such a feature request might better be submitted by a more expert C++ developer than myself. If there isn't already such an overload in TBB, maybe it could be proposed and adopted more quickly there.

Аватар пользователя Swapnil J.

Hi Friends,
Is there MIN or MAX operator implemented in openMp or not?
Really every where search operation are use full in bussiness logic.

Аватар пользователя Tim Prince

Although you're unlikely to read much of my answer, and haven't read what others wrote elsewhere, I'll attempt to answer anyway in case someone is interested in the discussion (which can easily become excessively verbose).
min and max reductions for Fortran (but not indexed min/max) have been in OpenMP at least since OpenMP 2.0. The C counterparts were introduced in OpenMP 3.1, and I believe have some support in Intel 12.1 and 13.0 compilers, although there may not be much coverage in test suites. However, the pragma simd reduction counterparts don't appear to exist, in spite of talk about them. I have some long standing issues filed against Intel compilers about shortcomings in this area.
It's important to engage both vectorization and OpenMP reduction where applicable (unless you have artificial goals such as minimization of single thread performance so as to claim high effectiveness of multiple threads). With "current technology" it's best to use explicitly 2 levels of loops: an inner loop to engage simd reduction with data as local as possible to each thread, and and outer loop to merge results from various threads.
Intel Cilk+ tuning advice advocates the explicit separation of vector inner and outer multi-thread loop optimization, and cilk provides specific syntax for the vectorized inner loop max/min reductions. C++ STL reductions cover part of the same territory. All major compilers optimize certain cases if you pay attention to their varying requirements.
There is a gcc development branch devoted to introduction of Cilk syntax. One may be skeptical about whether this is in a condition suitable for general application development.
Proposals for OpenMP 4.0 are intended to facilitate vector parallel optimization of a single level of looping. I haven't seen public discussion or progress reports. Whether this will relieve the programmer of changing code to optimize for specific types of architecture remains to be seen. There appears to be a difference of opinion against the Cilk+ language designers on the feasibility of this.
While recent versions of Intel OpenMP for Xeon SSE2 are quite effective in implementation of reduction with a vector inner loop and an outer loop with critical region, on the Intel(r) Xeon Phi(tm) coprocessor, it may be more effective to store a batch of partial reduction results from individual threads in an array and finish up with a single thread reduction.
If you are diligent, you willl find open source examples of these cases on line. As to showing or referring to such examples on Intel sites, policy decisions are continually delayed.

Аватар пользователя jimdempseyatthecove

TimP>> Agree that it should be possible to achieve full scaling to a reasonable number of threads

The Min, Max, MinMax functions are mostly memory bandwidth issues. Scaling would be limited to the number channels to memory (or a very small multiple there of, such as 2x memory channels).

Attached is a small OpenMP illustration that is not optimized for SSE or AVX (that is an exercize for you).

Jim Dempsey

Вложения: 

ВложениеРазмер
Скачать parallelminmax.cpp3.35 КБ
www.quickthreadprogramming.com
Аватар пользователя Tim Prince

Jim,
Yes, this is one of several ways to perform a vector inner parallel outer loop reduction. A likely performance obstacle is false sharing in
// get this thread's team member number
int iThread = omp_get_thread_num();
// Note, in event that a thread in the team for the parallel region
// processes more than one chunk, we include a test suitable for
// for first and subsequent passes
// (first pass) || (subsequent)
if((iMin[iThread] < 0) || (Min < Array[iMin[iThread]]))
{
iMin[iThread] = iiMin; // set / update the index to Min
}

but I suppose you may hope that it doesn't become more of an obstacle as the number of threads becomes large enough that multiple cache lines are in use in iMin[].

Аватар пользователя jimdempseyatthecove

If cache line eviction becomes a problem then iMin can become a cache line aligned and size or size*2 struct containing "int val; char x[padd];" then use of

iMin[i].val = iiMin;

However, note that the above statement is issued once per theread when Chunk == n / nThreads.
Then also the reduction loop would have to read in n / Chunk cache lines as opposed to fewer.

Because the writes are somewhat buffered, and occure intermittantly, my guess is the array of int's would perform faster than the array of structs.

It is easy enough to code both ways and run some tests.

Jim Dempsey

www.quickthreadprogramming.com
Аватар пользователя Tim Prince

The final reduction runs so much faster with vectorization at stride 1 that I've found it satisfactory to let each thread perform enough reductions to fill up a cache line or more, accomplishing the goal of having the threads store to separate cache lines. Of course, if you have a reason for using a compiler which can't vectorize it, that could influence your decision.

Аватар пользователя Tim Prince

C omp reduction max and min operators were included in OpenMP 3.1 standard, but I'm still looking for a widely available compiler which implements it.  gcc testsuite includes Fortran but not c or c++ tests for max or min reduction.   OpenMP 4.0 defines both parallel and simd capabilities for min and max reduction; apparently, Intel compilers will advertise OpenMP 4 support before these have been implemented.  Other OpenMP 4 reductions are supported now in current icc.

icpc does an excellent job without omp simd reduction directive of vectorizing std::max().  Sometimes, icc can vectorize equivalent C code, possibly with the help of #pragma vector; more often, gcc can accomplish max/min vectorization.  Cilk(tm) Plus includes equivalent reducers, but they seem to be less reliably optimized than gcc is with standard C source code.

Аватар пользователя Armando Lazaro Alaminos Bouza

The inclusion of min/max as reduction in C  would be of great value for me. I have a nearest neighbor search big problem, for surface registration that is calling for that.  The lack of min/max reduction force alternative implementations that look like to be suboptimal and not clear.

Аватар пользователя Sergey Kostrov

>>...The inclusion of min/max as reduction in C would be of great value for me...

I think you're mixing at least three different things:

- C language ( I don't believe that min/max reduction will ever be a part of some future C or C++ standard )
- Algorithm to find min/max values for a data set ( of course, as fast as possible )
- OpenMP ( see Tim's comments )

Аватар пользователя Armando Lazaro Alaminos Bouza

I beg your pardon, English is not native language and sometimes I can not express myself in the right way.  I was trying to support the idea of  adopting the OpenMP 3.1  min/max reduction for C and C++ ,  not  asking for C or C++ standard modification !    

I only want to use something like :

#pragma omp parallel for reduction(max : max_value)

which should be very handy for lots of programming problems.

Аватар пользователя Sergey Kostrov

>>I only want to use something like :
>>
>>#pragma omp parallel for reduction(max : max_value)

I think you need to submit your proposal to a right place, for example http://www.openmp.org ( mailto:webmaster@openmp.org ).

Аватар пользователя Tim Prince

Цитата:

Sergey Kostrov wrote:

>>I only want to use something like :
>>
>>#pragma omp parallel for reduction(max : max_value)

I think you need to submit your proposal to a right place, for example http://www.openmp.org ( mailto:webmaster@openmp.org ).

This form is already defined by OpenMP 3.1, although I've seen it denied.   OpenMP 4.0 RC2 also defines

#pragma omp parallel for simd reduction(max : max_value)

to specify explicitly that both simd and thread parallel optimizations are desired, as well as forms for simd without threaded parallelism.

I guess Intel compilers are waiting to implement reduction(max/min: ) until there is documented demand for the OpenMP 4 forms.

I've drafted a white paper on the omp simd forms already supported, as well as alternatives for those which aren't (e.g. C++ std::max(), std:max_element()....

Аватар пользователя Tim Prince

gcc development branch accessed by e.g.

svn co svn://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch gcc-omp4

includes early support for OpenMP 3.1 and 4.0 max|min reductions.  If it reaches sufficient maturity to be merged into gcc-4.9, this would indicate support for these features well before they appear in Intel C/C++.  I had to use the configure --disable-werror option in order to build this gcc.

As indicated elsewhere, the libiomp5 and open source Intel OpenMP libraries already should include support for omp parallel reduction(min|max: ).  It is not expected to appear in the icc release this week, although there would be several OpenMP 4.0 features.  It should be possible to use linux clang as well as the gcc branch to test min|max reductions with the Intel and gomp libraries. 

I didn't find any max|min reduction tests in gcc testsuite.  There are some gfortran max|min reduction unit tests, but the gomp-4_0-branch doesn't appear to add any gfortran OpenMP 4 features.

Аватар пользователя Armando Lazaro Alaminos Bouza

There are reference of use of min/max reductions in gcc 4.7 and up. See :

http://www.techdarting.com/2013/06/openmp-min-max-reduction-code.html

As in OpenMP 3.1 specification ( http://openmp.org/wp/2011/07/openmp-31-specification-released/ )

But  I do not use gcc, I only found the references in several sites some time ago.

Зарегистрируйтесь, чтобы оставить комментарий.