Forum Jump

Select Group :
Select Forum :
Sorted By :
Sort Order :
From The :
 
Thread Tools  Search this thread 
tim18
Total Points:
66,417
Status Points:
66,417
Black Belt
October 26, 2008 1:51 PM PDT
Optimizing the vectors benchmark

http://sites.google.com/site/tprincesite/levine-callahan-dongarra-vectors

The original version of this benchmark may be found at

http://www.netlib.org/benchmark/vectors

The netlib version is a shar file.  On Windows, you could use cygwin sharutils to unpack it.

 

I show examples of how to optimize, using compiler vectorization and OpenMP where appropriate, in Fortran, C, and C++ (all sharing the same Fortran driver).

In a few cases, threading is possible only at the expense of optimization of individual threads, and more than 4 threads may be needed to overcome the deficit, even on the longest loop length.  Both versions are shown.

Makefiles are provided for Intel icpc/ifort (linux), ifort/ICL/MSVC (Windows), and gcc/gfortran.  The Intel libiomp5 libraries are recommended, regardless of compiler chosen.

For testing on multiple last level cache machines, or with HyperThreading enabled, affinity environment variables KMP_AFFINITY or  GOMP_CPU_AFFINITY are strongly recommended, so as to place adjacent threads on the same cache, and to set 1 thread per core.  Note that libiomp5 implements either environment variable, gcc only the latter, and MSVC neither.

Dmitriy Vyukov
Total Points:
24,747
Status Points:
24,747
Black Belt
November 6, 2008 7:21 AM PST
Rate
 
#1
Quoting - tim18

I show examples of how to optimize, using compiler vectorization and OpenMP where appropriate, in Fortran, C, and C++ (all sharing the same Fortran driver).

It seems that there is a lot of valuable detailed information, but where are base non-optimized cases? I can't find them (I was looking into C++ sources).



tim18
Total Points:
66,417
Status Points:
66,417
Black Belt
November 8, 2008 10:41 AM PST
Rate
 
#2 Reply to #1
Quoting - Dmitriy V'jukov

It seems that there is a lot of valuable detailed information, but where are base non-optimized cases? I can't find them (I was looking into C++ sources).

There is no base non-optimized C++.  As I explained in more detail in my earlier paper, which was removed from the Intel sites (possibly on account of lack of interest), the C++ is derived from automatic translation of the original netlib Fortran.  The translation itself already introduces some source optimizations, some of which aren't typical of usual C or C++ practice. Almost by definition, it fits the quip about writing Fortran in any language.  Still, it doesn't make sense to compare in all cases the ability of C,C++ and Fortran compilers, nor to demonstrate OpenMP, on literal equivalents of the legacy Fortran, so I try to make the comparisons as fair as possible.

The purpose of the original, sometimes ugly, Fortran, was to test compilers of that era to see how much optimization they could perform.  In several cases, you might say I have perverted that purpose, by "improving" style, and sometimes introducing OpenMP.   The original authors were kind enough to agree to this use of their benchmark.

I don't have a personal position on whether the introduction of transform() could be considered an optimization.  g++ does very well with it, and icpc does well with some cases which g++ misses.  My personal position is not in favor of such use of transform() without vector classes, of which I present only a single example.  The re-organization of loops so as to fit accumulate(), inner_product(), and transform() is an important optimization, so I present both C and C++ versions of it.



Dmitriy Vyukov
Total Points:
24,747
Status Points:
24,747
Black Belt
November 25, 2008 12:51 AM PST
Rate
 
#3 Reply to #2
Quoting - tim18

There is no base non-optimized C++.  As I explained in more detail in my earlier paper, which was removed from the Intel sites (possibly on account of lack of interest), the C++ is derived from automatic translation of the original netlib Fortran.  The translation itself already introduces some source optimizations, some of which aren't typical of usual C or C++ practice. Almost by definition, it fits the quip about writing Fortran in any language.  Still, it doesn't make sense to compare in all cases the ability of C,C++ and Fortran compilers, nor to demonstrate OpenMP, on literal equivalents of the legacy Fortran, so I try to make the comparisons as fair as possible.

The purpose of the original, sometimes ugly, Fortran, was to test compilers of that era to see how much optimization they could perform.  In several cases, you might say I have perverted that purpose, by "improving" style, and sometimes introducing OpenMP.   The original authors were kind enough to agree to this use of their benchmark.

I don't have a personal position on whether the introduction of transform() could be considered an optimization.  g++ does very well with it, and icpc does well with some cases which g++ misses.  My personal position is not in favor of such use of transform() without vector classes, of which I present only a single example.  The re-organization of loops so as to fit accumulate(), inner_product(), and transform() is an important optimization, so I present both C and C++ versions of it.

And is there non-optimized Fortran code? It's just relatively difficult to extract the essence of optimization (the reusable pattern) w/o non-optimized case. It's much easier when optimizations are applied incrementally to some base code.



tim18
Total Points:
66,417
Status Points:
66,417
Black Belt
November 25, 2008 5:15 AM PST
Rate
 
#4 Reply to #3
Quoting - Dmitriy V'jukov

And is there non-optimized Fortran code? It's just relatively difficult to extract the essence of optimization (the reusable pattern) w/o non-optimized case. It's much easier when optimizations are applied incrementally to some base code.

The original Fortran is available at the netlib URL reference.

In a few cases, optimum varies with platform; for example, s112(), s211() and s261() may be optimized for Core i7 and Barcelona at the expense of Core 2.



Dmitriy Vyukov
Total Points:
24,747
Status Points:
24,747
Black Belt
November 25, 2008 5:36 AM PST
Rate
 
#5 Reply to #4
Quoting - tim18

The original Fortran is available at the netlib URL reference.

Ok, thank you.

Here is non-optimized Fortran code, right?

http://www.netlib.org/benchmark/vectors

 

So, in order to extract the reusable pattern, I can take, for example, 'subroutine s211' and compare it to function s211_() from file 'loopsv.c'. Do I get it right?

 



tim18
Total Points:
66,417
Status Points:
66,417
Black Belt
November 25, 2008 8:35 AM PST
Rate
 
#6 Reply to #5
Quoting - Dmitriy V'jukov

Here is non-optimized Fortran code, right?

http://www.netlib.org/benchmark/vectors

So, in order to extract the reusable pattern, I can take, for example, 'subroutine s211' and compare it to function s211_() from file 'loopsv.c'. Do I get it right?

You could make that comparison to see the general style of translation.  I used 'f2c -aAR' for the original translation.  f2c itself already applies some optimization during that translation.  Addition of restrict keyword is the most often required subsequent optimization needed in the C code.



Dmitriy Vyukov
Total Points:
24,747
Status Points:
24,747
Black Belt
November 29, 2008 4:02 AM PST
Rate
 
#7 Reply to #6
Quoting - tim18

You could make that comparison to see the general style of translation.  I used 'f2c -aAR' for the original translation.  f2c itself already applies some optimization during that translation.  Addition of restrict keyword is the most often required subsequent optimization needed in the C code.

I am still a bit uncertain about this thing. Am I right in the following?

You get base-line fortran code (it can be seen in, for example, subroutine s211).

Then you convert it to C with f2c - this is the first source of optimization (as compared with situation when we are directly writing C code).

Then you apply some manual optimizations to it (vectorization, parallelization with OpenMP, something else?).

Resulted code can be seen in, for example, function s211_() in 'loopsv.c'.

Right?

Also you write: "In a few cases, threading is possible only at the expense of optimization of individual threads, and more than 4 threads may be needed to overcome the deficit, even on the longest loop length.  Both versions are shown."

Please, point to some example where 2 versions are shown (function names).

Sorry for so many stupid questions, I just can't get integral view, but it looks worthwhile...



tim18
Total Points:
66,417
Status Points:
66,417
Black Belt
November 29, 2008 8:41 AM PST
Rate
 
#8 Reply to #7

In several cases, the original f2c translation performs fairly well with auto-vector and auto-parallel compilation.  While one of the purposes of the original vectors benchmark was to test whether compilers could interchange loops effectively, I have written in most of those optimizations.  In particular, should one wish to use the STL optimizations, writing those in requires the loop interchange optimization. s125() and s141() require the explicit calculation of array subscripts to break sequential dependence in the outer loop.

s141(), in the original loop nesting, exhibited problems with reconciliation of single thread optimization and threading.  With loop interchange, so as to make the inner loop stride 1, this is resolved.

In a majority of the omp parallel loops,  performance with all cores in use does not gain much from vectorization, although the vectorization speedup is large for a single thread.  Threaded performance scaling is much higher with MSVC compilation, using Intel OpenMP run-time library, as the single thread performance is much lower with MSVC (lacking auto-vectorization) than with gnu or Intel compilers.

In the case of s125(), this may be associated with a relatively high rate of last level cache misses, even though this benchmark is constructed so that most tests don't see a normal cache miss rate.  s114() and s125() show significant gain from the tactic of initializing the arrays (subroutine cs2d) with the same static schedule as in the later timed use.

The loops with schedule(guided) are problematic with respect to affinity, particularly on multi-socket platforms.  The static scheduled use of the arrays during initialization and use without guided in other loops doesn't match the uncontrolled affinity of schedule(guided).





Intel Software Network Forums Statistics

8293 users have contributed to 31243 threads and 99119 posts to date.
In the past 24 hours, we have 12 new thread(s) 12 new posts(s), and 24 new user(s).

In the past 3 days, the most popular thread for everyone has been huge pages on linux? The most posts were made to Pipeline buffer between stages? The post with the most views is Another example attached (Tr

Please welcome our newest member bwillems