LCD vectorization + threading benchmark

Levine-Callahan-Dongarra “vectors” Benchmark

Abstract

Use of the public “LCD” benchmark is discussed with respect to compilers of year 2007. Topics include translation of archaic syntax to more current Fortran and C[++] and optimizability on SSE4 architectures by unrolling, vectorization, and OpenMP threading.

Historical background

In 1991, when the reference http://citeseer.ist.psu.edu/levine91comparative.html was published, the High Performance Computing scene was entirely different from what it became a decade later. The original source code http://www.netlib.org/benchmark/vectors did not reflect the latest developments in language standards. In fact, it included examples of code which was already more than a decade obsolete.

For more than a decade, HPC had relied on hardware support for vectors of at least 16 (typically 64) elements in a single floating point instruction. All of the benchmarks in this suite relied on efficient vector support for loop operations of lengths 5 to 1000, such as the benchmark includes. The more popular hardware fully supported vectorization only of 64-bit data (double precision, or close to it). Contrary to the original instruction, we now emphasize single/float precision performance, with reasonable expectation that what vectorizes effectively for single precision will do the same in double.

LCD tests individual performance of more than 100 distinct loops, at “vector lengths” 10, 100, and 1000. The times for the shorter loops are made comparable by repeating them so as to execute the same total number of loop iterations. Function calls with possible side effects prevent compilers from discarding the repetitions. However, the repetition assures “hot cache” behavior of the shorter loops. The longer loops are fairly certain to use data residing in last level cache, but not necessarily in first level cache; they are easily accommodated by hardware strided automatic prefetch. Cache considerations were ignored in the original development of the benchmark.

As OpenMP parallelization is ineffective for “vector length” 10, and is effective only for a few cases of nested loops of length 100, the if-clause is used to control which vector lengths are threaded. This has to be considered when comparing the 10/100/1000 performance.

In accord with common lore, only vectorization options are useful for inner loops of LCD.

All results are check summed, catching most gross implementation errors. Minor variations in numerical behavior due to varying optimizations were not of primary concern.

Interesting to note is that the authors requested reports from people who ran the benchmark, with the most modern acceptable medium being 5” DOS floppy, none of the media remain viable. Although an e-mail address is given, it was not to be used for reporting.

As the contact information in the public postings is stale, Intel legal has declined to allow public use of them by Intel personnel. Thus, the issues of full trademark rights designations are generally moot.

Bringing LCD up to date

Compiler vectorization technology gained revived importance with the introduction of short vector support capability in SSE and Altivec. The LCD benchmark has retained importance for its showing of individual cases for vectorization. Not all cases can be vectorized effectively, particularly since the non-vector performance of current architectures often is satisfactory, and support for strides other than 1 is limited in hardware or software.

As developing language standards led to major changes in normal programming style around the time LCD was released, some of the performance quirks of the original LCD coding became irrelevant. In a few cases, Fortra n 90 syntax promotes optimizations which were difficult in the original source; in others, it has shown up deficiencies in compilers until recently. C and C++ surpassed Fortran in popularity at the time, but not until introduction of C99 keyword restrict was it possible to optimize these benchmarks with separate compilation, in accordance with the LCD rules. C++ STL syntax facilitates better optimization in some cases, with compilers which deal with the peculiarities of it and permit use of restrict together with C++.

The original Fortran code was updated automatically in accordance with Fortran 77 standard, using procedures based on the struct processor http://portal.acm.org/citation.cfm?id=811545

In a few cases, this original translation introduces changes which are more consistent with C translation.

Where appropriate, Fortran 90 syntax was introduced manually. In some cases, this inherently removes some of the obstacles to optimization in the original source.  In others, notably the pack/unpack, and frequently where/elsewhere and forall, there is no optimization advantage in practice to modern syntax, and it should be used only to improve readability.

The test kernel portion of the f77 code was translated to C by the f2c utility typically bundled in linux distributions until recently. Even after consolidating many repeated headers, the C code is about double the size of the Fortran. Where the Fortran 90 corresponds to C or C++ code which is more idiomatic or easily optimized, those changes were introduced.

As hinted above, restrict keywords had to be introduced to present the same optimization opportunities in C as in Fortran.  As restrict in C++ is not portable, -Drestrict=__restrict__ is used to change from icpc to g++ syntax.  g++ does not consistently apply __restrict__.  As this benchmark is derived from Fortran, g++ -fargument-noalias may be applied to change aliasing rules to Fortran style.  This flag is equivalent to MS CL and icc -Oa.

The driver code was found not translated correctly by f2c. In order to make a fair comparison between Fortran and C++, both are run from the same Fortran compilation of the driver.

Where OpenMP shows an advantage, it is included in the test kernels, with an if clause so that it is not invoked on the loops which are too short. For the most part, this is the same in the Fortran and C++. Some of the opportunities for OpenMP were uncovered by use of auto-parallelizing options, but it was necessary to use OpenMP to get consistent comparisons among platforms and compilers.

Performance of recent platforms is such that standard C or Fortran timers don’t have satisfactory resolution (sub millisecond resolution required). Tests on IA platforms are based on the rdtsc macro of Intel and Microsoft C compilers, and must be performed with EIST disabled, as rdtsc on current platforms is not a CPU clock cycle count, but instead is directly related to elapsed time.

Optimization implications of syntax updates

As Fortran 90 encourages writing individual array assignments, it depends on compiler optimizations to fuse loops where required, for example to achieve better register locality (or cache locality in extremely long loops). Ifort began to do this (at –O3) with 10.0 release, although it was a standard technique at least since the time LCD was published.

Conversely, array assignments give a better head start toward optimization, where the loop split is required for optimization. Examples are s211 and s261, on account of storing and re-using misaligned vectorizable array sections, where gfortran out-performs g++. For s233 and s261, a similar role is played by the introduction of transform() for C++ and f90 array assignments. S413 also is an example which does not optimize except with explicit splitting.

When not using separate array assignment or STL, Intel compiler DISTRIBUTE POINT directives may be used to force a loop split. I have introduced these only where it is necessary to reconcile a gap between icpc and g++ performance.

Fortran 95 forall and where..endwhere are used liberally in the updated LCD. These constructs are somewhat controversial. Forall was added to Fortran in order to improve portability against HPF, even though it was already known that forall did not accomplish its goal of improving optimization. In fact, it does not have a satisfactory implementation in OpenMP, and so I have used it only for LCD inner loops where threaded parallelism is not valuable, mostly where conditionals prevent use of array assignments. In some LCD functions where VECTOR ALIGNED directives promote successful C vectorization, forall leads to failure of optimization, which at the moment still looks like a compiler bug. It takes more than 10 years for compiler optimization to catch up to language development.

Where..endwhere superficially would seem to promote vectorization. In reality it makes it more difficult, in effect introducing an extra loop to generate a mask array. The execution time required for allocation and de-allocation of the implicit mask could be mitigated only by the compiler hoisting it up to the function preamble (outside the timing loop, OK by the LCD rules). So the syntax is valuable not for optimization, but for human clarity. gfortran developers have made significant strides in efficient implementation, approaching the efficiency of C or f77 in more cases than ifort.

Fortran merge is the clearest case of modern syntax improving optimization. In these LCD cases, merge (and near equivalent C ‘?’ operator) require VECTOR ALIGNED directives for best effect. Intel compilers can take advantage of SSE4 for these functions.

Notes on compilation options

The driver code must be compiled with safe optimizations, preferably with extra precision (non-SSE), in order to avoid false evaluations of checksums e.g. ifort –mp –c mains.F or gfortran –O2 –mfpmath=387 –c mains.F Pre-processor is invoked to accommodate the different style of OPEN required by Intel (extension of f95) and compilers such as gfortran which require f2003 style OPEN. That OPEN will return a failure code (handled by the source) when run on Windows, until such time as /proc/cpuinfo is supported there. PreTest kernel code is compiled with aggressive optimizations, preferably with correct observance of parentheses specified, where there is such an option, e.g. ifort –O3 –assume protect_parens or gfortran –O3 –funroll-loops –ftree-vectorize. icpc options include –ansi_alias –restrict .

Note the distinction between the option -ansi_alias (g++ equivalent -fstrict-aliasing is set by default) which asserts that the C++ code conforms with the ISO standard on argument aliasing, and the options g++ -fargument-noalias, or Microsoft style option -Oa, which in effect puts a default restrict qualifier on all arguments.  The latter is valid if all functions are called from standard-compliant Fortran.

The timer code requires Intel or Microsoft C, using a macro CLOCK_RATE to figure the conversion from reported ticks to seconds elapsed.  The standard OpenMP timer, or possibly Fortran system_clock(), might be used.

Various schemes have been used to run on Windows, primarily either ‘ifort -Qlowercase –us’ to make the symbol mangling the same as linux, or icl –Dfortttime_=FORTTIME to make the timer linkage agree with Windows Fortran default.  Insertion of a C interoperability interface block simplifies the accomodation to Windows.

Gfortran can compile and link g++ in one shot, e.g. ‘gfortran –O3 –funroll-loops –ftree-vectorize –ffast-math –Drestrict=__restrict__ mai ns.o loopstl.cpp forttime.o –lstdc++’

Note that gfortran/g++ option –ffast-math is equivalent to ifort/icpc –fp-model fast=2. g++ has no direct equivalent to the icpc default of –fp-model fast=1. g++ -fstrict-aliasing (a default) is equivalent of icpc –ansi_alias.

Ifort/icpc 10 default to the equivalent of 9.1 –xW. 9.1 could be optimized to some extent for the short vectors by –O1, but that is not available to 10.

Gfortran/gcc option –ftree-vectorizer-verbose=1 would be needed for those who wish to tally reports of vectorized loops. A similar count with ifort report would be misleading, as there are cases of reported vectorization which applies only to a loop which moves data from an unnecessary temporary array to the destination. Aside from that, current compilers have eliminated situations where vectorization performed with normal flag settings is unproductive.

For MS VC9 and ifort 10.1, the C++ code may be compiled by e.g.

cl /EHsc /c /Ox /favor:EM64T /GL- /openmp /fp:fast -Drestrict= loopstl.cpp
cl /EHsc /c /Ox /favor:EM64T /GL- /openmp -DCLOCK_RATE=2933000000 forttime.c
ifort -Qip- -QxW -Op -Qfpp -Qlowercase -us mains.F loopstl.obj forttime.obj /link -nodefaultlib:bufferoverflowu.lib libiomp5mt.lib
In the Windows ifort, it may be necessary to replace system_() by SYSTEMQQ().

Analysis of individual LCD tests

Following are comments based on extensive experience with LCD on current architectures. Unless otherwise specified, compilers tested are ifort 10, icpc 10, gfortran 4.3, g++ 4.3, all on x86-64 linux.

S111

A purely stride 2 benchmark, not amenable to IA SSE vectorization, so may perform better on Opteron. Performs poorly on IA-64 (possible bank conflicts).

S112

Overlapping source and destination, relies on vectorization at stride -1 for efficient vectorization. None of the tested compilers optimizes well, even though there is no obstacle in hardware. Ifort will vectorize at stride +1 by use of temporary buffer, but with performance inferior to non-vector code produced by other compilers.

S113

Vector-scalar addition with mis-alignment. All compilers tested vectorize OK, except g++.

S114

Mixed strides, reading and writing separate sections of 2D array. All compilers produce fair performance. ifort uses an unnecessary temporary array, and fails to move integer multiplication out of inner loop.  Due to the mixed strides, IA64 or Opteron perform relatively well. Increasing loop length is backward from normal omp schedule(guided) usage, so the schedule type should be tested for each platform.  OpenMP gives good speedup up to 8 threads, using the if clause to restrict threading to the longest loop case. Intel NUMA platforms get more than double performance when the same OpenMP parallel structure is used in the initialization (subroutine set2d).

S115

Dot product (dot_product or inner_product) is vectorized by batching sums. Only ifort 10 does this with sane options. Intel OpenMP library performs better than gnu. Inner_product is written with 0 initialization to make accuracy equivalent to Fortran.  Penryn speeds up significantly, presumably due to better handling of unaligned load.  Nehalem doesn't need aligned version.

S116

Overlapping sections of an array, replacing one of the sections. ifort 11 gets good performance for long loops, but generates far too much code, with an unnecessary temporary and an unused dead section of vector code left over from earlier versions.

S118

Reversed strided dot product. Fo tran suggested dot product order is opposite from original, in order to delay recursive use of the result of the previous inner loop. Reversing the C loop ruins accuracy. IA64 and Opteron should perform comparatively well. OpenMP Fortran parallelization might be achieved with omp ordered, to require that each thread waits until the previous one finishes before using that result.  Speedup is limited by false sharing. More speedup may be achieved without threading, by unroll and jam.  ifort 11 is capable of fusing dot_product loops of identical length, in order to enable this, so the peeling to equal lengths has to be done explicitly in source.  gfortran doesn't fuse loops, so that also has to be done in source.

S119

Reading and writing distinct sections of same array, with misalignment. All compilers perform OK. Misalignment reduces effectiveness of vectorization, so Penryn is better.  Unaligned version is never used.

S121

Explicitly overlapping sections of same array. Icpc performs best, by vectorizing with scalar loads of array element which may cross cache line, on every other iteration, to avoid cache line split stalls.  The special icpc optimization remains effective on Penryn/Harpertown platform, once a good BIOS is installed.  ifort -xT degrades performance.

S122

Stride hidden from compiler.

S123

Data-dependent indexing.

S124,443

“vector merge” ifort and icpc vectorize effectively with VECTOR ALIGNED directive. SSE4 (ifort -xS) can use blend, and then does not require the directive.

S125

Odd hidden vectorizability. All compilers perform the same.  Data must be stored local during set2d initialization for good NUMA performance.  Nehalem doesn't need aligned version, but is sensitive to socket locality of data.  GOMP_CPU_AFFINITY may be superior to KMP_AFFINITY.

S126

Awkward mixed strides with recursion as written. Ifort best, gfortran worst, IA-64 relatively good.  Penryn appears to have a handicap.

S127,128
Mixture of strides 1 and 2. C out-performs Fortran, not good for IA-64. icpc vectorizes s127, but advantage over g++ is minimal. Ifort vectorizes likewise when using DO..ENDDO rather than forall; gfortran also optimizes better with the change.  S128 f90 code performs similar to C only with those versions of ifort which fuse the 2 assignments.

S131,132

Difficult compiler analysis to determine vectorizability. All vectorize except gfortran, but misalignment reduces effectiveness. Penryn shows improvement on s132. g++ needs -fargument-noalias for full optimization.  The 4 versions generated by icpc (selection at run-time based on relative alignments) show performance benefit over the single version of g++ when running on Pentium D, but not on Conroe.  The unaligned vector loop version in s132 is never used.

S141

Original loop-recursive stride retained in C, removed from Fortran, OpenMP parallel with schedule(guided) used as intended. Intel compilers superior, but Fortran doesn’t reinsert the index recursion optimization.  OpenMP shows good scaling up to 8 threads.  Fortran code originally attached has an error.  The time lost by re-calculating the index in the inner loop can be minimized by forcing a single calculation, and replacing division by 2 by a shift.  Shift produces same result as /2, since the operand is positive: k+ishft((j-i)*(j-1+i),-1)  Penryn CPUs are not as dependent for performance on the shift substitution.

S151

Interprocedural analysis needed to vectorize loop. All give fair performance except gfortran.  ifort uses a temporary, wiping out the advantage of vectorization. ifort -xT degrades it somewhat.

S152

In-lining required to assemble vectorizable loop. ifort vectorizes effectively. 32-bit (but not 64-bit) icc vect izes.

S161

Fortran where..elsewhere. C superior to f95, probably due to inefficient combination of setting an implied logical followed by its use in a conditional. 

S162

Optimization extractable from if() condition. All mediocre, ifort cancels advantage of vectorization by using temporary. Recent versions may fuse the initialization of the temporary with the assignment to the result, apparently without analyzing for possible adverse dependency.

S171

All OK except icpc.

S172

C++ optimized by transform() versioning for actual case. All OK.

S173,174

Distinct segments of a 1D array. 

Embedding the assignment in the C comparison in s174 avoids an undesirable "distribution," when the entirety is vectorized by VECTOR ALIGNED.  Distribution may also be avoided by DISTRIBUTE POINT pragma at the top (or bottom) of the loop body.

S175

Hidden stride, C better than Fortran.

S176

Reversed copy of stride -1 array seems to improve caching, as well as enabling vectorization.  OpenMP parallelization of the length 100 version was effective only without the cache and vector optimizations.  GOMP_CPU_AFFINITY may be needed to produce effective cache locality.

S211

Assignments switched from original for vectorizability. All OK, all vectorized except g++. Vectorization (loop split required due to misaligned store forwarding) is only slight gain.  Aligned version shows no gain on Nehalem.

S212

Assignment switching makes vectorization easy. All vectorized effectively except g++.  ifort -xT gives a small loss.

S221,S222

Optimized by all. Partial vectorization by ifort doesn't gain, but loses at short loop length, while icc fails to optimize without partial vectorization.

S231

Vectorized by all compilers after loop nest switch.   No need for aligned version on Nehalem.

S232

Backwards use of schedule(guided). Unroll and jam, dictated in source, enables ifort to match gfortran.

S233

Partial vectorizability transformations written in, all optimize OK.  Penryn shows significant improvement in the misaligned vector loops here and the next 2.  Likewise, the aligned code version is not needed for Nehalem.

S234

All compilers perform same, vectorized or not: alignment problems. IA-64 performance is good.

S235

If loop nest transformations are written in, all optimize, even Microsoft C.  Intel compilers can do automatic distribution and loop interchange for vectorization, so can vectorize regardless of loop nests.

S241

IA-64 performs fairly well. Intel compilers vectorize in accordance with the hint about preloading, if tmp=a(i+1)*d(i) is calculated at the top of the loop.  This change inhibits Intel IA-64 compilers from producing unrolled loop versioning, reducing peak performance.  ifort -xT has a small loss.

S242

No vector benefit, ifort good. C++ partial_sum introduced to split off the other potentially vectorizable part, but this is beneficial only for a few IA CPUs, and only in comparison with the weak optimization of non-vector icpc.  g++ with -fargument-noalias (or with the dictated optimization tmp=*s1 + *s2) does as well as icpc on the split version, and much better on the original simple version, unless #pragma unroll(4) is set for Core architecture.

S243

Written without original false dependency, Intel better than gnu, but ifort stores one of the results to an unused temporary, unless VECTOR ALIGNED directive is given before the first f90 assignment.

S244

Without original false dependency, all OK except gfortran

S251

Squared expression, all vectorize OK

S252

Fortran eoshift, gfortran performance greatly improved by using a vectorizing compiler to build me cpy() (64-bit gcc or icc, if glibc < 2.6.1-18.3).  ifort uses idiv twice per inner loop, takes > 4 times as long as other versions.

S253

Only icpc vectorizes conditional effectively; ifort could do so with similar source

S254

Fortran cshift, ifort and g++ are good.  Penryn shows a speedup. gfortran 32-bit library must be built with -fno-builtin-memcpy, and an optimized memcpy with effective movdqa supplied, for competitive performance.

S255

Double cshift, vectorizable by peeling first 2 iterations, which gfortran accomplishes automatically for f77 code, as does gcc.  Optimization was dropped from recent Intel compiler versions.

S256,257

Recursive 1D array in 2D loop. All compilers about the same, until we try loop swapping, as suggested but not performed by ifort IA-64. Compilers, except for gnu-ia64, perform major optimizations after manual swapping. Intel auto-parallel seems to compensate for non-optimum loop nesting, but it is wrong to conclude that threaded parallelism is useful. The strange data dependencies prevent it being done automatically (depends on first array elements left over from earlier function tests).  Significant speedup on Penryn.

S258

Conditional loop carried recursion, all about same

S261

Vectorizable if not fused. Vectorization with fusion produces store forwarding stall due to mis-alignment in re-use of modified array. ifort 10.1 recognizes DISTRIBUTE POINT directive between loops to prevent fusion.  ifort 9.1 distributed automatically without directive. icpc fails to vectorize the transform() without #pragma ivdep.

S271

Conditional, all vectorize except gfortran

S272

Conditional, only icpc vectorizes with VECTOR ALIGNED (ifort could vectorize with similar source).  g++ is best of non-vector results, with -fargument-noalias.

S273,275

Fortran where, all compilers similar performance. Intel compilers do well on S275, the first case of competitive performance for f95 where, 13 years after it became a standard.

S274

Where..elsewhere, ifort is worst, due in part to code generation with multiple conditional jumps.  Penryn is somewhat slow, probably for the same reason.  icc 9.1 achieved full vectorization with #pragma vector aligned, and that ability is restored in 11.0.  In addition, placing #pragma distribute point at the top of the loop instructs the compiler not to split the loop, so that no duplicated memory accesses are required.

S276

Original loop split according to index ranges. Gfortran misses vectorization of 2nd loop. g++ needs -fargument-noalias to catch it.

S277

Nested where is slow, both C++ compilers good, directives which were needed in previous icpc versions should be eliminated for icpc 10.1.

S278

Where..elsewhere, ifort worst, icpc best with VECTOR ALIGNED

S279

Where nested in elsewhere, icpc best with VECTOR ALIGNED

S2710

Vectorization based on loop invariant ifs.   Fastest version is ifort with DO loop, VECTOR ALIGNED directive, and OpenMP.  Slowest is gfortran with f90 assignments.

S2711

Bogus if != 0 icpc best with VECTOR ALIGNED, gfortran worst

S2712

Non-bogus if, behaves like s2711

S281

Not vectorizable because of mixed +- strides, all about same

S291

Cshift, ifort and g++ with -fargument-noalias are best, gfortran worst.  Penryn shows good improvement.

S292

Double cshift, similar to s255.

S293

Sets whole array to its first value (C++ fill()). No problem for anyone, but would need VECTOR NONTEMPORAL if the array were large compared with cache. In the LCD framework, nontemporal is extremely slow, underlining the hot cache nature of the benchmark. For optimization of C++ fill(), using a local scalar copy f or the source operand is more effective than any ivdep or restrict pragmas or keywords.  For direct comparison with Fortran and C versions, the equivalent measure should be taken, so as to get similar optimization and data alignment.

S2101

Matrix diagonal operation, all same performance.  Intel compilers miss the opportunity for strength reduction (moving imul out of inner loop), but DTLB misses appear to be the main bottleneck.  On several Intel platforms, performance is tied to latency of last level cache access, which may be relatively high on QPI platform.

S2102

Initialize identity matrix, vectorizability written in (C++ fill()), all OK

S2111

“wavefront”   Vectorizable by  a classic mixed stride method, which is unsuitable for SSE.  icc and gcc need a rewrite to registerize the stride 1 recursion, which the Fortrans optimize automatically:

for (j = 2; j <= i__2; ++j) {
   float tmp = aa[1 + j*aa_dim1];

      for (i__ = 2; i__ <= i__3; ++i__)
         aa[i__ + j * aa_dim1] = tmp += aa[i__ + (j - 1) * aa_dim1];

}

The tmp scalar variable expresses the recursion as required for icc and gcc to registerize fully, so there are 2 memory references per loop iteration.  OpenMP parallelization is dubious on account of the recursion on both i and j.  The result from a previous j value for the entire cache line containing the current i must have been completed.   With full optimization, parallellization isn't needed.  Current compilers benefit from explicit unroll and jam strategy, with multiple outer loop iterations on each inner loop iteration, so as to registerize both outer and inner loop recursion.

S311

Bare vectorizable sum. Ifort best

S312

Product vectorized only by Intel

S313

Bare dot product Intel best

S314

Maxval/maxelement. Ifort best, icc and g can vectorize only with change to C source.  VC9 gives good non-vector performance with C or C++. 

S315

Maxloc/maxelement, ifort best, gfortran worst.  icc C vectorization matches VC9 non-vector C or C++. Minimum code size sequence is very slow.

S316

Min, like s314

S317

Exponentiation by loop, vectorized by ifort and icpc

S318

Maxloc/maxelement(abs) all good except possibly icpc with old g++

S319

“coupled reduction” could use forall/sum() or transform/accumulate. Icpc fuses 2 accumulate() STL operations. Ifort and icc optimize better with forall changed to DO/for with VECTOR ALIGNED (single loop, no f90 or STL stuff).  Penryn runs 3 times expected speed.  g++ gains with -fargument-noalias. 

S3110

Index of max in 2D array.   OpenMP threading shown in the C code is incorrect.  It works well, if the shared data names are shadowed in the inner loop, and an omp critical section is used to copy (conditionally) the results from each outer loop to the shared variables. A small improvement may be made in Fortran performance by enclosing the MAXLOC line by !$omp parallel workshare if(n>200)  ... !$omp end parallel workshare, but the -parallel option does better, and only a parallel do loop version matches C.  Inner loop optimization considerations same as s315.

S3111

Masked sum. Vectorized by ifort/icpc, g++ with -fargument-noalias

S3112

STL partial_sum. All same performance.  chksum is wrong in C++ version due to missing sum=b[*n]

S3113

Maxval(abs) all OK except g++

S321

Recurrence. All good but icpc, which needs #pragma unroll(4) for Core architecture. 

S322

2 step recurrence, Fortran parenthesized for shorter recurrence latency. Ifort best, but requires -protect_parens option to maintain performance on Windows.

S323

Source recurrence length reduced. Latest Intel compilers rightly prefer not to split into non-vectorizable and vectorizable (after non-vector section completes).   Avoidance of distribution is preferred, as the vectorizable portion uses only operands required also in the non-vectorizable portion.  ifort optimizes with statement in optimized order only.  icc requires also that the scalar carry-over dependency be written explicitly,

S331,332

Linear search, special vector code dropped, all OK

S341,342

Fortran pack/unpack. C best, ifort uses extra temporary arrays.  icpc suffers on Core architecture, on s341, from time lost copying to both float and int registers.  gfortran 4.4 out-performs ifort on some platforms.

S343

Extract positive vals from 2D array. All compilers perform the same.

S351

Bogus unrolling.

S352

Source unrolling hurts C more than Fortran.  Parentheses in the Fortran version are meant for consistency with the C version.  Intel 9.1 and 10.0 compilers encounter an optimization failure when aggressive optimization is invoked, as without -protect_parens, and the result is slower.  Aggressive optimization invokes re-rolling, which removes the source unrolling, with the intention of producing the same vectorization as the non-unrolled case.

S353

Unrolling removed. All OK except gfortran.

S411,412

Originally “undisciplined” loops with goto, changed to do..exit (not easily pre-countable) or optimum C for() loops. C for(;;) with break would be closer equivalent to Fortran but is not as idiomatic. Intel compilers slightly better than gnu; Fortran syntax is considered non-vectorizable.

S413

Original code is non-idiomatic for Fortran, so is replaced by array assignments. Explicit peeling and fusion facilitates vectorization.  It performs poorly with gcc, but OK with gfortran.  ifort requires DISTRIBUTE POINT directive with DO loop to avoid splitting, but then mostly unaligned code is generated (as with icc), which is faster only on i7 CPUs.

S414

Original goto loop changed to equivalent dowhile/while. No benefits from vectorization until Penryn, apparently due to misalignment.  Aligned code version is never used.

S415

Original goto loop changed to equivalent do..exit / for..break.  As it is intentionally not countable (no protection against non-termination, it is not effectively SSE vectorizable.  Surprsingly, ifort is fastest, without counting; with counting, g++ is slowest.

S421

This loop is labeled “no overlap,” which is wrong. It tests whether Fortran EQUIVALENCE breaks aliasing analysis; C++ should be easy, as f2c translated EQUIVALENCE by #defining 2 pointers to the same definition, and transform() is introduced. Gfortran has minor difficulty with it.  Ifort stores a 2nd unused temporary copy.  

S422,423

Slightly more complicated versions of s421. Only gfortran has big trouble; ifort uses temporary.  ifort improves with -xT.  The C version of s422 has a branch with #pragma ivdep, which was effective with icc 9.1, but not with current compilers.  Intel compilers don't generate an aligned code version for CPUs prior to i7, except by directive, although there is only 1 possible misaligned operand (same observation for s424.

S424

Overlap is in the “wrong” direction, but it could be vectorized effectively (as g++ does) either by reversing or by noticing that the overlap distance is bigger than the vector size. Ifort uses a temporary. ifort 10 stores non-temporal to the temporary, unless the array named array is arbitrarily dimensioned to a size < 100kB, or a LOOP COUNT directive is supplied.  As the temporary has o be read back immediately, this nontemporal store ruins performance.  Vectorization with a temporary is faster than scalar.

S431

Tests whether aliasing analysis uses PARAMETER constants. F2c propagated them into the pointers, so it is not a severe test for C++. 

S432

Same as s431, using local static variables initialized to constants. F2c interpretation appears to conflict with f90 (f77 allowed variations). Only gfortran has a big problem. Icpc 10 has a regression from 9.1 in this and the preceding case in not vectorizing without #pragma ivdep.

S441

Original Fortran arithmetic ifs translated to double merge/’?’ Vectorized only with Intel VECTOR ALIGNED; g++ has the most difficulty.

S442

Original computed goto translated to select case/switch, as these are idiomatic in both Fortran 77 and C, and are the only fair equivalents.  gfortran performs well when linked with Intel OpenMP compatibility library.  Intel compilers produce erratic performance unless code alignment directives are added.

S443 behaves similar to S124 after reconciling with modern syntax

S451

Nonsensical combination of sin and cos. As C++ valarray doesn’t support float data types, and has no syntactic advantage, it was not tested. Intel compilers (except for 32-bit C) vectorize as svml calls.

S452,453

A vector operand (float) is generated from the loop counter (int). Intel compilers vectorize, but the advantage over scalar unrolling is small.  Microsoft compilers reached satisfactory performance only with latest 64-bit VC9.

S471

Attempts to measure overhead of empty external function call (over same-source file function call). Saving xmm/x87 registers et al makes this a problem. Intel and g++ improved recently.  Penryn was worse until BIOS was fixed.

S481,482

Terminating a loop by terminating program or exiting the function prevents effective IA vectorization. s482 can be optimized by moving the comparison ahead of the unconditional assignment, but icc fails to take advantage of that by eliminating the now redundant copy to a new register.  There appears to be a block against hardware register renaming, making efficient code more important.

S491

Standard “saxpy” with scatter (indirect indexed) store. All compilers same.

S4112

Saxpy with sparse (indirect indexed) operand. Ifort best, gfortran worst

S4113

S491,4112 combined. All compilers same.

S4114

More complicated version of s4112.

S4115

Dot_product with sparse operand. All compilers OK except gfortran.

S4116

“more complicated sparse” dot_product. Ifort and g++ best, gfortran worst.

S4117

Vector operand generated by right shift on loop index.  ifort 11 vectorizes effectively.

S4121

Saxpy disguised in statement function. Gfortran doesn’t accept the extension of statement function in forall, so original f77 code is used, expanded in line in C source. All compilers vectorize.

va

simple copy (literally C++ copy).  gcc-4.3 and 4.4 templates no longer make memmove() substitution understood by icpc; icpc falls to in-line code, which does not optimize.

vag

plain gather (simplified S4112). All good except gfortran.

vas

plain scatter (simplified S491). All compilers OK.

vif

conditional copy. Icpc best (with VECTOR ALIGNED), gfortran worst

vpv

simple vector add (C++ transform). Missed C++ vectorization (regression in icpc 10)

vtv

multiply version of vpv, same comments

vpvtv

like saxpy but more memory access. Vectorized effectively by Intel compilers only.

vpvts

like saxpy with operand over-written. Vectorized by all.

vpvpv

add 3 vectors, over-writing operand. Vectorized by Intel compilers only.  ; Parentheses in Fortran require same order of evaluation as the C code, but ifort doesn't observe the order without -protect_parens.  The variant of parenthesis violation chosen by gnu -ffast-math degrades accuracy.

vtvtv

multiplication version of above. Gnu more competitive although not vectorizing.

vsumr

plain sum reduction (C++ accumulate). Intel vectorization more effective than gnu, due to more aggressive riffling

vdotr

plain dot_product/inner_product. Intel vectorization more effective than gnu

vbor

billed as test for speed of flops, not memory. Original flops count is valid only without optimization. Source optimized according to Fortran rules so as to reconcile optimizer differences, reducing claimed 59 flops to 17. No optimizing Fortran compiler would do 59 flops. Vectorized effectively by all, but ifort gains additional 50% performance with VECTOR ALIGNED directive.  The equivalent C pragma breaks it.

Similar Platforms

The updated LCD has been tested on IA-64 to verify correctness and efficacy of OpenMP. Specific comments on individual LCD functions would be different. Only in the STL minelement/maxelement functions does g++ out-perform icpc (total opposite of the situation on SSE). Gfortran IA-64 is generally uncompetitive, in spite of continuing work and tacit encouragement from Intel.

Windows XP64 ifort/icpc perform the same as reported above. Windows apparently over-rides EIST sufficiently that good results are obtained regardless of EIST setting.

Multiple L2 cache platforms would depend on KMP_AFFINITY for best performance. Limited Kentsfield and Clovertown tests have been run on linux. Windows multiple core platforms may not perform as well, but this has not been checked.

32-bit ifort performs well on linux x86-64, the most common differences coming in reduced unrolling, which may be an advantage when unrolling is not carried out efficiently. 32-bit icpc 10.0 on linux x86-64 exhibits deficiencies, some of them regressions from 9.1 which are peculiar to compilation on the 64-bit OS.

Conclusions

LCD benchmark gives useful insights into the vector efficiency of the various Fortran, C, and C++ operators as implemented by the tested compilers. Compilers have begun to take full advantage of f90 and C99 operators, even though the C99 support is not a default and is contrary to C++ standard. f95 vectorization support remains spotty.

f90 pack/unpack, and f95 features where and forall don't perform consistently with any tested compiler.

Icpc is stronger in vectorization than g++ (helped greatly by pragmas), but g++ sometimes excels in non-vectorizable situations. For ifort, the superiority over gfortran is less marked in the simple benchmarks included in LCD, aside from the directive-dependent cases. At least 2 of the missed vectorizations in gfortran and gcc have already been reported as fixed, while fixes in Intel compilers are being reserved for consideration for a future 11.0 release. All compilers require a sizeable collection of non-default options to show up well on LCD. None of the requirements are out of the line with normal applications, and they differ between Intel and gnu compilers.  

In view of the policy of the compiler to not make versions to take advantage of alignment when there are more than 4 operands, alignment directives are required in many cases for good performance on CPUs prior to i7.   For i7 (option SSE4.2), there are over 60 cases where multiple versions are generated unnecessarily.

Intel and gnu OpenMP appear to work successfully with backwards use of schedule(guided), where work balance requires increasing chunk sizes.

In summary, these compilers (and others) give excellent support to the vectorization capabilities of Intel platforms.

Categories:
AttachmentSize
Download lcd.tgz44.06 KB
Download lcd.tgz44.06 KB
For more complete information about compiler optimizations, see our Optimization Notice.