Optimization problem with allocatable arrays

Optimization problem with allocatable arrays

I detected a performance regression in ifort 14.0 when using allocatable arrays. Optimization is perfect when using fixed size arrays, optimization does not work properly when using autoparallelization and allocatable arrays. This did work with full performance with ifort 13.1.

When single threaded code is produced, there is a partial - but extremely significant - loss of performance when using allocatable arrays.

I attached two files, ifortregression.txt, showing compiler version, compiler parameters and execution times, and matmul.F, the source code of the test programm.

附件尺寸
下载 ifortregression.txt4.48 KB
下载 matmul.F3.31 KB
16 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Apparently, this clockx function in my installation doesn't work with -parallel. Generally speaking, it would be better to use Fortran system_clock, with arguments set to selected_integer_kind(12) in order to see the wall clock time of threaded applications.
If you did collect cpu time successfully in the threaded case, it would show the total time spent by all threads, so that parallelization would be expected to increase the reported time.

I notice that my compiler fails to perform the "blocked by 128" optimization on the second version with allocatable arrays. It will be important to study the compiler reports such as those generated by -opt-report.
The compiler fails to parallelize the first version when -parallel -DSIZEARGS is set, apparently because so much loop interchanging is required to achieve single thread optimization, and the outer loop on j is required for effective parallelization. The compiler needs to know that the problem is large enough to benefit from parallelization at the possible cost of single thread performance.
In the allocatable array case, in the absence of loop count directives and the like, the compiler will assume much smaller arrays than what you have set in the fixed dimension version. Apparently, this determines whether the compiler chooses cache blocking optimizations.

I can run the program with "time", cpu and wall time output is consistent with the overall values reported by time:

cp003421_matmul> OMP_NUM_THREADS=4 time ./a.out 6000                                      
 Running with array sizes        6000 by        6000
  dtime:     0.480   real time:     0.121    init
  dtime:    17.740   real time:     4.437     ikj
  dtime:    17.960   real time:     4.494     jki
 Sum of elements:           533082291842201.125
36.09user 0.16system 0:09.21elapsed 393%CPU (0avgtext+0avgdata 877168maxresident)k
0inputs+0outputs (0major+3881minor)pagefaults 0swaps

-opt-report reveales that (with option -parallel) the 14.0 compiler "forgets" to replace the matrix multiplication with the matmul intrinsic, while the 13.1 compiler does this replacement. I wonder why this feature has been dropped.

Many loops require a par-threshold of 99 or less instead of the default 100 to be parallelized:

> ifort -O3 -parallel -par-threshold99 -par-report -DARDIM=4000 -DSIZEARGS -mavx matmul.F
matmul.F(77): (col. 18) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(82): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(90): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(103): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(106): (col. 7) remark: PERMUTED LOOP WAS AUTO-PARALLELIZED.
matmul.F(119): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(120): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(136): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
> ifort -O3 -parallel -par-threshold -par-report -DARDIM=4000 -DSIZEARGS -mavx matmul.F 
matmul.F(106): (col. 7) remark: PERMUTED LOOP WAS AUTO-PARALLELIZED.
matmul.F(120): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.

In parallel mode both matrix multiplikations are treated the same, loops are permuted if necessary. So I wonder why loop order jki is treated differently in sequential mode. In fact - starting with compiler version 10 - jki is the only loop order among the 6 possible orders to show reduced performance.

 

-opt-matmul should be an effective way to parallelize. I don't know why it may no longer be automatic with parallel.

Unfortunately -opt-matmul is ignored by the 14.0 compiler if allocatable arrays are used. My full test program (source attached) shows that even calling the matmul intrinsic directly gives a slower executable than with compiler version 13.1. Calling dgemm directly is also slower, and this is definitely a compiler feature, not a library feature, since I can run both executables with either libMKL, there is no difference in execution speed.

> ifort14.0 -O3 -parallel -par-threshold99 -DARDIM=4000 -DSIZEARGS -DDGEMM -mavx -mkl crunchtbvar.F                                                                                         
> OMP_NUM_THREADS=4 time ./a.out 4000
 Running with array sizes        4000 by        4000                                                
  dtime:     0.210   real time:     0.056    init                           
  dtime:     9.520   real time:     2.379     ijk
  dtime:     9.770   real time:     2.444     ikj
  dtime:     9.730   real time:     2.434     jik
  dtime:     9.590   real time:     2.398     jki
  dtime:     9.730   real time:     2.436     kij
  dtime:     9.540   real time:     2.385     kji
 Sum of elements:           105312418747995.297
  dtime:     9.520   real time:     2.387  matmul
  dtime:     6.840   real time:     2.050   dgemm
 Sum of elements:           105312418747995.297
74.41user 0.10system 0:19.02elapsed 391%CPU (0avgtext+0avgdata 400392maxresident)k
0inputs+0outputs (0major+5068minor)pagefaults 0swaps
> ifort13.1 -O3 -parallel -par-threshold99 -DARDIM=4000 -DSIZEARGS -DDGEMM -mavx -mkl crunchtbvar.F                                                                                         
> OMP_NUM_THREADS=4 time ./a.out 4000
 Running with array sizes        4000 by        4000
  dtime:     0.230   real time:     0.059    init
  dtime:     6.300   real time:     1.574     ijk
  dtime:     5.300   real time:     1.329     ikj
  dtime:     5.380   real time:     1.346     jik
  dtime:     5.390   real time:     1.348     jki
  dtime:     5.420   real time:     1.355     kij
  dtime:     5.450   real time:     1.361     kji
 Sum of elements:           105312418747995.312
  dtime:     5.390   real time:     1.356  matmul
  dtime:     5.420   real time:     1.357   dgemm
 Sum of elements:           105312418747995.297
44.27user 0.08system 0:11.22elapsed 395%CPU (0avgtext+0avgdata 400376maxresident)k
0inputs+0outputs (0major+3820minor)pagefaults 0swaps
 

附件: 

附件尺寸
下载 crunchtbvar.F5.24 KB

You have so many different combinations of compiler options and tests that it's hard to see if there is any regression or not.  Can we just focus on one case at a time and make sure we do apples-to-apples comparisons?  In fact, the following case shows that 14.0.2 is about 2x faster than 13.1.2:

===================14.0.2=============

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.144 Build 20140120

$ ifort -O3 -parallel -par_threshold90 -par-report -DARDIM=4000 matmul.F -o matmul-14.0.2.144-pt90.x
matmul.F(73): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED
matmul.F(82): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED
matmul.F(90): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED
matmul.F(103): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED
matmul.F(119): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED
matmul.F(136): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED
$ export OMP_NUM_THREADS=1
$ time ./matmul-14.0.2.144-pt90.x
 Running with array sizes        4000 by        4000
  dtime:     0.190   real time:     0.195    init
  dtime:     5.500   real time:     5.504     ikj
  dtime:     5.500   real time:     5.498     jki
 Sum of elements:           105312418747995.250

real    0m15.850s
user    0m11.123s
sys     0m0.087s
$ export OMP_NUM_THREADS=4
$ time ./matmul-14.0.2.144-pt90.x
 Running with array sizes        4000 by        4000
  dtime:     0.200   real time:     0.054    init
  dtime:     5.620   real time:     1.406     ikj
  dtime:     5.600   real time:     1.400     jki
 Sum of elements:           105312418747995.031

real    0m3.173s
user    0m11.336s
sys     0m0.106s
$

=========================13.1.2=====================

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.2.183 Build 20130514

$ ifort -O3 -parallel -par_threshold90 -DARDIM=4000 matmul.F -par-report -o matmul-13.1.2.183-pt90.x
matmul.F(73): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(82): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(90): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(103): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(119): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
matmul.F(136): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
$ export OMP_NUM_THREADS=1
$ time ./matmul-13.1.2.183-pt90.x
 Running with array sizes        4000 by        4000
  dtime:     0.190   real time:     0.195    init
  dtime:    11.000   real time:    10.999     ikj
  dtime:    10.970   real time:    10.970     jki
 Sum of elements:           105312418747995.250

real    0m22.492s
user    0m22.101s
sys     0m0.075s
$ export OMP_NUM_THREADS=4
$ time ./matmul-13.1.2.183-pt90.x
 Running with array sizes        4000 by        4000
  dtime:     0.200   real time:     0.053    init
  dtime:    10.990   real time:     2.753     ikj
  dtime:    10.990   real time:     2.747     jki
 Sum of elements:           105312418747995.016

real    0m7.385s
user    0m22.113s
sys     0m0.088s
$

Perhaps this might be related to your test machine, which you indicated was a Xeon E31240.  That's an older Sandy Bridge box with a smallish 8 MB cache.  I ran my tests on a Ivy Bridge box with 20 MB

Patrick

The presentations on new versions of MKL raise suspicions that opt-matmul functionality is being replaced:
The following examples of code and link lines show how to partially inline Intel MKL functions in Fortran applications:

Include
mkl_inline_pp.fi, to be preprocessed by the Intel® Fortran Compiler preprocessor
mkl_inline.fi for each subroutine that calls *GEMM
# include "mkl_inline_pp.fi"
program DGEMM_MAIN
include 'mkl_inline.fi'
....
* Call Intel MKL DGEMM
....
call sub1()
stop 1
end

* A subroutine that calls DGEMM
subroutine sub1
* Need to include mkl_inline.fi for each subroutine that calls DGEMM
include 'mkl_inline.fi'
* Call Intel MKL DGEMM

end
Compile with /fpp compiler option and MKL_INLINE preprocessor macro to use threaded Intel MKL:
ifort /DMKL_INLINE /fpp your_application.f mkl_intel_lp64.lib mkl_core.lib mkl_intel_thread.lib /Qopenmp -I%MKLROOT%/include
Compile with -fpp compiler option and MKL_INLINE_SEQ preprocessor macro to use Intel MKL in the sequential mode:
ifort /DMKL_INLINE_SEQ /fpp your_application.f mkl_intel_lp64.lib mkl_core.lib mkl_sequential.lib -I%MKLROOT%/include

The presenter declined to discuss this.

-opt-matmul does not use MKL directly - it calls into an MKL-derived routine in the Fortran support library, since Fortran needs more than the generic xGEMM can supply.

Steve - Intel Developer Support

On the example originally presented in this thread, using the mkl_inline_pp, I get best performance at 2 threads (not significantly affected by SIZEARGS).
I am seeing a lack of optimization by opt-matmul with recent ifort versions when SIZEARGS is set, even with a MATMUL substitution in the source code. Without SIZEARGS, Qopt-matmul gives 25% better performance than GEMM, with further improvement up to 4 threads.
So, the current opt-matmul seems to give an advantage not present in dgemm, when it works.
As Steve said, ifort opt-matmul uses its own entry point into MKL, unlike gfortran which uses gemm (but that is designed to work with MKL only on linux).
I was wondering, in view of comments about changes in opt-matmul, along with the advertising of the new include file interface for gemm, whether we should expect changes in support of opt-matmul.

Remarks on Patrick's comment:

As I mentioned in ifortregression.txt the two compiler versions have different strategies regarding use of avx instructions. In single threaded mode both require the -mavx flag and linking against MKL does not improve performance; execution speed is the same for both compiler versions. In parallel mode the 14.0.2 compiler's matmul intrinsic uses avx by default, while the 13.1.2 compiler ignores the -mavx flag and requires linking against MKL to use avx. That's why I use different compiler flags for the two versions.

Without SIZEARGS both compilers produce executables with the same performance, in fact the 14.0.2 compiler is somewhat better, since both loop orders give the same speed while with the 13.1.2 compiler loop order ikj is slower (and varies considerably when repeating thu run). With SIZEARGS the 14.0.2 compiler does not replace the loops with the matmul intrinsic even when -opt-matmul is specified.

I believe that allocatable arrays are the standard for production software, therefor it is regrettable that the latest compiler version gives slower code.

What really worries me is that the execution time of dgemm increases significantly when it is called with allocated arrays. It looks as if the memory layout of the allocated arrays is not as optimized as it used to be with the 13.1.2 compiler.

Axel

Is your point that there are too many compiler options and directives to play with?

To my surprise, the option -align array32byte doesn't prove useful.  avx and avx2 are helpful only at -O3.

As I suggested, the loop count directives make a huge difference (>10x, if not setting par-threshold) to your allocatable array case.  Do you consider that unacceptable?  They don't entirely eliminate the difference, presumably because the generated code must still allow for the run-time determination of array sizes and so it makes more code version branches.

Auto-parallelization seems unusually effective for this case, once you learn about using loop count directives when you deny the compiler the fixed dimension information.  Given that various MKL solutions are effective, this doesn't make a strong case for auto-parallelization.

This seems to show a weakness of the opt-matmul scheme in that the loop count directives don't work for that case.

Correction:

Some of the performance deficits I have detected look like an initialization effect of MKL. When I repeat the first nested loop (ifort13 -parallel -mkl) or the dgemm call (ifort14 -parallel -mkl) in the source, the second execution of the identical code sections runs at full speed. Sorry that I didn't notice that earlier. I do not yet understand why this initialization overhead varies so much when I repeat the runs.

Two problems remain: Both compiler versions produce slow code for the jki loop order in single threaded mode. When using -parallel all 6 loop orders and the explicit matmul call run at the same speed. The compiler version 14.0.2 produces slower code when autoparallelization is used in connection with allocatable arrays. This seems to be due to no longer offering the option to use MKL routines by specifying -mkl.

-mkl option is only a shortcut for linking commonly used groups of MKL libraries.  It doesn't have effects such as implying -opt-matmul.

As far as compiler version 13.1.2 is concerned, -mkl does improve performance. I can only guess that identically named internal procedures are replaced by higher performance mkl functions:

> ifort --version
ifort (IFORT) 13.1.2 20130514
Copyright (C) 1985-2013 Intel Corporation.  All rights reserved.
> ifort -O3 -parallel -par-threshold99 -DARDIM=4000 -DSIZEARGS matmul.F
> OMP_NUM_THREADS=4 ./a.out
 Running with array sizes        4000 by        4000
  dtime:     0.250   real time:     0.068    init
  dtime:    10.140   real time:     2.537     ikj
  dtime:    10.160   real time:     2.540     jki
 Sum of elements:           105312418747995.031
> ifort -O3 -parallel -par-threshold99 -mkl -DARDIM=4000 -DSIZEARGS matmul.F
> OMP_NUM_THREADS=4 ./a.out
 Running with array sizes        4000 by        4000
  dtime:     0.210   real time:     0.056    init
  dtime:     6.080   real time:     1.518     ikj
  dtime:     5.240   real time:     1.311     jki
 Sum of elements:           105312418747995.016
 

with compiler version 14.0.2 -mkl has no influence on performance:

> ifort --version
ifort (IFORT) 14.0.2 20140120
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.
> ifort -O3 -parallel -par-threshold99 -DARDIM=4000 -DSIZEARGS matmul.F
> OMP_NUM_THREADS=4 ./a.out
 Running with array sizes        4000 by        4000
  dtime:     0.220   real time:     0.058    init
  dtime:    12.980   real time:     3.246     ikj
  dtime:    12.810   real time:     3.204     jki
 Sum of elements:           105312418747995.016
> ifort -O3 -parallel -par-threshold99 -mkl -DARDIM=4000 -DSIZEARGS matmul.F
> OMP_NUM_THREADS=4 ./a.out
 Running with array sizes        4000 by        4000
  dtime:     0.220   real time:     0.058    init
  dtime:    12.810   real time:     3.203     ikj
  dtime:    12.820   real time:     3.205     jki
 Sum of elements:           105312418747995.016
 

>>I have detected look like an initialization effect of MKL. When I repeat the first nested loop (ifort13 -parallel -mkl) or the dgemm call (ifort14 -parallel -mkl) in the source, the second execution of the identical code sections runs at full speed. Sorry that I didn't notice that earlier. I do not yet understand why this initialization overhead varies so much when I repeat the runs.

The default MKL library is multi-threaded through use of OpenMP. The first call contains the overhead of creating the OpenMP thread pool (*** internal to MKL). Therefore, for timing purposes, the SOP is to discard the timing results for the first pass .OR. insert into your code, prior to the timed section, a call to MKL that you know establishes its thread pool.

RE the ***

When your application is multithreaded you might want to consider/experiment linking with the single threaded MKL. IOW each of your application threads can concurrently call MKL where each call into MKL continues using the same thread. Should each of your application threads call the multithreaded MKL concurrently, then you tend to oversubscribe the number of threads (# concurrent calls) * (number of threads spawned per MKL instance). On a 4-core/8-thread system this could explode to 64 threads if done improperly.

Jim Dempsey

www.quickthreadprogramming.com

If your application offers opportunity for parallelism at a higher level than individual MKL function calls,  that should be useful.  As Jim said, this might be done with the mkl sequential library.  

Even if you call the threaded MKL from an OpenMP threaded region, MKL shouldn't use additional threads until you set OMP_NESTED.    Even though the library may be able to prevent over-subscription, there aren't satisfactory methods to maintain data locality, so I agree in general with Jim's cautions.

登陆并发表评论。