# A basic question about auto vectorization of 3-level nested loop

## A basic question about auto vectorization of 3-level nested loop

Hi All,

I wish to get your kindly help on this basic question.

I find a nested loop in a paper as following: ("Model-Driven SIMD Code Generation for a Multi-Resolution Tensor Kernel")

for(int i = 0; i < LEN2; i++) {
for(int k = 0; k < LEN2; k++) {
for(int j = 0; j < LEN2; j++) {
cc[i][j] += aa[j][k] * bb[k][i];
}
}
}

I modifed it as:

for(int i = 0; i < LEN2; i++) {
for(int k = 0; k < LEN2; k++) {
t = bb[k][i];
for(int j = 0; j < LEN2; j++) {
cc[i][j] += aa[j][k] * t;
}
}
}

I use ICC to test these two pieces of code, I check execution time and the numbers of scalar and vector operations.

1. The CPU is Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
2. The ICC version is icc (ICC) 12.1.5 20120612.
3. OS: GNU/Linux

The output is:

Loop                 Time(Sec) Checksum
tensor_kernel      2.25        0.210001
PAPI_FP_OPS 8
PAPI_VEC_SP 26369718676

tensor_kernel     20.55       0.210001
PAPI_FP_OPS 9
PAPI_VEC_SP 26264735280

The execution time are quiet different: one is 2.x sec, the other is around 20 sec.

On the other hand, the PAPI result shows that the instruction numbers of scalar and vector instruction operations are similar. To comfirm the result from PAPI, I checked the .s code, and didn't find obviors difference. I dont understand what causes this execution time difference?

Code is as attached, it use the framework of TSVC. (By default the papi version will also be built, you could remove them from list.)

Thank you very much for reading this question.

Best Regards,

Susan

AnhangGröße
3.49 KB
27 Beiträge / 0 neu
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

If you would give t local scope e.g.

for(int k = 0; k < LEN2; k++) {
float t = bb[k][i];

...

the compiler would be freer to optimize.  However, it still looks intended to prevent optimization of loop nesting (at least if you set -O3 -ansi-alias) so as to get unity stride inner loop.   Then possibly you might also get automatic unroll and jam, so that perhaps only every other iteration on k writes back, if the optimized loop nesting didn't already accomplish that.

If you would compare the opt-report results, you might get a text indication about what optimization is suppressed by giving t scope outside the nested loops.

It would be better to write the loops with optimized nesting yourself before inserting a temporary which might block optimization.

I don't know whether papi allows you to distinguish between "simulated gather" where the compiler vectorized with scalar loads and stride 1 parallel data access; it would seem not.

Hi Tim,

Thank you very much for the reply.

I am still confused and need your further help. Since the assembly code of them have almost the same instruction sequence, I dont understand why the execution time is different. (BTW, I didnt see the multiple choice in .s code according to runtime condition.) PAPI use counters to get these numbers at runtime, these number look reasonable to me since there are same instruction sequence in .s file.

I tried the modification you mensioned, the performance get no abvious improvement.

Thank you again!

Susan

Quote:

TimP (Intel) wrote:

If you would give t local scope e.g.

for(int k = 0; k < LEN2; k++) {
float t = bb[k][i];

...

I would go even further, when I use this kind of nesting, I often do:

for(int k = 0; k < LEN2; k++) {
const float t = bb[k][i];

to give compiler hint, that it should optimize it, since modification later is not possible.

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Thank you very much for further suggestion. I tried const, register, neither of them helps.

Seems from .s and the counter result I can tell no obvious difference, but exact execution time differ a lot, it is a little bit confusing...

Susan

Susan,

Changes in codes like that one could have a severe impact on performance of some application. There are two possibilities:

1. An alignment related problem: There is a possibility that it creates some misalignment and addition a _declspec( align( xx ) ) specificator, or some dummy variable of a right size, could fix the problem. Note: I had similar problems with two legacy C/C++ compilers and it is a real problem especially when the software needs to be portable. A solution: declare with some alignment specificator.

2. A variable declared in a local scope is not initalized: It is possible that assignment to some default value will fix the problem ( it looks like a "magic" but a couple of times I was really puzzled with such kind of fixes! ).
...
int tensor_kernel_1()
{
// example from the paper "Model-Driven SIMD Code Generation for a Multi-Resolution Tensor Kernel"
clock_t start_t, end_t, clock_dif; double clock_dif_sec;
float t = 0.0f;

init( "tensor_kernel ");
...
Note: These variables, that is, start_t, end_t, clock_dif; clock_dif_sec, are also not initialized.

A Combined solution could look like: declare the variable in a scope where it is used with some alignment specificator and initialize it ( including all the rest automatic variables ).

Sergey,

Thank you very much for your detailed reply. In the code I used __attribute__((aligned(x))) for allocating global arrays..

I have another question about alignment: I feel like icc could generate conditional code to do loop peeling for unaligned part, is that true?

BTW,Initializing variables for this case doesn't help obviously on performance.

Susan

Yes, in most cases where a loop is vectorizable but alignment is not assured, the compiler peels some iterations so as to begin the vectorized alignment at an aligned point in an array for which alignment is recognized as useful.

Peeling for alignment is not so important when dealing with 128-bit parallel instructions on core I7-2 or newer, as those CPUs have excellent support for unaligned 128-bit parallel access.  So it is possible to see AVX-128 code which doesn't peel for alignment even though it supports mis-aligned data.  That option is chosen more frequently by gcc than by icc.

The matrix operation case which you pose at the beginning of this thread raises additional issues with alignment.  In order for __atttribute__((aligned())) to support inner vector loops without peeling for alignment, not only do the loops need to be nested appropriately, but the corresponding extents of the arrays must be multiples of 16 bytes (32 for AVX-256...)...

Tim,

Thank you very much for the information.

"but the corresponding extents of the arrays must be multiples of 16 bytes (32 for AVX-256...)" I used 32 byte aligned allocation, the array size is multiple of 32 bytes. Did I miss anything or misunderstand your point?

Since the instruction related counter shows similar result, I will try to check the counter for L1/L2 by using PAPI later. Beside SIMD, I wish to check whether memory hierache is the reason.

Best Regards,

Susan

Hi Susan, I'll check how your test project works on Ivy Bridge ( executables will be compiled with different C/C++ compilers ) and post results.

Sergey,

Thank you very much.

I got following result related to L1 and L2 cache: (L1 data miss: PAPI_L1_DCM; L1 instr miss: PAPI_L1_ICM; L2 data miss: PAPI_L2_DCM)

Loop                   Time(Sec)         Checksum

tensor_kernel_1         2.22                 0.210001
PAPI_FP_OPS 8
PAPI_VEC_SP 26351151240
PAPI_L1_DCM 953799685
PAPI_L1_ICM 788
PAPI_L2_DCM 120717240

tensor_kernel_2         19.44                0.210001
PAPI_FP_OPS 8
PAPI_VEC_SP 26288592672
PAPI_L1_DCM 13229545630
PAPI_L1_ICM 3169
PAPI_L2_DCM 7702311276

Seems that the L1 data miss got a great increase of 13.8x and it might be the reason. But I dont know why this happened, does it mean that whenever it execute t=bb[][], it will cause a L1 miss? I will try to modify this statement and have a look.

Susan

We've tried to explain how you've defeated compiler optimizations for data locality.  It's interesting how your statistics show the effects of backward nested loops.

>>...Seems that the L1 data miss got a great increase of 13.8x and it might be the reason. But I dont know why this
>>happened, does it mean that whenever it execute t=bb[][], it will cause a L1 miss? I will try to modify this
>>statement and have a look.

Susan, these numbers are very interesting. I will look at a Datasheet of your CPU ( Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz ) and the test case tomorrow.

About the increase of L1/L2 miss, I still wish to get your opinion on following things:

In the code, LEN2=256

So for the innermost loop (cache line size 64B): LEN2*4B are W/R for cc array (one row); LEN2*64B are W/R for aa aray (totally Len2 cache line are read); totally: 256 * 4 + 256 * 64=17KB. Data L1 is 32KB, but I dont understand why this also cause L1/L2 miss increase obviously. I changed the LEN2 to 128, the L1 miss is still around get a increase of 15x, also exist L2 miss increase. If I did anything incorrect, please pardon me and do me a favor to correct me.

BTW, a personal feeling: for normal (not expert) programmer like me, I might think that it is good to have t=bb[i][k] and move it outside innermost loop since this array visiting has nothing to do with innermost loop iteration variable. Or I might just occasionally wrote the code to have such a t, I feel there is no big difference since this value is just a constant for innermost loop interation.... The paper says: since the last index for each array visited in the statement are different (one i, one j and one k) and hard to be used to determine loop permutation, the manual code they gave just do unrolling and move this kind of t to the outside of innermost loop.

I have an additional question, I know icc can recoganize typical routine of matrix multiply; can icc recogonize a typical 4x4 (or 8x8 for AVX) float point matrix transpose and use pack/unpack/shuffle to deal with that?

Susan, Let me answer last part of your post at the moment.

>>...can icc recogonize a typical 4x4 (or 8x8 for AVX) float point matrix transpose and use pack/unpack/shuffle to
>>deal with that?

It looks like No but I'm not sure for 100%. There is a macro _MM_TRANSPOSE4_PS declared in xmmintrin.h header file. I know about that macro for more than 5 years and used it in some image processing project. However, in 2012 I evaluated _MM_TRANSPOSE4_PS performance and it "failed" in a "competition" with a Hard-Coded-Non-SSE-Transponse for a 4x4 matrix of single-precision floating-point values.

Hard-Coded-Non-SSE-Transponse for a 4x4 matrix looks like:
...
float fValue = 0.0f;

fValue = fMxIn[0][1]; fMxIn[0][1] = fMxIn[1][0]; fMxIn[1][0] = fValue;
fValue = fMxIn[0][2]; fMxIn[0][2] = fMxIn[2][0]; fMxIn[2][0] = fValue;
fValue = fMxIn[0][3]; fMxIn[0][3] = fMxIn[3][0]; fMxIn[3][0] = fValue;
fValue = fMxIn[1][2]; fMxIn[1][2] = fMxIn[2][1]; fMxIn[2][1] = fValue;
fValue = fMxIn[1][3]; fMxIn[1][3] = fMxIn[3][1]; fMxIn[3][1] = fValue;
fValue = fMxIn[2][3]; fMxIn[2][3] = fMxIn[3][2]; fMxIn[3][2] = fValue;
...
As you can see there are No any for-loops in the code and it outperformed _MM_TRANSPOSE4_PS macro.

Then, I don't know anything about AVX-based transpose macro or support by Intel C++ compiler. Where did you see or read about it? I just checked immintrin.h header file and I didn't find any macros for transpose of a 8x8 matrix.

Note: A transpose of 4x4 or 8x8 matricies are the trivial cases and if you're interested in the transpose of larger matricies then Diagonal or Eklundh Based algorithms need to be used.

When you force the compiler to follow the bad organization of loop nesting you quoted at the top of this thread, each element of aa[] comes from a different cache line, so you may expect to read the entire array each time through the inner loop.  Your papi statistics appear to verify this.

The compiiler apparently vectorized the loop using "simulated gather," where the elements read separately each from a different cache line are packed into a vector register.  Thus you execute roughly the same number of vector floating point operations as in the stride 1 vectorized case, but you replace the parallel reads from aa[] by scalar reads.  You say you looked at the asm code; you would see the replacement of parallel data moves by groups of insertp instructions.

Sergay, thank you very much for the useful and detailed info, i will try to implement one.

>>...Seems that the L1 data miss got a great increase of 13.8x and it might be the reason.

I used VTune and I confirm that ( reproduced on 3rd Gen i7 Ivy Bridge / Windows 7 Professional ).

>>...But I dont know why this happened, does it mean that whenever it execute t=bb[][], it will cause a L1 miss?

Yes, but only in case of aggressive optimizations by the compiler with /O3 option. As a summary this is what I had in VTune ( for your "optimized" function ):

...A significant proportion of cycles is being spent on data fetches that miss in the L2 but hit in the LLC...

Here are results of investigation:

Optimization: /Od
Loop Time(Sec) Checksum
tensor_kernel 99.0000 0.214022
tensor_kernel_1 85.0000 0.214022

Optimization: /O1
Loop Time(Sec) Checksum
tensor_kernel 25.0000 0.214022
tensor_kernel_1 25.0000 0.214022

Optimization: /O2
Loop Time(Sec) Checksum
tensor_kernel 20.0000 0.213768
tensor_kernel_1 20.0000 0.214172

Optimization: /O3 Note: Performance negatively affected
Loop Time(Sec) Checksum
tensor_kernel 3.0000 0.213823
tensor_kernel_1 11.0000 0.214172

Source codes attached.

## Anlagen:

AnhangGröße
5.29 KB

Summary of VTune results is attached and here are a couple of screenshots:

## Anlagen:

AnhangGröße
2.82 KB
216.32 KB
132.9 KB

Hi Sergey,

Thank you very much for the info. That's really helpful.

Now I understand I should be very careful of this modification, it is better to be used with tiling (or other transformations) to avoid cache miss.

Best Regards,

Susan

This is a follow up and you could try ( if interested ) a #pragma prefetch [var1 [:hint1 [:distance1]] [, var2 [:hint2 [:distance2]]]...] directive. Values for hint arguments are as follows:

- _MM_HINT_T0 - for integer data that will be reused
- _MM_HINT_NT1 - for integer and floating point data that will be reused from L2 cache
- _MM_HINT_NT2 - for data that will be reused from L3 cache
- _MM_HINT_NTA - for data that will not be reused

and the directive has to be used as follows ( without hint value in my example ):
...
float t;
...
#pragma prefetch t
for( int k = 0; k < LEN2; k++ )
{
t = bb[k][i];
for( int j = 0; j < LEN2; j++ )
{
...

I detected a similar problem in a different algorithm and negative effect of such manual-optimizations is significant ( it slows execution in ~2x ). Just take a look at numbers:

Intel C++ compiler - /O2

[ Problems with Cache Lines ( after manual-optimization similar to Susan's ) ]
...
Matrix Size: 1024 x 1024
...
Calculating...
Matrix multiplication - Pass 1 - Completed: 1.40400 secs
Matrix multiplication - Pass 2 - Completed: 1.18500 secs
Matrix multiplication - Pass 3 - Completed: 1.20200 secs
Matrix multiplication - Pass 4 - Completed: 1.20100 secs
Matrix multiplication - Pass 5 - Completed: 1.18500 secs
...

[ No Problems with Cache Lines - ~50% faster (!) ]
...
Matrix Size: 1024 x 1024
...
Calculating...
Matrix multiplication - Pass 1 - Completed: 0.82700 secs
Matrix multiplication - Pass 2 - Completed: 0.62400 secs
Matrix multiplication - Pass 3 - Completed: 0.62400 secs
Matrix multiplication - Pass 4 - Completed: 0.63900 secs
Matrix multiplication - Pass 5 - Completed: 0.62400 secs
...

This is a follow up. For example, If these codes:

...
// (A)
for( j = 0; j < uiSize; j++ )
{
tC[i][j] = ( T )0;
for( k = 0; k < uiSize; k += 4 )
{
tC[i][j] += ( ( tA[i][k ] * tB[k ][j] ) +
( tA[i][k+1] * tB[k+1][j] ) +
( tA[i][k+2] * tB[k+2][j] ) +
( tA[i][k+3] * tB[k+3][j] ) );
}
}
...

Changed as follows:

...
// (B)
{
register T tR = ( T )0;
for( k = 0; k < uiSize; k += 4 )
{
tR += ( ( tA[i][k ] * tB[k ][j] ) +
( tA[i][k+1] * tB[k+1][j] ) +
( tA[i][k+2] * tB[k+2][j] ) +
( tA[i][k+3] * tB[k+3][j] ) );
}
tC[i][j] = ( T )tR;
}
...

Then performance improves by ~3.9% compared to (A).

If tB is transposed:

...
// (C)
{
register T tR = ( T )0;
for( k = 0; k < uiSize; k += 4 )
{
tR += ( ( tA[i][k ] * tB[k][j ] ) +
( tA[i][k+1] * tB[k][j+1] ) +
( tA[i][k+2] * tB[k][j+2] ) +
( tA[i][k+3] * tB[k][j+3] ) );
}
tC[i][j] = ( T )tR;
}
...

Then performance improves by ~30.5% compared to (A).

This may be off point of this thread. In examining your code it is a matrix multiplication where the destination is neither of the sources. It might not hurt to investigate producing ccT, the transposed value of the eventual result, then after complete computation of ccT, transpose it to produce cc.

Although this adds extra work for the final transpose, it removes latencies by improving vectorization and cache utilization.

Also, it wouldn't hurt to add a test using MKL.

Jim Dempsey