Array notation surprisingly slow

Array notation surprisingly slow

Hi,

i made a unusual discovery today. I wanted to compare the speed of a program using array notation with the speed of a program without it. The program adds two arrays of 1,000,000 integers one thousand times and then reports the result (see below).

 int main(int argc, char ** argv) { 
 int array1[1000000];
 int array2[1000000];
 int a = 1;
 for (int i = 0; i < 1000000; i++) {
 array1[i] = a; 
 array2[i] = a;
 }
 for (int j = 0; j < 1000; j++) { 
 for (int i = 0; i < 1000000; i++) { 
 array1[i] += array2[i]; } 
 } 
 cout << array1[0]<< endl; 

The second program is its array-notation equivalent

 int main(int argc, char ** argv) {
 int array1[1000000];
 int array2[1000000];
 array1[0:1000000] = a;
 array2[0:1000000] = a;
 for (int j = 0; j < 1000; j++) {
 array1[0:1000000] += array2[0:1000000];
 }
 cout << array1[0]<< endl;
 

Without any optimization upon compilation, the first program (without the array-notation) took about 2.9 seconds and the second program took about 1.9 seconds. If i use optimization level -O3, the first program takes 0.6 seconds and the second program still takes 1.2 second. This is double the time the first program uses.

Obviously, the two for-loops of the first program get vectorized, when using the -O3 option. Still, i would have thought, that the array-notation version was at least as fast. Did i miss something there? Maybe there is a bug somewhere when trying to optimize array-notation-code, so it generates less efficient code.If not - what do have to do to get the array-notation as fast as the "normal" code?

I used a recent version of the cilkplus-branch of the gcc (a few days old) and a Core i7. 

5 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

I looked at the assembler-code (compiled it with flags -O3 and -S) and it seems that the second version of my program (array-notation version) is not being vectorized: When adding up the arrays, it does not use the %xmm-Register but the %edx-register. Oddly enough, when setting the values of the array = a (array1[0:100..] =a, array2 also) it does use the xmm0-register. So there seems to be a problem only with vectorizing array notation nested in a for-loop.

This is the first time i looked at assembler code so maybe someone else could try to replicate the problem and have a look too.

When you exchange the line

 array1[0:1000000] += array2[0:1000000];


with a for-loop that does the exact same thing, you get a vectorized version. It then takes the same time as the version of the program not using array notation and it uses the %xmm-register when looking at the Assembler-Code. So the problem really is with creating SIMD-Code from array-notation inside a for-loop.

It's hard to guess resulting code in a case like this where there is an opportunity for the compiler to short-cut your code and suppress loops.

I have the same problem when using array-notation with inline-if statements. Consider the edit-distance of two strings A,B. You can write that as


char A[100] = { ... }; // i didn't want to write the entire 100 symbols

char B[100] = { ... } ;

int s;

s[0:100] = (A[0:100] == B[0:100]) ? 1: 0;

cout << s[0:100]

Because of the output at the end, the compiler is forced to do the whole work, without shortcuts or anything. Still, when compiling with -O3 and looking at the assembler code, i get something like that:
EDIT: This is the part, where the two symbols are compared. Basically a for-loop that compares byte-for-byte two strings.


.L2:

 	movzbl	112(%rsp,%rax), %ecx

	cmpb	%cl, (%rsp,%rax)

	sete	%cl

	addl	$1, %edx

	movzbl	%cl, %ecx

	movl	%ecx, 224(%rsp,%rax,4)

	addq	$1, %rax

	cmpl	$99, %edx

	jle	.L2

	movl	224(%rsp), %esi

	movl	$_ZSt4cout, %edi

	call	_ZNSolsEi

	movl	228(%rsp), %esi

	movl	$_ZSt4cout, %edi

	call	_ZNSolsEi

	movl	232(%rsp), %esi

	movl	$_ZSt4cout, %edi

	call	_ZNSolsEi

	movl	236(%rsp), %esi

	movl	$_ZSt4cout, %edi

	call	_ZNSolsEi

	movl	240(%rsp), %esi

	movl	$_ZSt4cout, %edi

	call	_ZNSolsEi

	xorl	%eax, %eax

	addq	$632, %rsp

	.cfi_def_cfa_offset 8

	ret

	.cfi_endproc

The code itself is not that interesting, but note that there is not sign of %xmm registers or any commands that you would expect from SIMD-code (like movdqa or something like that). On the other hand, when you use a "normal" for-loop, you can generate SIMD-Code without a problem. Adding the '#pragma simd' leads to valid SIMD-Code as well, but i think it is because the compiler automatically detects that there is no dependence (because it works without the pragma as well).
I don't know whether this is only a problem with the gcc or also with an intel compiler, maybe someone could check that out.

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui