Java polynomial approx. faster than C code polynomial approx.

Java polynomial approx. faster than C code polynomial approx.

Hi everybody!
This question resembles slightly my otherthread posted here http://software.intel.com/en-us/forums/showthread.php?t=105474
While porting my library of elementary and special functions from Java to C I implemented polynomial approximation as it was advised me by few posters.Now I use poly approximation in my librarywhere it is applicable.I was interested in performance measurement between the same implementation written in managed code and in native code.To my big surprise java code always executed faster than native code.
After studying asm code and knowing than Intel c++ compiler uses security cookie checking and fills the buffer with 128 int 3 (0xcc) instructionsright after function's prolog.I came to conclusion that this is compiler induced overhead which is responsible for slower execution of C code.
Here are the tests taken from my thread http://software.intel.com/en-us/forums/showthread.php?t=105474
Can anybody help me to understand why native code can be so slow when compared to Java code.

result for native code 1 million iterations.

start value of fastsin(): 39492698 // Native code
end value of fastsin() : 39492760
delta of fastsin() is : 62 millisec
sine is: 0.841470444509448080000000

java -server

C:\\Program Files\\Java\\jdk1.7.0\\bin>java -server SineFunc
start value : 1339596068015
end value : 1339596068045
running time of fastsin() is :30 milisec

java -client

C:\\Program Files\\Java\\jdk1.7.0\\bin>java -client SineFunc
start value : 1339596081083
end value : 1339596081130
running time of fastsin() is :47 milisec

Here is the fastsin() prologue

0000055 push ebp
000018b ec mov ebp, esp
0000381 ec 80 00 00
00 sub esp, 128; 00000080H
0000957 push edi
0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]
0000db9 20 00 00 00 mov ecx, 32; 00000020H
00012b8 cc cc cc cc mov eax, -858993460; ccccccccH
00017f3 ab rep stosd <-- Can be this culprit for slower code execution

And here is the code.Java implementation is identical to this code.

double fastsin(double x){
double sum = 0;
double half_pi,zero_arg;
half_pi = Pi/2;
zero_arg = Zero;

if(x > half_pi){ // simple input checking range 0 return (x-x)/(x-x) ;
}else if (x < zero_arg){
return (x-x)/(x-x);
}else{

double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10,coef11,rad,sqr;
coef1 = -0.16666666666666666666666666666667;// 1/3!
coef2 = 0.00833333333333333333333333333333;// 1/5!
coef3 = -1.984126984126984126984126984127e-4;// 1/7!
coef4 = 2.7557319223985890652557319223986e-6;// 1/9!
coef5 = -2.5052108385441718775052108385442e-8;// 1/11!
coef6 = 1.6059043836821614599392377170155e-10;// 1/13!
coef7 = -7.6471637318198164759011319857881e-13;// 1/15!
coef8 = 2.8114572543455207631989455830103e-15 ;// 1/17!
coef9 = -8.2206352466243297169559812368723e-18;// 1/19!
coef10 = 1.9572941063391261230847574373505e-20;// 1/21!
coef11 = -3.8681701706306840377169119315228e-23;// 1/23!
rad = x;//
sqr = x*x; //x^2

sum = rad+rad*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11)))))))))));

}
return sum;
}

publicaciones de 12 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Hi Iliya,

Quoting iliyapolak...To my big surprise java code always executed faster than native code. After studying asm code and
knowing than Intel c++ compiler uses
security cookie checking...

[SergeyK] Please see my comment 3 at the end of the post.

...and fills the buffer with 128 int 3 (0xcc)
instructionsright after function's prolog.I came to conclusion that this is compiler induced overhead which
is responsible for slower execution of C code.

[SergeyK] Could you provide a complete list of compiler options? I noticed that Intel C++ compilercodes
are always 2xslower when ALL optimizations are disabled.My point of viewis based on
my test cases verified with Intel, Microsoft, MinGW and Borland C++ compilers.
It is interesting that your C implementation is also 2x slower.

...
result for native code 1 million iterations.

...delta of fastsin() is : 62 millisec...

java -server

...running time of fastsin() is : 30 milisec...

java -client

...running time of fastsin() is : 47 milisec...

Here is the fastsin() prologue

0000055 push ebp
000018b ec mov ebp, esp
0000381 ec 80 00 00
00 sub esp, 128; 00000080H
0000957 push edi
0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]
0000db9 20 00 00 00 mov ecx, 32; 00000020H
00012b8 cc cc cc cc mov eax, -858993460; ccccccccH
00017f3 ab rep stosd <--
Can be this culprit for slower code execution

And here is the code.Java implementation is identical to this code.

double fastsin(double x){
double sum = 0;
double
half_pi, zero_arg;
half_pi = Pi/2;
zero_arg = Zero;

if(x > half_pi){ // simple input checking range 0 return (x-x)/(x-x) ;
}else if (x < zero_arg){
return (x-x)/(x-x);
}else{

double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10,coef11,rad,sqr;
coef1 = -0.16666666666666666666666666666667;// 1/3!
coef2 = 0.00833333333333333333333333333333;// 1/5!
coef3 = -1.984126984126984126984126984127e-4;// 1/7!
coef4 = 2.7557319223985890652557319223986e-6;// 1/9!
coef5 = -2.5052108385441718775052108385442e-8;// 1/11!
coef6 = 1.6059043836821614599392377170155e-10;// 1/13!
coef7 = -7.6471637318198164759011319857881e-13;// 1/15!
coef8 = 2.8114572543455207631989455830103e-15 ;// 1/17!
coef9 = -8.2206352466243297169559812368723e-18;// 1/19!
coef10 = 1.9572941063391261230847574373505e-20;// 1/21!
coef11 = -3.8681701706306840377169119315228e-23;// 1/23!
rad = x;//
sqr = x*x; //x^2

sum = rad+rad*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11)))))))))));

}
return sum;
}

Here are a couple of tips:

1.Did you get 62 millisecforC codes in Debug or Release configuration?

2.Every time when you call the 'FastSin' function a memory for15 variables of type 'double' will be allocated
on the stack ( I'bolded' and 'underlined'declaration of all these variables ).

3.As soon as a memory is allocated for these 15 variables it has to be initialized with some default value,
like 0xcccccccc,( it willnever beinitialized with 0 ) andit looks like instruction 'rep stosd' does it.

4.Every time when you call the 'FastSin' functionyou initialize 'coef..' variables with the same constant
values, like '1/3!'. Why wouldn't you declare these 13constants as global?

5.And one more thing, 15 (variables)x 8 (sizeof(double)) is equal to 120. You mentioned that the assembler
codes are initializing some buffer with 128 '0xcc'.

Here isa screenshot that demonstratesa default initialization for '__int32' and '__int64' variables:

Quoting iliyapolak 0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]
0000db9 20 00 00 00 mov ecx, 32; 00000020H
00012b8 cc cc cc cc mov eax, -858993460; ccccccccH
00017f3 ab rep stosd <-- Can be this culprit for slower code execution

I think No.

Best regards,
Sergey

Hi Sergey!
Thanks for your answer :)

>>1.Did you get 62 millisecforC codes in Debug or Release configuration?

62 millisec was measured for Debug configuration.
In release mode rep stosd instruction is gone and code is inlined inside the main()'s for-loop

>>3.As soon as a memory is allocated for these 15 variables it has to be initialized with some default value,
like 0xcccccccc

It could be also filled with x86 int3 (0xcc) instruction to force debugger break-in when the code executes out of return address.I read about this behaviour in Chris Eagle book"The Ida Pro Book"

>>4.values, like '1/3!'. Why wouldn't you declare these 13constants as global?

I preffer do not give to this values a global scope.

>>5.You mentioned that the assembler
codes are initializing some buffer with 128
0xcc

Look at this code

00 sub esp, 128; 00000080H<-- here 128 - bytes buffer
0000957 push edi
0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]
0000db9 20 00 00 00 mov ecx, 32; 00000020H
00012b8 cc cc cc cc mov eax, -858993460<-- here
I think these are int3 instructions

This was also recognized by Intel amplifier as source of main hot-spot with 14 instructions executed per iteration and whole this block can add significant overhead of few nanosec.

P.S
Sergeygo to my thread here http://software.intel.com/en-us/forums/showthread.php?t=105474
I uploaded another book on accuracy and stabillity of numerical methods.:)

post #147

SergeyK] Could you provide a complete list of compiler options? I noticed that Intel C++ compilercodes
are always 2xslower when ALL optimizations are disabled.My point of viewis based on
my test cases verified with Intel, Microsoft, MinGW and Borland C++ compilers.
It is interesting that your C implementation is also 2x slower

Compiler options for release mode

Zi /nologo /W3 /WX- /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue

Here is the optimization introduced by Microsoft compiler in release code.As you can see whole fastsin() code was inlined inside the main() there is also unrolling of fastsin() argument,but what is very strange that delta for 1e6 iteration measurement was 0.And for 1e7 iterations the result was it31 millisec i.e 3.1 nanosec per eration.Too slow to be true.

The result for 10 million iterations
running time of fastsin() release code is: 31 milisec
fastsin() is: 0.909297421962549370000000

asm code

; 23   :  int main(void){
  00000	55		 push	 ebp

  00001	8b ec		 mov	 ebp, esp

  00003	83 e4 c0	 and	 esp, -64		; ffffffc0H

  00006	83 ec 30	 sub	 esp, 48			; 00000030H
; 24   : 	double e1 = 0;

; 25   : 	 double sine;

; 26   : 	 sine = 0;

; 27   : 	double gam;

; 28   : 	gam = 0;

; 29   : 	double fastgam;

; 30   : 	fastgam = 0;

; 31   : 	 double arg1;

; 32   : 	 arg1 = 1.0f;
  00009	f2 0f 10 05 00

	00 00 00	 movsd	 xmm0, QWORD PTR _one

  00011	53		 push	 ebx

  00012	55		 push	 ebp

  00013	56		 push	 esi

  00014	57		 push	 edi
; 33   : 	unsigned int start2,end2;

; 34   : 	 start2 = GetTickCount();
  00015	8b 3d 00 00 00

	00		 mov	 edi, DWORD PTR __imp__GetTickCount@0

  0001b	f2 0f 11 44 24

	30		 movsd	 QWORD PTR _arg1$[esp+64], xmm0

  00021	ff d7		 call	 edi

  00023	f2 0f 10 15 00

	00 00 00	 movsd	 xmm2, QWORD PTR __real@3e7ad7f2a0000000

  0002b	f2 0f 10 25 00

	00 00 00	 movsd	 xmm4, QWORD PTR __real@3b4761b41316381a

  00033	f2 0f 10 2d 00

	00 00 00	 movsd	 xmm5, QWORD PTR __real@3bd71b8ef6dcf572

  0003b	f2 0f 10 35 00

	00 00 00	 movsd	 xmm6, QWORD PTR __real@3c62f49b46814157

  00043	f2 0f 10 5c 24

	30		 movsd	 xmm3, QWORD PTR _arg1$[esp+64]

  00049	8b f0		 mov	 esi, eax

  0004b	b8 40 42 0f 00	 mov	 eax, 1000000		; 000f4240H

$LL9@main:
; 35   : 	 for(int i2 = 0;i2<10000000;i2++){
  00050	48		 dec	 eax
; 36   : 		 arg1 += 0.0000001f;
  00051	f2 0f 58 da	 addsd	 xmm3, xmm2

  00055	f2 0f 58 da	 addsd	 xmm3, xmm2

  00059	f2 0f 58 da	 addsd	 xmm3, xmm2

  0005d	f2 0f 58 da	 addsd	 xmm3, xmm2

  00061	f2 0f 58 da	 addsd	 xmm3, xmm2

  00065	f2 0f 58 da	 addsd	 xmm3, xmm2

  00069	f2 0f 58 da	 addsd	 xmm3, xmm2

  0006d	f2 0f 58 da	 addsd	 xmm3, xmm2

  00071	f2 0f 58 da	 addsd	 xmm3, xmm2

  00075	f2 0f 58 da	 addsd	 xmm3, xmm2
; 37   : 		 sine = fastsin(arg1);
  00079	66 0f 28 cb	 movapd	 xmm1, xmm3

  0007d	f2 0f 59 cb	 mulsd	 xmm1, xmm3

  00081	66 0f 28 f9	 movapd	 xmm7, xmm1

  00085	f2 0f 59 fc	 mulsd	 xmm7, xmm4

  00089	66 0f 28 c5	 movapd	 xmm0, xmm5

  0008d	f2 0f 5c c7	 subsd	 xmm0, xmm7

  00091	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  00095	f2 0f 5c c6	 subsd	 xmm0, xmm6

  00099	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  0009d	f2 0f 58 05 00

	00 00 00	 addsd	 xmm0, QWORD PTR __real@3ce952c77030ad4a

  000a5	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000a9	f2 0f 5c 05 00

	00 00 00	 subsd	 xmm0, QWORD PTR __real@3d6ae7f3e733b81f

  000b1	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000b5	f2 0f 58 05 00

	00 00 00	 addsd	 xmm0, QWORD PTR __real@3de6124613a86d09

  000bd	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000c1	f2 0f 5c 05 00

	00 00 00	 subsd	 xmm0, QWORD PTR __real@3e5ae64567f544e4

  000c9	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000cd	f2 0f 58 05 00

	00 00 00	 addsd	 xmm0, QWORD PTR __real@3ec71de3a556c734

  000d5	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000d9	f2 0f 5c 05 00

	00 00 00	 subsd	 xmm0, QWORD PTR __real@3f2a01a01a01a01a

  000e1	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000e5	f2 0f 58 05 00

	00 00 00	 addsd	 xmm0, QWORD PTR __real@3f81111111111111

  000ed	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  000f1	f2 0f 5c 05 00

	00 00 00	 subsd	 xmm0, QWORD PTR __real@3fc5555555555555

  000f9	f2 0f 59 cb	 mulsd	 xmm1, xmm3

  000fd	f2 0f 59 c1	 mulsd	 xmm0, xmm1

  00101	f2 0f 58 c3	 addsd	 xmm0, xmm3

  00105	f2 0f 11 44 24

	30		 movsd	 QWORD PTR _sine$[esp+64], xmm0

  0010b	0f 85 3f ff ff

	ff		 jne	 $LL9main  

Btw Sergey Do You preffer to continue our discussion in the Intel AVX and CPU instructions forum, because I posted there asm listings and measurements result moreover there are also very interesting responses from bronxzv here http://software.intel.com/en-us/forums/showthread.php?t=105474

Quoting iliyapolak...Do You preffer to continue our discussion in the Intel AVX and CPU instructions forum, because I posted there asm listings and measurements result moreover there are also very interesting responses from bronxzv here http://software.intel.com/en-us/forums/showthread.php?t=105474

I will startreading allposts I missed ( from Page #9 to the Last ) soon.

I think there is nothing wrongwith havingboth threads. However,the threadin the'Intel AVX and CPU' forum
has grown significantly. You could also make a note that all the rest discussions will be done, for example,in
the 'Software Tuning, Performance Optimization' forum.

So, it is up to you to decide what thread has to be 'alive' for discussions.

Best regards,
Sergey

Hi everybody!

I decided to continue this discussion in this thread http://software.intel.com/en-us/forums/showpost.php?p=187714

Have you make the test with JRockit JVM?

...and change the test with different values for -XX:MinYoung and others tuning variables...

...and change the test with different values for -XX:MinYoung and others tuning variables

The tests were made with Sun(Oracle) JVM with two settingsclient and server.Code was compiled by incremential EclipseJDT compiler and ran by Java RuntimeEnvironment.

java version "1.7.0"
Java SE Runtime Environment (build 1.7.0-b147)
Java HotSpot Client VM (build 21.0-b17, mixed mode)

Please bear in mind that the results posted in this thread are outdated because the C code was compiled in Debug mode hence highly optimizedbyJIT compiler code could easily outperform much slowernative code.
Please look here post #158 http://software.intel.com/en-us/forums/showthread.php?t=105474

And here are the results of aggresively optimized native code:

Tested today fastsin() 1e6 iterations and the result was 15 millisec i.e ~33.39 cycles per iterationfor my CPU.

results

start val of fastsin() 13214314
end val of fastsin() 13214329
running time of fastsin() release code is: 15 millisec
fastsin() is: 0.891207360591512180000000

No!
The compilerwas incrementialEclipse JDT.The tests were made with Sun JVM.

Inicie sesión para dejar un comentario.