3D engine

271 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

I trying asm in line, and have error: Unknown opcode DD in asm instruction, with:

void __declspec(naked) make_rotation()
{
	// Naked functions must provide their own prolog...
	__asm{
				translation:		dd  0
                               ...
		  }
}

 

Citação:

shaynox s. escreveu:

I still don't understand Data structure alignment, i read this on wikipedia:

Data structure alignment is the way data is arranged and accessed in computer memory. It consists of two separate but related issues: data alignment and data structure padding. When a modern computer reads from or writes to a memory address, it will do this in word sized chunks (e.g. 4 byte chunks on a 32-bit system) or larger.

why this system ? [0x0000_0000] point on first byte address memory and [0x0000_0001] point on second byte

mov       eax, [0x0000_0000]     ; begin to store 4 byte after the first byte of RAM to eax

mov       eax, [0x0000_0001]     ; begin to store 4 byte after the second byte of RAM to eax

can we disable this memory management, and access to data address byte after byte ?

Data is aligned laid out by chunks of double word (4-bytes). Data padding or structure padding is used to align structure on for example double word boundaries. If you have structure with 3 members 2 of them are uint and one of them is uchar so you will need 3 uchar bytes of padding. I do not think that you can disable memory management.

ok, and for declaration of data, i don't use C declaration, because icl don't align data, it put them randomly unfortunnaly.

Usually the code is incremented by 1,2 or 4.

mov eax, dword ptr [esi+4] 

In this case value at memory address esi+4 will be loaded into eax. From the CPU and Memory Controller point of view there are only memory cells at granularity of one byte.In order to address memory of uint array base address of first dword must be loaded in register and further that register must be incremented by 4. 

Quote:

shaynox s. wrote:

I trying asm in line, and have error: Unknown opcode DD in asm instruction, with:

void __declspec(naked) make_rotation()
{
	// Naked functions must provide their own prolog...
	__asm{
				translation:		dd  0
                               ...
		  }
}

 

IIRC in inline assembly block you cannot use MASM or other declaration directive. Variables must be declared before entry to _asm block.

http://msdn.microsoft.com/en-us/library/5sds75we.aspx

http://msdn.microsoft.com/en-us/library/4ks26t93.aspx

>>>ok, and for declaration of data, i don't use C declaration, because icl don't align data, it put them randomly unfortunnaly.>>>

​It depends on data type. Usually when working with the array data type compiler will lay it out linearly.

Please download AVX Cloth rendering source and also read included pdf document about the SoA optimization.

https://software.intel.com/en-us/articles/soa-cloth-simulation-with-256-...

Back, mysteriously if i put asm in line, i fall down at 70 fps (250 originaly)

	static float translation[5] = { 0,
									0,		// translation_x
									0,		//translation_y
									0,		// translation_z
									0		// reserved: color
								  };
	float angle = 0;
	float rotation_x = 0;
	static float _xmm0[6] = { 0,			// sin.x	0
							  1.0,			// cos.x	4
							  1.0,			// cos.x	8
							  0,			// sin.x	12
							  1.0,			// 1		16	
							  0			// sin.y	20				
							 };
	float rotation_z = 0;
	static float _xmm1[6] = { 0,			// sin.z	0
							  1.0,			// cos.z 4
							  1.0,			// cos.z	8
							  0,			// sin.z	12	
							  1.0,			// 1		16
							  0,			// sin.y	20					
							 };
	float rotation_y = 0;
	static float _xmm2[6] = { 0, 			// sin.y	0
							  1.0,			// cos.y 4
							  1.0,			// cos.y 8
							  0,			// sin.y	12	
							  1.0,			// 1		16
							  0			// sin.y	20
							 };
	float color_pixel = 0;
	float coordonee[5] = { x,		// x
						   y,		// y
						   z,		// z
						   0,		// reserved: color
						 };
	float end_coord[3] = { 0,		// _x
						   0,		// _y
						   0		// _z
						 };
	float conv_signe = -0.0;
	float rapport = RAPPORT;
	
	// x
	_xmm0[0] = sin(DEG2RAD(rotation_object[0]));
	_xmm0[3] = sin(DEG2RAD(rotation_object[0]));
	_xmm0[5] = sin(DEG2RAD(rotation_object[0]));

	_xmm0[1] = cos(DEG2RAD(rotation_object[0]));
	_xmm0[2] = cos(DEG2RAD(rotation_object[0]));

	// z
	_xmm1[0] = sin(DEG2RAD(rotation_object[2]));
	_xmm1[3] = sin(DEG2RAD(rotation_object[2]));
	_xmm1[5] = sin(DEG2RAD(rotation_object[2]));

	_xmm1[1] = cos(DEG2RAD(rotation_object[2]));
	_xmm1[2] = cos(DEG2RAD(rotation_object[2]));

	// y
	_xmm2[0] = sin(DEG2RAD(rotation_object[1]));
	_xmm2[3] = sin(DEG2RAD(rotation_object[1]));
	_xmm2[5] = sin(DEG2RAD(rotation_object[1]));

	_xmm2[1] = cos(DEG2RAD(rotation_object[1]));
	_xmm2[2] = cos(DEG2RAD(rotation_object[1]));
	
	__asm{
		// == == == == == == =
		// yaw
		// == == == == == == =
	Yaw:        // y
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule x = x.cos(phi.y) * cos(phi.z) - y.cos(phi.y) * sin(phi.z) - z.sin(phi.y)
		//
		// On calcule  A = x.cos(phi.y), B = y.cos(phi.y) et C = z.sin(phi.y)
		movups	xmm0, [_xmm2 + 4]
			movups	xmm1, [coordonee]
			mulps	xmm0, xmm1

			// On calcule D = A * cos(phi.z), E = B * sin(phi.z) et C = C * 1
			movups	xmm1, [_xmm1 + 8]
			mulps	xmm0, xmm1

			// On calcule F = D - E, C = C - 0
			hsubps	xmm0, xmm0

			// On calcule xmm0 = F - C
			hsubps	xmm0, xmm0

			// On modifie x selon selon le rapport entre x et y pour que x soit proportionnelle à y
			movd	xmm1, [rapport]
			divps	xmm0, xmm1

			// On save la new coordonée
			movd	[end_coord], xmm0

			// == == == == == == =
			// / yaw
			// == == == == == == =

			// == == == == == == =
			// pitch
			// == == == == == == =
		Pitch:        // x
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule y = x.(cos(phi.x) * sin(phi.z) - sin(phi.x) * cos(phi.z) * sin(phi.y)) +
		//				 y.(sin(phi.x) * sin(phi.z) * sin(phi.y) + cos(phi.x) * cos(phi.z)) -
		//				 z.(sin(phi.x) * cos(phi.y))
		//
		// On calcule A = cos(phi.x) * sin(phi.z), B = sin(phi.x) * cos(phi.z), E = cos(phi.x) * cos(phi.z) et F = sin(phi.x) * sin(phi.z)
		movddup xmm0, [_xmm0 + 8]
			movups 	xmm1, [_xmm1]
			mulps	xmm0, xmm1

			// on sauve xmm0 dans xmm7 pour le copier dans xmm0 de Roll car l'equation de y ressemblent a l'equation de z mis a part que la valeur sin(phi.y) est
			// multiplié par d'autres equations

			// On calcule C' = A' * sin(phi.y) et G' = E' * sin(phi.y)
			movddup	xmm7, [_xmm2 + 12]
			mulps	xmm7, xmm0

			// On calcule C = B * sin(phi.y) et G = F * sin(phi.y)
			movddup	xmm2, [_xmm2 + 16]
			mulps	xmm0, xmm2

			// Copie le contenu du haut(64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
			// En somme on separe les deux partie x et y : xmm0 = A) cos(phi.x) * sin(phi.z)								xmm0 = cos(phi.x) * sin(phi.z)
			//											 			C) sin(phi.x) * cos(phi.z) * sin(phi.y) = > sin(phi.x) * sin(phi.y) * cos(phi.z)
			//														E) cos(phi.x) * cos(phi.z)								xmm1 = cos(phi.x) * cos(phi.z)
			//														G) sin(phi.x) * sin(phi.z) * sin(phi.y)							sin(phi.x) * sin(phi.y) * sin(phi.z)
			movhlps xmm1, xmm0

			// On calcule D = A - C
			hsubps xmm0, xmm0

			// On calcule H = E + G
			haddps xmm1, xmm1

			// On calcule sin(phi.x) * cos(phi.y) et cos(phi.x) * cos(phi.y)
			//
			// On calcule I.roll = cos(phi.x) * cos(phi.y) et I.Pitch = sin(phi.x) * cos(phi.y)
			movlps		xmm3, [_xmm0 + 8]
			movlps		xmm2, [_xmm2 + 4]
			mulps		xmm2, xmm3
			movshdup 	xmm3, xmm2
			// On calcule x.D + y.H - z.I
			//
			// On calcule J = x.D, K = y.H et L = z.I
			movups		xmm5, [coordonee]
			movsldup	xmm4, xmm1        // y.H
			movss		xmm4, xmm0        // x.D
			movlhps 	xmm4, xmm3        // z.I.Pitch
			mulps		xmm4, xmm5

			// On calcule M = J + K
			haddps	xmm4, xmm4

			// On calcule N = M - L
			hsubps	xmm4, xmm4

			// On save la new coordonée
			movd	[end_coord+4], xmm4

			// == == == == == == =
			// / pitch
			// == == == == == == =
			// == == == == == == =
			// roll
			// == == == == == == =
		Roll:        // z
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule z' = x.(cos(phi.x) * cos(phi.z) * sin(phi.y) + sin(phi.x) * sin(phi.z)) +
		//				  y.(sin(phi.x) * cos(phi.z) - cos(phi.x) * sin(phi.z) * sin(phi.y)) +
		//				  z.(cos(phi.x) * cos(phi.y))
		//
		// Copie le contenu du haut(64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
		// En somme on separe les deux partie x et y : xmm7 = C') cos(phi.x) * sin(phi.z) * sin(phi.y)				xmm7 =	C') cos(phi.x) * sin(phi.z) * sin(phi.y))
		//											 			B') sin(phi.x) * cos(phi.z)						 =>				B') sin(phi.x) * cos(phi.z)
		//														G') cos(phi.x) * cos(phi.z) * sin(phi.y)				xmm1 =	G') cos(phi.x) * cos(phi.z) * sin(phi.y)
		//														F') sin(phi.x) * sin(phi.z)										F') sin(phi.x) * sin(phi.z
		movhlps xmm1, xmm7

			// On calcule D' = -B' + C'
			movd	xmm6, [conv_signe]
			orps	xmm7, xmm6
			haddps	xmm7, xmm7

			// On calcule H' = G' + F'
			haddps	xmm1, xmm1

			// On calcule x.D' + y.H' + z.I'
			//
			// On calcule J = x.D', K = y.H' et L = z.I'
			movups		xmm3, [coordonee]
			movsldup	xmm4, xmm7        // y.D'
			movss		xmm4, xmm1        // x.H'
			movlhps 	xmm4, xmm2        // z.I'
			mulps		xmm4, xmm3

			// On calcule M' = J' + K'
			haddps	xmm4, xmm4

			// On calcule N' = M' + L'
			haddps	xmm4, xmm4
			movd	[end_coord+8], xmm4
			// == == == == == == =
			// / roll
			// == == == == == == =
	}

 

I have a problem with two function, what do you see like difference between those code:

	// x
	_xmm0[0] = sin(DEG2RAD(rotation_object[0]));
	_xmm0[3] = sin(DEG2RAD(rotation_object[0]));
	_xmm0[5] = sin(DEG2RAD(rotation_object[0]));

	_xmm0[1] = cos(DEG2RAD(rotation_object[0]));
	_xmm0[2] = cos(DEG2RAD(rotation_object[0]));

	// z
	_xmm1[0] = sin(DEG2RAD(rotation_object[2]));
	_xmm1[3] = sin(DEG2RAD(rotation_object[2]));
	_xmm1[5] = sin(DEG2RAD(rotation_object[2]));

	_xmm1[1] = cos(DEG2RAD(rotation_object[2]));
	_xmm1[2] = cos(DEG2RAD(rotation_object[2]));

	// y
	_xmm2[0] = sin(DEG2RAD(rotation_object[1]));
	_xmm2[3] = sin(DEG2RAD(rotation_object[1]));
	_xmm2[5] = sin(DEG2RAD(rotation_object[1]));

	_xmm2[1] = cos(DEG2RAD(rotation_object[1]));
	_xmm2[2] = cos(DEG2RAD(rotation_object[1]));
		fld		dword ptr [rotation_object + 0]	// st0 = x
			// On convertit l'angle degre en radians: pi/180 * (angle en degré)
			fmul	dword ptr[pi_180]

			// On calcule les nouveaux sin et cos de l'angle de l'objet
			fsincos
				// sin
					fst 	dword ptr[_xmm0 + 0]		// st0
					fst 	dword ptr[_xmm0 + 12]
					fstp 	dword ptr[_xmm0 + 20]
				// cos
					fst 	dword ptr [_xmm0 + 4]		// st1
					fstp 	dword ptr [_xmm0 + 8]

		fld		dword ptr[rotation_object + 8]	// st0 = z
			// On convertit l'angle degre en radians: pi/180 * (angle en degré)
			fmul	dword ptr[pi_180]

			// On calcule les nouveaux sin et cos de l'angle de l'objet
			fsincos
				// sin
					fst 	dword ptr[_xmm1 + 0]		// st0
					fst 	dword ptr[_xmm1 + 12]
					fstp 	dword ptr [_xmm1 + 20]
				// cos
					fst 	dword ptr[_xmm1 + 4]		// st1
					fstp 	dword ptr[_xmm1 + 8]

		fld		dword ptr[rotation_object + 4]	// st0 = y
			// On convertit l'angle degre en radians: pi/180 * (angle en degré)
			fmul	dword ptr[pi_180]

			// On calcule les nouveaux sin et cos de l'angle de l'objet
			fsincos
				// sin
					fst 	dword ptr[_xmm2 + 0]		// st0
					fst 	dword ptr [_xmm2 + 12]
					fstp 	dword ptr [_xmm2 + 20]
				// cos
					fst 	dword ptr [_xmm2 + 4]		// st1
					fstp 	dword ptr [_xmm2 + 8]

For me, it's the same, but the one do correctly rotation, and other do correct rotation but inverse axes, i tried to swap the storage to  x, y and z on them but don't work.

Well, it's not fault to my asm code, but it's the icl fault apparently, i explain i have try with only one asm code:

	__asm{
		mov		rax, 2
	    }

and it's fall down to 90 fps :/

>>>Back, mysteriously if i put asm in line, i fall down at 70 fps (250 originaly)>>>

It is very hard to tell exactly what has happend hence I would like to advise downloading and trying Intel VTune profiler. You run various versions of your rotation code under VTune and post screenshots here.

https://software.intel.com/en-us/intel-vtune-amplifier-xe/try-buy

Citação:

shaynox s. escreveu:

Well, it's not fault to my asm code, but it's the icl fault apparently, i explain i have try with only one asm code:

	__asm{
		mov		rax, 2
	    }

and it's fall down to 90 fps :/

I do not understand what do you mean?

static float _xmm0[6] = { 0,            // sin.x    0
010
                          1.0,          // cos.x    4
011
                          1.0,          // cos.x    8
012
                          0,            // sin.x    12
013
                          1.0,          // 1        16 
014
                          0         // sin.y    20             
015
                         };

 

When working with auto-vectorizing compiler try to declare statically allocated float arrays with the size of XMM or YMM registers.

static float xmm0[4] = { 0.0f,0.0f,0.0f,0.0f};

static float ymm0[8] = {0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f,0.0f};

Use also "f"  float keyword  and fill your array with 0.f values in order to force compiler at compile time to calculate float values.

 

I decide to manually vectorize, need too much knowledgement: aligned memory, option compiler, intrinsic function for auto vectorization.

And for that i come back to nasm, sorry :/

For the problem of speed about integration asm inline, if i put only one  __asm    mov     rax, 2 for exemple, the program is being slowly.

I don't know if it's true generally, cause to a flag compiler maybe, try it for compare the speed.

>>>I decide to manually vectorize, need too much knowledgement: aligned memory, option compiler, intrinsic function for auto vectorization>>>

Actually you need Intel compiler for auto-vectorization. For memory alignment you can use _mm_malloc() intrinsic function which is allocating memory  aligned on 64-byte boundaries.

Not always you will be able to vectorize manually.

If i understand, intel compiler vectorize only for loop ?

I have build one exemple of VecSamples.zip, it's from sim2.cpp

void vec_copy(float *dest, float *src, int len)
{
    float ii;

#pragma simd
    for (int i = 0, ii = 0.0f; i < len; i++)
        dest[i] = src[i] * ii++;
}

 

And here the assembly code:

 

.B1.3::                         ; Preds .B1.1 .B1.3
        vxorps    xmm0, xmm0, xmm0                              ;20.22
        vcvtsi2ss xmm0, xmm0, eax                               ;20.22
        vmulss    xmm1, xmm0, DWORD PTR [r9+rdx*4]              ;20.22
        inc       eax                                           ;20.22
        vmovss    DWORD PTR [rcx+rdx*4], xmm1                   ;20.3
        inc       rdx                                           ;19.38
        cmp       rdx, r8                                       ;20.3
        jl        .B1.3         ; Prob 82%                      ;20.3

Always this -ss sufix :/

>>>Always this -ss sufix :/>>>

It seems that ICC did not vectorize the code. Try to use restrict keyword at least. Did you allocate dest and src aligned on 128-bit (4*float) usually failure to vectorize the code can be blamed on the array stride which is not exactly n*4*float.

Did try to follow this artice https://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers

 >>>vmulss    xmm1, xmm0, DWORD PTR [r9+rdx*4] >>>

Did you check with debugger if there is a vector load in xmm1 register?

.

>>>Always this -ss sufix :/>>>

Can you post ICC vectorization report?

Finnaly, icc have just vectorized few portion of code:

;;; 		*(object + _x) *= size_scale;    //x

        vmovss    xmm4, DWORD PTR [r8]                          ;222.5
$LN1366:
        vinsertps xmm5, xmm4, DWORD PTR [16+r8], 16             ;222.5
$LN1367:
        vmovss    xmm4, DWORD PTR [64+r8]                       ;222.5
$LN1368:
        vinsertps xmm1, xmm5, DWORD PTR [32+r8], 32             ;222.5
$LN1369:
        vinsertps xmm5, xmm4, DWORD PTR [80+r8], 80             ;222.5
$LN1370:
        vinsertps xmm3, xmm1, DWORD PTR [48+r8], 48             ;222.5
$LN1371:
        vinsertps xmm1, xmm5, DWORD PTR [96+r8], 96             ;222.5
$LN1372:
        vinsertps xmm4, xmm1, DWORD PTR [112+r8], 112           ;222.5
$LN1373:
        vinsertf128 ymm1, ymm3, xmm4, 1                         ;222.5
$LN1374:
        vmulps    ymm3, ymm1, ymm0                              ;222.5

The report speaks only about optimization loop, it is not intended to apply to my calculated rotation, i have read pdf and it's impossible to vectorize code out of loop, he talk only about optimisation loop .

I want talk about function sin and cos doing by intel compiler, i have try to know how he do it, and i have see this:

call      __libm_sse2_sincos 

Like i know fsincos, i ask to me, what this function do, and like i don't find source code, i dissassembly this code:

And i saw this:

sub 		rsp, 68
	movaps		ss:[rsp+0x40], xmm7
	movaps		ss:[rsp+0x30], xmm6
	movsd		ss:[rsp+0x70], xmm0
	pextrw 		eax, xmm0, 3
	and 		ax, 0x7FFF
	sub 		ax, 0x3030
	cmp 		ax, 0x10C5
	ja libmmd.1800FDC1F
	;{
		unpcklpd	xmm0, xmm0
		movapd 		xmm1, ds:[0x18020EB70]
		mulpd		xmm1, xmm0
		movapd		xmm2, ds:[0x18020EB60]
		cvtsd2si	edx, xmm1
		addpd		xmm1, xmm2
		movapd		xmm3, ds:[0x18020EB50]
		subpd		xmm1, xmm2
		movapd		xmm2, ds:[0x18020EB40]
		mulpd		xmm3, xmm1
		add			rdx, 0x1C7600
		movapd		xmm4, xmm0
		and			rdx, 0x3F
		movapd		xmm5, ds:[0x18020EB30]
		lea			rax, qword  ds:[0x18020D9E0]
		shl			rdx, 6
		add 		rax, rdx
		mulpd		xmm2, xmm1
		subpd 		xmm0, xmm3
		mulpd 		xmm1, ds:[0x18020EB20]
		subpd 		xmm4, xmm3
		movapd 		xmm7, ds:[rax+0x10]
		movapd	 	xmm3, xmm4
		subpd 		xmm4, xmm2
		mulpd		xmm5, xmm0
		subpd 		xmm0, xmm2
		movapd 		xmm6, ds:[0x18020EB10]
		mulpd 		xmm7, xmm4
		subpd		xmm3, xmm4
		mulpd 		xmm5, xmm0
		mulpd 		xmm0, xmm0
		subpd 		xmm3, xmm2
		movapd 		xmm2, ds:[rax]
		subpd 		xmm1, xmm3
		movapd 		xmm3, ds:[rax+0x30]
		addpd 		xmm2, xmm3
		subpd 		xmm7, xmm2
		mulpd 		xmm1, xmm7
		movapd 		xmm7, ds:[rax+0x10]
		mulpd 		xmm2, xmm4
		mulpd 		xmm6, xmm0
		mulpd 		xmm3, xmm4
		mulpd 		xmm2, xmm0
		mulpd 		xmm7, xmm0
		mulpd 		xmm0, xmm0
		addpd 		xmm5, ds:[0x18020EB00]
		mulpd 		xmm4, ds:[rax]
		addpd		xmm6, ds:[0x18020EAF0]
		mulpd		xmm5, xmm0
		movapd 		xmm0, xmm3
		addpd 		xmm3, ds:[rax+0x10]
		addpd		xmm6, xmm5
		movq 		xmm5, xmm6
		unpckhpd	xmm6, xmm6
		unpcklpd	xmm5, xmm5
		mulpd		xmm6, xmm7
		mulpd 		xmm2, xmm5
		movapd 		xmm7, xmm4
		addpd		xmm4, xmm3
		movapd		xmm5, ds:[rax+0x10]
		subpd 		xmm5, xmm3
		subpd		xmm3, xmm4
		addpd 		xmm1, ds:[rax+0x20]
		addpd 		xmm5, xmm0
		addpd 		xmm3, xmm7
		addpd 		xmm1, xmm5
		addpd 		xmm1, xmm3
		addpd 		xmm1, xmm2
		addpd 		xmm1, xmm6
		addpd 		xmm1, xmm4
		movq 		xmm0, xmm1
		unpckhpd 	xmm1, xmm1
	;} jmp libmmd.1800FDE1E
	; ... Prépare les donnée a etre traitée, a mon avis.
	libmmd.1800FDC1F:
	; ... Contient autant d'instruction SIMD que celles du 1st bloc.
	ret

Later i have see another function who look like a little bit same in this url:

https://github.com/mario007/renmas/blob/master/renmas3/asm/sincosps.py

I have transfer this code in my project (SDL asm), but my fps fall down, 20 less, relative of fsincos, is it normal ?

	;=============================================================================================================
	 ; float sin[4], cos[4] sincosps (float angle_radians[4])
	 ; Calcule les fonctions sin et cos des 4 angles contenu dans angle_radians[4].
	 ; Entrée : angle_radians[4]
	 ; Sotie: sin[4] et cos[4]
	 ; Destroyed: ebx - edx - ebp
	 ; DATA:
			_ps_am_inv_sign_mask	dd	0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF
			_ps_am_sign_mask		dd	0x80000000, 0x80000000, 0x80000000, 0x80000000
			_ps_am_pi_o_2			dd	1.57079632679, 1.57079632679, 1.57079632679, 1.57079632679
			_ps_am_2_o_pi			dd	0.63661977236, 0.63661977236, 0.63661977236, 0.63661977236
			_epi32_1				dd	1.0, 1.0, 1.0, 1.0
			_ps_am_1 				dd	1.0, 1.0, 1.0, 1.0
			_epi32_2 				dd	2.0, 2.0, 2.0, 2.0
			_ps_sincos_p3			dd	-0.00468175413, -0.00468175413, -0.00468175413, -0.00468175413
			_ps_sincos_p2 			dd	0.0796926262, 0.0796926262, 0.0796926262, 0.0796926262
			_ps_sincos_p1 			dd	-0.64596409750621,-0.64596409750621,-0.64596409750621,-0.64596409750621
			_ps_sincos_p0 			dd	1.570796326794896, 1.570796326794896, 1.570796326794896, 1.570796326794896
			
	;=============================================================================================================
	sincosps:
			movups 		xmm7, [sincosps_angle_radians]
			movups		xmm1, [_ps_am_inv_sign_mask]
		andps 		xmm0, xmm1
		movups		xmm1, [_ps_am_sign_mask]
		andps 		xmm7, xmm1
		movups		xmm1, [_ps_am_2_o_pi]
		mulps 		xmm0, xmm1
		pxor 		xmm3, xmm3
		movups 		xmm5, [_epi32_1]
		movups 		xmm4, [_ps_am_1]
		cvttps2dq 	xmm2, xmm0
		pand 		xmm5, xmm2
		pcmpeqd		xmm5, xmm3
		movups 		xmm3, [_epi32_1]
		movups 		xmm1, [_epi32_2]
		cvtdq2ps 	xmm6, xmm2
		paddd 		xmm3, xmm2
		pand		xmm2, xmm1
		pand 		xmm3, xmm1
		subps		xmm0, xmm6
		pslld 		xmm2, 30
		minps 		xmm0, xmm4
		subps 		xmm4, xmm0
		pslld 		xmm3, 30
		movaps 		xmm6, xmm4
		xorps 		xmm2, xmm7
		movaps		xmm7, xmm5
		andps 		xmm6, xmm7
		andnps 		xmm7, xmm0
		andps 		xmm0, xmm5
		andnps		xmm5, xmm4
		movups 		xmm4, [_ps_sincos_p3]
		orps 		xmm6, xmm7
		orps 		xmm0, xmm5
		movups 		xmm5, [_ps_sincos_p2]
		movaps 		xmm1, xmm0
		movaps 		xmm7, xmm6
		mulps 		xmm0, xmm0
		mulps 		xmm6, xmm6
		orps 		xmm1, xmm2
		orps 		xmm7, xmm3
		movaps 		xmm2, xmm0
		movaps 		xmm3, xmm6
		mulps 		xmm0, xmm4
		mulps 		xmm6, xmm4
		movups 		xmm4, [_ps_sincos_p1]
		addps 		xmm0, xmm5
		addps 		xmm6, xmm5
		movups 		xmm5, [_ps_sincos_p0]
		mulps 		xmm0, xmm2
		mulps 		xmm6, xmm3
		addps 		xmm0, xmm4
		addps 		xmm6, xmm4
		mulps 		xmm0, xmm2
		mulps 		xmm6, xmm3
		addps 		xmm0, xmm5
		addps 		xmm6, xmm5
		mulps 		xmm0, xmm1
		mulps		xmm6, xmm7
		movups		[sincosps_sin], xmm0 	; sinus(xmm0)
		movups		[sincosps_cos], xmm6 	; cosinus(xmm0)
		
		; to add in put_object
			movups		xmm0, [deg_rotation_x]
			movups		xmm1, [pi_180]
			mulps		xmm0, xmm1
			movups		[sincosps_angle_radians], xmm0
		call	sincosps
		
		movups		xmm1, [sincosps_cos]
		movsldup	xmm0, xmm1
		movsd 		[_xmm0 + 4], xmm0			; save cos(x)
		movhps 		[_xmm2 + 4], xmm0			; save cos(y)
		movups		xmm1, [sincosps_cos + 8]
		movsldup	xmm0, xmm1
		movsd		[_xmm1 + 4], xmm0			; save cos(z)
		
		; save sin(x)
			movups		xmm0, [sincosps_sin]
			movss		[_xmm0 + 0], xmm0
			movss		[_xmm0 + 12], xmm0
			movss		[_xmm0 + 20], xmm0
		
		; save sin(y)
			movups		xmm0, [sincosps_sin + 4]
			movss		[_xmm2 + 0], xmm0
			movss		[_xmm2 + 12], xmm0
			movss		[_xmm2 + 20], xmm0

		; save sin(z)
			movups		xmm0, [sincosps_sin + 8]
			movss		[_xmm1 + 0], xmm0
			movss		[_xmm1 + 12], xmm0
			movss		[_xmm1 + 20], xmm0
	ret
	;=============================================================================================================
	; / sincosps
	;=============================================================================================================

I saw in optimization reference manual, fsincos have 119 latency, and latency's instruction of SIMD are around 5-6.

Latency is like clock cycle ?

>>>Finnaly, icc have just vectorized few portion of code:>>>

Yes that portion of the code was vectorized. You can see also that loop was 6x unrolled. I suppose that auto-vectorizer will try to vectorize major code hotspots which are mainly loops. I did not try to vectorize  code which is not running inside the loop. Regarding that piece of vectorized loop disassembly you can see that compiler performed stride checks in order to calculate its length.

If you have multiple arrays loaded with doubles you can  unroll 8x the loop and issue prefetch for every array per cycle.

Pseudocode:

for(unsigned int i = 0; i < Len; i +=8 )

 {

       prefetch(array1+x) // prefetch distance x should be calculated by trial and error

        array1[i] // do some operation

        array2[i] // do some operation

        array3[i] // do some operation

        array4[i] // do some operation

        array5[i] // do some operation

          prefetch(array2+x) // // prefetch distance x should be calculated by trial and error

        array1[i+1] // do some operation

        array2[i+1] // do some operation on array

        array2[i+1] // do some operation on array

        array3[i+1] // do some operation on  array

       array4[i+1] // do some operation on array

       array5[i+1] // do some operation on array

        ............................... code continues

      array1[i+7] // do some operation on array

      array2[i+7] // do some operation on array

      array3[i+7] //  do some operation on array

      array4[i+7] // do some operation on  array

      array5[i+7] // do some operation on array

}

   

 

 

>>>I have transfer this code in my project (SDL asm), but my fps fall down, 20 less, relative of fsincos, is it normal ?>>>

Can you test speed of execution of your code with the call to  __libm_sse2_sincos function?

Usually libm_sse2_sincos should be faster than x87 fsincos.

>>>I saw in optimization reference manual, fsincos have 119 latency, and latency's instruction of SIMD are around 5-6.

Latency is like clock cycle ?>>>

I have that raw speed(without call and ret overhead)  of libm_sse2_sincos could be less than x87 fsincos. Crude estimation of  libm_sse2_sincos execution speed could be simple as calculation and summation  of the all machine code instructions. Pay attention that some of the instruction can be executed in paralel or out-of-order.

Latency is the time in CPU cycles needed to proces (retire) machine code instruction. For example latency of x87 fsincos is 112 CPU cycles that's mean that 112 cycles are needed to execute that instruction. At the hardware level fsincos is broken down into various  micro-ops which are scheduled to run on FP adder and mul units and accumulate the results into floating point physical registers. Horner Scheme is probably used for the result convergence.

Today I will upload my own implementation of trigo instructions.

Pseudocode:

for(unsigned int i = 0; i < Len; i +=8 )

 {

       prefetch(array1+x) // prefetch distance x should be calculated by trial and error

        array1[i] // do some operation

        array2[i] // do some operation

        array3[i] // do some operation

        array4[i] // do some operation

        array5[i] // do some operation

          prefetch(array2+x) // // prefetch distance x should be calculated by trial and error

        array1[i+1] // do some operation

        array2[i+1] // do some operation on array

        array2[i+1] // do some operation on array

        array3[i+1] // do some operation on  array

       array4[i+1] // do some operation on array

       array5[i+1] // do some operation on array

        ............................... code continues

      array1[i+7] // do some operation on array

      array2[i+7] // do some operation on array

      array3[i+7] //  do some operation on array

      array4[i+7] // do some operation on  array

      array5[i+7] // do some operation on array

}

 

Post #130 is related to post #126.

In post #130 as you can see 8x  unrolling is loading in each array 64-byte stride which can fit L1D cache line.

For difference of speed between fsincos and call      __libm_sse2_sincos, here's the code test:

the generate code made by cos ans sin block, generate 3 * call      __libm_sse2_sincos, that's why i put only 3 * fsincos.

I have try to calling __libm_sse2_sincos through asm inline, but he didn't find this function, and made error at compilation :/

float			rotation_object[4] = { 0 };     // 0: x , 1: y , 2: z

void        put_pixel(float *coord, int color, float* a)
{
	int		offset_pixel;

	int		a_time_sincosps, b_time_sincosps;
	int		a_time_fsincos, b_time_fsincos;
	unsigned int	loop_test;
	int		temp;

	__asm	rdtsc
	__asm	mov		[a_time_sincosps], eax

			trigo.cos[_x] = cos(DEG2RAD(rotation_object[_x]));
			trigo.cos[_y] = cos(DEG2RAD(rotation_object[_y]));
			trigo.cos[_z] = cos(DEG2RAD(rotation_object[_z]));

			trigo.sin[_x] = sin(DEG2RAD(rotation_object[_x]));
			trigo.sin[_y] = sin(DEG2RAD(rotation_object[_y]));
			trigo.sin[_z] = sin(DEG2RAD(rotation_object[_z]));

	__asm	rdtsc
	__asm	mov		[b_time_sincosps], eax
	
	printf("time_sincosps = %i\n", b_time_sincosps - a_time_sincosps);

	__asm	rdtsc
	__asm	mov		[a_time_fsincos], eax

			__asm		vxorpd    xmm1, xmm1, xmm1				
			__asm		vcvtss2sd xmm1, xmm1, [temp]	
			__asm		vmovsd    xmm15, [temp]
			__asm		vmulsd    xmm0, xmm15, xmm1				
			__asm		vzeroupper								
		__asm		fsincos

			__asm		vxorpd    xmm2, xmm2, xmm2				
			__asm		vmovapd   xmm14, xmm0					
			__asm		vcvtss2sd xmm2, xmm2, [temp]
			__asm		vcvtsd2ss xmm1, xmm1, xmm1			
			__asm		vmulsd    xmm0, xmm15, xmm2				
			__asm		vmovss[temp], xmm1
		__asm		fsincos

			__asm		vxorpd    xmm2, xmm2, xmm2		
			__asm		vmovapd   xmm13, xmm0			
			__asm		vcvtss2sd xmm2, xmm2, [temp]
			__asm		vcvtsd2ss xmm1, xmm1, xmm1
			__asm		vmulsd    xmm0, xmm15, xmm2
			__asm		vmovss    [temp], xmm1
		__asm		fsincos

		__asm		vcvtsd2ss xmm1, xmm1, xmm1
		__asm		vcvtsd2ss xmm14, xmm14, xmm14
		__asm		vcvtsd2ss xmm13, xmm13, xmm13
		__asm		vcvtsd2ss xmm0, xmm0, xmm0
		__asm		vmovss    [temp], xmm1
		__asm		vmovss    [temp], xmm14
		__asm		vmovss    [temp], xmm13
		__asm		vmovss    [temp], xmm0
	
	__asm	rdtsc
	__asm	mov		[b_time_fsincos], eax

	printf("time_fsincos = %i", b_time_fsincos - a_time_fsincos);

	while (1);
}

I have add store anc convert function because it's the code made by trigo.cos[_x] = ... and other stores.

can't do only cos(DEG2RAD(rotation_object[_x])); because he remove call      __libm_sse2_sincos.

And here the result:

time_sincosps = 19839
time_fsincos = 1981

And, is it possible to modify the processus of compilation by rewrite asm file ? it will be wonderfull if it's possible, i will be able to (correct) the code or rewrite it like i think.

Finnaly is it to possible to personalize the asm code make by icl ? ex:

Keep the display of number, like hexadecimal, and write it with 0x instead H at the end of number:

return 0xdeadbeef --> (origin) mov    eax, -559038737 --> (wish) mov    eax, 0xdeadbeef  ^^

I will try instrinsics function through this wonderfull web page: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

It's too complex to work with intrinsic function, load, set, only cause he don't allowed casting float [4] to __m128 :/

"error : argument of type "float" is incompatible with parameter of type "__m128"

"error : a value of type "__m128" cannot be assigned to an entity of type "float""

"error : cast to type "__m128" is not allowed"

It's strange that the compiler don't authorized this, one more forbidding add in the way programming's black list of high level compiler :/

Quote:

shaynox s. wrote:

It's too complex to work with intrinsic function, load, set, only cause he don't allowed casting float [4] to __m128 :/

"error : argument of type "float" is incompatible with parameter of type "__m128"

"error : a value of type "__m128" cannot be assigned to an entity of type "float""

"error : cast to type "__m128" is not allowed"

It's strange that the compiler don't authorized this, one more forbidding add in the way programming's black list of high level compiler :/

Do not cast __m128 to float array because __m128 is a struct of unions. You can simply initialize __m128 member fields , but this is not recommended or you can cast float pointer to  __m128 type pointer.

float vertex[4] = {0.f,0.f,0.f,0.f};

__m128 *ptr_m128 = (__m128 *)vertex;

http://stackoverflow.com/questions/11759791/is-it-possible-to-cast-float...

http://stackoverflow.com/questions/5118158/using-sse-to-speed-up-computa...

 

Ok, but like i do most store and load, it will be a little complexe, and i don't think it will be optimize if i do with that:

	static float	_xmm0[4], _xmm1[4], _xmm2[4], _xmm3[4], _xmm4[4], _xmm7[4];

	trigo.coord = coord;
	
	trigo._xmm0[0] = sin(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[1] = cos(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[2] = cos(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[3] = sin(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[4] = 1;
	trigo._xmm0[5] = sin(DEG2RAD(rotation_object[_x]));

	trigo._xmm1[0] = sin(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[1] = cos(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[2] = cos(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[3] = sin(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[4] = 1;
	trigo._xmm1[5] = sin(DEG2RAD(rotation_object[_z]));

	trigo._xmm2[0] = sin(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[1] = cos(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[2] = cos(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[3] = sin(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[4] = 1;
	trigo._xmm2[5] = sin(DEG2RAD(rotation_object[_y]));

		// == == == == == == =
		// yaw
		// == == == == == == =
	//Yaw:; y
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule x = x.cos(phi_y) * cos(phi_z) - y.cos(phi_y) * sin(phi_z) - z.sin(phi_y)
		//
		// On calcule  A = x.cos(phi_y), B = y.cos(phi_y) et C = z.sin(phi_y)
			*_xmm0 = _mm_mul_ps(trigo._xmm2[1], trigo.coord);

		// On calcule D = A * cos(phi_z), E = B * sin(phi_z) et C = C * 1
			* _xmm0 = _mm_mul_ps(*_xmm0, trigo._xmm1[2]);

		// On calcule F = D - E, C = C - 0
			*_xmm0 = _mm_hsub_ps(*_xmm0, *_xmm0);

		// On calcule xmm0 = F - C
			*_xmm0 = _mm_hsub_ps(*_xmm0, *_xmm0);

		// On save la new coordonée
			trigo.end_coord[_x] = *_xmm0;

		// == == == == == == =
		// / yaw
		// == == == == == == =

		// == == == == == == =
		// pitch
		// == == == == == == =
	//Pitch:; x
		  // On applique la rotation au point | [esi + 0] = x
		  // | [esi + 4] = y
		  // | [esi + 8] = z
		  // On calcule y = x.(cos(phi_x) * sin(phi_z) - sin(phi_x) * cos(phi_z) * sin(phi_y)) +
		  //				 y.(sin(phi_x) * sin(phi_z) * sin(phi_y) + cos(phi_x) * cos(phi_z)) -
		  //				 z.(sin(phi_x) * cos(phi_y))
		  //
		// On calcule A = cos(phi_x) * sin(phi_z), B = sin(phi_x) * cos(phi_z), E = cos(phi_x) * cos(phi_z) et F = sin(phi_x) * sin(phi_z)
		
		//movddup		xmm0, [_xmm0 + 8]
			_xmm0[2] = trigo._xmm0[2];
			_xmm0[3] = trigo._xmm0[2];

		//movddup		xmm7, [_xmm2 + 12]
			*_xmm0 = _mm_mul_ps(*_xmm0, trigo._xmm1);

		// on sauve xmm0 dans xmm7 pour le copier dans xmm0 de Roll car l'equation de y ressemblent a l'equation de z mis a part que la valeur sin(phi_y) est
		// multiplié par d'autres equations

		// On calcule C' = A' * sin(phi_y) et G' = E' * sin(phi_y)
		
		// movddup		xmm2, [_xmm2 + 16]
			_xmm7[2] = trigo._xmm2[3];
			_xmm7[3] = trigo._xmm2[3];

			*_xmm7 = _mm_mul_ps(*_xmm7, *_xmm0);

		// On calcule C = B * sin(phi_y) et G = F * sin(phi_y)
		
		// movhlps		xmm1, xmm0
			_xmm2[2] = trigo._xmm2[4];
			_xmm2[3] = trigo._xmm2[4];

			*_xmm0 = _mm_mul_ps(*_xmm0, *_xmm2);

		// Copie le contenu du haut(64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
		// En somme on separe les deux partie x et y : xmm0 = A) cos(phi_x) * sin(phi_z)								xmm0 = cos(phi_x) * sin(phi_z)
		//											 			C) sin(phi_x) * cos(phi_z) * sin(phi_y) = > sin(phi_x) * sin(phi_y) * cos(phi_z)
		//														E) cos(phi_x) * cos(phi_z)								xmm1 = cos(phi_x) * cos(phi_z)
		//														G) sin(phi_x) * sin(phi_z) * sin(phi_y)							sin(phi_x) * sin(phi_y) * sin(phi_z)
		
		// movhlps 	xmm1, xmm0
			_xmm1[1] = _xmm0[3];
			_xmm1[0] = _xmm0[2];
		// On calcule D = A - C
			*_xmm0 = _mm_hsub_ps(*_xmm0, *_xmm0);

		// On calcule H = E + G
			*_xmm1 = _mm_hadd_ps(_xmm1, _xmm1);

		// On calcule sin(phi_x) * cos(phi_y) et cos(phi_x) * cos(phi_y)
		//
		// On calcule I.roll = cos(phi_x) * cos(phi_y) et I.Pitch = sin(phi_x) * cos(phi_y)
		// movlps		xmm3, [_xmm0 + 8]
			_xmm3[1] = trigo._xmm0[5];
			_xmm3[0] = trigo._xmm0[4];

		// movlps		xmm2, [_xmm2 + 4]
			_xmm2[1] = trigo._xmm2[2];
			_xmm2[0] = trigo._xmm2[1];

			*_xmm2 = _mm_mul_ps(*_xmm2, *_xmm3);

		// movshdup 	xmm3, xmm2
			_xmm3[0] = _xmm2[1];
			_xmm3[1] = _xmm2[1];
			_xmm3[2] = _xmm2[3];
			_xmm3[3] = _xmm2[3];
		// On calcule x.D + y.H - z.I
		//
		// On calcule J = x.D, K = y.H et L = z.I
		// movsldup		xmm4, xmm1	; y.H
			_xmm4[0] = _xmm1[0];
			_xmm4[1] = _xmm1[0];
			_xmm4[2] = _xmm1[2];
			_xmm4[3] = _xmm1[2];

		// movss		xmm4, xmm0	; x.D
			_xmm4[0] = _xmm0[0];

		// movlhps		xmm4, xmm3; z.I.Pitch
			_xmm4[2] = _xmm3[0];
			_xmm4[3] = _xmm3[1];

			*_xmm4 = _mm_mul_ps(*_xmm4, trigo.coord);

		// On calcule M = J + K
			_xmm4 = _mm_hadd_ps(*_xmm4, *_xmm4)

		// On calcule N = M - L
			_xmm4 = _mm_hsub_ps(*_xmm4, *_xmm4)

		// On save la new coordonée
			trigo.end_coord[_y] = *_xmm4;

		// == == == == == == =
		// / pitch
		// == == == == == == =
		// == == == == == == =
		// roll
		// == == == == == == =
	//Roll:; z
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule z' = x.(cos(phi_x) * cos(phi_z) * sin(phi_y) + sin(phi_x) * sin(phi_z)) + 
		//				  y.(sin(phi_x) * cos(phi_z) - cos(phi_x) * sin(phi_z) * sin(phi_y)) +
		//				  z.(cos(phi_x) * cos(phi_y))
		//
		// Copie le contenu du haut(64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
			// En somme on separe les deux partie x et y : xmm7 = C') cos(phi_x) * sin(phi_z) * sin(phi_y)				xmm7 =	C') cos(phi_x) * sin(phi_z) * sin(phi_y))
			//											 			B') sin(phi_x) * cos(phi_z)						 =>				B') sin(phi_x) * cos(phi_z)
			//														G') cos(phi_x) * cos(phi_z) * sin(phi_y)				xmm1 =	G') cos(phi_x) * cos(phi_z) * sin(phi_y)
			//														F') sin(phi_x) * sin(phi_z)										F') sin(phi_x) * sin(phi_z)
		// movhlps		xmm1, xmm7	
			_xmm1[0] = _xmm7[2];
			_xmm1[1] = _xmm7[3];

		// On calcule D' = -B' + C'
			or	xmm7[0], 0x80000000;
			*_xmm7 = _mm_hadd_ps(_xmm7, _xmm7);

		// On calcule H' = G' + F'
			*_xmm1 = _mm_hadd_ps(_xmm1, _xmm1);

		// On calcule x.D' + y.H' + z.I'
		//
		// On calcule J = x.D', K = y.H' et L = z.I'
		
		// movsldup		xmm4, xmm7	; y.D'	
			_xmm4[0] = _xmm7[0];
			_xmm4[1] = _xmm7[0];
			_xmm4[2] = _xmm7[2];
			_xmm4[3] = _xmm7[2];

		// movss		xmm4, xmm1	; x.H'
			_xmm4[0] = _xmm1[0];

		// movlhps 	xmm4, xmm2	; z.I'
			_xmm4[2] = _xmm2[0];
			_xmm4[3] = _xmm2[1];

			*_xmm4 = _mm_mul_ps(*_xmm4, trigo.coord);

		// On calcule M' = J' + K'
			_xmm4 = _mm_hadd_ps(*_xmm4, *_xmm4)

		// On calcule N' = M' + L'
			_xmm4 = _mm_hadd_ps(*_xmm4, *_xmm4)

		// On save la new coordonée
			trigo.end_coord[_z] = *_xmm4;

		// == == == == == == =
		// / roll
		// == == == == == == =

 

 

 

Citação:

shaynox s. escreveu:

For difference of speed between fsincos and call      __libm_sse2_sincos, here's the code test:

the generate code made by cos ans sin block, generate 3 * call      __libm_sse2_sincos, that's why i put only 3 * fsincos.

I have try to calling __libm_sse2_sincos through asm inline, but he didn't find this function, and made error at compilation :/

float			rotation_object[4] = { 0 };     // 0: x , 1: y , 2: z

void        put_pixel(float *coord, int color, float* a)
{
	int		offset_pixel;

	int		a_time_sincosps, b_time_sincosps;
	int		a_time_fsincos, b_time_fsincos;
	unsigned int	loop_test;
	int		temp;

	__asm	rdtsc
	__asm	mov		[a_time_sincosps], eax

			trigo.cos[_x] = cos(DEG2RAD(rotation_object[_x]));
			trigo.cos[_y] = cos(DEG2RAD(rotation_object[_y]));
			trigo.cos[_z] = cos(DEG2RAD(rotation_object[_z]));

			trigo.sin[_x] = sin(DEG2RAD(rotation_object[_x]));
			trigo.sin[_y] = sin(DEG2RAD(rotation_object[_y]));
			trigo.sin[_z] = sin(DEG2RAD(rotation_object[_z]));

	__asm	rdtsc
	__asm	mov		[b_time_sincosps], eax
	
	printf("time_sincosps = %i\n", b_time_sincosps - a_time_sincosps);

	__asm	rdtsc
	__asm	mov		[a_time_fsincos], eax

			__asm		vxorpd    xmm1, xmm1, xmm1				
			__asm		vcvtss2sd xmm1, xmm1, [temp]	
			__asm		vmovsd    xmm15, [temp]
			__asm		vmulsd    xmm0, xmm15, xmm1				
			__asm		vzeroupper								
		__asm		fsincos

			__asm		vxorpd    xmm2, xmm2, xmm2				
			__asm		vmovapd   xmm14, xmm0					
			__asm		vcvtss2sd xmm2, xmm2, [temp]
			__asm		vcvtsd2ss xmm1, xmm1, xmm1			
			__asm		vmulsd    xmm0, xmm15, xmm2				
			__asm		vmovss[temp], xmm1
		__asm		fsincos

			__asm		vxorpd    xmm2, xmm2, xmm2		
			__asm		vmovapd   xmm13, xmm0			
			__asm		vcvtss2sd xmm2, xmm2, [temp]
			__asm		vcvtsd2ss xmm1, xmm1, xmm1
			__asm		vmulsd    xmm0, xmm15, xmm2
			__asm		vmovss    [temp], xmm1
		__asm		fsincos

		__asm		vcvtsd2ss xmm1, xmm1, xmm1
		__asm		vcvtsd2ss xmm14, xmm14, xmm14
		__asm		vcvtsd2ss xmm13, xmm13, xmm13
		__asm		vcvtsd2ss xmm0, xmm0, xmm0
		__asm		vmovss    [temp], xmm1
		__asm		vmovss    [temp], xmm14
		__asm		vmovss    [temp], xmm13
		__asm		vmovss    [temp], xmm0
	
	__asm	rdtsc
	__asm	mov		[b_time_fsincos], eax

	printf("time_fsincos = %i", b_time_fsincos - a_time_fsincos);

	while (1);
}

I have add store anc convert function because it's the code made by trigo.cos[_x] = ... and other stores.

can't do only cos(DEG2RAD(rotation_object[_x])); because he remove call      __libm_sse2_sincos.

And here the result:

time_sincosps = 19839
time_fsincos = 1981

And, is it possible to modify the processus of compilation by rewrite asm file ? it will be wonderfull if it's possible, i will be able to (correct) the code or rewrite it like i think.

Finnaly is it to possible to personalize the asm code make by icl ? ex:

Keep the display of number, like hexadecimal, and write it with 0x instead H at the end of number:

return 0xdeadbeef --> (origin) mov    eax, -559038737 --> (wish) mov    eax, 0xdeadbeef  ^^

The piece of code which  calls library cos and sin function is expected to run slower than code which is using CPU built-in x87 fsincos machine code instruction. I would try to run both of pieces of code inside the double loop and calculate the average of the runs. When using rdtsc you must serialize the execution inside the CPU because of the  nature of out-of-order processing there is a possibility that part of the code will be scheduled to run before the execution of rdtsc instruction.

 

>>>Finnaly is it to possible to personalize the asm code make by icl ? ex:>>>

If I understood your question correctly you can load compiled exe file into IDA Pro disassembler and perform any assembly code changes after that you can recompile the code. I am not sure if this will work correctly.

Citação:

shaynox s. escreveu:

Ok, but like i do most store and load, it will be a little complexe, and i don't think it will be optimize if i do with that:

	static float	_xmm0[4], _xmm1[4], _xmm2[4], _xmm3[4], _xmm4[4], _xmm7[4];

	trigo.coord = coord;
	
	trigo._xmm0[0] = sin(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[1] = cos(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[2] = cos(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[3] = sin(DEG2RAD(rotation_object[_x]));
	trigo._xmm0[4] = 1;
	trigo._xmm0[5] = sin(DEG2RAD(rotation_object[_x]));

	trigo._xmm1[0] = sin(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[1] = cos(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[2] = cos(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[3] = sin(DEG2RAD(rotation_object[_z]));
	trigo._xmm1[4] = 1;
	trigo._xmm1[5] = sin(DEG2RAD(rotation_object[_z]));

	trigo._xmm2[0] = sin(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[1] = cos(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[2] = cos(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[3] = sin(DEG2RAD(rotation_object[_y]));
	trigo._xmm2[4] = 1;
	trigo._xmm2[5] = sin(DEG2RAD(rotation_object[_y]));

		// == == == == == == =
		// yaw
		// == == == == == == =
	//Yaw:; y
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule x = x.cos(phi_y) * cos(phi_z) - y.cos(phi_y) * sin(phi_z) - z.sin(phi_y)
		//
		// On calcule  A = x.cos(phi_y), B = y.cos(phi_y) et C = z.sin(phi_y)
			*_xmm0 = _mm_mul_ps(trigo._xmm2[1], trigo.coord);

		// On calcule D = A * cos(phi_z), E = B * sin(phi_z) et C = C * 1
			* _xmm0 = _mm_mul_ps(*_xmm0, trigo._xmm1[2]);

		// On calcule F = D - E, C = C - 0
			*_xmm0 = _mm_hsub_ps(*_xmm0, *_xmm0);

		// On calcule xmm0 = F - C
			*_xmm0 = _mm_hsub_ps(*_xmm0, *_xmm0);

		// On save la new coordonée
			trigo.end_coord[_x] = *_xmm0;

		// == == == == == == =
		// / yaw
		// == == == == == == =

		// == == == == == == =
		// pitch
		// == == == == == == =
	//Pitch:; x
		  // On applique la rotation au point | [esi + 0] = x
		  // | [esi + 4] = y
		  // | [esi + 8] = z
		  // On calcule y = x.(cos(phi_x) * sin(phi_z) - sin(phi_x) * cos(phi_z) * sin(phi_y)) +
		  //				 y.(sin(phi_x) * sin(phi_z) * sin(phi_y) + cos(phi_x) * cos(phi_z)) -
		  //				 z.(sin(phi_x) * cos(phi_y))
		  //
		// On calcule A = cos(phi_x) * sin(phi_z), B = sin(phi_x) * cos(phi_z), E = cos(phi_x) * cos(phi_z) et F = sin(phi_x) * sin(phi_z)
		
		//movddup		xmm0, [_xmm0 + 8]
			_xmm0[2] = trigo._xmm0[2];
			_xmm0[3] = trigo._xmm0[2];

		//movddup		xmm7, [_xmm2 + 12]
			*_xmm0 = _mm_mul_ps(*_xmm0, trigo._xmm1);

		// on sauve xmm0 dans xmm7 pour le copier dans xmm0 de Roll car l'equation de y ressemblent a l'equation de z mis a part que la valeur sin(phi_y) est
		// multiplié par d'autres equations

		// On calcule C' = A' * sin(phi_y) et G' = E' * sin(phi_y)
		
		// movddup		xmm2, [_xmm2 + 16]
			_xmm7[2] = trigo._xmm2[3];
			_xmm7[3] = trigo._xmm2[3];

			*_xmm7 = _mm_mul_ps(*_xmm7, *_xmm0);

		// On calcule C = B * sin(phi_y) et G = F * sin(phi_y)
		
		// movhlps		xmm1, xmm0
			_xmm2[2] = trigo._xmm2[4];
			_xmm2[3] = trigo._xmm2[4];

			*_xmm0 = _mm_mul_ps(*_xmm0, *_xmm2);

		// Copie le contenu du haut(64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
		// En somme on separe les deux partie x et y : xmm0 = A) cos(phi_x) * sin(phi_z)								xmm0 = cos(phi_x) * sin(phi_z)
		//											 			C) sin(phi_x) * cos(phi_z) * sin(phi_y) = > sin(phi_x) * sin(phi_y) * cos(phi_z)
		//														E) cos(phi_x) * cos(phi_z)								xmm1 = cos(phi_x) * cos(phi_z)
		//														G) sin(phi_x) * sin(phi_z) * sin(phi_y)							sin(phi_x) * sin(phi_y) * sin(phi_z)
		
		// movhlps 	xmm1, xmm0
			_xmm1[1] = _xmm0[3];
			_xmm1[0] = _xmm0[2];
		// On calcule D = A - C
			*_xmm0 = _mm_hsub_ps(*_xmm0, *_xmm0);

		// On calcule H = E + G
			*_xmm1 = _mm_hadd_ps(_xmm1, _xmm1);

		// On calcule sin(phi_x) * cos(phi_y) et cos(phi_x) * cos(phi_y)
		//
		// On calcule I.roll = cos(phi_x) * cos(phi_y) et I.Pitch = sin(phi_x) * cos(phi_y)
		// movlps		xmm3, [_xmm0 + 8]
			_xmm3[1] = trigo._xmm0[5];
			_xmm3[0] = trigo._xmm0[4];

		// movlps		xmm2, [_xmm2 + 4]
			_xmm2[1] = trigo._xmm2[2];
			_xmm2[0] = trigo._xmm2[1];

			*_xmm2 = _mm_mul_ps(*_xmm2, *_xmm3);

		// movshdup 	xmm3, xmm2
			_xmm3[0] = _xmm2[1];
			_xmm3[1] = _xmm2[1];
			_xmm3[2] = _xmm2[3];
			_xmm3[3] = _xmm2[3];
		// On calcule x.D + y.H - z.I
		//
		// On calcule J = x.D, K = y.H et L = z.I
		// movsldup		xmm4, xmm1	; y.H
			_xmm4[0] = _xmm1[0];
			_xmm4[1] = _xmm1[0];
			_xmm4[2] = _xmm1[2];
			_xmm4[3] = _xmm1[2];

		// movss		xmm4, xmm0	; x.D
			_xmm4[0] = _xmm0[0];

		// movlhps		xmm4, xmm3; z.I.Pitch
			_xmm4[2] = _xmm3[0];
			_xmm4[3] = _xmm3[1];

			*_xmm4 = _mm_mul_ps(*_xmm4, trigo.coord);

		// On calcule M = J + K
			_xmm4 = _mm_hadd_ps(*_xmm4, *_xmm4)

		// On calcule N = M - L
			_xmm4 = _mm_hsub_ps(*_xmm4, *_xmm4)

		// On save la new coordonée
			trigo.end_coord[_y] = *_xmm4;

		// == == == == == == =
		// / pitch
		// == == == == == == =
		// == == == == == == =
		// roll
		// == == == == == == =
	//Roll:; z
		// On applique la rotation au point | [esi + 0] = x
		// | [esi + 4] = y
		// | [esi + 8] = z
		// On calcule z' = x.(cos(phi_x) * cos(phi_z) * sin(phi_y) + sin(phi_x) * sin(phi_z)) + 
		//				  y.(sin(phi_x) * cos(phi_z) - cos(phi_x) * sin(phi_z) * sin(phi_y)) +
		//				  z.(cos(phi_x) * cos(phi_y))
		//
		// Copie le contenu du haut(64..127) d'un paquet de valeurs réel de simple précision (4*32 bits) dans sa partie basse (0..31).
			// En somme on separe les deux partie x et y : xmm7 = C') cos(phi_x) * sin(phi_z) * sin(phi_y)				xmm7 =	C') cos(phi_x) * sin(phi_z) * sin(phi_y))
			//											 			B') sin(phi_x) * cos(phi_z)						 =>				B') sin(phi_x) * cos(phi_z)
			//														G') cos(phi_x) * cos(phi_z) * sin(phi_y)				xmm1 =	G') cos(phi_x) * cos(phi_z) * sin(phi_y)
			//														F') sin(phi_x) * sin(phi_z)										F') sin(phi_x) * sin(phi_z)
		// movhlps		xmm1, xmm7	
			_xmm1[0] = _xmm7[2];
			_xmm1[1] = _xmm7[3];

		// On calcule D' = -B' + C'
			or	xmm7[0], 0x80000000;
			*_xmm7 = _mm_hadd_ps(_xmm7, _xmm7);

		// On calcule H' = G' + F'
			*_xmm1 = _mm_hadd_ps(_xmm1, _xmm1);

		// On calcule x.D' + y.H' + z.I'
		//
		// On calcule J = x.D', K = y.H' et L = z.I'
		
		// movsldup		xmm4, xmm7	; y.D'	
			_xmm4[0] = _xmm7[0];
			_xmm4[1] = _xmm7[0];
			_xmm4[2] = _xmm7[2];
			_xmm4[3] = _xmm7[2];

		// movss		xmm4, xmm1	; x.H'
			_xmm4[0] = _xmm1[0];

		// movlhps 	xmm4, xmm2	; z.I'
			_xmm4[2] = _xmm2[0];
			_xmm4[3] = _xmm2[1];

			*_xmm4 = _mm_mul_ps(*_xmm4, trigo.coord);

		// On calcule M' = J' + K'
			_xmm4 = _mm_hadd_ps(*_xmm4, *_xmm4)

		// On calcule N' = M' + L'
			_xmm4 = _mm_hadd_ps(*_xmm4, *_xmm4)

		// On save la new coordonée
			trigo.end_coord[_z] = *_xmm4;

		// == == == == == == =
		// / roll
		// == == == == == == =

 

 

 

trigo._xmm0[] array member should have size divisable by 4 in order to be loaded into xmm register. I do not think if it is a good way to use compiler intrinsic  *_xmm0 = _mm_mul_ps(trigo._xmm2[1], trigo.coord);

trigo._xmm0 have 6 float member
       ._xmm1 
       ._xmm2

So i will not read out of array, i have control of those array, don't worry. If of course icl will not fragment array's member.

 

It will be difficult with IDA :/

For test difference speed, i can't test fsincos through asminline, cause, like i wrote, only one asm in line fall down my fps of -170  (now) fps :D

The only way to see the truth is contact intel engineer :/

>>>So i will not read out of array, i have control of those array, don't worry. If of course icl will not fragment array's member.>>>

ICL will not fragment array data type it is by design of the compiler to allocate contigous chunk of memory for the array.

>>>It will be difficult with IDA :/>>>

Yes I know that, but for simple code patching it can be used.

>>>For test difference speed, i can't test fsincos through asminline, cause, like i wrote, only one asm in line fall down my fps of -170  (now) fps :D>>>

Does it mean that fsincos caused 170 fps drop?

No of course, i mean if i put only one mov, it fall down:

Ex:

              __asm      mov       eax, 0xdeadbeef

Two reason for this behaviour, one is icl don't like asm inline, second i don't know how to configure compiler's option correctly .

Intel need really must learn, in this world there isn't only hacker, but developers who want programming without obstacle.

I talk about new technologie of protections program and data: Intel SGX && intel MPX. I'm scared about futur, who we'll don't allow programming without enter through a lot of protocol, like 10_000 ^^

And of course the most powerfull technologie "anti-hack", pagination :D (hate this one particularly)

Do you know how to access to first pixel of thé screen, easier way like vesa. I would like to continue my project of 3D engine without OS in IA-32e mode. i don't know why VBE core is discontinue :/

I have learn we're king of the world if we can access to pixel of screen and second parameter is how many time need to access to them.

Do intel show calendar of new Extensions ISA ? i tried to search it

>>>Do you know how to access to first pixel of thé screen, easier way like vesa. I would like to continue my project of 3D engine without OS in IA-32e mode. i don't know why VBE core is discontinue :/>>>

I do not know how to do it.

I only suppose that you will need write your own display driver and manage memory-mapped region by yourself.

ok, i post my update: i have change SDL by WinAPI: (terraindata.obj is unusable for now, i don't change it for now)

Downloadapplication/zip winapi_test_4.zip

Thanks for the upload.

I will run it later at my home pc.

页面

发表评论

登录添加评论。还不是成员?立即加入