Optimization of sine function's taylor expansion

343 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

>>>This is a C-like, that is very obsolete, way of declaring a function. Also, why don't you want to have a class that wraps and consolidates all common things, variables, declarations, etc?>>>

Yes I know this and I'm only at early stage of library design.Later I will create static library wrapped in classes with static function members.
Now I only tested a few possible vectorization of input argument.

>>>Try to use 'typedef's for declaration of structures instead of 'struct':>>>
Yes I will do this.My intention is to wrap primitive types and data types in my own "typedefs".
Btw Is it recommended to declare and initialize structure in header file?
I tried to do this and even after using include guards I got compile errors.

>>>Lots of unnecesseary things in that piece of code:>>>
That piece of code simple fills and returns structure with NaN values,but I agree with you that this adds unnecesary clutter.I could
also fix the values with for exmaple HALF_PI and did recursive call , but better option is simply return NULL.

>> Tip # 5 <<

Think about applications and use cases of your API. Will a developer X be happy with your function(s) / class(es) or not?

>> Tip # 6 <<

Have a test-case(s) for your API from the very beginning.

>> Tip # 7 <<

Evaluate performance of your API from the very beginning. It will save time later and you won't need to do a re-design if some function doesn't work fast.

I will proceed with the design of library according to your advise.

Today I tested large composed structure with four 4D inner structures put simply 4x4D vectors matrix.I was able to load them into XMM registers and perform various vector-wise and scalar-wise operation.I could try to calculate 16 scalar sine values and return pointer to such a matrix.

I think that I will switch to intrinsics .

>>>>Try to use 'typedef's for declaration of structures instead of 'struct':
>>
>>Yes I will do this. My intention is to wrap primitive types and data types in my own "typedefs".
>>Btw Is it recommended to declare and initialize structure in header file?

- Declaration could be done in both. That is, in a header file or in a cpp-file
- Initialization has to be done in a cpp-file

Note: If you do initialization in a header file you can get a compilation error like 'A variable is already declared' or something like that ( every C++ compiler will display a different error ).

>>I tried to do this and even after using include guards I got compile errors.

Could you post a text of the compiler error? What about a really small reproducer?

Hi Iliya, Here are a couple of more tips:

- Use of macros help in reducing the overall size of sources ( Note: macros can't be debugged! )
- Try to use a Hungarian Notation when declaring variables or members of a class
- Consider C++ templates ( one implementation will cover many different types )
- If possible, consider portability ( Portability is always a challenge! )

I could provide some little examples and just let me know what you need.

Best regards,
Sergey

>>Regarding declaration of those sine coefficients and their structures I will "typedef" them and move them to header file. Is this
>>good solution to this problem?

Here is an example:


template < class T > class TSomeClass

{

public:

	TSomeClass( void )

	{

		InitData();

	};
	virtual ~TSomeClass( void )

	{

	};
private:

	inline void InitData( void )

	{

		m_tNF2 = ( T )1.0 / ( T )2.0;

		m_tNF3 = ( T )1.0 / ( T )6.0;

		m_tNF4 = ( T )1.0 / ( T )24.0;

		m_tNF5 = ( T )1.0 / ( T )120.0;

		...

	};
private:

	T m_tNF2;

	T m_tNF3;

	T m_tNF4;

	T m_tNF5;

	...

};

This is a test:

#include

void main( void )
{
printf("Pasted from VS 2005");
}

>>>- Declaration could be done in both. That is, in a header file or in a cpp-file
- Initialization has to be done in a cpp-file

Note: If you do initialization in a header file you can get a compilation error like 'A variable is already declared' or something like that ( every C++ compiler will display a different error ).>>>

I could declare and initialize my own typedefs for various trigo and special functions pre-calculated coefficients(primitive types) and put them in .h file,but I won't
be able to load them into XMM registers inside asm block.This is my main problem.Lets look at this from different perspective if I'm forced to use struct types I can declare them inside #include guarded "region" but I still need to initialize them in .cpp file probably as a global constants so I
will use less lines of code spent of every function.Or simply keep initializing those structures inside function.
As I wrote you earlier I was not able initialize even inside #ifdef block.

>>>- Use of macros help in reducing the overall size of sources ( Note: macros can't be debugged! )>>>
I do not like macro definition.Usually I put in header file various constants and function declaration , regarding length of source code
it does not matter for me:)

>>>Try to use a Hungarian Notation when declaring variables or members of a class>>>
I hate it but I will use it in order to prevent name conflicts.

>>>- If possible, consider portability ( Portability is always a challenge! )>>>
What do you mean by writing this?
Are you talking about the hiding differences between various OS or between various compilers?
I think that portability question is still too complicated to me. I have no more than 10 months of programming experience:)

Here is an example:

This is good design I will use it as a template.
My design will be based on static functions inside my own namespace.For such a library I still do not want to do full object orientation.
I still love structured programming paradigm design:)

>>...I do not like macro definition.Usually I put in header file various constants and function declaration...

I was talking about macros like this:
...
#define HrtExchange( Value1, Value2, Unused ) \
{ \
_asm MOV eax, [##Value1] \
_asm MOV edx, [##Value2] \
_asm MOV [##Value1], edx \
_asm MOV [##Value2], eax \
}
...

>>...Are you talking about the hiding differences between various OS or between various compilers?..

The 2nd part of your question '...between various compilers...'. For example, Intel and MSC C++ compilers must compile the sources and results have to be the same ( or almost the same ).

I was talking about macros like this:
...
#define HrtExchange( Value1, Value2, Unused ) \
{ \
_asm MOV eax, [##Value1] \
_asm MOV edx, [##Value2] \
_asm MOV [##Value1], edx \
_asm MOV [##Value2], eax \
}
Theoritically I could move simple scalar double argument trigo function computation to header file and define them as a macros,but what about the 4D
double vectors and complex multi-vector structures?I studied a design of xnamath library which was fully implemented inside header file.
What do You think about the such a design?
Such a macros are good for removing burden of coeffs calculation/initialization ,but my intention is to vectorize calculation and not to speed up it
Regarding initialization coeffs structure inside header file I abandoned it and simply will only declare them in .h file.Inside trigo function I will initialize/instatiate those structures by such a design it could be possible to reduce such a function to more than 100 lines of code.
Now I'm testing large 4x4D structure with 16 sine scalar argument for now I was able to do simple vectors computation.

>>>The 2nd part of your question '...between various compilers...'. For example, Intel and MSC C++ compilers must compile the sources and results have to be the same ( or almost the same ).>>>
How can you design such a library?
Do you perform an extensive testing of your library on various compilers an by observing the results can you develop some form of an abstraction which hides the differences between various compilers.
For example when your library function uses long double for sin calculation and this function must be compiled by Microsoft compiler will it by design switch to using double primitive types.

@Sergey
How can I insert formatted code in message like your template class example?

>>...it could be possible to reduce such a function to more than 100 lines of code...

You need to decide what is more important for you. That is, some number of extra lines ( could be hundreds and more ) in source codes, or performance.

>>...How can I insert formatted code in message like your template class example?..

Have you read that comment:
...
To enable syntax highlighting, surround the language with brackets, where language is one of the following languages: bash, csharp, cpp, css, fortran, jscript, java, perl, php, plain, python, r, ruby, sql, xml, html, javascript, s, splus. For example, *php*foo*/php* for PHP code.

NOTE: I replaced square brackets with a character *

Here is example for *cpp* ... */cpp*:


#include *stdio.h*
void main( void )

{

     printf( "Hello, formatted source codes!n" );

}

NOTE: Problems with arrow-left and arrow-right characters after #include directive and extra lines are still not fixed. Well, Intel Software Developers... Is it going to take another a couple of years?

>>...How can you design such a library?

Iliya, you need to create a prototype of your library with some numbers of stubs, use cases, test cases, etc from the very beginning.

Actually, this is a very big subject to discuss and it is all about a Right Software Engineering. If some operating system, or a platform, or a CPU, or a C/C++ compiler is not considered from the beginning it is very hard to add and provide support for it later and it involves re-design, re-implementation, re-testing etc. An example of a poorly designed library is MFC. An example of a really good design is OWL ( Object Window Library ) which implemented by Borland Corporation and many modern software developers don't know about it.

>>>You need to decide what is more important for you. That is, some number of extra lines ( could be hundreds and more ) in source codes, or performance.>>>

I decided try to not optimize for performance.Structures will be declared in header file and initialized in cpp file , for initialization array-like syntax will be used because of smal number af struct members.
Later I will try to optimize for performance by using macros.

Hi Sergey!
I completely redesigned vectorized fastsin() function.Reduced by more than 70% size of code (140 lines of code removed)and improved inline SSE Horner scheme evaluation.Inside _asm block I removed even power multiplication of xmm register and total count of instruction executed per one
polynomial term is 3(one mov,one mul ,one add).As you can see from the code below array-like initialization of structures was used.
Here is an improved version:

inline struct SinVector *fastsinVec4D(struct Test1 *test1ptr1){

if(test1ptr1 == NULL){
return NULL;
}else if(test1ptr1->c1 >= HALF_PI_FLT || test1ptr1->c2 >= HALF_PI_FLT || test1ptr1->c3 >= HALF_PI_FLT || test1ptr1->c4 >=HALF_PI_FLT)
{
return NULL;
}else if(test1ptr1->c1 <= NEG_HALF_PI_FLT || test1ptr1->c2 <= NEG_HALF_PI_FLT || test1ptr1->c3 <= NEG_HALF_PI_FLT || test1ptr1->c4 <= NEG_HALF_PI_FLT)
{
return NULL;
}else{

SinVector sinvec1 = {-0.1666666,-0.1666666,-0.1666666,-0.1666666},*sinvec1ptr;
sinvec1ptr = &sinvec1;
SinVector sinvec2 = {0.0083333,0.0083333,0.0083333,0.0083333},*sinvec2ptr;
sinvec2ptr = &sinvec2;
SinVector sinvec3 = {-1.9841269e-4,-1.9841269e-4,-1.9841269e-4,-1.9841269e-4},*sinvec3ptr;
sinvec3ptr = &sinvec3;
SinVector sinvec4 = {2.7557319e-6,2.7557319e-6,2.7557319e-6,2.7557319e-6},*sinvec4ptr;
sinvec4ptr = &sinvec4;
SinVector sinvec5 = {-2.5052108e-8,-2.5052108e-8,-2.5052108e-8,-2.5052108e-8},*sinvec5ptr;
sinvec5ptr = &sinvec5;
SinVector sinvec6 = { 1.6059043e-10, 1.6059043e-10, 1.6059043e-10, 1.6059043e-10},*sinvec6ptr;
sinvec6ptr = &sinvec6;
SinVector sinvec7 = {-7.6471637e-13,-7.6471637e-13,-7.6471637e-13,-7.6471637e-13},*sinvec7ptr;
sinvec7ptr = &sinvec7;
SinVector sinvec8 = {2.8114572e-15,2.8114572e-15,2.8114572e-15,2.8114572e-15},*sinvec8ptr;
sinvec8ptr = &sinvec8;
SinVector sinvec9 = {-8.2206352e-18,-8.2206352e-18,-8.2206352e-18,-8.2206352e-18},*sinvec9ptr;
sinvec9ptr = &sinvec9;
SinVector sinvec10 = {1.9572941e-20,1.9572941e-20,1.9572941e-20,1.9572941e-20},*sinvec10ptr;
sinvec10ptr = &sinvec10;
SinVector sinvec11 = {-3.8681701e-23,-3.8681701e-23,-3.8681701e-23,-3.8681701e-23},*sinvec11ptr;
sinvec11ptr = &sinvec11;

SinVector result = {0.0f,0.0f,0.0f,0.0f},*resultptr;
resultptr = &result;

_asm{

xorps xmm0,xmm0
xorps xmm1,xmm1
xorps xmm6,xmm6
xorps xmm7,xmm7
xorps xmm5,xmm5
movups xmm0,test1 //arg x,y,z,w
movups xmm7,xmm0 // copy of arg xmm7 accumulator
mulps xmm0,test1 //x^2
mulps xmm0,test1 //x^3
movups xmm1,sinvec1
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,sinvec2
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,sinvec3
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,sinvec4
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,sinvec5
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,sinvec6
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,sinvec7
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,sinvec8
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,sinvec9
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,sinvec10
mulps xmm0,xmm1
addps xmm7,xmm0
movups xmm1,sinvec11
mulps xmm0,xmm1
addps xmm7,xmm0
movups result,xmm7

}

return resultptr;
}
}

>>>An example of a poorly designed library is MFC.>>>
I'm not surprised reading this sentence.

>>>liya, you need to create a prototype of your library with some numbers of stubs, use cases, test cases, etc from the very beginning.>>>

I was interested mainly in innerworkings of such a library.
For example does your library function or maybe macros perform some tests to check exact CPU version by wrapping CPUID instruction.
Another check could be run against various OS's and compilers.

>>For example does your library function or maybe macros perform some tests to check exact CPU version by wrapping CPUID instruction.

No at the moment. A small R&D test was done, however. Unfortunately, there are so many CPUs with different instruction sets that it's very hard to support all of them. Identification of a CPU is not a problem, and once again, support of many CPUs is a problem.

>>Another check could be run against various OS's and compilers.

Yes and I'll post some codes ( examples ) later.

>>>Yes and I'll post some codes ( examples ) later.>>>
Thanks it would be very interesting to gain some insight into innerworking of such a library.

Now I'm planning to implement my own library of "primitive" intrinsics.
My intention is to wrap in c++ static methods, various SSEn instruction which will perform register loading ,vector and scalar calculation and some data conversion for example converting from "some_type" array[n] where 0

>>... What is your opinion about this?...

Please post a prototype of some function that does what you've described for a code review. Everything is possible but it's hard to tell anything now.

>>>Please post a prototype of some function that does what you've described for a code review. Everything is possible but it's hard to tell anything now.>>>
You are right how I could have forgotten:)

_VecLibDouble is typedef of double primitive type.
Here is the example which has been already tested:

_VecLibDouble *Vec_Add2D_d(_VecLibDouble a[],_VecLibDouble b[]){

_VecLibDouble sum[] = {0.0,0.0};

if(a == NULL || b == NULL){
return NULL;
else if(sizeof(a)/sizeof(a[0] )> 2 || sozeof(b)/sizeof(b[0])>2)
return NULL;
]else{

_asm
{
xorpd xmm0,xmm0
xorpd xmm1,xmm1
xor eax,eax
xor edx,edx
mov eax,[a]
mov edx,[b]
movupd xmm0,[eax]
movupd xmm1,[edx]
addpd xmm1,xmm0
movupd sum,xmm1

}

return sum;
}

}

Hi Iliya, Here is my feedback:

_VecLibDouble * Vec_Add2D_d( _VecLibDouble a[], _VecLibDouble b[] )
{
if( a == NULL || b == NULL )
{
return ( _VecLibDouble * )NULL;
else
if( sizeof(a)/sizeof(a[0] )> 2 || sizeof(b)/sizeof(b[0])>2 )
return ( _VecLibDouble * )NULL;
]
else
{
_VecLibDouble sum[] = { 0.0L, 0.0L };

_asm
{
xorpd xmm0,xmm0
xorpd xmm1,xmm1
xor eax,eax
xor edx,edx
mov eax,[a]
mov edx,[b]
movupd xmm0,[eax]
movupd xmm1,[edx]
addpd xmm1,xmm0
movupd sum,xmm1
}
return ( _VecLibDouble * )sum;
}
}

I always follow very strict rules:

- Don't initialize anything before all input parameters are verified
- I know that it looks like useless: 'return ( _VecLibDouble * )sum' ( it is my personal style for many years already )
- When it comes to 'double's use 'L' suffix
- What about a use case on how the function will be used?

always follow very strict rules:

- Don't initialize anything before all input parameters are verified
- I know that it looks like useless: 'return ( _VecLibDouble * )sum' ( it is my personal style for many years already )
- When it comes to 'double's use 'L' suffix
- What about a use case on how the function will be used?

Thank you for your programming tips I very appreciate it.

>>>I know that it looks like useless: 'return ( _VecLibDouble * )sum' ( it is my personal style for many years already )>>>
Is it necessesery to cast sum pointer to _VecLibDouble type?I think that this is needed to reassure proper return type pointer.

>>>- When it comes to 'double's use 'L' suffix>>>
Are you referring here to long integer type.This rule probably enforces compiler to allocate 8 bytes for double type storage.

>>>- What about a use case on how the function will be used?>>>
Simplest use case could be passing two 2-component [x,y] 1D vectors and add them by using SIMD XMM registers.The result is accumulated
in xmm1 and passed to sum array.I think that inside _asm block i won't use eax register for passing an address of function argument.
I have created also typedef struct _Vec128_f which contains 4 floats aligned on 16-byte boundary.
Tommorow I will post more test-cases.

>>>>I know that it looks like useless: 'return ( _VecLibDouble * )sum' ( it is my personal style for many years already )
>>
>>Is it necessesery to cast sum pointer to _VecLibDouble type?I think that this is needed to reassure proper return type pointer.

No, As I've told this is simply my personal style.

A C/C++ compiler must do it for you by default. But, when I do a code review I prefer to see a type of a return parameter explicitly casted to some type.

Hi Iliya, Here is a note about junk-codes. Even on a small project they are accumulating pretty quickly. After a couple of weeks a team could have hundreds of junk-codes. After a couple of months it could be already as many as thousands. An example is below:

_VecLibDouble * Vec_Add2D_d( _VecLibDouble a[], _VecLibDouble b[] )
{
if( a == NULL || b == NULL )
{ // Do you need a bracket here? That's depends on a style and keep it if you like it. Some companies could require the bracket (!).
return ( _VecLibDouble * )NULL;
} //...
else
if( sizeof(a)/sizeof(a[0] )> 2 || sizeof(b)/sizeof(b[0])>2 )
{ // Do you need a bracket here? That's depends on a style and keep it if you like it.
return ( _VecLibDouble * )NULL;
} //...
else // You don't need 'else' here. Note: In that case it could be your style of programming and that's OK.
{
_VecLibDouble sum[] = { 0.0L, 0.0L };
...
return ( _VecLibDouble * )sum;
}
}

and a modified function could look like:

_VecLibDouble * Vec_Add2D_d( _VecLibDouble a[], _VecLibDouble b[] )
{
if( a == NULL || b == NULL )
return ( _VecLibDouble * )NULL;
else
if( sizeof(a)/sizeof(a[0] )> 2 || sizeof(b)/sizeof(b[0])>2 )
return ( _VecLibDouble * )NULL;

_VecLibDouble sum[] = { 0.0L, 0.0L };
...
return ( _VecLibDouble * )sum;
}

>>>Hi Iliya, Here is a note about junk-codes. Even on a small project they are accumulating pretty quickly. After a couple of weeks a team could have hundreds of junk-codes. After a couple of months it could be already as many as thousands. An example is below:>>>

Yes I know that pretty well,but as You have deduced that using if-else with block statement brackets is my style.It will be changed in release build as a attempt to optimize my library.

I would like to ask you how could be implemented XMM register - loading intrinsic? I mean should I design such a functions with void return type or
maybe better option is to load xmm register with one of my "typedefs" and return pointer to structure.
I studied Intel load intrinsics and they are returning _m128 type when beign passed a float or double type array as an argument.

>>...should I design such a functions with void return type or maybe better option is to load xmm register with one of my "typedefs" and
>>return pointer to structure...

I would consider a solution with as Less As Possible Overhead in terms of performance. So, you could try both versions, evaluate performance and then select the fastest. It always makes sence to create as many as possible versions of some new software subsystem, or a function, in order to select the best solution.

>>>I would consider a solution with as Less As Possible Overhead in terms of performance. So, you could try both versions, evaluate performance and then select the fastest. It always makes sence to create as many as possible versions of some new software subsystem, or a function, in order to select the best solution.>>>
I'm designing my library exactly as you advised me.Two primitive vector type were created one is based on structures and second is based on arrays also more than dozen various load, convert and store functions are planned to be written and tested soon.I will constantly post various small test-cases.

>>...I will constantly post various small test-cases...

Hi Iliya, Thank you.

>>>Hi Iliya, Thank you.>>>
I will start posting test-cases in friday.I need to prepare myself for java programming related job interview.
I would like to ask you what kind of questions/tests I need to be prepared for?
What could I be asked by the reviewer?.

I 'am asking this here,because I cannot send you a private message(system is rejecting my attempts).

Iliya, please take a look at www.monster.com. It should have a section related to interviews, etc. Good Luck!

>>>Good Luck!>>>
Thank you very much.Regarding monster.com i found a few good sites with very helpful inteview preparation advises.

Hi Sergey!
I attended today a programming job interview , I was asked a few questions about the Spring and Hibernate frameworks sadly these frameworks are unknown to me.I talked to them about my projects ,but interviewers demand the knowledge in web programming,using spring framework and UI programming.I must wait their decision untill wednsdey next week.

Hi Sergey
I'm posting a few test-cases of my "intrinsinc" library more will follow soon.

_VecLibDouble (lib typedef of double) arbitrary size array addition.

_VecLibDouble *Vector_Add_d(_VecLibDouble a[],_VecLibDouble b[],int len){

if(a == NULL || b == NULL)
return (_VecLibDouble*)NULL;
if((sizeof(a)/sizeof(a[0]) - sizeof(b)/sizeof(b[0])) > 0)
return (_VecLibDouble*)NULL;

_asm{
xor eax,eax
xor ebx,ebx
xor ecx,ecx

xorpd xmm0,xmm0
xorpd xmm1,xmm1
mov ecx,len
mov eax,[a]
mov ebx,[b]
Back:
movupd xmm0,[eax]
movupd xmm1,[ebx]
addpd xmm0,xmm1
movupd [eax],xmm0
add eax,16
dec ecx
jnz Back

}
return a;
}

main() arguments and call

_VecLibDouble testar2[10],*testarptr;
_VecLibDouble testar3[10];
for(int i = 0;i < 10;i++){
testar2[i] = (double)i;
testar3[i] = 100.0;
}

testarptr = Vector_Add_d(testar2,testar3,10);

for(int i = 0;i < 10;i++) printf("array addition %.18f \n",testarptr[i]);

Here is the result:

array addition 100.000000000000000000
array addition 101.000000000000000000
array addition 102.000000000000000000
array addition 103.000000000000000000
array addition 104.000000000000000000
array addition 105.000000000000000000
array addition 106.000000000000000000
array addition 107.000000000000000000
array addition 108.000000000000000000
array addition 109.000000000000000000

Thanks for the update and I'll take a look some time later ( there will be a couple of very busy weeks for me ).

Best regards,
Sergey

>>>Thanks for the update and I'll take a look some time later >>>
Ok Thank you for your help and very insigthful tips.

Hi Sergey!
I'm posting an update related to my VecLib project.While testing slightly optimized version of sine function where the sine convergence
is achieved with the help of SSE inline assembly I ran into some problem.I eliminated one instruction which performed explicit multiplication of the argument by x^2 so the total count of instruction per one term was three,but the accurracy was greatly reduced up to 2-3 decimal places.Double precision and single precision primitives were used so the truncation can not be blamed for the inaccurate result.
I suspect that somehow combined multiplication of an argument by pre-calculated coefficient coupled with exponentiation of the argument all of it performed in the same register xmm0 which served as an accumulator caused the loss in accuracy.
I rewrote the inline asm block and removed the load of xmm0 register by adding another instruction which multiplies the argument by x^2 and that problem dissapeared.
Please look at Vec_Sin_f() function inline assembly code block and Vec_Cos_d() inline assembly block.
Thanks in advance.

Attachments: 

AttachmentSize
Download vec-lib.txt8.56 KB

Pages

Login to leave a comment.