The rcpps is likely to be useful only where 12 bits precision may be adequate
rcpps was used only when I started to write in asm various functions expansions.I simply implemented these formulas almost "as is" i.e coefficients were not pre-calculatedand knowing that divaps is very slow I used rcppsto calculate the coefficients's reciprocals.I know that rcpps's precision is lacking so Istarted to pre-calculate taylor series coefficients in Mathematica 8 and implemented Horner scheme to speed up the running time of my library functions.For example I was able to achieve 4x improvment in my gamma stirling function when compared to pure iterative method.Here is thelink post #28 http://software.intel.com/en-us/forums/showthread.php?t=106032
>>don't see enough context to understand how you would want rcpps for expansion of a series with known coefficients
As I stated earlier in my post rcpps was used to calculate on-the-fly coefficients of various taylor expansions.It is clear that rcpps can not be used with pre-calculated coefficients. So I decided tocompletely rewrite my asembly based implementations.
In my tests 'RDTSC' instruction is mapped to a CRT-function 'CrtClock' using Hardware Run-Time Abstraction Layer
Hi Sergey!
What is Hardware Run-Time Abstraction Layer?Does itis somehow relate to Windows HAL? What are these types for example RTint ? Is it macro for int type? When you perform all your tests you are using your wrapper library which wraps WIN API(as I understood properly) could such a layered approach add a significant overhead to the timing of various functions?
used adifferent number of iterations up to 100,000,000.Results are consistent andalways 5 clock cycles
Thanks for testing! The results are exactly as I predicted a few cycles of overhead.Bear in mind that in real-world scenario CPU logic will execute for-loop statement in parallel with inside the loop statements. The interesting scenario will arise when you will implement a for-loop based on floating-point counter and inside-loop floating-point calculations.Here CPU logic I suppose will interleave the execution of fpinstruction beetwen Port0 and Port1.
The Intel compilers have default options -no-prec-div -no-prec-sqrt where, in some contexts, rather than using an IEEE accurate hardware instruction (or in some cases library function), there is explicit iteration in the generated code, starting with rcpps or rsqrtps. The Harpertown CPU made a big improvement in the efficiency of the IEEE accurate hardware instructions, and the Ivy Bridge core I7-3 is even better (aside from splitting the AVX-256 into sequenced 128-bit instructions). Yet we still have these iterative sequences. I apologize for getting confused about the sequence of posts and reverting back to the rcpps subject where the thread started.
Sergey
Could you explain how is implemented in software Run-Time HAL.I thin that the possible implementation would be to wrap inline assembly in custom library functions.
Quoting iliyapolak... What are these types for example RTint ? Is it macro for int type?
[SergeyK] No. RTint is declared asa typedef. Here is a simple fix for youbased onmacros:
... #define RTint int #define RTclock_t clock_t #define CrtPrintf printf #define RTU( text ) text #define SysGetTickCount ::GetTickCount ...
When you perform all your tests you are using your wrapper library which wraps WIN API(as I understood properly) could such a layered approach add a significant overhead to the timing of various functions?
[SergeyK] Yes and it is applicable toany function, not just to some Win32 API function. In almost 99.99% ofmy Test-Cases I use a Win32 API function 'GetTickCount' to measure time intervals. It completely satisfies the project needs.
Quoting iliyapolakSergey Could you explain how is implemented in software Run-Time HAL.I thin that the possible implementation would be to wrap inline assembly in custom library functions.
Hi Iliya,
There is atechnique in C/C++ and I would call it as a 'macro substitution'. Actually, it is as old as C language andhere is a small example:
int FunctionA( int iParamA ); int FunctionB( int iParamB );
int FunctionA( int iParamA ) { printf( "FunctionA called\n" ); }
int FunctionB( int iParamB ) { printf( "FunctionB called\n" ); }
#define _USE_FUNCTION_A //#define _USE_FUNCTION_B
#ifdef _USE_FUNCTION_A #define FunctionFunctionA #endif #ifdef _USE_FUNCTION_B // I always use explicit #ifdefs and Inever use #ifdef-#else-#endif #define FunctionFunctionB #endif
void main( void ) { Function( 1 ); }
In essence, HRT AL in the ScaLib library is based on that.Remember, that this is a compile time technique. If you need a similar substitution at arun-time it could be alsodone.
Thanks for explaining.
You know when I saw first time the word "HAL" I thought that this could be some kind of wrapper library which accesses hardware and on top of this library runs software which can get access to the underlying hardwars by library wrapper functions.And all this is access is performed by custom driver which reads Device Object and Driver Object data structures and hardware accesses are implemnted inn inline assembly.
Quoting TimP (Intel)... I apologize for getting confused about the sequence of posts and reverting back to the rcpps subject where the thread started...
This is a very long and interesting thread. Unfortunately, we're deviating all the time from the main subject...
The Intel compilers have default options -no-prec-div -no-prec-sqrt where, in some contexts, rather than using an IEEE accurate hardware instruction (or in some cases library function), there is explicit iteration in the generated code, starting with rcpps or rsqrtps.
Tim! As I understood properly rcpps instruction can be used as a some fallback optionsto calculate possibly on the fly Taylor Series coefficents.
Sergey! If you are interested in my other classes.Ican share my work.I have twoJava classes.First class is based on Complex numbers and Complex functions implemented as a objects. This project contains only trigonometric and exponential functions in complex domain.Second classdeals with various bit-manipulations and is based mostlyon the book "Hacker's Delight" class is implemented as a static methods.
I completed another set of tests and I'll post results soon. Two tests for Intel instruction 'fsin' added:
one is in ticks and another one isin clock cycles
That's would be great. Afaik 'fsin' latency is around 40-100 cpi.Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element.
Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 - C Macro
Best time: 62 ticks
Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 - C Macro
Best time: 74 ticks
Normalized Taylor Series 11t Sin(30.0) = 0.4999999999999643100000 - C Macro
Best time: 93 ticks
Normalized Series 7t Sin(30.0) = 0.4999999918690232700000
Best time: 203 ticks
Normalized Chebyshev Polynomial 7t Sin(30.0) = 0.4999999476616694400000
Best time: 203 ticks
Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000
Best time: 203 ticks
Normalized Chebyshev Polynomial 9t Sin(30.0) = 0.4999999997875643800000
Best time: 218 ticks
Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000
Best time: 218 ticks
Normalized Series 9t Sin(30.0) = 0.5000000000202800000000
Best time: 234 ticks
Normalized Series 11t Sin(30.0) = 0.5000000000000000000000
Best time: 265 ticks
Chebyshev Polynomial 7t Sin(30.0) = 0.4999999476616695500000
Best time: 266 ticks
CRT Sin(30.0) = 0.4999999999999999400000 - SSE2 support On
Best time: 281 ticks
Chebyshev Polynomial 9t Sin(30.0) = 0.4999999997875643800000
Best time: 312 ticks
Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV3 - Optimized
Best time: 343 ticks
Intel instruction FSIN(30.0) = 0.4999999999999999400000 - C Macro
Best time: 359 ticks
Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV2 - Optimized
Best time: 406 ticks
Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV1 - Not Optimized
Best time: 453 ticks
CRT Sin(30.0) = 0.4999999999999999400000 - SSE2 support Off
Best time: 484 ticks
Set of Tests B:
Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 - C Macro
Best time: 24 clock cycles
Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 - C Macro
Best time: 34 clock cycles
Normalized Taylor Series 11t Sin(30.0 )= 0.4999999999999643100000 - C Macro
Best time: 42 clock cycles
Intel instruction FSIN(30.0 )= 0.4999999999999999400000 - C Macro
Best time: 140 clock cycles
Technical Details:
- Windows XP 32-bit / Visual Studio 2005 - All compiler optimizations are disabled / Release configuration / Platform Win32 - Double data type ( 53-bit precision ) - Number of calls for every test is 4194304 ( 2^22 ) - Process priority Normal - All implementations are portable / N/A for cases: CRT 'sin' and Intel 'fsin' instruction - CRT function 'sin' tested with SSE2 support On ( faster ) and Off ( slower ) - C Macro means it is implemented as a macro ( inline / no call overhead ) - Ordered from Fastest to Slowest - Values in Ticks measured using GetTickCount Win32 API function - Values in Clock Cycles measured using RDTSC instruction
>>... >> Intel instruction FSIN(30.0 )= 0.4999999999999999400000 - C Macro >> Best time: 140 clock cycles >>...
A note regarding 140 clock sycles for Intel 'FSIN' instruction. The result of my test is very consistent with what I found in Intel documentation. Here are details from Intel 64 and IA-32 Architectures Optimization Reference Manual ( Nov 2007 edition ). Topic: Instruction Latency and Throughput
Summary for 'FSIN' instruction from Tables C-11 and C-11a from Pages C-25 and C-26:
... Latency for a CPU with CPUID 0F_2H : 160 - 180 Latency for a CPU with CPUID 0F_3H : 160 - 200 Latency for a CPU with CPUID 06_17H: 82 Latency for a CPU with CPUID 06_0EH: 119 Latency for a CPU with CPUID 06_0DH: 119 ... Throughput for a CPU with CPUID 0F_2H : 130 Throughput for a CPU with CPUID 06_0EH: 116 Throughput for a CPU with CPUID 06_0DH: 116 ...
Also, a Footnote #4 says:
... 4. Latency and Throughput of transcendental instructions can vary substantially in a dynamic execution environment. Only an approximate value or a range of values are given for these instructions. ...
Sergey try to pass a random value as an argument to your functions. I see that fastsin() is the slowest function probably because of 11 terms involved in calculation. Chebyshev polynomial approximation suffers from the lowest accuracy.
4. Latency and Throughput of transcendental instructions can vary substantially in a dynamic execution environment. Only an approximate value or a range of values are given for these instructions
It can also depend on the value passed as an argument to 'fsin'.
4. Latency and Throughput of transcendental instructions can vary substantially in a dynamic execution environment. Only an approximate value or a range of values are given for these instructions
It can also depend on the value passed as an argument to 'fsin'.
That's a good idea to verifysome time later. Thank you, Iliya!
Quoting Sergey Kostrov
... NormalizedTaylorSeries7tSin(30.0)=0.4999999918690232200000-CMacro Besttime:24clockcycles ...
[SergeyK] I'd like to note thatit is just twocode lines C macroandthere is noa range reduction.
Here is your comment from aPost #278:
>>...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element...
I think it is unfair to compare highly optimized trigonometric functions that use latest sets of Intel instructions ( SSE4 / AVX / etc ) withhighly portable C implementations of trigonometric functions.
@Sergey
Sorry I did not remember the correct value when I have written my post.
The correct value is 20.96 cycles per element for HA double argument.
Btw my source is the same as yours.
By the way, that methodis used in Digital Signal Processing to generate tables of values forsin, cos and tan ( all based on Linear Interpolation ) forembedded systems with very constrained Read Only Memory (ROM )resources.
Hi Iliya,
.
>>...Does your summary project is still relevant?
.
I just completed a quick review and the most important piece of information that I wanted to use is removed and this is "A Post Number". I really don't understand why developers of new IDZ website removed it. How are we going to refer to some post? Any ideas?
.
Best regards,
Sergey
>>...I just completed a quick review and the most important piece of information that I wanted to use is removed and this is "A Post Number". I really don't understand why developers of new IDZ website removed it. How are we going to refer to some post? Any ideas?
Yes I know that.Developers removed 64 posts from this thread.I have never anticipated such a loss of our accumulated knowledge.
Can we write some petition to the webmaster?
>>...Yes I know that.Developers removed 64 posts from this thread.I have never anticipated such a loss of our accumulated knowledge...
.
I didn't count have many posts are lost but a damage to some source code examples is significant because some of them are re-formatted to almost unreadable state.
.
>>...Can we write some petition to the webmaster?
.
I think Yest but is it going to work? I don't think so. He/She won't try to do anything to recover these lost posts. On my account I lost almost 500 posts and access to my private folders with screenshots, doc and zip archives. They don't want to change a color of font from light grey to black in the new editor. I can't read clearly and need ro move my head as closer as possible to my display. Is it difficult to change the color? No, however it is not done in 4 weeks!
>>... didn't count have many posts are lost but a damage to some source code examples is significant because some of them are re-formatted to almost unreadable state.
Cannot understand the reason behind the forum redesign.Why they needed to do this?Everything was working ok, it was very strange decision not so easily understandable by us - forum users.
>>>...I think Yest but is it going to work? I don't think so. He/She won't try to do anything to recover these lost posts. On my account I lost almost 500 posts and access to my private folders with screenshots, doc and zip archives. They don't want to change a color of font from light grey to black in the new editor. I can't read clearly and need ro move my head as closer as possible to my display. Is it difficult to change the color? No, however it is not done in 4 weeks!
I feel sorry for you Sergey, for loss of your posts , code and docs.Do you have any backup of your code examples which were posted and have been lost?
>>...Do you have any backup of your code examples which were posted and have been lost?
.
Yes, I have but I will need some time to identify, find and extract these test-cases. I have hundreds and hundreds of different tests in my own collection and it always a little challenge if something needs to be recovered. Unfortunately, i don't have time to re-post them.
@Sergey slightly off topic question to you.Now I'am working on multithreaded version of my SpecialFunctions library my methods will be mostly arrray - filling functions, but I think that simply using multithreaded java programming to fill an array is kinda boring.I would like to ask you what type of problems(presumably from math and/or physics) could be used to sharpen my multithreading programming skills :)
>>...what type of problems(presumably from math and/or physics) could be used to sharpen my multithreading programming skills...
.
At the beginning I would consider a very simple case like addition of two large matrices. Depending on a number of CPUs the addition process could be done in parallel. Another example is sorting large data sets and a Merge sorting algorithm is a perfect candidate for R&D.
>>>...At the beginning I would consider a very simple case like addition of two large matrices>>>
Thank you for your answer.I'am already testing double-loop arrays filled with sum of various sine function(fastsin beign used) with the varying frequency.My CPU is core i3 installed in laptop and from the tests i can see that for "large pool" of threads more than 4 I can gain slightly improved performance.I think that main reason behind the mediocre performance under heavily floating-point load is the hyper-threading technology.
>>...Cannot understand the reason behind the forum redesign.Why they needed to do this?
.
Here is my understanding of the situation:
.
- 1. It was hard to maintain the Old System
- 2. Intel wanted to make the upgrade before IDF 2012 event in San Francisco ( in order to announce about a new IDZ website )
- 3. As soon as some new versions of Intel software are released, like Intel C++ compiler v13 or IPP library, all posts must be placed in the New System
.
Best regards,
Sergey
>>...I think that main reason behind the mediocre performance under heavily floating-point load is the hyper-threading technology.
.
By the way, you're not the first developer who experienced a problem with HTT when a FPU is used. Could you provide some real performance numbers?
.
Best regards,
Sergey
.
PS: Also, you could consider some simple algorithms like Matrix Transpose, finding a Min or Max values in a data set, scalar additions, etc. So, just use your imagination and almost all iterative algorithms ( with a 'FOR' statement ) could be implemented in parallel.
...>>>By the way, you're not the first developer who experienced a problem with HTT when a FPU is used. Could you provide some real performance numbers?....>>>
Yes I expected such a behaviour from the HT processor.Simply provididng discreet logic as gp-registers with own APIC and sharing FPU beetwen physical and logical cores will not accelerate even easily "threadable" floating-point data.
Soon I will post some results from my testing.But bear in mind that everything is written in java , and Java System.nano() timing method is binary translated to RDTSC instruction with all its consequences.
He/She won't try to do anything to recover these lost posts.
If you or Sergey are missing specific post or attachments, please raise the topic (e.g. here in the forum or by a pm to me). I cannot promise anything but, personally, I was missing an article and a blog and both could be recovered.
>>...Soon I will post some results from my testing.But bear in mind that everything is written in java , and Java System.nano() timing
>>method is binary translated to RDTSC instruction with all its consequences.
.
Thank you in advance, Iliya!
>>...If you or Sergey are missing specific post or attachments, please raise the topic (e.g. here in the forum or by a pm to me)...
.
Thomas,
.
Iliya and I noticed that lots ( hundreds! ) of posts are deleted. As I told in my case it is about 500. Also, I don't have any longer access to a 'Files' section of the Old System where I kept hundreds of jpg-files ( screenshots ), docs and zip archives. Unfortunately, my archive ( 10-month job on the ISN! ) was damaged significantly!
rcpps was used only when I started to write in asm various functions expansions.I simply implemented these formulas almost "as is" i.e coefficients were not pre-calculatedand knowing that divaps is very slow I used rcppsto calculate the coefficients's reciprocals.I know that rcpps's precision is lacking so Istarted to pre-calculate taylor series coefficients in Mathematica 8 and implemented Horner scheme to speed up the running time of my library functions.For example I was able to achieve 4x improvment in my gamma stirling function when compared to pure iterative method.Here is thelink post #28 http://software.intel.com/en-us/forums/showthread.php?t=106032
>>don't see enough context to understand how you would want rcpps for expansion of a series with known coefficients
As I stated earlier in my post rcpps was used to calculate on-the-fly coefficients of various taylor expansions.It is clear that rcpps can not be used with pre-calculated coefficients.
So I decided tocompletely rewrite my asembly based implementations.
Related to an issue with measuring an overhead ofempty 'for' statement.
>>...
>>It should not be more than few cycles per iteration when unoptimized by compiler...
>>...
In my tests 'RDTSC' instruction is mapped to a CRT-function 'CrtClock' using Hardware Run-Time Abstraction Layer.
... // Sub-Test 6.2 - Overhead of Empty For Statement { ///* CrtPrintf( RTU("Sub-Test 6.2 - [ Empty For Statement ]n") ); RTclock_t ctClock1 = 0; RTclock_t ctClock2 = 0; ctClock1 = ( RTclock_t )CrtClock(); for( RTint t = 0; t < 1000000; t++ ) { ; } ctClock2 = ( RTclock_t )CrtClock(); CrtPrintf( RTU("Sub-Test 6.2 - 1,000,000 iterations - %4ld clock cyclesn"), ( RTint )( ( RTfloat )( ctClock2 - ctClock1 ) / 1000000 ) ); //*/ } // Sub-Test 6.3 - Overhead of Empty For Statement { ///* CrtPrintf( RTU("Sub-Test 6.3 - [ Empty For Statement ]n") ); RTclock_t ctClock1 = 0; RTclock_t ctClock2 = 0; ctClock1 = ( RTclock_t )CrtClock(); for( RTint t = 0; t < 10000000; t++ ) { ; } ctClock2 = ( RTclock_t )CrtClock(); CrtPrintf( RTU("Sub-Test 6.3 - 10,000,000 iterations - %4ld clock cyclesn"), ( RTint )( ( RTfloat )( ctClock2 - ctClock1 ) / 10000000 ) ); //*/ } ...Output is as follows:
...
Sub-Test 6.2 - [ Empty For Statement ]
Sub-Test 6.2 - 1,000,000 iterations - 5 clock cycles
Sub-Test 6.3 - [ Empty For Statement ]
Sub-Test 6.3 - 10,000,000 iterations - 5 clock cycles
...
I used adifferent number of iterations up to 100,000,000.Results are consistent andalways 5 clock cycles.
Hi Sergey!
What is Hardware Run-Time Abstraction Layer?Does itis somehow relate to Windows HAL?
What are these types for example RTint ? Is it macro for int type?
When you perform all your tests you are using your wrapper library which wraps WIN API(as I understood properly) could such a layered approach add a significant overhead to the timing of various functions?
Thanks for testing!
The results are exactly as I predicted a few cycles of overhead.Bear in mind that in real-world scenario CPU logic will execute for-loop statement in parallel with inside the loop statements.
The interesting scenario will arise when you will implement a for-loop based on floating-point counter and inside-loop floating-point calculations.Here CPU logic I suppose will interleave the execution of fpinstruction beetwen Port0 and Port1.
What do you mean by saying this?Are you refering to division algorithms implemented in hardware?
The Intel compilers have default options -no-prec-div -no-prec-sqrt where, in some contexts, rather than using an IEEE accurate hardware instruction (or in some cases library function), there is explicit iteration in the generated code, starting with rcpps or rsqrtps. The Harpertown CPU made a big improvement in the efficiency of the IEEE accurate hardware instructions, and the Ivy Bridge core I7-3 is even better (aside from splitting the AVX-256 into sequenced 128-bit instructions). Yet we still have these iterative sequences.
I apologize for getting confused about the sequence of posts and reverting back to the rcpps subject where the thread started.
Hi Sergey!
What is Hardware Run-Time Abstraction Layer?
[SergeyK] Please take a look at a thread:
http://software.intel.com/en-us/forums/showthread.php?t=106134&o=a&s=lr
Post #3
Does itis somehow relate to Windows HAL?
[SergeyK] No. It is an internal feature of the ScaLib project.
I will follow up on all the rest your questions later.
Best regards,
Sergey
Sergey
Could you explain how is implemented in software Run-Time HAL.I thin that the possible implementation would be to wrap inline assembly in custom library functions.
What are these types for example RTint ? Is it macro for int type? [SergeyK]
No. RTint is declared asa typedef. Here is a simple fix for youbased onmacros:
...
#define RTint int
#define RTclock_t clock_t
#define CrtPrintf printf
#define RTU( text ) text
#define SysGetTickCount ::GetTickCount
...
When you perform all your tests you are using your wrapper library which wraps WIN API(as I understood properly) could such a layered approach add a significant overhead to the timing of various functions?
[SergeyK]
Yes and it is applicable toany function, not just to some Win32 API function. In almost 99.99% ofmy Test-Cases
I use a Win32 API function 'GetTickCount' to measure time intervals. It completely satisfies the project needs.
Hi Iliya,
There is atechnique in C/C++ and I would call it as a 'macro substitution'. Actually, it is as old as C language andhere is
a small example:
int FunctionA( int iParamA );
int FunctionB( int iParamB );
int FunctionA( int iParamA )
{
printf( "FunctionA called\n" );
}
int FunctionB( int iParamB )
{
printf( "FunctionB called\n" );
}
#define _USE_FUNCTION_A
//#define _USE_FUNCTION_B
#ifdef _USE_FUNCTION_A
#define Function FunctionA
#endif
#ifdef _USE_FUNCTION_B // I always use explicit #ifdefs and Inever use #ifdef-#else-#endif
#define Function FunctionB
#endif
void main( void )
{
Function( 1 );
}
In essence, HRT AL in the ScaLib library is based on that.Remember, that this is a compile time technique. If you need
a similar substitution at arun-time it could be alsodone.
Best regards,
Sergey
Thanks for explaining.
You know when I saw first time the word "HAL" I thought that this could be some kind of wrapper library which accesses hardware and on top of this library runs software which can get access to the underlying hardwars by library wrapper functions.And all this is access is performed by custom driver which reads Device Object and Driver Object data structures and hardware accesses are implemnted inn inline assembly.
Here is a Test-Case for a run-time substitution:
int ( *g_pFunction[2] )( int iParam );
int FunctionA( int iParamA );
int FunctionB( int iParamB );
int FunctionA( int iParamA )
{
printf( "FunctionA called\n" );
}
int FunctionB( int iParamB )
{
printf( "FunctionB called\n" );
}
void main( void )
{
int iCondition = 0;
g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;
if( iCondition == 0)
g_pFunction[0]( 0 ); // FunctionA will be executed
else
g_pFunction[1]( 1 );
}
...
void main( void )
{
int iCondition = 0;
g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;
if( iCondition == 0)
g_pFunction[0]( 0 ); // FunctionA will be executed
else
g_pFunction[1]( 1 );
}
Or
...
void main( void )
{
int iCondition = 0;
g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;
g_pFunction[iCondition]( 0 ); // FunctionA will be executed
}
...
I apologize for getting confused about the sequence of posts and reverting back to the rcpps subject where the thread started...
This is a very long and interesting thread. Unfortunately, we're deviating all the time from the main subject...
Best regards,
Sergey
Here is a Template based API substitution technique:
...
class CApiSet1
{
public:
CApiSet1(){};
~CApiSet1(){};
void FunctionA(){ printf("CApiSet1::FunctionA\n"); };
void FunctionB(){ printf("CApiSet1::FunctionB\n"); };
};
class CApiSet2
{
public:
CApiSet2(){};
~CApiSet2(){};
void FunctionA(){ printf("CApiSet2::FunctionA\n"); };
void FunctionB(){ printf("CApiSet2::FunctionB\n"); };
};
template < class APISET > void ExecuteApiSet( APISET &as );
template < class APISET > void ExecuteApiSet( APISET &as )
{
as.FunctionA();
as.FunctionB();
};
void main( void )
{
CApiSet1 as1;
CApiSet2 as2;
ExecuteApiSet( as1 );
ExecuteApiSet( as2 );
}
...
Output is as follows:
CApiSet1::FunctionA
CApiSet1::FunctionB
CApiSet2::FunctionA
CApiSet2::FunctionB
Tim!
As I understood properly rcpps instruction can be used as a some fallback optionsto calculate possibly on the fly Taylor Series coefficents.
This thread is most viewed thread for the last month.Butsadly it has only afew active participants.
Sergey!
If you are interested in my other classes.Ican share my work.I have twoJava classes.First class is based on Complex numbers and Complex functions implemented as a objects. This project contains only trigonometric and exponential functions in complex domain.Second classdeals with various bit-manipulations and is based mostlyon the book "Hacker's Delight" class is implemented as a static methods.
Application - ScaLibTestApp - WIN32_MSC
...
Hi Iliya,
I completed another set of tests and I'll post results soon. Two tests for Intel instruction 'fsin' added:
one is in ticks and another one isin clock cycles.
Best regards,
Sergey
That's would be great.
Afaik 'fsin' latency is around 40-100 cpi.Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element.
Set of Tests A:
Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 - C Macro Best time: 62 ticks Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 - C Macro Best time: 74 ticks Normalized Taylor Series 11t Sin(30.0) = 0.4999999999999643100000 - C Macro Best time: 93 ticks Normalized Series 7t Sin(30.0) = 0.4999999918690232700000 Best time: 203 ticks Normalized Chebyshev Polynomial 7t Sin(30.0) = 0.4999999476616694400000 Best time: 203 ticks Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 Best time: 203 ticks Normalized Chebyshev Polynomial 9t Sin(30.0) = 0.4999999997875643800000 Best time: 218 ticks Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 Best time: 218 ticks Normalized Series 9t Sin(30.0) = 0.5000000000202800000000 Best time: 234 ticks Normalized Series 11t Sin(30.0) = 0.5000000000000000000000 Best time: 265 ticks Chebyshev Polynomial 7t Sin(30.0) = 0.4999999476616695500000 Best time: 266 ticks CRT Sin(30.0) = 0.4999999999999999400000 - SSE2 support On Best time: 281 ticks Chebyshev Polynomial 9t Sin(30.0) = 0.4999999997875643800000 Best time: 312 ticks Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV3 - Optimized Best time: 343 ticks Intel instruction FSIN(30.0) = 0.4999999999999999400000 - C Macro Best time: 359 ticks Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV2 - Optimized Best time: 406 ticks Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV1 - Not Optimized Best time: 453 ticks CRT Sin(30.0) = 0.4999999999999999400000 - SSE2 support Off Best time: 484 ticksSet of Tests B:
Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 - C Macro Best time: 24 clock cycles Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 - C Macro Best time: 34 clock cycles Normalized Taylor Series 11t Sin(30.0 )= 0.4999999999999643100000 - C Macro Best time: 42 clock cycles Intel instruction FSIN(30.0 )= 0.4999999999999999400000 - C Macro Best time: 140 clock cyclesTechnical Details:
- Windows XP 32-bit / Visual Studio 2005
- All compiler optimizations are disabled / Release configuration / Platform Win32
- Double data type ( 53-bit precision )
- Number of calls for every test is 4194304 ( 2^22 )
- Process priority Normal
- All implementations are portable / N/A for cases: CRT 'sin' and Intel 'fsin' instruction
- CRT function 'sin' tested with SSE2 support On ( faster ) and Off ( slower )
- C Macro means it is implemented as a macro ( inline / no call overhead )
- Ordered from Fastest to Slowest
- Values in Ticks measured using GetTickCount Win32 API function
- Values in Clock Cycles measured using RDTSC instruction
>> Intel instruction FSIN(30.0 )= 0.4999999999999999400000 - C Macro
>> Best time: 140 clock cycles
>>...
A note regarding 140 clock sycles for Intel 'FSIN' instruction. The result of my test is very
consistent with what I found in Intel documentation. Here are details from Intel 64 and IA-32 Architectures
Optimization Reference Manual ( Nov 2007 edition ). Topic: Instruction Latency and Throughput
Summary for 'FSIN' instruction from Tables C-11 and C-11a from Pages C-25 and C-26:
...
Latency for a CPU with CPUID 0F_2H : 160 - 180
Latency for a CPU with CPUID 0F_3H : 160 - 200
Latency for a CPU with CPUID 06_17H: 82
Latency for a CPU with CPUID 06_0EH: 119
Latency for a CPU with CPUID 06_0DH: 119
...
Throughput for a CPU with CPUID 0F_2H : 130
Throughput for a CPU with CPUID 06_0EH: 116
Throughput for a CPU with CPUID 06_0DH: 116
...
Also, a Footnote #4 says:
...
4. Latency and Throughput of transcendental instructions can vary substantially in a
dynamic execution environment. Only an approximate value or a range of values
are given for these instructions.
...
Sergey try to pass a random value as an argument to your functions.
I see that fastsin() is the slowest function probably because of 11 terms involved in calculation.
Chebyshev polynomial approximation suffers from the lowest accuracy.
It can also depend on the value passed as an argument to 'fsin'.
It can also depend on the value passed as an argument to 'fsin'.
That's a good idea to verifysome time later. Thank you, Iliya!
While testing 'fsin' try to pass 2^22 random values to instruction.
Hi Iliya,
I'd like to verify our sources of data... Where did youseethat number? I've found a link:
http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html
and I don't seethe23.90 number for the sinefunction ( real ). Here is a screenshot:
NormalizedTaylorSeries7tSin(30.0)=0.4999999918690232200000-CMacro
Besttime:24clockcycles
... [SergeyK] I'd like to note thatit is just twocode lines C macroandthere is noa range reduction.
Here is your comment from aPost #278:
>>...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element...
I think it is unfair to compare highly optimized trigonometric functions that use latest sets of Intel instructions ( SSE4 / AVX / etc )
withhighly portable C implementations of trigonometric functions.
@Sergey
Sorry I did not remember the correct value when I have written my post.
The correct value is 20.96 cycles per element for HA double argument.
Btw my source is the same as yours.
Completely agree with you.
I have the following trig functions, but I am wondering if there is a faster algorithm that I could implement:
JAVASCRIPT CODE NEW ADVANCE SEO 2012 RESEARCH PHP SCRIPT AND CODE HINDI SHAYARI SAD HINDI SHAYARI IMPRESS GIRL
Hi!
What mathematical formulas are your functions based on?
Did you use library trigonometric functions?
Quoting iliyapolak ...What mathematical formulas are your functions based on?
Trigonometric functions implemented in the test-case( Post #290 )are based on fundamental trigonometric identities:
sin(A+B) = sin(A)*cos(B) + cos(A)*sin(B)
cos(A-B) = sin(A)*sin(B) + cos(A)*cos(B)
tan((A+B)/2) = ( sin(A) + sin(B) )/ ( cos(A) + cos(B) )
By the way, that methodis used in Digital Signal Processing to generate tables of values forsin, cos and tan ( all based on Linear Interpolation )
forembedded systems with very constrained Read Only Memory (ROM )resources.
I havemy owntest-case and I'll post it later...
Best regards,
Sergey
Yes that's true,but in his post he is looking for the the fastest method for trigoformulas calculation.
Please take a look at results of my tests in a Post #279:
http://software.intel.com/en-us/forums/showpost.php?p=190992
There are two groups of test results and results inevey setareordered fromthe best to the worst.
Best regards,
Sergey
@Sergey
Does your summary project is still relevant?
Hi Iliya,
.
>>...Does your summary project is still relevant?
.
I just completed a quick review and the most important piece of information that I wanted to use is removed and this is "A Post Number". I really don't understand why developers of new IDZ website removed it. How are we going to refer to some post? Any ideas?
.
Best regards,
Sergey
Hi Sergey
>>...I just completed a quick review and the most important piece of information that I wanted to use is removed and this is "A Post Number". I really don't understand why developers of new IDZ website removed it. How are we going to refer to some post? Any ideas?
Yes I know that.Developers removed 64 posts from this thread.I have never anticipated such a loss of our accumulated knowledge.
Can we write some petition to the webmaster?
>>...Yes I know that.Developers removed 64 posts from this thread.I have never anticipated such a loss of our accumulated knowledge...
.
I didn't count have many posts are lost but a damage to some source code examples is significant because some of them are re-formatted to almost unreadable state.
.
>>...Can we write some petition to the webmaster?
.
I think Yest but is it going to work? I don't think so. He/She won't try to do anything to recover these lost posts. On my account I lost almost 500 posts and access to my private folders with screenshots, doc and zip archives. They don't want to change a color of font from light grey to black in the new editor. I can't read clearly and need ro move my head as closer as possible to my display. Is it difficult to change the color? No, however it is not done in 4 weeks!
>>... didn't count have many posts are lost but a damage to some source code examples is significant because some of them are re-formatted to almost unreadable state.
Cannot understand the reason behind the forum redesign.Why they needed to do this?Everything was working ok, it was very strange decision not so easily understandable by us - forum users.
>>>...I think Yest but is it going to work? I don't think so. He/She won't try to do anything to recover these lost posts. On my account I lost almost 500 posts and access to my private folders with screenshots, doc and zip archives. They don't want to change a color of font from light grey to black in the new editor. I can't read clearly and need ro move my head as closer as possible to my display. Is it difficult to change the color? No, however it is not done in 4 weeks!
I feel sorry for you Sergey, for loss of your posts , code and docs.Do you have any backup of your code examples which were posted and have been lost?
>>...Do you have any backup of your code examples which were posted and have been lost?
.
Yes, I have but I will need some time to identify, find and extract these test-cases. I have hundreds and hundreds of different tests in my own collection and it always a little challenge if something needs to be recovered. Unfortunately, i don't have time to re-post them.
It is good that you have a backup.
@Sergey slightly off topic question to you.Now I'am working on multithreaded version of my SpecialFunctions library my methods will be mostly arrray - filling functions, but I think that simply using multithreaded java programming to fill an array is kinda boring.I would like to ask you what type of problems(presumably from math and/or physics) could be used to sharpen my multithreading programming skills :)
>>...what type of problems(presumably from math and/or physics) could be used to sharpen my multithreading programming skills...
.
At the beginning I would consider a very simple case like addition of two large matrices. Depending on a number of CPUs the addition process could be done in parallel. Another example is sorting large data sets and a Merge sorting algorithm is a perfect candidate for R&D.
Hi Sergey!
>>>...At the beginning I would consider a very simple case like addition of two large matrices>>>
Thank you for your answer.I'am already testing double-loop arrays filled with sum of various sine function(fastsin beign used) with the varying frequency.My CPU is core i3 installed in laptop and from the tests i can see that for "large pool" of threads more than 4 I can gain slightly improved performance.I think that main reason behind the mediocre performance under heavily floating-point load is the hyper-threading technology.
>>...Cannot understand the reason behind the forum redesign.Why they needed to do this?
.
Here is my understanding of the situation:
.
- 1. It was hard to maintain the Old System
- 2. Intel wanted to make the upgrade before IDF 2012 event in San Francisco ( in order to announce about a new IDZ website )
- 3. As soon as some new versions of Intel software are released, like Intel C++ compiler v13 or IPP library, all posts must be placed in the New System
.
Best regards,
Sergey
>>...I think that main reason behind the mediocre performance under heavily floating-point load is the hyper-threading technology.
.
By the way, you're not the first developer who experienced a problem with HTT when a FPU is used. Could you provide some real performance numbers?
.
Best regards,
Sergey
.
PS: Also, you could consider some simple algorithms like Matrix Transpose, finding a Min or Max values in a data set, scalar additions, etc. So, just use your imagination and almost all iterative algorithms ( with a 'FOR' statement ) could be implemented in parallel.
...>>>By the way, you're not the first developer who experienced a problem with HTT when a FPU is used. Could you provide some real performance numbers?....>>>
Yes I expected such a behaviour from the HT processor.Simply provididng discreet logic as gp-registers with own APIC and sharing FPU beetwen physical and logical cores will not accelerate even easily "threadable" floating-point data.
Soon I will post some results from my testing.But bear in mind that everything is written in java , and Java System.nano() timing method is binary translated to RDTSC instruction with all its consequences.
Quote:
If you or Sergey are missing specific post or attachments, please raise the topic (e.g. here in the forum or by a pm to me). I cannot promise anything but, personally, I was missing an article and a blog and both could be recovered.
>>...Soon I will post some results from my testing.But bear in mind that everything is written in java , and Java System.nano() timing
>>method is binary translated to RDTSC instruction with all its consequences.
.
Thank you in advance, Iliya!
>>...If you or Sergey are missing specific post or attachments, please raise the topic (e.g. here in the forum or by a pm to me)...
.
Thomas,
.
Iliya and I noticed that lots ( hundreds! ) of posts are deleted. As I told in my case it is about 500. Also, I don't have any longer access to a 'Files' section of the Old System where I kept hundreds of jpg-files ( screenshots ), docs and zip archives. Unfortunately, my archive ( 10-month job on the ISN! ) was damaged significantly!
Páginas
Faça login para deixar um comentário.