Optimization of sine function's taylor expansion

343 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

The rcpps is likely to be useful only where 12 bits precision may be adequate

rcpps was used only when I started to write in asm various functions expansions.I simply implemented these formulas almost "as is" i.e coefficients were not pre-calculatedand knowing that divaps is very slow I used rcppsto calculate the coefficients's reciprocals.I know that rcpps's precision is lacking so Istarted to pre-calculate taylor series coefficients in Mathematica 8 and implemented Horner scheme to speed up the running time of my library functions.For example I was able to achieve 4x improvment in my gamma stirling function when compared to pure iterative method.Here is thelink post #28 http://software.intel.com/en-us/forums/showthread.php?t=106032

>>don't see enough context to understand how you would want rcpps for expansion of a series with known coefficients

As I stated earlier in my post rcpps was used to calculate on-the-fly coefficients of various taylor expansions.It is clear that rcpps can not be used with pre-calculated coefficients.
So I decided tocompletely rewrite my asembly based implementations.

Related to an issue with measuring an overhead ofempty 'for' statement.

>>...
>>It should not be more than few cycles per iteration when unoptimized by compiler...
>>...

In my tests 'RDTSC' instruction is mapped to a CRT-function 'CrtClock' using Hardware Run-Time Abstraction Layer.

...

// Sub-Test 6.2 - Overhead of Empty For Statement

{

///*

	CrtPrintf( RTU("Sub-Test 6.2 - [ Empty For Statement   ]n") );
	RTclock_t ctClock1 = 0;

	RTclock_t ctClock2 = 0;
	ctClock1 = ( RTclock_t )CrtClock();

	for( RTint t = 0; t < 1000000; t++ )

	{

		;

	}

	ctClock2 = ( RTclock_t )CrtClock();

	CrtPrintf( RTU("Sub-Test 6.2 -   1,000,000 iterations                  - %4ld clock cyclesn"),

			   ( RTint )( ( RTfloat )( ctClock2 - ctClock1 ) / 1000000 ) );

//*/

}

// Sub-Test 6.3 - Overhead of Empty For Statement

{

///*

	CrtPrintf( RTU("Sub-Test 6.3 - [ Empty For Statement   ]n") );
	RTclock_t ctClock1 = 0;

	RTclock_t ctClock2 = 0;
	ctClock1 = ( RTclock_t )CrtClock();

	for( RTint t = 0; t < 10000000; t++ )

	{

		;

	}

	ctClock2 = ( RTclock_t )CrtClock();

	CrtPrintf( RTU("Sub-Test 6.3 -  10,000,000 iterations                  - %4ld clock cyclesn"),

			   ( RTint )( ( RTfloat )( ctClock2 - ctClock1 ) / 10000000 ) );

//*/

}

...

Output is as follows:

...
Sub-Test 6.2 - [ Empty For Statement ]
Sub-Test 6.2 - 1,000,000 iterations - 5 clock cycles
Sub-Test 6.3 - [ Empty For Statement ]
Sub-Test 6.3 - 10,000,000 iterations - 5 clock cycles
...

I used adifferent number of iterations up to 100,000,000.Results are consistent andalways 5 clock cycles.

In my tests 'RDTSC' instruction is mapped to a CRT-function 'CrtClock' using Hardware Run-Time Abstraction Layer

Hi Sergey!

What is Hardware Run-Time Abstraction Layer?Does itis somehow relate to Windows HAL?
What are these types for example RTint ? Is it macro for int type?
When you perform all your tests you are using your wrapper library which wraps WIN API(as I understood properly) could such a layered approach add a significant overhead to the timing of various functions?

used adifferent number of iterations up to 100,000,000.Results are consistent andalways 5 clock cycles

Thanks for testing!
The results are exactly as I predicted a few cycles of overhead.Bear in mind that in real-world scenario CPU logic will execute for-loop statement in parallel with inside the loop statements.
The interesting scenario will arise when you will implement a for-loop based on floating-point counter and inside-loop floating-point calculations.Here CPU logic I suppose will interleave the execution of fpinstruction beetwen Port0 and Port1.

Tim

We are still struggling with software implementation of divide

What do you mean by saying this?Are you refering to division algorithms implemented in hardware?

The Intel compilers have default options -no-prec-div -no-prec-sqrt where, in some contexts, rather than using an IEEE accurate hardware instruction (or in some cases library function), there is explicit iteration in the generated code, starting with rcpps or rsqrtps. The Harpertown CPU made a big improvement in the efficiency of the IEEE accurate hardware instructions, and the Ivy Bridge core I7-3 is even better (aside from splitting the AVX-256 into sequenced 128-bit instructions). Yet we still have these iterative sequences.
I apologize for getting confused about the sequence of posts and reverting back to the rcpps subject where the thread started.

Quoting iliyapolak

In my tests 'RDTSC' instruction is mapped to a CRT-function 'CrtClock' using Hardware Run-Time Abstraction Layer

Hi Sergey!

What is Hardware Run-Time Abstraction Layer?

[SergeyK] Please take a look at a thread:
http://software.intel.com/en-us/forums/showthread.php?t=106134&o=a&s=lr
Post #3

Does itis somehow relate to Windows HAL?

[SergeyK] No. It is an internal feature of the ScaLib project.
I will follow up on all the rest your questions later.

Best regards,
Sergey

Sergey
Could you explain how is implemented in software Run-Time HAL.I thin that the possible implementation would be to wrap inline assembly in custom library functions.

Quoting iliyapolak...
What are these types for example RTint ? Is it macro for int type?

[SergeyK]
No. RTint is declared asa typedef. Here is a simple fix for youbased onmacros:

...
#define RTint int
#define RTclock_t clock_t
#define CrtPrintf printf
#define RTU( text ) text
#define SysGetTickCount ::GetTickCount
...

When you perform all your tests you are using your wrapper library which wraps WIN API(as I understood properly) could such a layered approach add a significant overhead to the timing of various functions?

[SergeyK]
Yes and it is applicable toany function, not just to some Win32 API function. In almost 99.99% ofmy Test-Cases
I use a Win32 API function 'GetTickCount' to measure time intervals. It completely satisfies the project needs.

Quoting iliyapolakSergey Could you explain how is implemented in software Run-Time HAL.I thin that the possible implementation would be to wrap inline assembly in custom library functions.

Hi Iliya,

There is atechnique in C/C++ and I would call it as a 'macro substitution'. Actually, it is as old as C language andhere is
a small example:

int FunctionA( int iParamA );
int FunctionB( int iParamB );

int FunctionA( int iParamA )
{
printf( "FunctionA called\n" );
}

int FunctionB( int iParamB )
{
printf( "FunctionB called\n" );
}

#define _USE_FUNCTION_A
//#define _USE_FUNCTION_B

#ifdef _USE_FUNCTION_A
#define Function FunctionA
#endif
#ifdef _USE_FUNCTION_B // I always use explicit #ifdefs and Inever use #ifdef-#else-#endif
#define Function FunctionB
#endif

void main( void )
{
Function( 1 );
}

In essence, HRT AL in the ScaLib library is based on that.Remember, that this is a compile time technique. If you need
a similar substitution at arun-time it could be alsodone.

Best regards,
Sergey

Thanks for explaining.
You know when I saw first time the word "HAL" I thought that this could be some kind of wrapper library which accesses hardware and on top of this library runs software which can get access to the underlying hardwars by library wrapper functions.And all this is access is performed by custom driver which reads Device Object and Driver Object data structures and hardware accesses are implemnted inn inline assembly.

Here is a Test-Case for a run-time substitution:

int ( *g_pFunction[2] )( int iParam );

int FunctionA( int iParamA );
int FunctionB( int iParamB );

int FunctionA( int iParamA )
{
printf( "FunctionA called\n" );
}

int FunctionB( int iParamB )
{
printf( "FunctionB called\n" );
}

void main( void )
{
int iCondition = 0;

g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;

if( iCondition == 0)
g_pFunction[0]( 0 ); // FunctionA will be executed
else
g_pFunction[1]( 1 );
}

Quoting Sergey Kostrov

...
void main( void )
{
int iCondition = 0;

g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;

if( iCondition == 0)
g_pFunction[0]( 0 ); // FunctionA will be executed
else
g_pFunction[1]( 1 );
}

Or
...
void main( void )
{
int iCondition = 0;

g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;

g_pFunction[iCondition]( 0 ); // FunctionA will be executed
}
...

Quoting TimP (Intel)...
I apologize for getting confused about the sequence of posts and reverting back to the rcpps subject where the thread started...

This is a very long and interesting thread. Unfortunately, we're deviating all the time from the main subject...

Best regards,
Sergey

Here is a Template based API substitution technique:

...
class CApiSet1
{
public:
CApiSet1(){};
~CApiSet1(){};
void FunctionA(){ printf("CApiSet1::FunctionA\n"); };
void FunctionB(){ printf("CApiSet1::FunctionB\n"); };
};

class CApiSet2
{
public:
CApiSet2(){};
~CApiSet2(){};
void FunctionA(){ printf("CApiSet2::FunctionA\n"); };
void FunctionB(){ printf("CApiSet2::FunctionB\n"); };
};

template < class APISET > void ExecuteApiSet( APISET &as );

template < class APISET > void ExecuteApiSet( APISET &as )
{
as.FunctionA();
as.FunctionB();
};

void main( void )
{
CApiSet1 as1;
CApiSet2 as2;

ExecuteApiSet( as1 );
ExecuteApiSet( as2 );
}
...

Output is as follows:

CApiSet1::FunctionA
CApiSet1::FunctionB
CApiSet2::FunctionA
CApiSet2::FunctionB

The Intel compilers have default options -no-prec-div -no-prec-sqrt where, in some contexts, rather than using an IEEE accurate hardware instruction (or in some cases library function), there is explicit iteration in the generated code, starting with rcpps or rsqrtps.

Tim!
As I understood properly rcpps instruction can be used as a some fallback optionsto calculate possibly on the fly Taylor Series coefficents.

This is a very long and interesting thread

This thread is most viewed thread for the last month.Butsadly it has only afew active participants.

That would be awesome if you share your work

Sergey!
If you are interested in my other classes.Ican share my work.I have twoJava classes.First class is based on Complex numbers and Complex functions implemented as a objects. This project contains only trigonometric and exponential functions in complex domain.Second classdeals with various bit-manipulations and is based mostlyon the book "Hacker's Delight" class is implemented as a static methods.

Quoting Sergey KostrovHere is another set of performance numbers:

Application - ScaLibTestApp - WIN32_MSC
...

Hi Iliya,

I completed another set of tests and I'll post results soon. Two tests for Intel instruction 'fsin' added:

one is in ticks and another one isin clock cycles.

Best regards,
Sergey

I completed another set of tests and I'll post results soon. Two tests for Intel instruction 'fsin' added:

one is in ticks and another one isin clock cycles

That's would be great.
Afaik 'fsin' latency is around 40-100 cpi.Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element.

Set of Tests A:

     Normalized Taylor Series        7t  Sin(30.0) = 0.4999999918690232200000 - C Macro

     Best time:  62 ticks
     Normalized  Taylor Series       9t  Sin(30.0) = 0.5000000000202798900000 - C Macro

     Best time:  74 ticks
     Normalized Taylor Series       11t  Sin(30.0) = 0.4999999999999643100000 - C Macro

     Best time:  93 ticks
     Normalized Series               7t  Sin(30.0) = 0.4999999918690232700000

     Best time: 203 ticks
     Normalized Chebyshev Polynomial 7t  Sin(30.0) = 0.4999999476616694400000

     Best time: 203 ticks
     Normalized Taylor Series        7t  Sin(30.0) = 0.4999999918690232200000

     Best time: 203 ticks
     Normalized Chebyshev Polynomial 9t  Sin(30.0) = 0.4999999997875643800000

     Best time: 218 ticks
     Normalized Taylor Series        9t  Sin(30.0) = 0.5000000000202798900000

     Best time: 218 ticks
     Normalized Series               9t  Sin(30.0) = 0.5000000000202800000000

     Best time: 234 ticks
     Normalized Series              11t  Sin(30.0) = 0.5000000000000000000000

     Best time: 265 ticks
     Chebyshev Polynomial            7t  Sin(30.0) = 0.4999999476616695500000

     Best time: 266 ticks
     CRT                                 Sin(30.0) = 0.4999999999999999400000 - SSE2 support On

     Best time: 281 ticks
     Chebyshev Polynomial            9t  Sin(30.0) = 0.4999999997875643800000

     Best time: 312 ticks
     Normalized Taylor Series       23t  Sin(30.0) = 0.4999999999999999400000 - FastSinV3 - Optimized

     Best time: 343 ticks
     Intel instruction                  FSIN(30.0) = 0.4999999999999999400000 - C Macro

     Best time: 359 ticks
     Normalized Taylor Series       23t  Sin(30.0) = 0.4999999999999999400000 - FastSinV2 - Optimized

     Best time: 406 ticks
     Normalized Taylor Series       23t  Sin(30.0) = 0.4999999999999999400000 - FastSinV1 - Not Optimized

     Best time: 453 ticks
     CRT                                 Sin(30.0) = 0.4999999999999999400000 - SSE2 support Off

     Best time: 484 ticks

Set of Tests B:

     Normalized Taylor Series        7t  Sin(30.0) = 0.4999999918690232200000 - C Macro

     Best time:  24 clock cycles
     Normalized Taylor Series        9t  Sin(30.0) = 0.5000000000202798900000 - C Macro

     Best time:  34 clock cycles
     Normalized Taylor Series       11t  Sin(30.0 )= 0.4999999999999643100000 - C Macro

     Best time:  42 clock cycles
     Intel instruction                  FSIN(30.0 )= 0.4999999999999999400000 - C Macro

     Best time: 140 clock cycles

Technical Details:

- Windows XP 32-bit / Visual Studio 2005
- All compiler optimizations are disabled / Release configuration / Platform Win32
- Double data type ( 53-bit precision )
- Number of calls for every test is 4194304 ( 2^22 )
- Process priority Normal
- All implementations are portable / N/A for cases: CRT 'sin' and Intel 'fsin' instruction
- CRT function 'sin' tested with SSE2 support On ( faster ) and Off ( slower )
- C Macro means it is implemented as a macro ( inline / no call overhead )
- Ordered from Fastest to Slowest
- Values in Ticks measured using GetTickCount Win32 API function
- Values in Clock Cycles measured using RDTSC instruction

>>...
>> Intel instruction FSIN(30.0 )= 0.4999999999999999400000 - C Macro
>> Best time: 140 clock cycles
>>...

A note regarding 140 clock sycles for Intel 'FSIN' instruction. The result of my test is very
consistent with what I found in Intel documentation. Here are details from Intel 64 and IA-32 Architectures
Optimization Reference Manual
( Nov 2007 edition ). Topic: Instruction Latency and Throughput

Summary for 'FSIN' instruction from Tables C-11 and C-11a from Pages C-25 and C-26:

...
Latency for a CPU with CPUID 0F_2H : 160 - 180
Latency for a CPU with CPUID 0F_3H : 160 - 200
Latency for a CPU with CPUID 06_17H: 82
Latency for a CPU with CPUID 06_0EH: 119
Latency for a CPU with CPUID 06_0DH: 119
...
Throughput for a CPU with CPUID 0F_2H : 130
Throughput for a CPU with CPUID 06_0EH: 116
Throughput for a CPU with CPUID 06_0DH: 116
...

Also, a Footnote #4 says:

...
4. Latency and Throughput of transcendental instructions can vary substantially in a
dynamic execution environment. Only an approximate value or a range of values
are given for these instructions.
...

  • NormalizedTaylorSeries23tSin(30.0)=0.4999999999999999400000-FastSinV1-NotOptimized
  • Besttime:453ticks
  • Sergey try to pass a random value as an argument to your functions.
    I see that fastsin() is the slowest function probably because of 11 terms involved in calculation.
    Chebyshev polynomial approximation suffers from the lowest accuracy.

    4. Latency and Throughput of transcendental instructions can vary substantially in a
    dynamic execution environment. Only an approximate value or a range of values
    are given for these instructions

    It can also depend on the value passed as an argument to 'fsin'.

    Quoting iliyapolak

    4. Latency and Throughput of transcendental instructions can vary substantially in a
    dynamic execution environment. Only an approximate value or a range of values
    are given for these instructions

    It can also depend on the value passed as an argument to 'fsin'.

    That's a good idea to verifysome time later. Thank you, Iliya!

    That's a good idea to verifysome time later

    While testing 'fsin' try to pass 2^22 random values to instruction.

    Quoting iliyapolak...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element.

    Hi Iliya,

    I'd like to verify our sources of data... Where did youseethat number? I've found a link:

    http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html

    and I don't seethe23.90 number for the sinefunction ( real ). Here is a screenshot:

    Quoting Sergey Kostrov...
    NormalizedTaylorSeries7tSin(30.0)=0.4999999918690232200000-CMacro
    Besttime:24clockcycles
    ...

    [SergeyK] I'd like to note thatit is just twocode lines C macroandthere is noa range reduction.

    Here is your comment from aPost #278:

    >>...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element...

    I think it is unfair to compare highly optimized trigonometric functions that use latest sets of Intel instructions ( SSE4 / AVX / etc )
    withhighly portable C implementations of trigonometric functions.

    @Sergey
    Sorry I did not remember the correct value when I have written my post.
    The correct value is 20.96 cycles per element for HA double argument.
    Btw my source is the same as yours.

    Here is your comment from aPost #278:

    >>...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element...

    I think it is unfair to compare highly optimized trigonometric functions

    Completely agree with you.

    I have the following trig functions, but I am wondering if there is a faster algorithm that I could implement:

    1. static const double SINMIN=0.0009999998333333;
    2. static const double COSMIN=0.9999995000000417;
    3. static const double TANMIN=0.0010000003333334;
    4. static const double E=2.718281828459045235360;
    5. static const double PI=3.14159265358979323846;
    6. static const double MINVAL=0.01;
    7. double sin(double ax)
    8. {
    9. double x=aabs(ax);
    10. if (x==MINVAL)
    11. {
    12. switch (quadrant(ax))
    13. {
    14. case 1:
    15. case 2:
    16. return SINMIN;
    17. case 3:
    18. case 4:
    19. return -SINMIN;
    20. }
    21. }
    22. else
    23. return (SINMIN*cos(x-MINVAL)+COSMIN*sin(x-MINVAL));
    24. }
    25. double cos(double x)
    26. {
    27. if (x==MINVAL)
    28. {
    29. switch (quadrant(ax))
    30. {
    31. case 1:
    32. case 4:
    33. return COSMIN;
    34. case 2:
    35. case 3:
    36. return -COSMIN;
    37. }
    38. }
    39. else
    40. return (COSMIN*cos(x-MINVAL)+SINMIN*sin(x-MINVAL));
    41. }
    42. double tan(double x)
    43. {
    44. if (x==MINVAL)
    45. {
    46. switch (quadrant(ax))
    47. {
    48. case 1:
    49. case 3:
    50. return TANMIN;
    51. case 2:
    52. case 4:
    53. return -TANMIN;
    54. }
    55. }
    56. else
    57. return ((TANMIN+tan(x-MINVAL))/(1.0000000000000-(TANMIN*tan(x-MINVAL))));
    58. }

    JAVASCRIPT CODE NEWADVANCE SEO 2012 RESEARCHPHP SCRIPT AND CODEHINDI SHAYARISAD HINDI SHAYARIIMPRESS GIRL

    I have the following trig functions, but I am wondering if there is a faster algorithm that I could implement

    Hi!
    What mathematical formulas are your functions based on?
    Did you use library trigonometric functions?

    Hi Iliya,

    Quoting iliyapolak...What mathematical formulas are your functions based on?

    Trigonometric functions implemented in the test-case( Post #290 )are based on fundamental trigonometric identities:

    sin(A+B) = sin(A)*cos(B) + cos(A)*sin(B)

    cos(A-B) = sin(A)*sin(B) + cos(A)*cos(B)

    tan((A+B)/2) = ( sin(A) + sin(B) )/ ( cos(A) + cos(B) )

    By the way, that methodis used in Digital Signal Processing to generate tables of values forsin, cos and tan ( all based on Linear Interpolation )
    forembedded systems with very constrained Read Only Memory (ROM )resources.

    I havemy owntest-case and I'll post it later...

    Best regards,
    Sergey

    Trigonometric functions implemented in the test-case( Post #290 )are based on fundamental trigonometric identities

    Yes that's true,but in his post he is looking for the the fastest method for trigoformulas calculation.

    Quoting bestofshayariI have the following trig functions, but I am wondering if there is a faster algorithm that I could implement...

    Please take a look at results of my tests in a Post #279:

    http://software.intel.com/en-us/forums/showpost.php?p=190992

    There are two groups of test results and results inevey setareordered fromthe best to the worst.

    Best regards,
    Sergey

    @Sergey
    Does your summary project is still relevant?

    Hi Iliya,
    .
    >>...Does your summary project is still relevant?
    .
    I just completed a quick review and the most important piece of information that I wanted to use is removed and this is "A Post Number". I really don't understand why developers of new IDZ website removed it. How are we going to refer to some post? Any ideas?
    .
    Best regards,
    Sergey

    Hi Sergey

    >>...I just completed a quick review and the most important piece of information that I wanted to use is removed and this is "A Post Number". I really don't understand why developers of new IDZ website removed it. How are we going to refer to some post? Any ideas?

    Yes I know that.Developers removed 64 posts from this thread.I have never anticipated such a loss of our accumulated knowledge.
    Can we write some petition to the webmaster?

    >>...Yes I know that.Developers removed 64 posts from this thread.I have never anticipated such a loss of our accumulated knowledge...
    .
    I didn't count have many posts are lost but a damage to some source code examples is significant because some of them are re-formatted to almost unreadable state.
    .
    >>...Can we write some petition to the webmaster?
    .
    I think Yest but is it going to work? I don't think so. He/She won't try to do anything to recover these lost posts. On my account I lost almost 500 posts and access to my private folders with screenshots, doc and zip archives. They don't want to change a color of font from light grey to black in the new editor. I can't read clearly and need ro move my head as closer as possible to my display. Is it difficult to change the color? No, however it is not done in 4 weeks!

    >>... didn't count have many posts are lost but a damage to some source code examples is significant because some of them are re-formatted to almost unreadable state.

    Cannot understand the reason behind the forum redesign.Why they needed to do this?Everything was working ok, it was very strange decision not so easily understandable by us - forum users.

    >>>...I think Yest but is it going to work? I don't think so. He/She won't try to do anything to recover these lost posts. On my account I lost almost 500 posts and access to my private folders with screenshots, doc and zip archives. They don't want to change a color of font from light grey to black in the new editor. I can't read clearly and need ro move my head as closer as possible to my display. Is it difficult to change the color? No, however it is not done in 4 weeks!

    I feel sorry for you Sergey, for loss of your posts , code and docs.Do you have any backup of your code examples which were posted and have been lost?

    >>...Do you have any backup of your code examples which were posted and have been lost?
    .
    Yes, I have but I will need some time to identify, find and extract these test-cases. I have hundreds and hundreds of different tests in my own collection and it always a little challenge if something needs to be recovered. Unfortunately, i don't have time to re-post them.

    It is good that you have a backup.

    @Sergey slightly off topic question to you.Now I'am working on multithreaded version of my SpecialFunctions library my methods will be mostly arrray - filling functions, but I think that simply using multithreaded java programming to fill an array is kinda boring.I would like to ask you what type of problems(presumably from math and/or physics) could be used to sharpen my multithreading programming skills :)

    >>...what type of problems(presumably from math and/or physics) could be used to sharpen my multithreading programming skills...
    .
    At the beginning I would consider a very simple case like addition of two large matrices. Depending on a number of CPUs the addition process could be done in parallel. Another example is sorting large data sets and a Merge sorting algorithm is a perfect candidate for R&D.

    Hi Sergey!

    >>>...At the beginning I would consider a very simple case like addition of two large matrices>>>

    Thank you for your answer.I'am already testing double-loop arrays filled with sum of various sine function(fastsin beign used) with the varying frequency.My CPU is core i3 installed in laptop and from the tests i can see that for "large pool" of threads more than 4 I can gain slightly improved performance.I think that main reason behind the mediocre performance under heavily floating-point load is the hyper-threading technology.

    >>...Cannot understand the reason behind the forum redesign.Why they needed to do this?
    .
    Here is my understanding of the situation:
    .
    - 1. It was hard to maintain the Old System
    - 2. Intel wanted to make the upgrade before IDF 2012 event in San Francisco ( in order to announce about a new IDZ website )
    - 3. As soon as some new versions of Intel software are released, like Intel C++ compiler v13 or IPP library, all posts must be placed in the New System
    .
    Best regards,
    Sergey

    >>...I think that main reason behind the mediocre performance under heavily floating-point load is the hyper-threading technology.
    .
    By the way, you're not the first developer who experienced a problem with HTT when a FPU is used. Could you provide some real performance numbers?
    .
    Best regards,
    Sergey
    .
    PS: Also, you could consider some simple algorithms like Matrix Transpose, finding a Min or Max values in a data set, scalar additions, etc. So, just use your imagination and almost all iterative algorithms ( with a 'FOR' statement ) could be implemented in parallel.

    ...>>>By the way, you're not the first developer who experienced a problem with HTT when a FPU is used. Could you provide some real performance numbers?....>>>

    Yes I expected such a behaviour from the HT processor.Simply provididng discreet logic as gp-registers with own APIC and sharing FPU beetwen physical and logical cores will not accelerate even easily "threadable" floating-point data.
    Soon I will post some results from my testing.But bear in mind that everything is written in java , and Java System.nano() timing method is binary translated to RDTSC instruction with all its consequences.

    Citação:

    iliyapolak escreveu:

    He/She won't try to do anything to recover these lost posts.


    If you or Sergey are missing specific post or attachments, please raise the topic (e.g. here in the forum or by a pm to me). I cannot promise anything but, personally, I was missing an article and a blog and both could be recovered.

    >>...Soon I will post some results from my testing.But bear in mind that everything is written in java , and Java System.nano() timing
    >>method is binary translated to RDTSC instruction with all its consequences.
    .
    Thank you in advance, Iliya!

    >>...If you or Sergey are missing specific post or attachments, please raise the topic (e.g. here in the forum or by a pm to me)...
    .
    Thomas,
    .
    Iliya and I noticed that lots ( hundreds! ) of posts are deleted. As I told in my case it is about 500. Also, I don't have any longer access to a 'Files' section of the Old System where I kept hundreds of jpg-files ( screenshots ), docs and zip archives. Unfortunately, my archive ( 10-month job on the ISN! ) was damaged significantly!

    Páginas

    Deixar um comentário

    Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!