Optimization of sine function's taylor expansion

343 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Sergey!

I have preliminary java multithreading tests results.I tested fastsin() method multiplied by java random.float() method to insert peudo-random noise into monotonic increasing data.The tested algorithm was simple bubble sort which sorted 1D array filled with 1000 and 1000000 elements.
Access to the main sorting loop was controlled by synchronized statement , multiple threads were waiting for each other completion.Timingt method
was java.nanoTime() which is binary translated to RDTSC instruction.
Here are the results:

1a.Bubble sort 1000 elements three threads(main and 2 spawned threads) with synchronized{} statement : ~13.235 miliseconds.
1b.Bubble sort 1000 elements three threads(main and 2 spawned threads)without synchronized{} statement:~12.678 miliseconds.
2a.Bubble sort 1000 elements five threads(main and 4 spawned threads)without synchronized{} statement:~7.603 miliseconds.
2b.Bubble sort 1000 elements five threads(main and 4 spawned threads)with synchronized{} statement: ~10.551 miliseconds.
3a.Bubble sort 1000 elements nine threads(main and 8 spawned threads)with synchronized{}statement: ~10.233 miliseconds.
3b.Bubble sort 1000 elements nine threads(main and 8 spawned threads)without synchronized{}statement:~9.664 miliseconds.
4a.Bubble sort 1e4 elements five threads(main and 4 spawned threads)without synchronized{}statement: ~71.133 miliseconds.
4b.Bubble sort 1e4 elements five threads(main and 4 spawned threads)with synchronized{}statement:~173.134 miliseconds.

Hi Iliya,
.
Thanks for the results. By the way, you have not included performance numbers for a single-thread case. It would be nice to see a performance improvement of a multi-threaded sorting over single-thread sorting.
.
>>... five threads(main and 4 spawned threads)without synchronized{} statement:~7.603 miliseconds...
.
It's the best case.

Hi Iliya,
.
>>...- 2. Intel wanted to make the upgrade before IDF 2012 event in San Francisco...
.
By the way, if you have some time and interest you could look at some videos from the IDF 2012. Here a couple of links:
.
http://www.intel.com/content/www/us/en/intel-developer-forum-idf/san-fra...

.
IDF 2012 San Francisco highlights from day two:
.
http://www.intel.com/content/www/us/en/intel-developer-forum-idf/san-fra...

.
Renee James in her keynotes announced about the new IDZ website.

Hi Sergey!

>>... five threads(main and 4 spawned threads)without synchronized{} statement:~7.603 miliseconds...
.
It's the best case.

Yes probably because of my CPU(2physical cores +2 logical),but when I looked at sorting results the numbers were not sorted at monotonically increased order over the whole domain(array length).For this I can blame the lack of global synchronization lock in the form of synchronized{} statement.With the synchronized{} statement only one thread was accessing an array and other threads were waiting on global lock.This can explain the better results pointed by you.
Soon I will add single-threaded results and integer values array sorting in order to eliminate dependency on FPU.
I also plan to test floating-point heavy calculation like a projectile motion.

Sergey thanks for posting info about the IDF

>>... the numbers were not sorted at monotonically increased order over the whole domain(array length).For this I can blame the lack of global synchronization lock in the form of
>>synchronized{} statement...
.
The Bubble Sorting algorithm is iterative and there has to be a last step in the sorting that uses as input a complete data set. I think only in that case the data set will be sorted completely. Did you consider a recursive Merge Sorting algorithm for your R&D?

Hi Sergey

>>>....Did you consider a recursive Merge Sorting algorithm for your R&D...>>>
No.I think that I cannot exclude an unsynchronized array accesses performed by the different threads running on two different physical CPUs.

I started recently a new project which involves arbitrary precision arithmetics represented by the Java.Math.BigDecimal class.Today i was able to calculate at very large precision reaching more than 300 decimal places.So I think that I can unleash the power of multithreading to perform calculation at very large arbitrary precision.

Hi Sergey

If you are interested I can post a test-case which involves arbitrary-precision arithmetics and multi-threading.
Today I performed a few tests of one of my Special Functions library method.This method (Airy function) is approximated by very large precision coefficients which have 100 decimal places of precision(coefficients were calculated by Mathematica 8) and final calculation is performed by two
Horner scheme polynomials multiplications.Every polynomial is constructed from 12 Java.BigDecimal coefficients multiplied by the function argument.
The result is accurate up to 15 decimal places were compared to Mathematica 8.
But the most interesting part is the time measurement of the method calculation - it is whopping ~2.1 microsecond per one loop iteration and I suppose that this is more than 40-50 time slower than similar method based on double precision arithmetics.

Hi Iliya,
.
>>...Today i was able to calculate at very large precision reaching more than 300 decimal places...
.
That's interesting and you could calculate your own highly accurate tables of any trigonometric functions. I still don't have time for Intel and Borland Big Decimal Numbers Libraries.

>>...This method (Airy function)...
.
Is it related to Optics?
.
>>...If you are interested I can post a test-case which involves arbitrary-precision arithmetics and multi-threading...
.
Did you do / consider any programming for Android? It is 99% Java based and IDZ has a dedicated forum for this.

Hi Sergey!

...>>>That's interesting and you could calculate your own highly accurate tables of any trigonometric functions. I still don't have time for Intel and Borland Big Decimal Numbers Libraries....>>>

Yes this is what I'am doing now.My new library will contain many trigonometric and special functions tables calculated up to 200 decimal places of precision.
If you are interested I will post the source code.

...>>>Is it related to Optics?...>>>

Yes , but my main reason behind dealing with those function is to simply optimize my Special Functions library.

...>>>Did you do / consider any programming for Android? It is 99% Java based and IDZ has a dedicated forum for this...>>>

No mainly because I'm a dedicated Windows "fanboy":)
Java is used for those areas of programming where I'am not fluent in(Windows system programming,Windows multithreading).

@Sergey

Here is the code for Airy(x) function written in Java.This program calculates the value of Airy function up to 225 decimal places.

package arbitrary_precision_functions_lib;
import java.math.BigDecimal;
import arbitrary_precision_functions_lib.HelperMethods;

public class ArbitraryPrecisionAi implements Runnable{

private BigDecimal[]a;
private Thread[]threads;
private BigDecimal result;
private int size,numofThreads;
private int startLoop,currLoop,endLoop;

private class AiRange{

public int start,end;
}

public ArbitraryPrecisionAi(int size,int numofThreads){

this.size = size;
this.numofThreads = numofThreads;
a = new BigDecimal[size];
threads = new Thread[numofThreads];
startLoop = currLoop = 0;
endLoop = size;
}

private synchronized AiRange getAiRange(){

AiRange ar = new AiRange();
if(currLoop > endLoop)
return null;

ar.start = currLoop;
currLoop += (endLoop - startLoop)/numofThreads+1;
ar.end = (currLoop < endLoop)?currLoop:endLoop;

return ar;

}

public void run(){

AiRange ar2;

while((ar2 = getAiRange()) != null){
multiThreadedAi(ar2.start,ar2.end,0.001);
System.out.println("Current thread = " + Thread.currentThread() + " ");

}
}

public void multiThreadedAi(int start,int end,double inc){

if( inc == 0.0)
throw new IllegalArgumentException("arg inc cannot be equal to zero");

double x;

x = 0;
BigDecimal sum = BigDecimal.ZERO;
BigDecimal term1 = BigDecimal.ZERO;
BigDecimal term2 = BigDecimal.ZERO;

for(int i = start;i < end;i++){
x += inc;

BigDecimal x1 = BigDecimal.valueOf(x);
BigDecimal x2 = x1;
BigDecimal c1 = BigDecimal.valueOf( 0.355028053887817);
BigDecimal c2 = BigDecimal.valueOf(0.258819403792807);
//double f_part,g_part,f_arg,g_arg,c1,c2;

BigDecimal fcoef1 = new BigDecimal("0.1666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666667");

BigDecimal fcoef2 = new BigDecimal("0.005555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555556");
BigDecimal fcoef3 = new BigDecimal("0.00007716049382716049382716049382716049382716049382716049382716049382716049382716049382716049382716049383");
BigDecimal fcoef4 = new BigDecimal("5.845491956603067714178825289936401047512158623269734380845491956603067714178825289936401047512158623e-7");

BigDecimal fcoef5 = new BigDecimal("2.783567598382413197228012042826857641672456487271302086116900931715746530561345376160190975005789821e-9");
BigDecimal fcoef6 = new BigDecimal("9.096626138504618291594810597473390985857700938795104856591179515410936374383481621438532598058136669e-12");
BigDecimal fcoef7 = new BigDecimal("2.165863366310623402760669189874616901394690699713120203950280837002603898662733719390126809061461112e-14");
BigDecimal fcoef8 = new BigDecimal("3.923665518678665584711357228033726270642555615422319210054856588772833149751329201793707987430183173e-17");
BigDecimal fcoef9 = new BigDecimal("5.589267120624879750301078672412715485245805719974813689536832747539648361469129917085054113148409079e-20");
BigDecimal fcoef10 = new BigDecimal("6.424444966235493965863308818865190212926213471235418033950382468436377426976011398948338061090125378e-22");
BigDecimal fcoef11 = new BigDecimal("6.083754702874520801006921229985975580422550635639600410937862186019296805848495642943501951789891456e-25");
BigDecimal fcoef12 = new BigDecimal("4.828376748313111746830889865068234587636944948920317786458620782554997464959123526145636469674517029e-28");

BigDecimal gcoef1 = new BigDecimal("0.08333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333");
BigDecimal gcoef2 = new BigDecimal("0.001984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984");
BigDecimal gcoef3 = new BigDecimal("0.00002204585537918871252204585537918871252204585537918871252204585537918871252204585537918871252204585538");
BigDecimal gcoef4 = new BigDecimal("1.413195857640302084746529190973635418079862524306968751413195857640302084746529190973635418079862524e-7");
BigDecimal gcoef5 = new BigDecimal("5.888316073501258686443871629056814241999427184612369797554982740167925353110538295723480908666093851e-10");
BigDecimal gcoef6 = new BigDecimal("1.737381935562572339700344898525535302663307861488498265626270824451461228589233787414424191552196750e-12");
BigDecimal gcoef7 = new BigDecimal("3.760566960092147921429317962176483339098068964260818756766819966345154174435570968429489592104321971e-15");
BigDecimal gcoef8 = new BigDecimal("6.267611600153579869048863270294138898496781607101364594611366610575256957392618280715815986840536618e-18");
BigDecimal gcoef9 = new BigDecimal("8.290491534594682366466750357531929759916377787171117188639373823512244652635738466555312151905471716e-21");
BigDecimal gcoef10 = new BigDecimal("8.914507026445895017706183180141859956899330953947437837246638519905639411436277921027217367640292168e-24");
BigDecimal gcoef11 = new BigDecimal("7.945193428204897520237239910999875184402255752181317145496112762839250812331798503589320292014520649e-27");
BigDecimal gcoef12 = new BigDecimal("5.964859931084757898076005939189095483785477291427415274396481053182620729978827705397387606617507995e-30");

BigDecimal f_arg = BigDecimal.valueOf(0);
BigDecimal g_arg = BigDecimal.valueOf(0);
BigDecimal f_part = BigDecimal.valueOf(0);
BigDecimal g_part = BigDecimal.valueOf(0);
BigDecimal one = BigDecimal.ONE;
f_arg = x1.pow(3);
g_arg = x1.pow(4);

// Example of Horner scheme polynomial multiplication for BigDecimal polynomial
f_part = one.add(f_arg.multiply(fcoef1)).add(f_arg.multiply(fcoef2)).add(f_arg.multiply(fcoef3)).add(f_arg.multiply(fcoef4)).add(f_arg.multiply(fcoef5)).add(f_arg.multiply(fcoef6)).add(f_arg.multiply(fcoef7)).add(f_arg.multiply(fcoef8)).add(f_arg.multiply(fcoef9)).add(f_arg.multiply(fcoef10)).add(f_arg.multiply(fcoef11)).add(f_arg.multiply(fcoef12));
g_part = x2.add(g_arg.multiply(gcoef1)).add(g_arg.multiply(gcoef2)).add(g_arg.multiply(gcoef3)).add(g_arg.multiply(gcoef4)).add(g_arg.multiply(gcoef5)).add(g_arg.multiply(gcoef6)).add(g_arg.multiply(gcoef7)).add(g_arg.multiply(gcoef8)).add(g_arg.multiply(gcoef9)).add(g_arg.multiply(gcoef10)).add(g_arg.multiply(gcoef11)).add(g_arg.multiply(gcoef12));

term1 = c1.multiply(f_part);
term2 = c2.multiply(g_part);

sum = term1.subtract(term2);

a[i] = sum;

}
}

public BigDecimal[] getBigDecimalAi(){

for(int i = 0;i < numofThreads;i++){

threads[i] = new Thread(this);
threads[i].setName(Integer.toString(i));
threads[i].start();
System.out.println("Active threads =" + Thread.activeCount() + " "
+ "names = " + threads[i].getName() + " " + " threads status = " + threads[i].getState()
+ " ");
}

for(int i = 0;i < numofThreads;i++){

try{
threads[i].join();

}catch(InterruptedException e){

e.printStackTrace();
}
}

return a;
}

/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
BigDecimal[]test;

ArbitraryPrecisionAi ap = new ArbitraryPrecisionAi(10,2);
long start,end;
start = System.nanoTime();
test = ap.getBigDecimalAi();
end = System.nanoTime();
for(int i = 0;i < test.length;i++){
System.out.println("test[i] = " + test[i] + " " + " precision = " + test[i].precision() + " " );
}
System.out.println("running time is = " + (end - start));
}

}

Hi Iliya, Thank you for the source codes!
.
>>>>Did you do / consider any programming for Android? It is 99% Java based and IDZ has a dedicated forum for this...
>>
>>No mainly because I'm a dedicated Windows "fanboy":)
>>Java is used for those areas of programming where I'am not fluent in(Windows system programming,Windows multithreading).
.
Do you know how many Android applications implemented in Java are downloaded from Google Play? Take a look at:
.
http://en.wikipedia.org/wiki/Google_Play

.
PS: It would be nice to know how many scientific applications uploaded to / exist on / downloaded from Google Play.

Hi Sergey!

...>>>Hi Iliya, Thank you for the source codes!...>>>
You are welcome.
When I will finish my project I will send you all the source code for the dozens of methods.

...>>>Do you know how many Android applications implemented in Java are downloaded from Google Play? Take a look at>>>

Thank you for the link.I would like to focus my effort on learning DirectX programming.It would be great to write an application which could be able
to visualize in 3D various math objects ie. functions.

>>>....PS: It would be nice to know how many scientific applications uploaded to / exist on / downloaded from Google Play.>>>

Do you refer to Android based scientific apps?

>>...Do you refer to Android based scientific apps?
.
Yes.

Hi Iliya,
.
>>...Today i was able to calculate at very large precision reaching more than 300 decimal places...
.
Did you try to calculate PI number? If interested I could give a very simple formula...

Hi Sergey

...>>>Did you try to calculate PI number? If interested I could give a very simple formula...>>>

No I did not,but I calculated Zeta(3) function only by means of an iterative method and precision was more than 200 decimal places.
Could you give me your formula?

Yesterday I tested fastsin() like function written with the help of arbitrary precision BigDecimal class the result reached almost 1025 decimal places ,but was accurate only to four decimal places when compared to similar calculation performed by Mathematica 8.

I also uploaded a text file with the results sadly an accuracy is up to 4 dec. places.I will write another version with 100 dec. places coefficients and singl-threaded to find the error.
here is the code for arbitrary precision fastsin():

package arbitrary_precision_functions_lib;

import java.math.BigDecimal;

public class ArbitraryPrecisionSin implements Runnable{

private BigDecimal[]a;
private Thread[]threads;
private BigDecimal result;
private int size,numofThreads;
private int startLoop,currentLoop,endLoop;
private int numofCPUs;

private class SinRange{

public int start;
public int end;
}

public ArbitraryPrecisionSin(int size,int numofThreads ){

if(numofThreads <= 0)
throw new IllegalArgumentException("parameter numofThreads must be greater than 0");
this.size = size;
this.numofThreads = numofThreads;
a = new BigDecimal[size];
threads = new Thread[numofThreads];
startLoop = currentLoop = 0;
endLoop = size;
}

public ArbitraryPrecisionSin(int size){
numofCPUs = Runtime.getRuntime().availableProcessors();
this.size = size;
numofThreads = numofCPUs;
a = new BigDecimal[size];
threads = new Thread[numofThreads];
startLoop = currentLoop = 0;
endLoop = size;

}

private synchronized SinRange getSinRange(int c){

if(c < 0 || c > 1)
throw new IllegalArgumentException("parameter c must be 0 or 1");
SinRange sr = new SinRange();

if(currentLoop > endLoop)
return null;

int cpu;
cpu = Runtime.getRuntime().availableProcessors();

switch(c){

case 0 : {

sr.start = currentLoop;
currentLoop += (endLoop - startLoop)/numofThreads;
sr.end = (currentLoop < endLoop)?currentLoop:endLoop;

}

break;

case 1 :{

sr.start = currentLoop;
currentLoop += (endLoop - startLoop)/cpu;
sr.end = (currentLoop < endLoop)?currentLoop:endLoop;

}

break;

default : {

System.out.println("default switch option hit for arbitrary number of threads enter 0 for CPU limited thread number enter 1");
}
}

return sr;
}

public void run(){

SinRange sr;

while((sr = getSinRange(1)) != null){
multiThreadedSin(sr.start,sr.end,0.01);
System.out.println("Current thread = " + Thread.currentThread() + " ");
}
}

public void multiThreadedSin(int start,int end,double inc){

if(inc == 0)
throw new IllegalArgumentException("parameter inc cannot be 0");

if(size * inc > (Math.PI/2) || size * -inc < (-Math.PI/2))
throw new IllegalArgumentException("parameter inc * size cannot excede PI/2");

double x;
x = 0;
BigDecimal rad = BigDecimal.ZERO;
BigDecimal sum = BigDecimal.valueOf(0);

for(int i = start;i < end;i++){

x += inc;

BigDecimal arg = BigDecimal.valueOf(x);
rad = BigDecimal.valueOf(x);
BigDecimal coef1 = new BigDecimal("-0.1666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666666667");
BigDecimal coef2 = new BigDecimal("0.008333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333");
BigDecimal coef3 = new BigDecimal("-0.0001984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984126984");
BigDecimal coef4 = new BigDecimal("2.755731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922398589065255731922e-6");
BigDecimal coef5 = new BigDecimal("-2.505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210838544171877505210839e-8");
BigDecimal coef6 = new BigDecimal("1.605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717015494793272571050348828126605904383682161459939237717e-10");
BigDecimal coef7 = new BigDecimal("-7.647163731819816475901131985788070444155100239756324412409068493724578380663036747692832348917005001661086317170973255629340285424941509597594253678909763565848221932878017534102190186846271502356158440814525470610126694782779438864094948751033407118063202719287375372031456687541343625999710655795311879967964624049280133936218592303248387904472560557216641872726528811184895840980497065153149809234465319121403777488433573089657745742401827057911713996370081026165682250338334994419650504306588962673618758274842930927587012243096899181555266211350867435523520179604835689491774147858803943460028116112772197428282084366740451396536052620708705364790020874676959333043989128645213301297957382613467269551925636581721237805893890549975206059862144518229174313830398486483142567798652454737110821766906422991079075735160391245047329703414359499015583671668327752983837639922296006952091608176264260920345576430232514888599544684200768856853512938169022825107481192137276793361449446105530761615417700e-13");
BigDecimal coef8 = new BigDecimal("2.811457254345520763198945583010320016233492735204531033973922240339918522302587039592953069454781250610693498959916638099022163759169672646174357970187413075679493357675741740478746392222893934689764132652399070077252461317198323111799613511409340852229118646796829180893917899831376333088128917571805838223516405900470637476550953052664848494291382557800235982620047357053270529772241568071010959277376955559339624076629990106491818287647730535996953675136059200796206709683211395007224450112716530394712778777515783429259930971726801169689436107114289498354235360148836650548446377889266155683833866217930954936868413370125165954608842875260553442937507674513587990089701885531328419594837273019657084394090307566809278605108047996314413992596376661113667027143528855324684767573033990712173096237833243746720248432044261487149753567431749815814552820466296968008763838206726473144151326535391272397185873687585483414926303192720870903254967991973905450407162202991645879912297590479974544711550625e-15");
BigDecimal coef9 = new BigDecimal("-8.220635246624329716955981236872280749220738991826114134426673217368182813750254501733780904838541668452320172397417070464977087015116001889398707515167874490290916250513864738241948515271619692075333721205845234144013044787129599742104133074296318281371691949698330938286309648629755359906809700502356252115545046492604203147809804247558036533015738473100105212339319757465703303427606924184242570986482326196899485604181257621321106104233130222213314839579120470164347104336875423997732310271100966066411633852385331664502722139552050203770281014954062860684898713885487282305398765758088174514134111748336125546398869503289958931604803728832027610928385013197625702016672179916164969575547581928821884193246513353243504693298385954135713428644376202086745693402131155920130899336356697988810222917641063586901311204807782126168870080209794783083487779141219204703987831013820096912723177004068047944987934759021881330193868984563950009517450268929548100605737435648087368164612837660744282782311769e-18");
BigDecimal coef10 = new BigDecimal("1.95729410633912612308475743735054303552874737900621765105396981365909114613101297660328116781870039725055242199938501677737549690836095283080921607503997011673593244059853922339094012268371897430365088600139172241524120113979276184335812692245150435270754570230912641387769277348327508569209754773865625050370120154585814360662138196370429441262279487454764409817602851368231031033990641052005775499678150623735702038194350262134777717055679139364682024457652274297667564352622608110204924975884258449352929856848990020993931936693273413363225333064167449990221810667554483816097840091669808958500480160042837182135613225614744957591474582503179343691593642472846222788911581994333830177546048126465695269261746751594759240529266105015139738407429716138601468778590526141615019878042652671239215235197860433356407791232620016201891663793990243305078617587853467607145123202444983225738239461294201925563848717342300177210186264707e-20");
BigDecimal coef11 = new BigDecimal("-3.868170170630684037716911931522812323179342646257347136470296074425081316464452522931385707151581812748127316204318214975050389146958404803970782756995988372995913914226362101563122772102211411667294241109469807144745456798009410757624756763738150894678944075709735995805716943642836137731418078534893775699014232304067477483441466331431411882653744811358980431177921963799032233873332827114738646238698628927583044233098653125033458547069984106066871277799322637946709535261093273102640838636881689284025801737429574470404066506470944007044175143494288942539478032131322831877187448596879434648096231765639057757575225627371522177491437854711099007589113971954463439684110756595221611883845088428769943625657120907793856904431764518226855556486154809941062343968629378844405655625991293990593931355938765098297247884814503164958060455585260108734936843187097310702045845573978965232788997272759292276015403142773330194896418682742306610915419851745505411540437340320010995748453245652524130802894678e-23");
BigDecimal coef12 = new BigDecimal("6.446950284384473396194853219204687205298904410428911894117160124041802194107420871552309511919303021246878860340530358291750648578264008006617971261659980621659856523710603502605204620170352352778823735182449678574575761330015684596041261272896918157798240126182893326342861572738060229552363464224822959498357053840112462472402443885719019804422908018931634051963203272998387056455554711857897743731164381545971740388497755208389097578449973510111452129665537729911182558768488788504401397728136148806709669562382624117340110844118240011740291905823814904232463386885538053128645747661465724413493719609398429595958709378952536962485729757851831679315189953257439066140184594325369353139741814047949906042761868179656428174052940863711425927476924683235103906614382298074009426043318823317656552259897941830495413141357505274930100759308766847891561405311828851170076409289964942054648328787932153793359005237955550324827364471237177684859033086242509019234062233866684992914088742754206884671491130e-26");

arg = arg.pow(2);

sum = rad.add(rad.multiply(arg)).add(arg.multiply(coef1)).add(arg.multiply(coef2)).add(arg.multiply(coef3)).add(arg.multiply(coef4)).add(arg.multiply(coef5)).add(arg.multiply(coef6)).add(arg.multiply(coef7)).add(arg.multiply(coef8)).add(arg.multiply(coef9)).add(arg.multiply(coef10)).add(arg.multiply(coef11)).add(arg.multiply(coef12));

a[i] = sum;

}

}

public BigDecimal[] getSinValues(int c){
if(c < 0 || c > 1)
throw new IllegalArgumentException("parameter c must be 0 or 1");

switch(c){

case 0 : {

for(int i = 0;i < numofThreads;i++){

threads[i] = new Thread(this);
threads[i].setName(Integer.toString(i));
threads[i].start();
System.out.println("Active threads =" + Thread.activeCount() + " "
+ "names = " + threads[i].getName() + " " + " threads status = " + threads[i].getState()
+ " ");
}

for(int i = 0;i < numofThreads;i++){

try{

threads[i].join();

}catch(InterruptedException e){
e.printStackTrace();
}
}
}
break;

case 1 : {

for(int i = 0;i < numofCPUs;i++){

threads[i] = new Thread(this);
threads[i].setName(Integer.toString(i));
threads[i].start();
System.out.println("Active threads =" + Thread.activeCount() + " "
+ "names = " + threads[i].getName() + " " + " threads status = " + threads[i].getState()
+ " ");
}

for(int i = 0;i < numofCPUs;i++){

try{

threads[i].join();

}catch(InterruptedException e){
e.printStackTrace();
}
}
}

break;
}

return a;
}

/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
BigDecimal[]test;
ArbitraryPrecisionSin ap = new ArbitraryPrecisionSin(100);
test = ap.getSinValues(1);
for(int i = 0;i < test.length;i++) System.out.println("test[i] = " + test[i] + " " + "precision = " + test[i].precision() + " ");
}

}

@Sergey

Uploaded text file with the results of 1000 decimal precision fastsin() test step 0.01 two threads.

Attachments: 

AttachmentSize
Download arbitraryprecisionsine.txt104.18 KB

>>>>Did you try to calculate PI number? If interested I could give a very simple formula...
>>
>>No I did not,but I calculated Zeta(3) function only by means of an iterative method and precision was more than 200 decimal places.
>>Could you give me your formula?
.
The value of PI can be calculated as follows:
.
PI / 4 = 1 - 1/3 + 1/5 - 1/7 + 1/9 - 1/11 + 1/13 - ...

>>>>Did you try to calculate PI number? If interested I could give a very simple formula...
>>
>>No I did not,but I calculated Zeta(3) function only by means of an iterative method and precision was more than 200 decimal places.
>>Could you give me your formula?
.
Here is a piece of C++ codes I tried ( single- and double precision data types ):
...
RTfloat fPI = ( 1.0f - 1.0f/ 3.0f + 1.0f/ 5.0f - 1.0f/ 7.0f + 1.0f/ 9.0f - 1.0f/ 11.0f + 1.0f/ 13.0f -
1.0f/ 15.0f + 1.0f/ 17.0f - 1.0f/ 19.0f + 1.0f/ 21.0f - 1.0f/ 23.0f + 1.0f/ 25.0f -
1.0f/ 27.0f + 1.0f/ 29.0f - 1.0f/ 31.0f + 1.0f/ 33.0f - 1.0f/ 35.0f + 1.0f/ 37.0f -
1.0f/ 39.0f + 1.0f/ 41.0f - 1.0f/ 43.0f + 1.0f/ 45.0f - 1.0f/ 47.0f + 1.0f/ 49.0f -
1.0f/ 51.0f + 1.0f/ 53.0f - 1.0f/ 55.0f + 1.0f/ 57.0f - 1.0f/ 59.0f + 1.0f/ 61.0f -
1.0f/ 63.0f + 1.0f/ 65.0f - 1.0f/ 67.0f + 1.0f/ 69.0f - 1.0f/ 71.0f + 1.0f/ 73.0f -
1.0f/ 75.0f + 1.0f/ 77.0f - 1.0f/ 79.0f + 1.0f/ 81.0f - 1.0f/ 83.0f + 1.0f/ 85.0f -
1.0f/ 87.0f + 1.0f/ 89.0f - 1.0f/ 91.0f + 1.0f/ 93.0f - 1.0f/ 95.0f + 1.0f/ 97.0f -
1.0f/ 99.0f + 1.0f/101.0f - 1.0f/103.0f + 1.0f/105.0f - 1.0f/107.0f + 1.0f/109.0f -
1.0f/111.0f + 1.0f/113.0f - 1.0f/115.0f + 1.0f/117.0f - 1.0f/119.0f + 1.0f/121.0f -
1.0f/123.0f + 1.0f/125.0f - 1.0f/127.0f + 1.0f/129.0f - 1.0f/131.0f + 1.0f/133.0f -
1.0f/135.0f + 1.0f/137.0f - 1.0f/139.0f + 1.0f/141.0f - 1.0f/143.0f + 1.0f/145.0f -
1.0f/147.0f + 1.0f/149.0f - 1.0f/151.0f + 1.0f/153.0f - 1.0f/155.0f + 1.0f/157.0f -
1.0f/159.0f + 1.0f/161.0f - 1.0f/163.0f + 1.0f/165.0f - 1.0f/167.0f + 1.0f/169.0f -
1.0f/171.0f + 1.0f/173.0f - 1.0f/175.0f + 1.0f/177.0f - 1.0f/179.0f + 1.0f/181.0f -
1.0f/183.0f + 1.0f/185.0f - 1.0f/187.0f + 1.0f/189.0f - 1.0f/191.0f + 1.0f/193.0f -
1.0f/195.0f + 1.0f/197.0f - 1.0f/199.0f + 1.0f/201.0f - 1.0f/203.0f + 1.0f/205.0f -
1.0f/207.0f + 1.0f/209.0f - 1.0f/211.0f + 1.0f/213.0f - 1.0f/215.0f + 1.0f/217.0f -
1.0f/219.0f + 1.0f/221.0f - 1.0f/223.0f + 1.0f/225.0f - 1.0f/227.0f + 1.0f/229.0f -
1.0f/231.0f + 1.0f/233.0f - 1.0f/235.0f + 1.0f/237.0f - 1.0f/239.0f + 1.0f/241.0f -
1.0f/243.0f + 1.0f/245.0f - 1.0f/247.0f + 1.0f/249.0f - 1.0f/251.0f + 1.0f/253.0f
) * 4.0f;
.
RTdouble dPI = ( 1.0L - 1.0L/ 3.0L + 1.0L/ 5.0L - 1.0L/ 7.0L + 1.0L/ 9.0L - 1.0L/ 11.0L + 1.0L/ 13.0L -
1.0L/ 15.0L + 1.0L/ 17.0L - 1.0L/ 19.0L + 1.0L/ 21.0L - 1.0L/ 23.0L + 1.0L/ 25.0L -
1.0L/ 27.0L + 1.0L/ 29.0L - 1.0L/ 31.0L + 1.0L/ 33.0L - 1.0L/ 35.0L + 1.0L/ 37.0L -
1.0L/ 39.0f + 1.0L/ 41.0L - 1.0L/ 43.0L + 1.0L/ 45.0L - 1.0L/ 47.0L + 1.0L/ 49.0L -
1.0L/ 51.0L + 1.0L/ 53.0L - 1.0L/ 55.0L + 1.0L/ 57.0L - 1.0L/ 59.0L + 1.0L/ 61.0L -
1.0L/ 63.0L + 1.0L/ 65.0L - 1.0L/ 67.0L + 1.0L/ 69.0L - 1.0L/ 71.0L + 1.0L/ 73.0L -
1.0L/ 75.0L + 1.0L/ 77.0L - 1.0L/ 79.0L + 1.0L/ 81.0L - 1.0L/ 83.0L + 1.0L/ 85.0L -
1.0L/ 87.0L + 1.0L/ 89.0L - 1.0L/ 91.0L + 1.0L/ 93.0L - 1.0L/ 95.0L + 1.0L/ 97.0L -
1.0L/ 99.0L + 1.0L/101.0L - 1.0L/103.0L + 1.0L/105.0L - 1.0L/107.0L + 1.0L/109.0L -
1.0L/111.0L + 1.0L/113.0L - 1.0L/115.0L + 1.0L/117.0L - 1.0L/119.0L + 1.0L/121.0L -
1.0L/123.0L + 1.0L/125.0L - 1.0L/127.0L + 1.0L/129.0L - 1.0L/131.0L + 1.0L/133.0L -
1.0L/135.0L + 1.0L/137.0L - 1.0L/139.0L + 1.0L/141.0L - 1.0L/143.0L + 1.0L/145.0L -
1.0L/147.0L + 1.0L/149.0L - 1.0L/151.0L + 1.0L/153.0L - 1.0L/155.0L + 1.0L/157.0L -
1.0L/159.0L + 1.0L/161.0L - 1.0L/163.0L + 1.0L/165.0L - 1.0L/167.0L + 1.0L/169.0L -
1.0f/171.0L + 1.0f/173.0L - 1.0f/175.0L + 1.0f/177.0L - 1.0f/179.0L + 1.0f/181.0L -
1.0f/183.0L + 1.0f/185.0L - 1.0f/187.0L + 1.0f/189.0L - 1.0f/191.0L + 1.0f/193.0L -
1.0f/195.0L + 1.0f/197.0L - 1.0f/199.0L + 1.0f/201.0L - 1.0f/203.0L + 1.0f/205.0L -
1.0f/207.0L + 1.0f/209.0L - 1.0f/211.0L + 1.0f/213.0L - 1.0f/215.0L + 1.0f/217.0L -
1.0f/219.0L + 1.0f/221.0L - 1.0f/223.0L + 1.0f/225.0L - 1.0f/227.0L + 1.0f/229.0L -
1.0f/231.0L + 1.0f/233.0L - 1.0f/235.0L + 1.0f/237.0L - 1.0f/239.0L + 1.0f/241.0L -
1.0f/243.0L + 1.0f/245.0L - 1.0f/247.0L + 1.0f/249.0L - 1.0f/251.0L + 1.0f/253.0L
) * 4.0L;
.
CrtPrintf( RTU("PI = %.32f ( Single-precision accuracy / Using Series )\n"), fPI );
CrtPrintf( RTU("PI = %.32f ( Double-precision accuracy / Using Series )\n"), dPI );
...
Of course, accuracy is very limited by constraints of IEEE 754 Standard.

>>...Of course, accuracy is very limited by constraints of IEEE 754 Standard...
.
Or something else...
.
...
PI = 3.14946579933166500000000000000000 ( Single-precision accuracy / Using Series )
PI = 3.14946654729979380000000000000000 ( Double-precision accuracy / Using Series )
...
Just to two digits!
.
Here is PI value up to 31 digits ( from Windows Calculator ):
.
3.1415926535897932384626433832795

Hi Iliya,
.
>>>>...Does your summary project is still relevant?
>>.
>>I just completed a quick review and the most important piece of information that I wanted to use is removed and this is "A Post Number".
.
If somebody will be interested in getting more information from our posts he/she will get it. I was thinking about a "Summary" for about one week and I would say let's save our time and forget it. However, I would do just one modification for a subject of the thread. What do you think about a change:
.
from 'Optimalization of sine function's taylor expansion' to 'Optimization of sine function's Taylor expansion'?
.
PS: Did you want to use a word 'Optimal' instead of 'Optimization'?

Hi Sergey!

Ok I agree with you.Someone who is interested will look through all our posts and find the relevant info.
Regarding the title Yes please change it.I think that proper word is optimization.

Soon i will post Pi calculation tests I want to test many Pi-approximations formulas , please follow this link http://en.wikipedia.org/wiki/Numerical_approximations_of_%CF%80

Now I'm working on my new project BigDecimal Complex numbers class.I have a BigComplex Constructor which receives two member fields
as a parameter BigDecimal BigRe and BigDecimal BigIm.My problem is that I cannot display properly the numerical values of BigComplex constructor when its members have values larger than maximal double-precision values ie. Double.Max_Value.

>>...Ok I agree with you.Someone who is interested will look through all our posts and find the relevant info.
>>Regarding the title Yes please change it.I think that proper word is optimization.
.
Unfortunately, I can't do this since you're the author ( owner ) of the thread.

@Sergey

Does Intel have its own BigDecimal - like library?

Hi Iliya,

>>...Does Intel have its own BigDecimal - like library?

Is that what you're looking for?
http://software.intel.com/en-us/articles/intel-decimal-floating-point-ma...

>>...the result reached almost 1025 decimal places ,but was accurate only to four decimal places...

Interesting results! I think you need to "step down" and try to calculate just for 4 or 6 decimals in order to understand what is wrong.

...>Is that what you're looking for?
http://software.intel.com/en-us/articles/intel-decimal-floating-point-ma...

No they have not a precision larger than 128-bit types.

...>>>nteresting results! I think you need to "step down" and try to calculate just for 4 or 6 decimals in order to understand what is wrong.>>>
Yes I know that something wrong has occurred during the calculation.The coefficients were calculated by Mathematica 8 and can be trusted as very accurate and my double-precision fastsin() was accurate up to 15 decimal places.I think that I can blame BigDecimal library string parsing methods
for mishandling those very large arguments passed to the constructors.Another explanation could be some rounding and truncation performed by the library algorithms.
Later I will test single-threaded low-precision version and post the results.

>>...Later I will test single-threaded low-precision version and post the results....

That's the right decision, Iliya. It always makes sence to simplify a test case when additional investigation is needed.

>>>...That's the right decision, Iliya. It always makes sence to simplify a test case when additional investigation is needed...>>>
I used global synchronized method to prevent a race condition and only one thread at time is allowed to calculate Horner polynomial.
So the lack of precision is very strange to me.

Hi Sergey
>>>...That's the right decision, Iliya. It always makes sence to simplify a test case when additional investigation is needed...>>>
I did an extensive set of tests on my BigDecimal version of fastsin,sadly the issue with the accuracy still persists and even got worse.
Up to an argument of 0.30 radian the tested method can achieve an accuracy of 4 decimal places.For larger arguments the value of arbitrary
precision version of fastsin exceeds 1.0.I tried various combination of precision truncation and method rounding and nothing can help.
I suppose that somehow BigDecimal arithmetics methods which works on objects are responsible for such a divergence.

>>...I did an extensive set of tests on my BigDecimal version of fastsin,sadly the issue with the accuracy still persists and even got worse...

Hi Iliya, take a look at:
.
http://software.intel.com/en-us/articles/haswell-support-in-intel-ipp

and here is a quote:
...
Floating Point Multiply Accumulate - Our floating-point multiply accumulate significantly increases
peak flops and provides improved precision to further improve transcendental mathematics. They are
broadly usable in high performance computing, professional quality imaging, and face detection.
They operate on scalars, 128-bit packed single and double precision data types, and 256-bit packed
single and double-precision data types. [These instructions were described previously, in the
initial Intel AVX specification].
...

Is it similar to FMA ( Fused Multiply-Add ) or something else?

>>>Hi Iliya, take a look at:
.
http://software.intel.com/en-us/articles/haswell-support-in-intel-ipp>>>

Thanks for link and info.
After reading through this article I came to conclusion that in my case it will not resolve my problem.
Hardware implemented FMA specially tailored for transcendentals calculation should be helpful in my case, but my code uses Java BigDecimal
types which are binary translated by JIT compiler to x87 FPU code.I'm not sure if JIT translates even to SSE3 and/or SSE4 so as you can see
it all depends on Oracle Java JIT developers.

If you are interested i rewrote arbitrary precision fastsin() method in its new form fastsin() uses iterative calculation and polynomial summation in order
to achieve a convergence and I was able to achieve a great accuracy up to dozens of decimal places when I run a tests against arbitrary precision
sin() implemented by Mathematica 8.

sine function test1
for sin(0.1), arguments in radian

Mathematica 8 result , precision of result 1000 decimal places.

BigDecimal MathematicaResult = new BigDecimal("0.09983341664682815230681419841062202698991538801798225999276686156165"+
"1744283292427609662443804063036267832503180935989035450807237470459378"+
"8733561019849184104968347730506328324943597890052224240860814227296674"+
"3717361429690723851578245480827955359733878534633879232333616788674249"+
"0732984534800581449739604751862203867889743928811306547167725769195052"+
"7939226260748108333691856379033824884726830161865556889346698245411612"+
"7350321922213400495628460096652233220746111279082638889248599762600486"+
"7369921168342258220078495944896623421021701909153051750727014383748742"+
"8538739455042219898916646878947958333218732148928243147445744176306144"+
"8319041883501260441677212938344509031562810353361999685976614314713635"+
"2376471332757607248532041985898211422170648603374528422580789056073829"+
"9833912269838132246283976008123694667742027949047688125913753569293073"+
"4542900571839934963830844812380293707930735198964874164274114958462665"+
"4721689231998075453149763755566049310161914743340176805534440027234162"+
"60117444319296284058355");

My own implementation BigDecimal type precision of result 1000 decimal places.

BigDecimal MyResult = new BigDecimal("0.09983341664682815230681419841062202698991538801798225999276691183645306851605639339348255072848269607650126301750623266285915014777363849429882494172304175818564569878093779910005002435262146480554144164068634503196275439405716427829467718846989916097461274251683687505626691144955116757431195882281097332046230322367528010330012325509400263897859673738501094946240128179295914379177052294570998660870974914229353299674791489086469695092248956464490776701589133882672522224862317928978624183262578915405699876841355904181868741141987361769147329635241904812103650693742736260834678043920025858501656467334341308687703560216612464405573345205965242336552029461830301279503719798956900959047668640280681892448507477838575486397731923600164078104802726389290258429170226935933291644091950079513081214487731344803942215805520434642873011293542131331139589003591171761381808368667811893808151688806855063014066906054722182073878502847520332849760979231929048065256809912370394113984135840931628730"+
"11562022880642264046740491556892673469828818826945815853664137950845961480297893453241670066891939");

Absolute error = -5.02748013242327639657838201069244196598086687598365702436274083429103031791154254688397381233476888109257303049666056902353473990407194600189110189078583484574868186484958398689089163018221892664037245135388883801689588181830395113773730713685686000993557855361744888167079262768697835439434623848746515679126073212404092820502626938000530208144724689918354744116716425629459662049636783854348084302260358988333561371816637813744627046208117987786899176141451952663897179644646397225693380352993781826329919231406184551157513054627516946914584505887721324611845355101608653146170470480065933909111732126971669336643695721498243335392664544424531523171711372228219914210245794701324446527354703217952356284393759416894571062761667332113325078466738366272532863917550533731836280758843056774906862734091656942807495988115860918681464258094805560424058255325887407095753254395729315604821573757373123912935892639139056787572348730943383156187380642264046740491556892673469828818826
9458158536641379508460E-62

>>...for sin(0.1) arguments in radian

>> Mathematica 8 result <<
sin( 0.1 ) = 0.099833416646828152306814198410622026989915388017982259992766861...
>> Java BigDecimal result ( identical to Mathematica 8 up to 60 digits )<<
sin( 0.1 ) = 0.099833416646828152306814198410622026989915388017982259992766911...
>> Windows Calculator ( up to 33 digits ) <<
sin( 0.1 ) = 0.099833416646828152306814198410622

Your results look impressive.

>>>Your results look impressive.>>>
Thank you.
I have already written a few large precision methods and I will post the results and source code soon.
Why do not you try to use Java BigDecimal class ?

>>...Why do not you try to use Java BigDecimal class?

I don't do Java programming but an R&D project for Android OS will finally "force" me...

>>>I don't do Java programming but an R&D project for Android OS will finally "force" me...>>>
You can classify Java programming as a your programming knowledge multiplier:)

Hi Sergey!
Here I'm posting vectorised version of my fastsin() function.By looking at code you can see more than 200 lines of code.
I still did not do an extensive tests for an accuracy , precision and speed of execution.Later I will post the results.
By studying code you can see that I must to declare and define 10 structures and use them as a scalar coefficients.
Inside _asm{} block I cannot declare MASM REAL types and trying to load XMM registers with double coeffs one at time
does not work, so I'm forced to produce such a monstrous code.But I'am able to do vector calculation directly on
XMM registers.
Source code :

inline struct vec4D fastsin4D(struct Argument4D arg1){

if(arg1.f1 >= HALF_PI_FLT || arg1.f2 >= HALF_PI_FLT || arg1.f3 >= HALF_PI_FLT || arg1.f4 >= HALF_PI_FLT){
vec4D vec;
vec.f1 = arg1.f1;
vec.f2 = arg1.f2;
vec.f3 = arg1.f3;
vec.f4 = arg1.f4;
vec.f1 = (vec.f1 - vec.f1)/(vec.f1 - vec.f1);
vec.f2 = (vec.f2 - vec.f2)/(vec.f2 - vec.f2);
vec.f3 = (vec.f3 - vec.f3)/(vec.f3 - vec.f3);
vec.f4 = (vec.f4 - vec.f4)/(vec.f4 - vec.f4);
return vec;
}
else if(arg1.f1 <= NEG_HALF_PI_FLT || arg1.f2 <= NEG_HALF_PI_FLT || arg1.f3 <= NEG_HALF_PI_FLT || arg1.f4 <= NEG_HALF_PI_FLT){
vec4D vec1;
vec1.f1 = arg1.f1;
vec1.f2 = arg1.f2;
vec1.f3 = arg1.f3;
vec1.f4 = arg1.f4;
vec1.f1 = (vec1.f1 + vec1.f1)/(vec1.f1 + vec1.f1);
vec1.f2 = (vec1.f2 + vec1.f2)/(vec1.f2 + vec1.f2);
vec1.f3 = (vec1.f3 + vec1.f3)/(vec1.f3 + vec1.f3);
vec1.f4 = (vec1.f4 + vec1.f4)/(vec1.f4 + vec1.f4);
return vec1;

}else{
struct CoeffFlt1
{
float c1,c2,c3,c4;
} coeffflt1;

struct CoeffFlt2
{
float c1,c2,c3,c4;
} coeffflt2;

struct CoeffFlt3
{
float c1,c2,c3,c4;
} coeffflt3;

struct CoeffFlt4
{
float c1,c2,c3,c4;
} coeffflt4;

struct CoeffFlt5
{
float c1,c2,c3,c4;
} coeffflt5;

struct CoeffFlt6
{
float c1,c2,c3,c4;
} coeffflt6;

struct CoeffFlt7
{
float c1,c2,c3,c4;
} coeffflt7;

struct CoeffFlt8
{
float c1,c2,c3,c4;
} coeffflt8;

struct CoeffFlt9
{
float c1,c2,c3,c4;
} coeffflt9;

struct CoeffFlt10
{
float c1,c2,c3,c4;
} coeffflt10;

coeffflt1.c1 = -0.1666666;
coeffflt1.c2 = -0.1666666;
coeffflt1.c3 = -0.1666666;
coeffflt1.c4 = -0.1666666;

coeffflt2.c1 = 0.0083333;
coeffflt2.c2 = 0.0083333;
coeffflt2.c3 = 0.0083333;
coeffflt2.c4 = 0.0083333;

coeffflt3.c1 = -1.9841269e-4;
coeffflt3.c2 = -1.9841269e-4;
coeffflt3.c3 = -1.9841269e-4;
coeffflt3.c4 = -1.9841269e-4;

coeffflt4.c1 = 2.7557319e-6;
coeffflt4.c2 = 2.7557319e-6;
coeffflt4.c3 = 2.7557319e-6;
coeffflt4.c4 = 2.7557319e-6;

coeffflt5.c1 = -2.5052108e-8;
coeffflt5.c2 = -2.5052108e-8;
coeffflt5.c3 = -2.5052108e-8;
coeffflt5.c4 = -2.5052108e-8;

coeffflt6.c1 = 1.6059043e-10;
coeffflt6.c2 = 1.6059043e-10;
coeffflt6.c3 = 1.6059043e-10;
coeffflt6.c4 = 1.6059043e-10;

coeffflt7.c1 = -7.6471637e-13;
coeffflt7.c2 = -7.6471637e-13;
coeffflt7.c3 = -7.6471637e-13;
coeffflt7.c4 = -7.6471637e-13;

coeffflt8.c1 = 2.8114572e-15;
coeffflt8.c2 = 2.8114572e-15;
coeffflt8.c3 = 2.8114572e-15;
coeffflt8.c4 = 2.8114572e-15;

coeffflt8.c1 = -8.2206352e-18;
coeffflt8.c2 = -8.2206352e-18;
coeffflt8.c3 = -8.2206352e-18;
coeffflt8.c4 = -8.2206352e-18;

coeffflt9.c1 = 1.9572941e-20;
coeffflt9.c2 = 1.9572941e-20;
coeffflt9.c3 = 1.9572941e-20;
coeffflt9.c4 = 1.9572941e-20;

coeffflt10.c1 = -3.8681701e-23;
coeffflt10.c2 = -3.8681701e-23;
coeffflt10.c3 = -3.8681701e-23;
coeffflt10.c4 = -3.8681701e-23;

vec4D vec0;
vec0.f1 = 0.0f;
vec0.f2 = 0.0f;
vec0.f3 = 0.0f;
vec0.f4 = 0.0f;

_asm{

xorps xmm0,xmm0
xorps xmm1,xmm1
xorps xmm6,xmm6
xorps xmm7,xmm7
xorps xmm5,xmm5
movups xmm0,arg1//sine x,y,z,w
movups xmm5,xmm0 //copy arg1 to multiply arg power
movups xmm7,xmm0 // copy arg to accumulator
mulps xmm0,xmm5 // x^2
movups xmm6,xmm0 // copy of x^2
mulps xmm0,xmm5 // x^3
movups xmm1,coeffflt1
mulps xmm1,xmm0
addps xmm7,xmm1 // xmm7 accumulator
movups xmm1,coeffflt2
mulps xmm0,xmm6 // x^5
mulps xmm1,xmm0
addps xmm7,xmm1
movups xmm1,coeffflt3
mulps xmm0,xmm6 // x^7
mulps xmm1,xmm0
addps xmm7,xmm1
movups xmm1,coeffflt4
mulps xmm0,xmm6 // x^9
mulps xmm1,xmm0
addps xmm7,xmm1
movups xmm1,coeffflt5
mulps xmm0,xmm6 //x^11
mulps xmm1,xmm0
addps xmm7,xmm1
movups xmm1,coeffflt6
mulps xmm0,xmm6 //x^13
mulps xmm1,xmm0
addps xmm7,xmm1
movups xmm1,coeffflt7
mulps xmm0,xmm6 //x^15
mulps xmm1,xmm0
addps xmm7,xmm1
movups xmm1,coeffflt8
mulps xmm0,xmm6
mulps xmm1,xmm0
addps xmm7,xmm1
movups xmm1,coeffflt9
mulps xmm0,xmm6
mulps xmm1,xmm0
addps xmm7,xmm1
movups xmm1,coeffflt10
mulps xmm0,xmm6
mulps xmm1,xmm0
addps xmm7,xmm1
movups vec0,xmm7
}

return vec0;
}

}

Hi Iliya, I see how your problem looks like...

Could you also post declarations of HALF_PI_FLT and NEG_HALF_PI_FLT? Did you declare these two constants as '#define ...' or as 'const float ...'?

Declarations of 'vec4D' and 'Argument4D' are also needed.

>>>Hi Iliya, I see how your problem looks like...

Could you also post declarations of HALF_PI_FLT and NEG_HALF_PI_FLT? Did you declare these two constants as '#define ...' or as 'const float ...'?>>>
Yes it is going to be a serious problem.For example if for such a simple function as fastsin() is needed more than 200 lines of code what will be needed for gamma or bessel function.

HALF_PI_FLT is a float value of Pi/2 and NEG_HALF_PI_FLT is -Pi/2 both of them are declared as #define.constants.
The problem is heavy usage of structures declaration and definition needed for Horner scheme calculation.The bulk of code is due to this.
I have tried various combination of indirect addressing with [eax+float size] , but it does not work. Also as expected none of MASM directive and primitive type works.So I'm left with those long struct declaration and definition.I would also use less coefficients because of float type precision but I need to test it.

>>>Declarations of 'vec4D' and 'Argument4D' are also needed.>>>

struct vec4D
{
float c1,c2,c3,c4;
} vec0;

struct Argument4D
{
float a1,a2,a3,a4;
}arg1;

Hi Iliya,

Thanks for the update. My first impression is as follows: In overall, it looks good and can be optimized. But, you're trying to use a Java-like style of programming especially when it comes to declarations of different helper types, like structures, and initialization of members of these helper types. I'll take a look at your function tomorrow.

Best regards,
Sergey

>>> But, you're trying to use a Java-like style of programming especially when it comes to declarations of different helper types, like structures, and initialization of members of these helper types. I'll take a look at your function tomorrow.>>>
Thanks for answer.
Structures types can be accessed with pointers or with 'dot' operator.It is question of programming style.Today i will switch to the 2D ,3D,4D array
types and try to eliminate those pesky structures if it works I can save more than 100 lines of code.I will post the update.

Update on vectorised fastsin() with array type coefficients.
Sadly I still cannot load xmm registers with an array type coefficients as I told you earlier only passing structure reference to xmm register works.
I will investigate this issue with debugger today.

I did preliminary tests on execution speed of _asm{} block.The tests were made with the help of RDTSC instructions beign executed inside _asm{} block.After a few tests I got ~1567 cycles even if latency of accessing and zeroing eax and edx registers and RDTSC and CPUID instructions is included it could not be more than ~300-400 cycles for executing whole block.Inline assembly code should be very fast and total latency in cpi per one Horner scheme polynomial
term is not more than 4-5 cycles.So total time of calculation for 10 terms should be ~40-50 cycles excluding xoring xmm registers and copying x^2 value.
Could you run set of tests on this?

>> Tip #1 <<

Try to use 'typedef's for declaration of structures instead of 'struct':

typedef struct tagSOMESTRUCT
{
} SOMESTRUCT;

>> Tip # 2 <<

Lots of unnecesseary things in that piece of code:
...
if(arg1.f1 >= HALF_PI_FLT || arg1.f2 >= HALF_PI_FLT || arg1.f3 >= HALF_PI_FLT || arg1.f4 >= HALF_PI_FLT)
{
vec4D vec;
vec.f1 = arg1.f1;
vec.f2 = arg1.f2;
vec.f3 = arg1.f3;
vec.f4 = arg1.f4;
vec.f1 = (vec.f1 - vec.f1)/(vec.f1 - vec.f1);
vec.f2 = (vec.f2 - vec.f2)/(vec.f2 - vec.f2);
vec.f3 = (vec.f3 - vec.f3)/(vec.f3 - vec.f3);
vec.f4 = (vec.f4 - vec.f4)/(vec.f4 - vec.f4);
return vec;
}
...
Why don't you want to return just NULL:
...
if(arg1.f1 >= HALF_PI_FLT || arg1.f2 >= HALF_PI_FLT || arg1.f3 >= HALF_PI_FLT || arg1.f4 >= HALF_PI_FLT)
{
return (vec &)NULL;
}
...
If there is an error with one of input parameters I try to leave a function / method as soon as possible without doing any calculations.

>> Tip # 3 <<

inline struct vec4D fastsin4D( struct Argument4D arg1 )
{
...
}

This is a C-like, that is very obsolete, way of declaring a function. Also, why don't you want to have a class that wraps and consolidates all common things, variables, declarations, etc?

>> Tip # 4 <<

There are lots of declarations of some types and initializations of some constants inside of the function:

inline struct vec4D fastsin4D( struct Argument4D arg1 )
{
...
struct CoeffFlt1
{
float c1,c2,c3,c4;
} coeffflt1;
...
struct CoeffFlt10
{
float c1,c2,c3,c4;
} coeffflt10;
...
coeffflt1.c1 = -0.1666666;
coeffflt1.c2 = -0.1666666;
coeffflt1.c3 = -0.1666666;
coeffflt1.c4 = -0.1666666;
...
coeffflt10.c1 = -3.8681701e-23;
coeffflt10.c2 = -3.8681701e-23;
coeffflt10.c3 = -3.8681701e-23;
coeffflt10.c4 = -3.8681701e-23;
...
}

- Declarations of some types should be done only once even if it is a compile time thing
- Initializations of constants could be done only once during initialization of your library

>> Tip # 5 <<

Think about applications and use cases of your API. Will a developer X be happy with your function(s) / class(es) or not?

>> Tip # 6 <<

Have a test-case(s) for your API from the very beginning.

>> Tip # 7 <<

Evaluate performance of your API from the very beginning. It will save time later and you won't need to do a re-design if some function doesn't work fast.

>>>- Declarations of some types should be done only once even if it is a compile time thing
- Initializations of constants could be done only once during initialization of your library>>>
Thank You very much for your help I'm still learning and switching between 3 programming languages sometime confuses me:)

Regarding declaration of those sine coefficients and their structures I will "typedef" them and move them to header file.Is this good solution to
this problem?

Pages

Login to leave a comment.