I'm having optimization problems with MKL. I'm not sure whether I'm doing somthing wrong, or there is indeed a problem in this case (aka. it won't have benefits in my case).
I've made an implementation protype of the Black-Scholes algorithm for evaluating option prices, both using standard C functions, and MKL functions, by using the VML library. My problem is that the MKL implementation is much more slower than the normal float implementation. I've tried both single and multi threaded. Can someone please take a look and give me some advice/suggestion what else could I try? According to documentation this is a high-performance library. However, my results don't reflect this.
I've attached the code. Just uncomment the mkl_domain_set_num_threads() function. Also the makefile contains both single and multi threaded libraries. You just have to uncomment the corresponding lines.
Whenever I use Sequential linking:
icpc -c -w1 -O2 -xsse4.2 -DMKL_ILP64 -I. -I/opt/intel/composerxe/include -I/opt/intel/mkl/include -o Black76.o Black76.cpp
icpc -c -w1 -O2 -xsse4.2 -DMKL_ILP64 -I. -I/opt/intel/composerxe/include -I/opt/intel/mkl/include -o main.o main.cpp
icpc -L/opt/intel/mkl/lib/intel64 -L/opt/intel/lib/intel64 Black76.o main.o -o black76_intel -lrt -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lm
I'm getting the following performance results:
Completed 1 passes in 0 : 001526118 seconds
Completed 2 passes in 0 : 000007518 seconds
Completed 3 passes in 0 : 000008536 seconds
Completed 10 passes in 0 : 000026468 seconds
Completed 100 passes in 0 : 000329301 seconds
Completed 1000 passes in 0 : 002591126 seconds
Completed 10000 passes in 0 : 014796280 seconds
Completed 100000 passes in 0 : 147133308 seconds
Completed 1000000 passes in 1 : 465677079 seconds
Completed 10000000 passes in 14 : 714433962 seconds
It's also something odd here, because running 2 passes should not be quicker than running only one pass? There is huge difference between the 2, also running 3 doesn't reflect the reality either. Running even 100 passes is even quicker than the first one? This shouldn't happen.
When I compile with multi-threading I use the following options:
icpc -L/opt/intel/mkl/lib/intel64 -L/opt/intel/lib/intel64 Black76.o main.o -o black76_intel -lrt -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm
I will have to make 16 calculation repeatedly, so I defined ARRAYSZE=16, but I also tried increasing ARRAYSIZE to 16000, and enable multi threading, still sequential was faster than multithreaded. I'd like to improve performance with 16 calculations.
Can someone help me?