I'm a new MKL user and i would like to know if BLAS routines and VML functions are vectorized?
I'm currently using openMP to apply a complex filter to an image with a quad core (hyperthreading desactivated). i divide the images in 4 equal parts and apply the process to each part, giving a processor for each. Inside this process, i have some loop that process vectors. I have replaced some of these loops (when it is possible) by a function of BLAS or VML but without gaining time... I expected vectorization to make the code faster than loops do. Am i wrong? Maybe the vectors should be larger than a certain size? Another question : I don't expect parallelization to be effective with BLAS or VML in my case because there are called in a single open mp thread, am i wrong?
I would be very grateful if someone can provide me some help.