# Short Vector Math Library

## Short Vector Math Library

Hi all,

I was reading about the short vector math library, and how it can dramatically improve performance. So I decided to give it a try, and see what I can achieve compared to a standard loop performing an exponentiation for all values of a matrix.

The results were not what I expected, my implementation of the SVML is slower than the conventional loop (0.28 sec vs. 0.32 sec for a a 7500x7500 float matrix). In a way, this is good, because it's nice to see the Intel compiler doing a better job than my quick 1 hour hack. But I'd still like to know what it is doing better than me. If anybody is interested, my code is attached below.

Also, I read that the processor provides different ways to compute math functions, an accurate way vs. a fast LUT approximation. I would like to know where I can find info on how to control that.

Alex

###### CUT ######

#include

#include

#include

#include

#include

#include

// memory alignment

#define ALIGNMENT (16)

// timer variables

LARGE_INTEGER start, end, frequency;

float totalHi, totalLo;

extern "C" { __m128 vmlsExp4(__m128 a); }

////////////////////

///// TIMER ROUTINES

////////////////////

void resetTimer(void)

{

start.LowPart = start.HighPart = end.LowPart = end.HighPart = 0;

totalHi = totalLo = 0.0f;

QueryPerformanceCounter(&start);

}

void stopTimer(char* msg)

{

float total_time=0.0f, frequency_time=0.0f;

QueryPerformanceCounter(&end);

total_time = ((float)end.HighPart - (float)start.HighPart) * (float)pow(2.0f,32) +

((float)end.LowPart - (float)s
tart.LowPart);

QueryPerformanceFrequency(&frequency);

frequency_time = (float)((frequency.HighPart) * (pow(2.0f,32))) + (float)frequency.LowPart;

printf("%s: %f
", msg, total_time / frequency_time);

}

////////////////////

///// MEMORY ROUTINES

////////////////////

void CreateFloatInput(float **output,

int width, int height, // width and height of image

int width_float) // width of 16-bytes aligned rows

{

*output = (float*) _aligned_malloc(width_float*height*sizeof(float), ALIGNMENT);

return;

}

////////////////////

///// BENCHMARK ROUTINES

////////////////////

void BenchmarkFloat(float *image, int width, int height, int width_float)

{

int row=0, col=0;

__assume_aligned(image,ALIGNMENT);

// 1st do a loop for nothing to set the cache

// so both subsequent loops can be compared meaningfully

for(row=0 ; row

for(col=0 ; col

image[row*width_float + col] = image[row*width_float + col] + 4.3f;

}

}

resetTimer();

for(row=0 ; row

#pragma vector always

#pragma ivdep

for(col=0 ; col

image[row*width_float + col] = exp(image[row*width_float + col]);

}

}

stopTimer("normal");

resetTimer();

float* line=NULL;

__m128 temp;

for(row=0 ; row

line = (float *)&image[row*width_float];

for(col=0 ; col

temp = vmlsExp4(temp);

_mm_store_ps(&line[col], temp);

}

}

stopTimer("Intel short vector library");

return;

}

////////////////////

///// MAIN

////////////////////

int main(int argc, char* argv[])

{

// variables representing an input image

float *image=NULL;

int width, height, width_float;

QueryPerformanceFrequency(&frequency);

// 1st set-up the image

width = 7639;

height = 7419;

int alignment_needed = ALIGNMENT / sizeof(float);

width_float = (int)ceil((float)width/(float)alignment_needed) * alignment_needed;

CreateFloatInput(&image, width, height, width_float);

// benchmark

BenchmarkFloat(image, width, height, width_float);

// release resources

_aligned_free(image);

return 0;

}

26 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

Please post the compile options used for the performance testing and thanks!

I tried your testcase with the 10.0 compiler. the SVML is faster for me.

C: estissues>icl -DWIN32 -DWINDOWS -O2 -QxN t.cpp
Intel C++ Compiler for applications running on IA-32, Version 10.0 Build 2
0070426 Package ID: W_CC_P_10.0.025

t.cpp
t.cpp(55): (col. 3) remark: LOOP WAS VECTORIZED.
t.cpp(65): (col. 3) remark: LOOP WAS VECTORIZED.

-out:t.exe
t.obj

C: estissues>t
normal: 0.636434
Intel short vector library: 0.594954

C: estissues>icl -DWIN32 -DWINDOWS -O2 -QxO t.cpp
Intel C++ Compiler for applications running on IA-32, Version 10.0 Build 2
0070426 Package ID: W_CC_P_10.0.025

t.cpp
t.cpp(55): (col. 3) remark: LOOP WAS VECTORIZED.
t.cpp(65): (col. 3) remark: LOOP WAS VECTORIZED.

-out:t.exe
t.obj

C: estissues>t
normal: 0.620056
Intel short vector library: 0.610974

C: estissues>icl -DWIN32 -DWINDOWS -O2 -QxW t.cpp
Intel C++ Compiler for applications running on IA-32, Version 10.0 Build 2
0070426 Package ID: W_CC_P_10.0.025

t.cpp
t.cpp(55): (col. 3) remark: LOOP WAS VECTORIZED.
t.cpp(65): (col. 3) remark: LOOP WAS VECTORIZED.

-out:t.exe
t.obj

C: estissues>t
normal: 0.616194
Intel short vector library: 0.604466

C: estissues>

Hi Jeniffer and all,

Thanks a lot for the informative answer. I've kept playing around, and I've made some interesting observations that I unfortunately do not understand. That'd be great if somebody could enlighten me as to what is happening in my code.

Most surprising of all is that SVML does seem to become much slow under certain circumstances. My self-contained source code is attached to this message. To recap, my goal is to exponentiate a large float matrix as fast as possible. I am testingseveral exp() implementations, i.e. plain C call to exp() vs. SVML's exp(). Furthermore, I am also testing 2 different designs, using the same buffer for input and output, or using 2 buffers.

For each design, I make lots of calls to both exp() implementations in semi-random orders to perform some _very_ basic statistical analysis of each implementation in different context. The results are as follow:

SAME BUFFER
normal: 0.379070
Intel short vector library: 0.141328
normal: 0.374968
Intel short vector library: 0.141201

Intel short vector library: 0.348467 --> SUDDEN DROP
normal: 0.269847
Intel short vector library: 0.349904
normal: 0.268864
normal: 0.271135
Intel short vector library: 0.351629

DIFFERENT BUFFER
normal: 0.325714
Intel short vector library: 0.368096
normal: 0.260959
Intel short vector library: 0.349850
Intel short vector library: 0.350638
normal: 0.258922
Intel short vector library: 0.349910
normal: 0.259280
normal: 0.259503
Intel short vector library: 0.350372

As you can see, SVML is much faster than the standard exp() in the beginning, but then it becomes slower (from 0.14 to 0.34) at a certain point and never recovers. I do not understand why. However, I have tried to re-initialize the data before each block (to do so yourself, simply uncomment the 2 InitFloatInput calls, its indicated in the source). And the results are now as follow:

SAME BUFFER
normal: 0.469742
Intel short vector library: 0.148859
normal: 0.383944
Intel short vector library: 0.142378
Intel short vector library: 0.170809
normal: 0.270230
Intel short vector library: 0.364840
--> SUDDEN DROP
normal: 0.274514
normal: 0.272254
Intel short vector library: 0.353612

DIFFERENT BUFFER
normal: 0.412189
Intel short vector library: 0.120723
--> BUT RECOVERS HERE
normal: 0.366127
Intel short vector library: 0.121072
Intel short vector library: 0.120550
normal: 0.368461
Intel short vector library: 0.120768
normal: 0.366137
normal: 0.366620
Intel short vector library: 0.120850

There is still a drop occuring in the first block, but re-initializing the data for the second block makes SVML's exp() "recover" its speed. Any idea why is that? My guess would be that the processor throws some kind of mathematical exception, perhaps because I keep exponentiating the same data, hence it reaches a maximum value (FLT_MAX), which makes the code stall at a certain point. Re-initializing could fix that?!

Another interesting issue is that the speed of the standard exp() call is correlated to the speed of SVML's exp(). If you look closely, when SVML's exp() becomes slow (0.14->0.34), standard exp() becomes fas
ter (0.37 -> 0.26), while when SVML's exp() becomes faster again, standard exp() become slower again. I have no idea for that one!!

Alex

PS : My command to compile is: icl -DWIN32 -DWINDOWS -O2 -QxO (-QxT for Core2Duo) ShortVectorMathLib.cpp

## Attachments:

AttachmentSize
7.17 KB

Your code is encountering floating point exceptions (most likely denormals or invalid FP values) because you iterate over the same data. Reinitialization between tests would fix that as you already noticed.

Rant directed at Intel CPU engineers:
I do not understand why Intel CPUs handle floating point exceptions so poorly in comparison to your competitor. I have performed some tests earlier on Netburst architecture, and floating point exceptions in the code lead to some serious slowdowns -- the code on Intel CPU is executed more than two times slower than on AMD Opteron CPU. Once the exceptions are gone the code performs similar on both machines. Judging by the above test that seems to be the case with the Core architecture as well.

I would really want to see that improved in future Intel CPUs because your competitor proved that it is possible, and because one really cannot find any justification for that slowness.

Regards,
Igor Levicki

CPUs from nearly all vendors are designed to perform better with abrupt underflow settings. Are you comparing the CPUs with IEEE gradual underflow set consistently? It's certainly true that IA-64, and original P4, CPUs had much higher than normal performance penalty for handling gradual underflow. Are you generalizing from those?
If you are talking about handling of NaN or Infinity, can you show a practical situation where there is value in processing those quickly?
Each compiler has different settings to control gradual underflow. For example, with recent Intel compilers, the correctness option -fp:precise sets gradual underflow, but it may be turned off again by following with -Qftz. SSE intrinsics are available to change the setting during execution.

It's certainly true that IA-64, and original P4, CPUs had much higher than normal performance penalty for handling gradual underflow. Are you generalizing from those?

For Netburst (Prescott, not just the original P4) I have personally compared using the same executable compiled using Visual C++ 2003 compiler with and without SSE/SSE2 support. It really doesn't matter which type of FP exceptions happen, the net result is that the same code is taking 2.0x-2.5x longer to execute on Intel than on AMD CPU.

Moreover, check those numbers posted above, and you will see what I am talking about when I say that the same obviously applies to the Core architecture:

```#1 Intel short vector library: 0.170809 <- First run in-place
#2 Intel short vector library: 0.364840 <- Repeated run in-place (now with invalid FP numbers from accumulation and thus FP exceptions)
#3 Intel short vector library: 0.120723 <- Out of place operation (unmodified source values, cached)
```

It looks pretty self-explanatory to me, and I bet that enabling exceptions and handling them would show that they exist in #2. IMO, they are the sole reason for such a drastic slowdown (2.13x in this case).

If you are talking about handling of NaN or Infinity, can you show a practical situation where there is value in processing those quickly?

For example, you have a calculation which takes 1 hour to finish on Opteron. You run it accidentaly with invalid parameters and on Intel CPU it takes 2.5 hours while on the Opteron even with those invalid parameters still takes 1 hour.

Lets say that you are at work and it is 15:00, and if the calculation takes one hour to complete, with AMD CPU you would still be able to re-run the code with correct parameters and finish your job before 17:00h. With Intel CPU you would stay at work until 17:30 just to find out that you have made a mistake, and then an additional hour to fix it. That is clearly not the way I would want it to work.

Of course, you can debate whether software which could be affected by such behavior should have an exception handler and warn the user in case that invalid values occur, but you don't always have the control over the code.

Regards,
Igor Levicki

"Lets say that you are at work and it is 15:00, and if the calculation takes one hour to complete, with AMD CPU you would still be able to re-run the code with correct parameters and finish your job before 17:00h. With Intel CPU you would stay at work until 17:30 just to find out that you have made a mistake, and then an additional hour to fix it. That is clearly not the way I would want it to work."

Dear Igor,

The situation you are describing should be most familiar for engineers that work with lengthy simulations. An even more realistic approach would be that the engineer for example starts a script that launches 10 simulations (in sequence) that each take an hour to complete before she leaves the office in the afternoon. Let us furthermore assume that 3 out of the 10 simulations end up with invalid numbers that increases the simulation time from 1 hour to 10 hours. Now, she really needs as many results as possible, because her boss will present the simulation results to a customer that may save the company from bankruptcy the next morning... How should she proceed to write her software to get as many results as possible without staying awake all night? Obviously if the first two simulations involve invalid numbers, she won't have any results the next morning...

In my opinion, the best would be to write a software that regularly reads the floating point status word (SSE or x87), and terminates the application if too many invalid numbers have been detected. This way, you can get the best of both worlds: you avoid the slowdown of tedious C++ exception handling (this hampers the efficiency of out-of-order engines), and you have the possibility to terminate the application if something gets out of control. Remember that the bits in the floating point status words are sticky - she therefore must decide how often she should read it, and she must also decide if the status word should be cleared between each reading or not. In addition she must try to determine the number of invalid numbers that are allowed before the application terminates, this may be a difficult task that may involve some trial and error (intutition/experience). Anyway, she will most certainly have 7 out of 10 succeeded simulations the next morning, and her company will be saved from bankruptcy!

:-)

Best Regards,

Lars Petter Endresen

thanks all for the informative answers.

> the best would be to write a software that regularly reads the floating point status word (SSE or x87),

Alex

Lars, I agree that such applications should have error control. However, that is only possible if you have control over the source code.

As I already said, in many cases you don't because you are either working with binary or with 3rd party library.

Finally, what you are suggesting still doesn't change the fact that Intel CPUs including Core micro-architecture are inferior when it comes to floating point exceptions and that there is simply no proper justification for such a poor behavior when it is obvious that there is a room for at least two times improvement.

Regards,
Igor Levicki

Hello,

You do not need control of the source code. You need to be able to read the floating point statusregister for the application you are running. This can be done in a separate library that can be linked with the application. Each application only has one floating point status register, so you must ensure that you read this status register within the DLL or EXE that you are running. Of course, if you are unable to link in your own code that reads the floating point status register, then you do not have any chance to terminate the application if it is getting out of control! But then I think that you cannot be regarded as the person that is being held responsible for the numerical results generated by the application.

Lars Petter

Yes, you can find an excellent article by Microsofthere: http://msdn2.microsoft.com/en-us/library/9st43tcf(VS.80).aspx.

Typically you read (status87) and clear (clear87) once every second, and terminate if you observe for example more than 3 invalid numbers. As bits are sticky "one" detected invalid number in reality may mean millions of invalid numbers...

Yes, you can find an excellent article by Microsofthere: http://msdn2.microsoft.com/en-us/library/9st43tcf(VS.80).aspx.

Typically you read (status87) and clear (clear87) once every second, and terminate if you observe for example more than 3 invalid numbers. As bits are sticky "one" detected invalid number in reality may mean millions of invalid numbers...

Each application only has one floating point status register

Actually it has two. One for FPU and one for SSE2. That is why you have this function:

`void _statusfp2(unsigned int *px86, unsigned int *pSSE2)`

But you would need to be able to ENABLE floating poing exceptions first.

MSDN says: _statusfp2 is recommended for chips (such as the Pentium IV and later) that have both an x87 and an SSE2 floating point processor. For _statusfp2, the addresses are filled in with the floating-point status word for both the x87 or the SSE2 floating-point processor. When using a chip that supports x87 and SSE2 floating point processors, EM_AMBIGUOUS is set to 1 if _statusfp or _controlfp is used and the action was ambiguous because it could refer to the x87 or the SSE2 floating-point status word.

Of course, if you are unable to link in your own code that reads the floating point status register, then you do not have any chance to terminate the application if it is getting out of control!

Of course you have the chance -- TerminateProcess() API.

But as I said, if you only have a binary whether it is EXE or DLL, then it gets difficult. Of course you can use remote code injection techniques but that is not so trivial to accomplish and honestly I do not see any valid reason why I would need such an extensive use of software to work around poor hardware implementation, because in my opinion that 2x slowdown means just that -- poor hardware implementation.

Regards,
Igor Levicki

But you would need to be able to ENABLE floating poing exceptions first.

You can read the floating point status registers (both x87 and SSE) without enabling exceptions. As you may know, enabling exceptions may prevent some reordering transformations done by the Intel C++ Compiler, thus in certain cases you need to compile with /GX- toturn all exceptions off to obtain the maximal possible performance. Then, with all C++ exceptions disabled, you can still read the floating point status registers to detect illegal calculations. I think that you possibly mix the status registers with the control registers? They are separate and can be used separately!

in my opinion that 2x slowdown means just that -- poor hardware implementation.

Oh it is difficult to satisfy all customers! This is equivalent to state that it is not only important to be able to do the right things extremely fast, it is desirable to do the wrong things extremely fast too.... running around like a headless chicken....

You can read the floating point status registers (both x87 and SSE) without enabling exceptions.

True but what if you are dealing with extended precision numbers like many scientific applications do?

Intel SDM Volume 1, chapter 8.5.2 says: If the denormal value being loaded is a double extended-precision floating-point value, the denormal-operand exception is not reported.

Oh it is difficult to satisfy all customers! This is equivalent to state that it is not only important to be able to do the right things extremely fast, it is desirable to do the wrong things extremely fast too.... running around like a headless chicken....

I see this is fun for you... Ok smart guy, try to compile this using any compiler you like and run it on Intel and on AMD CPU of your choice:

```
#include
#include
#include

void f1(float *w, int n, float sigma)
{
for (int i = 0; i < n; i++) {
float t = -i * i / (sigma * sigma);
w[i] = expf(t);
}
}

void f2(float *w, int n, float sigma)
{
for (int i = 0; i < n; i++) {
float t = -i * i / (sigma * sigma);
w[i] = (t > -80 ? expf(t) : 0); // "optimization"
}
}

{
__asm	{
rdtsc
ret
}
}

// w1, w2 forced to separate pages to prevent
// hardware prefetch of w2 from affecting timing
__declspec(align(256))	float w1[3];
float dummy[1024];
__declspec(align(256))	float w2[3];

int main(int argc, char *argv[])
{
unsigned __int64 t0, t1, t2, t3;

float sigma = 0.2214f;

f1(w1, 3, sigma);

f2(w2, 3, sigma);

printf("   FP Exceptions, clocks : %f
", (double)(t1 - t0));
printf("NO FP Exceptions, clocks : %f
", (double)(t3 - t2));

return 0;
}
```

Feel free to report the numbers.

Moreover, feel free to experiment with the code -- f1() was part of the function which was getting called in a loop roughly 600 times.

You can either try that yourself by adding an outer loop or guesstimate the slowdown using simple math such as multiplication.

Oh, and sigma is variable and obviously not under the control of your code.

Have fun!

Regards,
Igor Levicki

I see this is fun for you... Ok smart guy, try to compile this using any compiler you like and run it on Intel and on AMD CPU of your choice:

`Dear Igor,`
`Sorry about that. I did not intend to be so sarcastic. I will run your case on my core 2 duo and come back with results soon.`
`Kind Regards and sorry again,`
`Lars Petter`

Dear Igor,

Yes, there is a dramatic difference. However, reading the status register reveals that results are erroneous, and you can kindly ask your software friends to call _statusfp2 and terminate if it is getting out of control. Try this:

```_statusfp2(&px86, &pSSE2);printf( "Status = %.4x
",pSSE2);Note that exception handling is not enabled, but you are still able to detect that somewhere along the calculations something went wrong.Kind Regards,Lars Petter Endresen```
Sorry about that. I did not intend to be so sarcastic.

Apology accepted. I already got used to being bashed for caring about things other find unimportant on various forums. I also learned that the only way to "fight" is by using examples.

Yes, there is a dramatic difference.

Good. That is what I was talking about. Now please compile it with say /QaxTO and run it on an AMD Athlon 64 or Opteron CPU if you have access to one of those or ask someone to do it for you.

However, reading the status register reveals that results are erroneous

I do not want to be so analy retentive about error checking especially when it can degrade performance, not to mention that I do not have the authority to crash the application.

The above piece of code resides in a DLL responsible for realtime 3D CAT scan visualization which is used by an end user -- a doctor, through a GUI application. He can manipulate some parameters in order to filter the scan to bring out the particular details. If parameters get out of the valid range (which is unknown at compile time since it can depend on the dataset itself, this code which is part of a downsampling function slows down the whole function to a crawl in turn rendering user interface unusable because of delayed texture updates.

But I digress... I do not want you to help me solve that problem because we already did with it with a simple conditional test like the one used in f2().

What I want is to prove that Intel CPUs (Itanium, Pentium 4, Prescott, Core2) all have inferior floating point exception recovery compared to the competition.

I am hereby voicing my discontent with that inferiority which is dragging through so many CPU generations without being fixed.

I want it fixed on a hardware level. I do not want software workarounds.

Regards,
Igor Levicki

Bad news... I figured out that this slowdown also has to do with the compiler!

Let me explain in details what this code does -- it is for 3D blurring and sigma is a variable parameter. Smaller values mean less softening so it is required that you allow small numbers.

First of all, this is a precision issue, I am getting _SW_INEXACT (0x00000001) by running the test code.

I have tried to use long double and to set /Qpc80 and /fp:strict but with Intel C++ Compiler it doesn't have any effect and the difference in number of clocks is ~100x (8155 .vs. 84 clocks) on my Core 2 Duo.

Then I tried to use Visual C++ 2005 compiler and at first I had worse results (12313 .vs. 763 clocks) than with Intel C++ Compiler.

Then I set /fp:strict and I get 938 .vs. 700 clocks. So, second loop is slower but the first is considerably faster than with Intel C++ Compiler.

There is something fishy going on here!!!

Regards,
Igor Levicki

long double is treated as double until you set -Qlong-double. This should over-ride the Windows convention that precision mode is always set to 53. As the Windows libraries come from Microsoft, they don't support 80-bit long double. libsvml, of course, has no long double support.
On the other subject, of terminating a case which has gone bad, usually it can be done with low overhead by checking some strategic values with isnan() and isinf().

Hm, that explains one part of this but it doesn't explain why the f1() function has that much higher clock count.

In real applicaton we observe 4 seconds slowdown when value of sigma is 0.22 and no slowdown when it is 0.5.

I will definately have to take a more detailed look at this issue, perhaps some assmebler code check is due.

Regards,
Igor Levicki

tim18, and anyone from Intel (Lexi?):

You might want to take a look at this more closely, there is definately something weird going on here.

I tested using various compiler switches to compile this test code. I found out that the clock count for f1() decreases drastically when I:

• Drop /QxT and let compiler generate only generic code
• Write QaxT instead of /QxT (keep in mind that I have Core 2 Duo E6300!)
• Write Qipo instead of Qip while keeping /QxT

I haven't yet got around to checking what is the difference in generated code but one thing I am certain of is that something doesn't look right if performance makes a tenfold dive when you flip a switch which should improve it.

Regards,
Igor Levicki

If you could submit an example which shows /QaxT performing better than /QxT on Core 2 Duo, please do so, on premier.intel.com. I belong to the school which prefers to minimize the variety of options which need to be tried, for example by making /QxW work well on CPUs like Core 2 Duo, so I think we may have some agreement there.
Investigation of some of these differences might require invoking VTune on your code.

If you could submit an example which shows /QaxT performing better than /QxT on Core 2 Duo, please do so, on premier.intel.com.

Sigh... It is the same example I posted here. I hoped someone could look at it without me having to explain it all over again on Premier Support.

I have already wasted too much time diagnosing this problem and on a showstopper bug in 10.0.025 which I submitted 6 days ago and which has been confirmed as reproduced immediately but I haven't got any progress update since then and we have an application to ship. Why should I bother submitting code performance issues when they seem not to care about broken code issues? What exactly is the priority if showstoppers aren't?

About the code sample -- not only it is performing better with /QaxT, it is also performing better without /Qx[n] alltogether, and with /QxK while all the others -- Qx(W|N|P|T|O) perform much worse unless you set /Qipo.

I belong to the school which prefers to minimize the variety of options which need to be tried, for example by making /QxW work well on CPUs like Core 2 Duo, so I think we may have some agreement there.

Agreement on what? I am sorry, I do not understand your point here. I believe I have already minimized both the test case and the switches. Next thing someone will ask me to fix it, right?

Investigation of some of these differences might require invoking VTune on your code.

I believe I am not the one who should be doing it. I do not own VTune and I am not going to download and install the trial just to troubleshoot performance issue Intel compiler developers should be troubleshooting.

Regards,
Igor Levicki

I submitted it anyway, #439834.

Regards,
Igor Levicki