ipp-6.1.1.035-windows-ia32 v8 slower than t7 and w7 for ippiCopy_8u_C1R

ipp-6.1.1.035-windows-ia32 v8 slower than t7 and w7 for ippiCopy_8u_C1R

Hi,
while going through some performance testing for some part of my code I found it odd that the v8 path was slower than t7 or w7. After a little digging, I found out that it was ippiCopy_8u_C1R that was really slower.
Has anyone got an explication for this behaviour ?
I'm using not threaded static libs. I didn't test for 16 byte aligned data. I guess that Visual Studio aligns data on 8 bytes boundaries.

Regards,
Matthieu

Here are the timings :

ippiCopy_8u_C1R
px : 2992241 clocks
px : 1500 us
w7 : 817182 clocks
w7 : 410 us
t7 : 960961 clocks
t7 : 482 us
v8 : 2208160 clocks
v8 : 1107 us

Here is the code :
static Ipp8u pSrc8u[1280*1024];
static Ipp8u pDst8u[1280*1024];
static IppiSize srcSize = {1280, 1024};
static int step8u = 1280 * sizeof(Ipp8u);

int i;
int freq;

ippGetCpuFreqMhz(&freq);

#define SPEED_8u_C1R(fn, it,cpu, cpuType, ...) {ippInitCpu(cpuType); \
ippiImageJaehne_8u_C1R(pSrc8u, step8u, srcSize); \
begin=ippGetCpuClocks(); \
for(i=it; i!=0; --i) \
fn(pSrc8u, step8u, pDst8u, __VA_ARGS__); \
end=ippGetCpuClocks(); \
printf(cpu" : %12.0f clocks\n", (float)(end-begin)/it); \
printf(cpu" : %12.0f us\n", ((float)(end-begin)/it)/freq); }

printf("ippiCopy_8u_C1R\n");
SPEED_8u_C1R(ippiCopy_8u_C1R, 1000, " px", ippCpuUnknown, step8u, srcSize);
SPEED_8u_C1R(ippiCopy_8u_C1R, 1000, " w7", ippCpuSSE2, step8u, srcSize);
SPEED_8u_C1R(ippiCopy_8u_C1R, 1000, " t7", ippCpuSSE3, step8u, srcSize);
SPEED_8u_C1R(ippiCopy_8u_C1R, 1000, " v8", ippCpuSSSE3, step8u, srcSize);

printf("ippiMirror_8u_C1R\n");
SPEED_8u_C1R(ippiMirror_8u_C1R, 1000, " px", ippCpuUnknown, step8u, srcSize, ippAxsHorizontal);
SPEED_8u_C1R(ippiMirror_8u_C1R, 1000, " w7", ippCpuSSE2, step8u, srcSize, ippAxsHorizontal);
SPEED_8u_C1R(ippiMirror_8u_C1R, 1000, " t7", ippCpuSSE3, step8u, srcSize, ippAxsHorizontal);
SPEED_8u_C1R(ippiMirror_8u_C1R, 1000, " v8", ippCpuSSSE3, step8u, srcSize, ippAxsHorizontal);

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Since you brought it up, I guess you could try allocating the arrays using ippiMalloc since you'd get32-byte alignment. I know I'm curious how big an effect it would have, although, it should not be a factor of 2 as you show in your numbers.

Peter

Quoting - pvonkaenel

Since you brought it up, I guess you could try allocating the arrays using ippiMalloc since you'd get32-byte alignment. I know I'm curious how big an effect it would have, although, it should not be a factor of 2 as you show in your numbers.

Peter

I forgot to mention I'm running this on a C2D E4400. The problem doesn't occur on a Xeon 5520 running windows XP x64 (so running in WoW mode for x32) nor does it occur on the Xeon in x64.

I checked the alignement, memory was 8 bytes aligned, not 16 or 32. I tried with 32 bytes alignment with no luck...

Here are the results with 32 bytes alignment

ippiCopy_8u_C1R
px : 2874015 clocks
px : 1441 us
w7 : 762926 clocks
w7 : 382 us
t7 : 818958 clocks
t7 : 411 us
v8 : 2117159 clocks
v8 : 1061 us

Quoting - matthieu.darbois

I forgot to mention I'm running this on a C2D E4400. The problem doesn't occur on a Xeon 5520 running windows XP x64 (so running in WoW mode for x32) nor does it occur on the Xeon in x64.

I checked the alignement, memory was 8 bytes aligned, not 16 or 32. I tried with 32 bytes alignment with no luck...

Here are the results with 32 bytes alignment

ippiCopy_8u_C1R
px : 2874015 clocks
px : 1441 us
w7 : 762926 clocks
w7 : 382 us
t7 : 818958 clocks
t7 : 411 us
v8 : 2117159 clocks
v8 : 1061 us

The same seems to happen in x64 mode when running SSE3 code instead of SSE2 (that is default code)

Hi Matthieu,

Have you tried other, previous versions (e.g. 6.0.2?) to see if the same applies there?

I am (unfortunately) continually amazed (and worried) to hear about such things. It just seems too apparent to me to make a unit-test that should have caught this. There are similar situations where some function has been rewritten for various reasons (that may be perfectly valid), and then some problem appears that was not there before. This is a situation I and we as a community have experienced numerous times before. Proper unit-tests would have caught that. It should also reduce the support time and resources needed. Yes, it does take more time and effort making proper unit-tests but it usually (in my view: always) pays off in the long run.

Intel: More unit-tests of both the functions and the samples, please, so we can all avoid these kind of situations! (The unit-tets could even be a part of the public package).

- Jay

Quoting - j_miles
Hi Matthieu,

Have you tried other, previous versions (e.g. 6.0.2?) to see if the same applies there?

I am (unfortunately) continually amazed (and worried) to hear about such things. It just seems too apparent to me to make a unit-test that should have caught this. There are similar situations where some function has been rewritten for various reasons (that may be perfectly valid), and then some problem appears that was not there before. This is a situation I and we as a community have experienced numerous times before. Proper unit-tests would have caught that. It should also reduce the support time and resources needed. Yes, it does take more time and effort making proper unit-tests but it usually (in my view: always) pays off in the long run.

Intel: More unit-tests of both the functions and the samples, please, so we can all avoid these kind of situations! (The unit-tets could even be a part of the public package).

- Jay

Hi Jay,
I did not tried on previous versions of IPP yet. I will certainly do so tomorrow and post the results.

I totally agree with you regarding the use of unit-test. In fact, I caught this while running unit-tests on functions that use both IPP and custom intrinsic assembly code. I do want to make sure that my so-called optimized code is in fact optimized. All functions do not make use of SSSE3 for example but knowing that IPP might, I always make a copy of the SSE2 or SSE3 code in the SSSE3 branch and verify that performance is not impacted. I first thought it was on my end until I realized that this was the case for functions calling ippiCopy. Time was wasted and a simple test at Intel should have done the trick...

Regards,
Matthieu

Matthieu,

Thanks for your code. We will have a check on v8 and t7 code performance, and provide you more information.

Regards,
Chao

Quoting - Chao Y (Intel)

Matthieu,

Thanks for your code. We will have a check on v8 and t7 code performance, and provide you more information.

Regards,
Chao

Here are the test result with other IPP version. This time alignement is 32 bytes for source and dest.

ippi : 5.3 Update 4 build 85.38
ippiCopy_8u_C1R
ippipxl.lib
px 2853282 clocks
px 1430 us
ippiw7l.lib
w7 745643 clocks
w7 374 us
ippit7l.lib
t7 750691 clocks
t7 376 us
ippiv8l.lib
v8 2150453 clocks
v8 1078 us

ippi : 6.0 build 167.26
ippiCopy_8u_C1R
ippipxl.lib
px 2877288 clocks
px 1442 us
ippiw7l.lib
w7 743875 clocks
w7 373 us
ippit7l.lib
t7 788125 clocks
t7 395 us
ippiv8l.lib
v8 2095032 clocks
v8 1050 us

ippi : 6.0 Update 1 build 167.32
ippiCopy_8u_C1R
ippipxl.lib
px 2875982 clocks
px 1442 us
ippiw7l.lib
w7 821651 clocks
w7 412 us
ippit7l.lib
t7 893756 clocks
t7 448 us
ippiv8l.lib
v8 2183061 clocks
v8 1094 us

ippi : 6.0 Update 2 build 167.41
ippiCopy_8u_C1R
ippipxl.lib
px 2973393 clocks
px 1490 us
ippiw7l.lib
w7 800967 clocks
w7 401 us
ippit7l.lib
t7 818101 clocks
t7 410 us
ippiv8l.lib
v8 2152733 clocks
v8 1079 us

ippi : 6.1 build 137.16
ippiCopy_8u_C1R
ippipxl.lib
px 3037524 clocks
px 1523 us
ippiw7l.lib
w7 813190 clocks
w7 408 us
ippit7l.lib
t7 821024 clocks
t7 412 us
ippiv8l.lib
v8 1993940 clocks
v8 1000 us

ippi : 6.1 build 137.20
ippiCopy_8u_C1R
ippipxl.lib
px 2888991 clocks
px 1449 us
ippiw7l.lib
w7 744724 clocks
w7 373 us
ippit7l.lib
t7 752056 clocks
t7 377 us
ippiv8l.lib
v8 2155080 clocks
v8 1081 us

All these test were run on a C2D E4400, win XP SP3 x86
The code to run these tests is attached with this post.
As you see, from IPP 5.3.4 up to 6.1.1, nothing has changed.

Regards,
Matthieu

Attachments: 

AttachmentSize
Downloadtext/x-c++src IPPTest.cpp3.77 KB

Instead of copying the same image a thousand times, could you try copying a larger buffer? I bet w8 with SSSE3 uses non-temporal (NTA) streaming loads, thus copying the data without using the CPU cache.

If you're using ippiCopy for something that fits your L2 cache, copies will be very fast after the first time, but if it exceeds the cache size, it will constantly be thrashing the whole cache, giving NTA the advantage. If that's the case, it might make sense for ippiCopy to have an extra argument or for Intel to add ippiStream.

Hello,

we do provide a special function, ippiCopyManaged where you can specify to not use cache with parameter

/* ////////////////////////////////////////////////////////////////////////////
// Name: ippiCopyManaged
//
// Purpose: copy pixel values from the source image to the destination image
//
//
// Returns:
// ippStsNullPtrErr One of the pointers is NULL
// ippStsSizeErr roiSize has a field with zero or negative value
// ippStsNoErr OK
//
// Parameters:
// pSrc Pointer to the source image buffer
// srcStep Step in bytes through the source image buffer
// pDst Pointer to the destination image buffer
// dstStep Step in bytes through the destination image buffer
// roiSize Size of the ROI
// flags The logic sum of tags sets type of copying.
// (IPP_TEMPORAL_COPY,IPP_NONTEMPORAL_STORE etc.)
*/

IPPAPI( IppStatus, ippiCopyManaged_8u_C1R,
( const Ipp8u* pSrc, int srcStep,
Ipp8u* pDst, int dstStep,
IppiSize roiSize, int flags ))

Regards,
Vladimir

Quoting - Chao Y (Intel)

Matthieu,

Thanks for your code. We will have a check on v8 and t7 code performance, and provide you more information.

Regards,
Chao

Hello,

We checked your test code for ippiCopy_8u_C1R performance:
ippiCopy_8u_C1R function has a threshold when it starts to use non-temporal store - (src_len + dst_len) >= L2. The threshold for w7 and t7 is 2MB, for v8 - 4MB (it was tuned for Merom system). E4400 has 2MB L2, in this case, t7 (for 1280*1024*2 = 2.6MB) uses non-temporal store, but v8 not. This issue could be solved by changing of the v8 threshold in IPP function. We are going to fix it in future release.

Maybe you can check the function suggest by Vladimir to control if it need to fill in the data into cache.

Thanks,
Chao

Quoting - Chao Y (Intel)

Hello,

We checked your test code for ippiCopy_8u_C1R performance:
ippiCopy_8u_C1R function has a threshold when it starts to use non-temporal store - (src_len + dst_len) >= L2. The threshold for w7 and t7 is 2MB, for v8 - 4MB (it was tuned for Merom system). E4400 has 2MB L2, in this case, t7 (for 1280*1024*2 = 2.6MB) uses non-temporal store, but v8 not. This issue could be solved by changing of the v8 threshold in IPP function. We are going to fix it in future release.

Maybe you can check the function suggest by Vladimir to control if it need to fill in the data into cache.

Thanks,
Chao

Thanks for the clarification, I didn't think that ippiCopy used NT store (I thought only ippiCopyManaged provided this option) so that all make sense now. But, as you said, the threshold should be 2MB and not 4MB for v8.
Does all ipp functions work like this ? What would you recommend for benchmarking, my first thought is that all data fits into cache thus providing info on only processing and not memory transfer. (Although this wouldn't be real smart for copy which is just memory transfer...)
As for NT loads as suggested above, I don't think it has been implemented until Penryn, and, even in this case, it wouldn't work with RAM which isn't a memory of the right kind (Correct me if I'm wrong, I know little about WC, WB memory) .

Regards,
Matthieu

Quoting - matthieu.darbois

Thanks for the clarification, I didn't think that ippiCopy used NT store (I thought only ippiCopyManaged provided this option) so that all make sense now. But, as you said, the threshold should be 2MB and not 4MB for v8.
Does all ipp functions work like this ? What would you recommend for benchmarking, my first thought is that all data fits into cache thus providing info on only processing and not memory transfer. (Although this wouldn't be real smart for copy which is just memory transfer...)
As for NT loads as suggested above, I don't think it has been implemented until Penryn, and, even in this case, it wouldn't work with RAM which isn't a memory of the right kind (Correct me if I'm wrong, I know little about WC, WB memory) .

Regards,
Matthieu

Hi,

this is a quote from ippi manual :"When flags is
set to IPP_TEMPORAL_COPY, the function is identical to the function ippiCopy_8u_C1R". That's why I thought that ippiCopy didn't use NT store (I knew I read it somewhere...), maybe it would be good to provide the information you gave in this post in the manual.

Regards,
Matthieu

Quoting - matthieu.darbois

Thanks for the clarification, I didn't think that ippiCopy used NT store (I thought only ippiCopyManaged provided this option) so that all make sense now. But, as you said, the threshold should be 2MB and not 4MB for v8.
Does all ipp functions work like this ? What would you recommend for benchmarking, my first thought is that all data fits into cache thus providing info on only processing and not memory transfer. (Although this wouldn't be real smart for copy which is just memory transfer...)
As for NT loads as suggested above, I don't think it has been implemented until Penryn, and, even in this case, it wouldn't work with RAM which isn't a memory of the right kind (Correct me if I'm wrong, I know little about WC, WB memory) .

Regards,
Matthieu

This problem is only for ippiCopy. If the dst image will be used soon, a temporal store is helpful. But if you expect the data item be not used in the near future, you can use non-temporal store. For non-temporal store, it will not contaminate the cache data.

I agree we can improve the document in this part. Thank you for your feedback...

Thanks,
Chao

Quoting - Chao Y (Intel)

Hello,

We checked your test code for ippiCopy_8u_C1R performance:
ippiCopy_8u_C1R function has a threshold when it starts to use non-temporal store - (src_len + dst_len) >= L2. The threshold for w7 and t7 is 2MB, for v8 - 4MB (it was tuned for Merom system). E4400 has 2MB L2, in this case, t7 (for 1280*1024*2 = 2.6MB) uses non-temporal store, but v8 not. This issue could be solved by changing of the v8 threshold in IPP function. We are going to fix it in future release.

Maybe you can check the function suggest by Vladimir to control if it need to fill in the data into cache.

Thanks,
Chao

This problem is fixed in the latested IPP 6.1 update 3.

Leave a Comment

Please sign in to add a comment. Not a member? Join today