IPP 7.0 Beta :SSE2 on IPP 7.0

IPP 7.0 Beta :SSE2 on IPP 7.0

Hi,

You state that SSE2 optimization layers (t7/m7) and 32-bit SSE3 optimization layer (w7) have been removed, but also state that the base 32-bit optimization layer of the library (px) has been compiled for higher performance and now requires a processor that conforms to the SSE2 processor architecture.

So does that mean that CPUs that only support SSE2 are still supported with the same performance as that achieved with IPP 6.1, or to support these processors which are still widely used, we need to stick with IPP 6.1?

Thanks

Jonathan

42 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Jonathan,

In the 7.0 beta 32-bit version of the library the SSE2 and SSE3 optimizations have been removed. In the 64-bit version of the library the SSE3 optimization has been removed (there never was an SSE2 optimization layer). The base layer of the 32-bit library (px) library has been compiled for SSE2, so SSE2 optimization in that layer is being provided by the compiler. This is consistent with the previously existing situation in the 64-bit version of the library (the "mx" layer).

You may see a reduction in performance for some primitives on SSE2 processors compared to the 6.x version of the library when running 32-bit code. How much reduction, if any, you experience depends on your application and the number and frequency of the IPP functions you use. Some functions will see little or no reduction in performance.

Other than the base level optimizations ("px" and "mx"), the lowest level optimization in both the 32-bit and 64-bit versions of the library are now tuned for SSSE3, which corresponds to the Atom and Core 2 processors.

Your feedback regarding target processor platforms for 2011 and beyond are welcome.

Regards,

Paul

Hi Paul,

Isn't that change intentionally crippling all AMD processors? A bleeding edge Phenom II hexa-core CPU is detected as ippCpuSSE3 and you're saying this hand-tuned code path was removed in favor of a compiler-generated SSE2 code path. All 64-bit AMD processors support only SSE2 and SSE3 and there is no more hand-tuned code for them in the newest IPP.

We use IPP so our software works best on the latest Intel processors, but supporting the latest Intel offerings shouldn't so drastically reduce performance for older or competitive offerings. This leaves us with the option of making slower processors even slower, or making faster processors (SSE4, AVX) even faster. Slower processors need the speed boost the most so we're stuck with IPP 6, meaning we'll never be able to encourage the adoption of newer Intel processors through competitive performance-enhancing instruction sets.

Couldn't you merge the existing hand-tuned SSE2/SSE3 code to the px/mx baseline layer to at least let older/other processors perform at their best? Until now, fairness in IPP performance on all x86 processors is what kept us investing in IPP development. It was kind of nice knowing the library would get the most of Nehalem while also getting the most of a VIA C7 or any other 3rd-party CPU supporting only SSE3. The lowest level optimization (SSSE3) is now only supported by Intel processors. :(

Thanks foryour feedback regarding SSE2 and SSE3 support. I will forward it to the appropriate individuals.

Can youtell mewhat sort of IPP applications you are creating and what are the target platforms for which you will build IPP applications in the 2011 time frame and beyond?

Please reply with a private thread if you do not wish to share such information publicly.

Paul

Since my endusers are all using any mix of processors, from any source (Dell, street shops, A-brands), I strongly disagree with any decision to drop IPP support for SSE2 and SSE3, as that would partly remove the advantage of using IPP at all.

I understand that Intel must be careful to not drag legacy code into the future, but that is why PX exists. PX should support legacy, and some other (SX?) should support SSE2 and SSE3.

Hi Paul

Thank you for the clarification, but that's not what I wanted to hear!

If we purchase 7.0 can we still legally use 6.1?

We need to support clients with older CPUs, and as another poster has said, they're the ones who need the greatest performance benefit! IMO the best compromise would be to make the base level (px) the SSE2 optimised code rather than rely on the compiler's optimised version. This would appear to maintain the same level of CPU support but with maximal optimisation for all CPUs.

There are still an awful lot of P4 class CPUs out there unfortunately and we're using the IPP to extract the absolute best of them, and think that you need to support them for a little while longer!

Regards

Jonathan

Paul,

We are creating image processing and video coding applications with IPP, although we really use almost all libraries (ippi, ipps, ippvc, ippsc, ippj, ippdc, etc.). Our target platforms are whatever mainstream customers still use. That means an overwhelmingly large proportion of SSE3-only processors, even among Intel's own processors. While we do ship bleeding-edge Nehalem Xeon systems to some customers, a quick look in our engineering department reveals 99% support for SSE3 but only roughly a quarter with SSE4 support, despite all processors being at least dual-core. We do test and optimize our software evenly on AMD processors with lots of Opterons and Phenoms around and we would really appreciate even performance improvements rather than degradation. Regarding our 2011 targets, please note AMD will still be shipping brand new SSE3-only processors. SSE4a doesn't count.

There was already a big scandal with the Intel compilers generating optimized code paths only for Intel CPU's in the past. In fact, it was settled only 6 months ago.

2.3 TECHNICAL PRACTICES

Intel shall not include any Artificial Performance Impairment in any
Intel
product or require any Third Party to include an Artificial Performance
Impairment in the Third Partys product. As used in this Section 2.3,

Artificial
Performance

Impairment means an affirmative engineering or design action
by Intel (but not a failure to act) that (i) degrades the performance or
operation of a Specified AMD product, (ii) is not a consequence of an
Intel
Product Benefit and (iii) is made intentionally to degrade the
performance or
operation of a Specified AMD Product. For purposes of this Section 2.3, Product
Benefit shall mean any benefit, advantage, or improvement in terms
of
performance, operation, price, cost, manufacturability, reliability,
compatibility, or ability to operate or enhance the operation of another
product.

http://download.intel.com/pressroom/legal/AMD_settlement_agreement.pdf

Removing SSE3 optimizations is not a failure to act but would be seen as a design action reducing 3rd-party performance, so I'm really hoping this was done to reduce library size rather than to give Intel an unfair advantage, as this would mistakenly hurt your own IPP customers. Intel already has the performance crown regardless, so please bring optimized code paths for weaker SSE2/SSE3 processors in the non-beta release.

P.S.: I don't work for AMD. I am just a 3rd-party engineer with no CPU bias whatsoever, as IPP should be.

Wow, sneaky! I think not providing those optimizations would fall under the category of "failure to act" since they're not actively checking to see if they're running on an AMD and turning stuff off. They're degrading performance on their own processors as well. IANAL however. If I'm interpreting this correctly, it also sounds like there will be a performance hit if you're using MS's compiler since I don't think it injects SSE instructions for anything other than doing division. I could be wrong though. Dirty, dirty, dirty.

If this goes through and we measure a performance hit on those platforms, I doubt we'll move from 6.1 to 7.0. If we do, we'll at least try to limit our usage of IPP to the bare minimum.

-Mark

Intel does not sell SSE3processors anymore. I do not think there is any legal obligation to support end-of-lifed products for any company. Otherwise we just will not be able to deliver new technologies like Westmere or AVX processors (which is coming soon).

The functionality you are looking for still be available in IPP 6.1 product.

By the way, the performance oriented customers migrating to the newest platforms. I personally would not considerthosewho use old or even end of lifed platforms as performance oriented customers. If they do not care about performance why anyone else should do?

Regards,
Vladimir

Hello Jonathan,

Yes, you can still use your 6.1 product, even if you purchase the 7.0 product. Once you have purchased the product there is no expiration on your use of that product to build and distribute applications. The expiration of a development license impacts your ability to get access to upgrades and prior versions of the product from our download site, and your access to premier support; a development license expiration does not impact your ability to build or distribute products based on old versions of the library.

I will forward your concerns about the changes in the optimization layers to the appropriate managers.

Regards,

Paul

Paul, I just benchmarked IPP 6.1 vs 7.0 to quantify the impact of the potential performance loss you mentioned. I used the H.264 decoder sample as it's a pretty complete code base using multiple IPP primitives.

Intel Xeon X5560 (Nehalem, SSE4.2) : 4% faster
AMD Phenom II X6 1055T (Thuban, SSE3) : 431% slower! (mx vs m7)

Since many additions to IPP 7.0 were results of our feature requests, we would hate to be stuck with 6.1 but such degradation on modern, high-end competing hardware would leave us with no choice. I know you do not sell SSE3 parts anymore, but for your library to be viable in the real world, it must extract the best of the latest Intel parts without crippling the rest. Consider those SSE3 parts will still be around in 5 years and as an ISV we want our software to be competitive on them. If IPP offers such pitiful performance on anything but Intel i7 then IPP 7.0 will see no meaningful adoption outside of technophile walls and we'll all be stuck in 2009, not taking advantage of AVX.

Isn't IPP supposed to be win-win for Intel and ISV's? We used to write our own SSE code and wouldn't initially bother with AVX due to limited market share. Now with IPP we could, but we won't since that'd be shooting ourselves in the foot. Sure there's some AVX support in 6.1, but just replace AVX by whatever comes next in 7.1.

Thank you for considering our concerns.

Dear Customers,

Thanks for yout inputs here.

We are almost certainly raising more fears than needed here - you are technically correct that the changes may reduce performance on some older generation processors including processors sold by Intel. Thats not what we expect, and thats not why we made the change. We made the change to reduce the size of IPP, which has become a concern that we needed to deal with - and we did reduce IPPs size with this release. Size was a significant challenge we have and one we decided to address. Of course, we will happily make the 6.1 version available as needed if that is needed - but that is not a good long term solution for any of us if that turns out to be necessary. We would benefit from feedback if the changes cause actual reductions in performance and would appreciate understanding if that happens in practice for you. We know it is theoretically possible to see reductions, but we made a choice that reducing the size and complexity of IPP in future releases was more real and important. If weve erred for any of your applications, please help us understand the details and well revisit our decision.

Thanks,
Ying

1. I cannot agree with "changes may reduce performance on some older generation processors".
A Phenom X4 is not an older generation processor. It is a current generation processor that does not have SSE4, so it can only use SSE3. By removing the hand-optimized SSE3 library, you cripple end-users with Phenom processors.

2. I fully agree that the IPP size is something to improve. 275MB for IPP 6.1.5 is a lot of bytes for a graphics library. I have personally attacked this problem in another way; by removing unused IPP functions, by compiling my own custom set of DLL files:
Ipp.dll (main library, detects CPU, loads specific DLL, also contains bunch of lib code, such as IPP core, JPEG, etc).
Ipp_gen.dll
Ipp_sse2.dll
Ipp_sse3.dll
Ipp_sse41.dll
Ipp_ssse3.dll

The size is 23.2MB, and the loaded size = 7.5MB (Main DLL + CPU DLL).
This is very manageble, both for the setup, and for the loading time.

I had to do very hard work to get this setup to work.
I ask Intel to create a new framework that implements this, of course with OMP support.
I can imaging an EXE file, that scans ippi.h and displays all functions, and then it keeps a config file, where it saves checkbox selections of all chosen functions. It also contains a Build-button, that generates proper make files, and then optionally calls the Intel or MS compiler to build DLL files.

If I could do it, Intel can do it also.

I strongly disagree with this change as well, I was really surprised to see that the next version is supposed to require at least SSE3 (assuming that no SSE3 = generic, FPU-based layer?).
Edit: wait, I read it's worse, is it really

We do mainstream software (I would even say a little higher than mainstream, as it's music sequencing/audio processing), and I can tell you that new instructions take A LOT of time to reach everyone. It's only last year that we started requiring SSE1 for our software (and we still have a few users who can't run it). I don't think we will dare to require SSE2 until a few years, so SSE3?

I also don't understand this layer system. SSE2 brought important new instructions, but SSE3 & 4 are kinda marginal. Shouldn't 3/4 of the functions in IPP be able to use SSE1 & 2 instructions only? I know I never needed SSE3 for my own code. So I'm pretty sure that in your SSE3 layer, most of the functions use SSE1 & 2 instruction only, but they're in the SSE3 layer because it's a "one layer for all functions" system?

Microsoft too seems to have no clue (or doesn't care) about the mainstream market, they keep introducing APIs that (on top of being Windows-only, but that's obvious) are restricted to the latest OS, not understanding that by the time they become usable, they're already obsolete.

If I understand correctly:

SSE2 is now minimum required CPU. Less than this, IPP 7 will not work at all.
SSE2 is placed in PX, only with compiler optimization for SSE2, not with hand-optimized code for SSE2.
SSE3 is not used in IPP 7.

If hand-optimized code is required (and of course it is required!), you need an SSE4 or AVX CPU, because the only hand-optimized libraries are V8 (SSE4) and G9 (AVX).

Intels' choice means that ONLY SSE4 and higher will achieve highest performance.

I strongly hope they at least put back the SSE3 library (T7).

I also hope they read and implement my text earlier in this topic about reducing the size of IPP.

That's not completely right. Let me try to explain more details on cpu-specific code available in IPP 7.0 beta:

PX library (which is 32-bit generic code) support SSE2 instructions set. Thatapplicable to Intel Pentium 4 processors family.
MX library (which is 64-bit generic code) support SSE3 instruction set. The reason is that SSE3 was a miminal instruction set for processors which support Intel64 architecture. That applicable to code named Prescott processor family (brand name was Intel Pentium 4 processor with Hyper Threading Technology)

V8 (32-bit) and U8 (64-bit) library supports SSSE3 (note additional 'S' letter), basically optimized for Intel Core2 processors family (code named Merom architecture)
P8 (32-bit) and Y8 (64-bit)library support SSE4.x instruction set, introduced in code named Penryn,Nehalem and Westmere processors (brand named mostly as Intel Core iX processors, where X might be 3, 5, 7)
G9 (32-bit) and E9 (64-bit) library support AVX instruction set

In dynamic libraries there are also S8 (32-bit) and N8 (64-bit) variants which contains Atom specific code.

Regards,
Vladimir

Wow, now I'm confused again. PX and MX exist in IPP 7, right? And then the kicker is that instead of containing SSE2 and SSE3 instructions written by a human, it's done by the compiler, correct? Does IPP just crash and burn if you try to run it on a P3 or does it have a way of reverting to non-SSE code?

-Mark

Vladimir, the main problem with the IPP 7.0 beta is that PX and MX offer a 4x performance decrease compared to previously hand-tuned SSE2/SSE3 code paths (W7, T7, M7) at this time. The compiler auto-vectorization isn't clever enough for all primitives.

That is a substantial regression affecting over half of the CPU market, therefore over half of our own customers. There is no way we can upgrade to IPP 7 under such conditions.

The previous suggestion about library size reduction is a good one. We already use static libraries for that reason, but you should make an easy tool that creates custom libraries/DLL's including X functions/domains with variants for Y requested optimization layers (SSE2, SSE3, SSSE3, SSE4, AVX). Honestly, at this point I would only include SSE2, SSE4 and AVX layers and perfectly cover the whole market (from P4 to i7 and beyond).

Also, if you could merge W7 & T7 into PX and M7 into MX, we wouldn't have to deal with performance regressions in the first place. That might be acceptable for everyone.

That's correct, PX and MX is compiler generated code, with autovectorization done by compiler where it possible with using SSE2 or SSE3 instruction set (not earlier). Hand tuned SSE2 and SSE3 code was removed in IPP 7.0.

And yes, pre Pentium 4 processors are not supported by IPP 7.0 (including Pentium III processor). There will be invalid instruction exception when you run IPP 7.0 based application on Pentium III or earlier processors.

Vladimir

Vlad, it would be preferable if MX only included SSE2 code, rather than SSE3. Before Intel adopted AMD-64, there were AMD Athlon processors with SSE2 only. This would cause them to crash.
http://en.wikipedia.org/wiki/Opteron#Opteron_.28130_nm_SOI.29

Since SSE3 really only adds LLDQU, it would be better left in M7, which hopefully comes back!

Vladimir, would it be possible to replace the compiler-generated SSE2/SSE3 by the previously hand-tuned code? Wouldn't it result in smaller libraries and higher performance?

I would not agree to exclude SSSE3 optimization as in my vision Intel Core 2 based systems are the majority right now with quickly extended number of Intel Core i7 based systems. Although it is not my level of decision. As we pointed in this thread earlier, we will gather your feedback and review decision made for IPP 7.0 taking into account many different factors which contribute to the particular way we decide to go.

Vladimir

Jonathan,

IPP 7.0 do support SSE2/3 capable processors trhough compiler generated code instead of hand tuned code as it was in IPP 6.1. We analysing performance difference between IPP 6.1 and IPP 7.0 when runs on Pentium 4 processor and it is kind of mixed picture. Some functions got degradation and other functions got substantial performance gain.

Regards,
Vladimir

It is clear, that if IPP 7 does not support high-performance AMD Phenom (max SSE3), then IPP 7 is a useless product to me.

It eats me that IPL/IPP could be used efficiently for graphics tasks on almost any processor, and now only on recent Intel processors.

If I understand correctly:

SSE2 is now minimum required CPU. Less than this, IPP 7 will not work at all.

Really? So there's no more "generic, guaranteed to work on everything" layer? Well that's totally useless to me. I still write a generic versionof all of my SSE2 code, it's not to use a library that will force us to add SSE2 to the requirements (which, again, isn't realistic for the slow mainstream market).

& what's with replacing hand-written code by compiler code? Isn'tthis the whole point of using IPP, speed?

IPP 7.0 do support SSE2/3 capable processors trhough compiler generated code instead of hand tuned code as it was in IPP 6.1. We analysing performance difference between IPP 6.1 and IPP 7.0 when runs on Pentium 4 processor and it is kind of mixed picture. Some functions got degradation and other functions got substantial performance gain.

Why not just keep the functions that brought a gain, & keep the handwritten code that was faster?

I mean, sometimes I spend a lot of time writing an SSE version of FPU stuff, only to realize it's slower. I just bite the bullet & keep the FPU version..

More, if IPP was "open", you could make challenges for users to suggest faster code for existing functions. Of course I'm not sure I would like it myself, because that would probably compromise safety/stability (but hey, IPP isn't really bug-free either right now). But really, making a library that was partly generated by a compiler, I don't see the point, unless that compiler is more clever than humans.

If you want open source, maybe it's time to consider AMD's performance libraries. (http://developer.amd.com/cpu/libraries/Pages/default.aspx) The Framewave project is all open source. It doesn't look like it has nearly the same amount of functionality as IPP but if you're like me and are only using IPP's image processing stuff, it appears that Framewave might have you covered. I know I'm going to evaluate it in light of what's going on with IPP 7.

-Mark

I was eagerly waiting for the 7.0 version, to have some bug fixes and feature requests, but I am very disappointed that there is a performance penalty for using the new library for many of our customers! Specially, we are using IPP for video coding and decoding which needs better optimization.
We do suggest to our customers to use the latest processors, but many of them for the time being will not change their systems.

Dear Customers,

Thank you for offering us your insights into the performance benefits and concerns you have seen regarding our next generation IntelIPP libraries, and the processors on whichyou have been seen. Note that one of the main reasons we released this beta is to capture this valuable feedback and have the opportunity to react to it before theproduct release. We appreciate your use of IntelIPP, and want you to keep using it (for the reasons you chose it in the first place). We also appreciate some of the suggestions you have offered to help us address concerns associated with the size of IPP. We are in the midst of reviewing this and other input we have received regarding the beta release, and are actively exploring our options. Our goal is to offer a product that best addresses areas you have asked us to change, while keeping the features/optimizations you need most. Well plan to send by the middle of August a summary of the path we decide to take. Thanks again for all of the valuable feedback you are sending us.

Sincerely,
Ying Song
Intel Corp.

I came to this forum to see if there's a way to run IPP 5.3 besides 6.x to enable compiling an SSE(1) optimized version of my code. People still seem to be using such old equipment!

I'm working on audio processing, which often runs on a PC that does nothing else. Many people use a very old PC for this for example to run a webradio station. I know that my software is being used for several FM community stations in Africa as well. The faster things run, the more (real-time) processing they can enable, and the more they can increase quality settings.

I dropped non-SSE support last year because non-SSE code is really way too slow (and I don't think there'll be many non-SSE PC's in use anymore). But dropping (or reducing the speed of) SSE2 at this moment is WAY too soon for me - I wouldn't even consider that for the next 4 years or so...

What is the status on this? Are there any plans to reintroduce SSE2/3 support in IPP 7.0, or should we just ignore this release and stick to 6.1 for the next years?

Dear Customers,

Thanks again for your input.We will provide the summary of the path we decide to take in the upcoming Intel IPP 7.0 product release in few weeks. In the meantime, pleasecontinue toparticipate in theIntel IPP 7.0 beta program to explore other features and provide us your feedback via survey.

Sincerely,
Ying Song
Intel Corp.

Hi Ying,

You mention "upcoming Intel IPP 7.0 product release" - is that a new beta, or the final release (which I would think was way too soon)?

I would also like to voice our concerns over the change in optimizations in IPP 7.0. I can only echo what all other concerned users have been stating: The base "px" 32-bit layer should run on any ia32/x86 compatible processor. In addition, the "hand-tuned" versions of the SSE2 and SSE3 should be kept.
I am, by the way, amazed - and a bit worried on the general quality - that the performance of some parts of the previous releases of IPP could be further improved by a simple re-compilation! That would seem like a natural step of any tuning proces, i.e. to check if a human-written version is at-all faster than a machine-generated (compiled) version...?!

  • Size is (almost) a non-issue for us. We use the static libraries. Basing the choice on the size of the dynamic libraries alone seems quite limited.
  • There are new features in 7.0 that will be important for us but the degrading in performance will likely prohibit us in upgrading loosing out on the performance improvements that indeed will be there for brand new (and upcoming) processors.
  • Our customers are using many different platforms and configurations. Due to the nature of the installations and our products (24/7/365 operations), many are not keen on changing their systems too often. Performance is indeed an issue, but try convincing a customer that an upgrade of a product (i.e. to get some new feature) on a system that was performing well will result in a dramatic degradation in performance. That simply will not be acceptable.
  • It would be nice to know these things a significant time ahead (years) so that migration plans can be made in advance. The time elapsed between the last production of a specific processor-type (and not only Intel processors) and when the change occurs should be much longer. The time span was definitely longer when the Pentium III optimizations was removed (although the heads-up should have come before).

Thanks for listening.

- Jay

To our valued Intel_IPP Customers,

Thank you very much for providing valuable Intel IPP 7.0 beta feedback regarding SSE2/SSE3 support. Based on your input, we are adjusting our plan for the upcoming Intel IPP 7.0 product releases as follows:

For the Intel IPP 7.0 product release, currently scheduled for release at the end of this year:

- We plan to include the w7* library (for IA-32 Architecture), which was hand-optimized for processors utilizing the SSE2 instruction set (e.g. Intel Pentium 4). The w7 library will also support the SSE3 instruction set.

- We plan to include the m7* libraries (for Intel 64 Architecture), which were hand-optimized for processors utilizing the SSE3 instruction set (e.g. Intel Pentium 4 Processor supporting HT Technology)

- The t7* library (for IA-32 Architecture) which was hand-optimized for processors utilizing the SSE3 instruction set (e.g. Intel Pentium 4 Processor supporting HT Technology) will not be part of the product release.

Soon after the Intel IPP 7.0 product release, we expect to have the px*(C optimized for all IA-32 processors) and mx*(C-optimized for all Intel 64 platforms) libraries offered in as a separate download to support processors utilizing pre-SSE2 instruction sets (e.g. Intel Pentium III.)

All products, computer systems, specifications, functionality descriptions, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Do not finalize a design with this information.

Thanks again for your time and effort in helping us best serve your needs. Please also check out other new features offered in the latest IPP 7.0 beta and provide any additional feedback you may have.

Sincerely,

The Intel IPP team

*Check this article for more reference

That's great news Ying. Thanks for taking our feedback into consideration.
We will eagerly be awaiting the next beta update.

Wonderful! Thank you so much! It's great to see that you're listening to your customers.

Just wanted to say thank you!

Hi Ying,

Now that the final 7.0 is available, I am curious as to when the separate px/mx libraries will be available?

Thanks.

- Jay

Hi Jay,

The seperate px/mx libraires will be availble in the next release update later this month. Please stay tuned.

Thanks,
Ying

Dear Customers,

As you may notice, the Intel IPP px/mx libraries are now available in the latest Intel IPP 7.0.1a release, these libraries are separate downloads, please go to Intel Registration Center to get these separate downloads.

Please note, this update release only added px/mx libraries. If you have installed Intel IPP 7.0 and do not need these libraries, you can skip this release update. And our next release update with more feature enhancement and bug fixes is scheduled sometime in early Q1' 2011.

Thanks,

Ying

It seems that IPP 8.0 also crashes on AMD Phenom II CPU.

Why is that?

Hi Royi,

let's continue discussion on this in the another thread you've created - I've already asked you several questions there.

Regards, Igor

Leave a Comment

Please sign in to add a comment. Not a member? Join today