Sandy Bridge: SSE performance and AVX gather/scatter

Sandy Bridge: SSE performance and AVX gather/scatter

Hi all,

I'm curious how the two symmetric 128-bit vector units on Sandy Bridge affect SSE performance. What's its peak throughput, and sustainable throughput for legacy SSE instructions?

I also wonder when parallelgather/scatter instructions will finally be supported. AVX is great in theory, but in practice parallelizing a loop requires the ability to load/store elements from (slightly) divergent memory locations. Serially inserting and extracting elements was still somewhat acceptable for SSE, but with 256-bit AVXitbecomes a serious bottleneck, which partially cancels its theoretical benefits.

Sandy Bridge's CPU cores are actually more powerful than its GPU, but the lack of gather/scatter will limit the use of all this computing power.

Cheers,

Nicolas

126 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Serially inserting and extracting elements was still somewhat acceptable for SSE, but with 256-bit AVXitbecomes a serious bottleneck,

For "slightly divergent locations", i.e.most elements in the same 64B cache line, AFAIK with SSEthe bestsolutionwas with indirect jumps to a series of static shuflles (controls as immediates) in order to maximize 128-bit load/stores. Now with AVX we can use dynamic shuffles (controls inYMM registers)using VPERMILPS. Based on IACA the new AVX solution is more than 2x faster than legacy SSE, I suppose it will be even more than 2x faster on real hardware since the main issue with the indirect jump solution is the high branch miss rate

Quoting bronxzvFor "slightly divergent locations", i.e.most elements in the same 64B cache line, AFAIK with SSEthe bestsolutionwas with indirect jumps to a series of static shuflles (controls as immediates) in order to maximize 128-bit load/stores. Now with AVX we can use dynamic shuffles (controls inYMM registers)using VPERMILPS. Based on IACA the new AVX solution is more than 2x faster than legacy SSE, I suppose it will be even more than 2x faster on real hardware since the main issue with the indirect jump solution is the high branch miss rate

VPERMILPScan only permute elements within one vector register. It's of no use for real gather/scatter.

Since gather/scatter is the parallel equivalent of load/store, it would allow parallelizing almost any loop, even when it contains indirect loads and stores (the only limitation being that the loop iterations should not alias and the indexing uses 32-bit offsets at most - both of which are easty to guarantee for multimedia applications).

Since all software contains many performance critical loops, just imagine how much faster things would be when four or eight iterations can execute simultaneously!

Sandy Bridge already has two 128-bit load units, so I don't think it would take a lot of extra logic to turn them into two 128-bit gather units, Larrabee style. With AVX the bottleneck of sequentially inserting/extracting elements is just too big. And it's only getting worse with FMA and future 512-bit and 1024-bit vectors. Amdahl's Law becomes a performance wall if load/store isn't parallelised as well.

If the cost of an optimal gather/scatter implementation is too big to justify it (e.g. if it adds another cycle of latency to L1 accesses), I believe it is still critical to add these instructions sooner rather than later, initiallyusing a cheaper implementation. For instance each of the load units can collect one element each cycle, meaning a gather operation would take just four cycles (throughput), which is already way better than extracting the offsets from one vector and inserting elements into another vector (16 cycles). This would allow developers to start using these instructions early on and later architectures (with increased transistor budget) could improve the performance of existing vectorized software with a faster gather/scatter implementation.

It would finally make the SIMD instruction set complete, giving every scalar operation a parallel equivalent...

Much tomy astonishment I just found out that AVX doesn't support integer operations on ymm registers. Frankly this makes it useless for a large range of multimedia applications.Andto boot vinsertps doesn't support ymm registers either. This means that even for floating-point applicationsloading/storing all elements is even slower than I expected.

I can understandthat completely duplicating the 128-bit SSE units would have been expensive, but it's beyond me why the engineers didn't add the 256-bit instructions anyhow and execute them in two cycles on a 128-bit unit (same way SSE used to be executed on 64-bit units on Pentium 3/4). This means we're stuck with the following roadmap:

AVX2 - FMA,half-floatsupport
AVX3 - integer support
AVX4 - gather/scatter support

Frankly this looks like it's going to become the same mess as SSE1-SSE4.2, with the same slow adoption issues. Supporting all these different extensions is a nasty software issue (even today lots of applications have an SSE2 path and a scalar path, and don't bother complicating things with other extensions).

With 200+ SP GFLOPS Sandy Bridge looks good on paper, but in practice it will have limited use. I guess I'll stick to SSE then after all. Hopefully the extraadd andmultiply units still offer a tiny bit of performance improvement...

Please Intel, add integer and gather/scatter support sooner rather than later! The first implementation doesn't need to be optimal, but at least the instructions should be available so developers can actually start using them!

Andto boot vinsertps doesn't support ymm registers either. This means that even for floating-point applicationsloading/storing all elements is even slower than I expected.

you just need one extra vinsertf128 for 8 vinsertps and 8 32-bit loads, the peformance impact should be very low (< 5%) and even negligible (< 1%) if you haveeven moderateL1D$ misses

Quoting bronxzvyou just need one extra vinsertf128 for 8 vinsertps and 8 32-bit loads, the peformance impact should be very low (< 5%) and even negligible (< 1%) if you haveeven moderateL1D$ misses

So what you're saying is, because it's horrendously slow to emulate a gather operation with extract/insert anyway, it's ok to make it even slower with AVX? Please note that with FMA this will mean you'll be able to do 16operations per cycle, but you'll still only be able to emulate a gather operation in 18 uops. In other words, it's Amdahl's Law at its worst. Fast vector operations are useless when the memory accesses are sequential.

Anyway, since insert/extract will become practically redundant anyway with gather/scatter support, it's probably best to leave them as is. But I seriously hope Intel's intention is to make the AVX instruction set complete by adding 256-bit integer operations and gather/scatter support.

>So what you're saying is, because it's horrendously slow to emulate a gather operation with extract/insert anyway, it's ok to make it even slower with AVX?

no, I just says that the fact that there is no 256-bit variant of vinsertps doesn't matter since on real world use cases the impact will be negligible

>that with FMA this will mean you'll be able to do 16operations per cycle

hint: we are already able to do 16 flops per clock with balanced vmulps / vaddps

Quoting bronxzvno, I just says that the fact that there is no 256-bit variant of vinsertps doesn't matter since on real world use cases the impact will be negligible

It does matter, not because it would have that much of an impact on real world performance, but because it would make iteasier for developers to convert their SSE code to AVX. Together with the lack of integer instructions, not having 256-bit insert/extract instructionsmakes AVXless attractive.

It's not that developers are too lazy to emulate the 256-bit operation with some extra instructions, the problem is that these instructions are expected to be added sooner or later anyway. So developers will have to rewrite/update their software again and again. It's very expensive for software companies to make use of all the latest extensions (rearchitecting, implementation, QA, marketing, support, etc).

So what will happen instead is that lots of developers won't even look at AVX and will stick to the more complete SSE instruction set. For Intel this means it takes even longer for these extra transistors to pay off.

Note once again that software developers like myself aren't asking for optimal implementations right away. If integer AVX instructions were executed in two cycles on an 128-bit unit, that would still make it worth starting to rewrite the software right away. The extra register space helps hide memory latencies so it would already be slightly faster. Later implementations can then have true 256-bit execution units for all AVX instructions, and the software would run a lot faster without requiring a rewrite. That's a big incentive for consumers to buy that next generation, since there would already be software making use of these instructions!

So it's in everyone's interest that instruction sets should be as complete as possible, as early as possible. I was really hoping AVX would make an end to the mess created by all the different SSE extensions, but it has taken a dissapointing start...

From my POV it's not very important if the instructions are in the ISA or not since I can afford to recompile mycode for new targets and that I don't program in assembly or not even with the intrinsics but with higher level wrapper classes to enjoy far more readable and maintanable code thanks to C++ operators overloading. I have typically already a 256-bit packed integer class for example or functionslikeScatter/Gather/Compress/Swizzle/Deswizzle/... Only actual timings are important for the final users, and only the quality of the source code should be important for the coders, well IMHO.

The *exact same source code* can be compiled to target SSE or AVX for example just by changing some headers (in fact huist a compilation flag), and generally I can't see in the ASM dumps much potentialfor improvements so it's IMO the best solution to cope with changes in the ISA and to startwrite and *validate* codebefore the CPU are available, it's of paramount importance since software development cycles are longer than ISA enhancements (with roughly each year some changes in the ISA for x86)

Quoting bronxzv

From my POV it's not very important if the instructions are in the ISA or not since I can afford to recompile mycode for new targets and that I don't program in assembly or not even with the intrinsics but with higher level wrapper classes to enjoy far more readable and maintanable code thanks to C++ operators overloading. I have typically already a 256-bit packed integer class for example or functionslikeScatter/Gather/Compress/Swizzle/Deswizzle/... Only actual timings are important for the final users, and only the quality of the source code should be important for the coders, well IMHO.

The *exact same source code* can be compiled to target SSE or AVX for example just by changing some headers (in fact huist a compilation flag), and generally I can't see in the ASM dumps much potentialfor improvements so it's IMO the best solution to cope with changes in the ISA and to startwrite and *validate* codebefore the CPU are available, it's of paramount importance since software development cycles are longer than ISA enhancements (with roughly each year some changes in the ISA for x86)

Taking optimal advantage of new ISA extensions withmerely arecompileis a luxury the majority of software developers don't have.There are lots ofdifferent extensions so you need multiple code paths to have optimal code for each.Managingmultiple paths is very messy. It's bad enough to have to manage your ownreleases withvarying features, that having to deal withvarying implementationswithin a release can become infeasible. Most developersopt foran SSE2path and a C path and don't bother about the rest. AVX could have been an interesting third path, but it's not complete yet (it's like SSE without SSE2) so most will skip it.

Also note that new extensions can have far-reaching consequences for the software architecture. When SSE2 was introduced it was possible to convert code which previously used MMX for integer operations and SSE for floating-point, to only use SSE2. But as a consequence you only had 8 registers for storing both integer and floating-point data. If previously you had nicely tuned MMX+SSE code which didn't need to spill any registers to the stack, SSE2 required you to throw that around so the register pressure wouldn't cancel the benefit of SSE2 integer instructions. Ironically x64 then solved that but lots of people still have 32-bit operating systems. So MMX+x87, MMX+SSE, SSE2, x64, etc. they all have different optimal usage and compilers are of very little help.

And that's just the development phase. Debugging, code maintenance, feature extensions, customer support, it all getsa lot more complicated with a highly fragmented ISA. And it means it gets adopted a lot more slowly than it would be if an extension was complete (even if sub-optimal).

I fully understand that CPU designers can't introduce it all at once, but at least with AVX they had the opportunity not to make some of the same mistakes again. It seems to me they could have easily extended all SSE2 integer operations to 256-bit AVX instructions by executing them in two sequential 128-bit chunks. It would eliminate at least one additional code path for those who make use of every extension, andfor otherswould make it more attractive to start coding for it right away instead of waiting for AVX3. It doesn't matter much if the integer execution units are extended to 256-bit in two years or four, the software will be ready to instantly take advantage of the hardware improvements.

The AVX emulator may have allowed to validate your code early, but it didn't allow to evaluate whether it's worth the trouble without integer operation support. The solution to the fact that "software development cycles are longer than ISA enhancements" is not to offer early emulation of the extensions alone, but to offer more complete extensions over longer cycles. Which in turn can be done ata manageable transistor cost by implementing less critical instructions in the most straightforward way.

So once again I'd also like to ask the Intel engineers to add gather/scatter instructions sooner rather than later. Even if initially they're just microcoded as sequential load/store operations developers can actually use them in practice and Intel can evaluate when the time is right to give them more optimal hardware support. By then the software making use of the instructions will already be onthe market and the speedup will be instant whenpeople upgrade. So by helping software developers Intel helps itself sell newer CPUs.

Note that AMD is already in the position to extend SSE2 integer instructions to 256-bit instructions and have them executed as a single 256-bit operation, since each Bulldozer module seems tohave a pair of fully symmetric 128-bit SSE2 capable execution units. And NVIDIA's 'Project Dover' may quickly become an interesting multimedia platform if they apply theirSIMD experience to ARM architectures. Because gather/scatter allows most loops to be auto-vectorized by the compiler it can result in very high performance/Watt even if other parts of the chip are still slightly inferior.

Having an SIMD instruction set with the parallel equivalent of ever scalar instruction is just as important as the multi-core revolution. ILP, TLP, DLP, you need all of them to maximize performance/Watt for the architecture that will dominate the future of computer technology.

Taking optimal advantage of new ISA extensions withmerely arecompileis a luxury the majority of software developers don't have.There are lots ofdifferent extensions so you need multiple code paths to have optimal code for each.Managing

well that's not a mere recompile but the careful design of the variant of your building blocks (for ex. your beloved Gather, 256-bit packed integers) that will be inlined on the new specialized code path (and yes you need multiple paths for your hotspots if you want high performance code)

after that you simply work at the higher level and recompile all your paths from the same source code, it'squite easilymanageable I'll say after many years doing just that

Quoting bronxzvwell that's not a mere recompile but the careful design of the variant of your building blocks (for ex. your beloved Gather, 256-bit packed integers) that will be inlined on the new specialized code path (and yes you need multiple paths for your hotspots if you want high performance code)

after that you simply work at the higher level and recompile all your paths from the same source code, it'squite easilymanageable I'll say after many years doing just that

Look, some extensions target specific applications and make them significantly faster, while other extensions help a wide range of applications but only by a small amount. There's nothing wrong with that in itself, I welcome every incremental improvement, but there's the potential for an extension which speeds up a large range of applications significantly. It may still take years for such a 'complete'extension to be implemented optimally, but since it already holds the promise of speedups developers won't have to wait for a set of smaller extensions to form one complete whole. It's a very attractive idea to know you can invest effort into developing a new code path and see your application become faster over the course of several years without having to worry about having to change it over and over again. And like I said it also helps the CPU manufacturer because it gives consumers a reason to buy newer CPUs which will significantly speed up existing applications, instead of having to wait for applications to appear which make use of an incremental new extension for whichsupport was just added.

AVX with integer operations and gather/scatter would be such a complete extension. I'm not saying incremental extensions are bad, I'm just saying complete extensions are better. I'm happy for you that for your application you seem to be fine with incremental extensions, but I'm sure that for the majority of developersa complete extension would have been very welcome. AVX with integer operations and gather/scatter allows to parallelize nearly every loop, and in many cases automatically. I'm talking about optimizing the hotspots in pretty much every application, not just those which take considerable effort to make use of incremental extensions.

Apparently for your application area you found a way to make it manageable by abstracting the instructions into "building blocks". That's great, but I wouldn't call that a "simple" solution. And for the record I use a powerful abstraction too: LLVM. It supports generic vector operations which are JIT-compiled and make use of whatever extensions your CPU support. But even with LLVM I need to make extensive use of its support for intrinsics to get the best possible performance. And even when no intrinsics are needed it's still important to have different code paths at the level of the abstract vector operations. For instance it would be unwise to use a 256-bit packed integer "building block" if the CPU doesn't support SSE2 because with MMX you can only store two such vectors in registers. The latency from having to spill registers to the stack and read them back a couple instructions later makes it a lot slower than using 64-bit packed integers and keeping things in registers. So LLVM made things a little more manageable for me but it took considerable effort to make use of it in the first place and it's not a silver bullet.

So despitesome helpfrom softwareabstractionsI'mstill joining the large group of developers who'd like to seeAVX support integer operations and gather/scatter as soon as possible so it becomes widely applicable and worth the coding effort...

So despitesome helpfrom softwareabstractionsI'mstill joining the large group of developers who'd like to seeAVX support integer operations and gather/scatter as soon as possible so it becomes widely applicable and worth the coding effort...

I will also welcome full support in AVX for 256-bit packed integers, though for my ownapplicationfield (realtime 3D rendering)the most important 256-bit integer related instructions are already here : 8 x 32-bit int<->floats conversions

It will be also nice to have vgather/vscatter LRBni like, though for 3D applications it's generally better (i.e. faster since it minimize L1 DCache misses and replace a series of 32-bit loads by 128-bit or 256-bit loads) to keep data in AoS form for the cases requiring gathers (and that's the only casewhere SoA/hybrid SoAare to be avoided I'll say), and thena series of gather is replaced by a swizzle operation

I have to side with Nicolas here.

Gather/Scatter are sorely missed from the ISA for many purposes.

AVX not having FMA as promised initially is a disappointment I can live with.

However, AVX not having integer operations is detrimental for its adoption by developers including me. Integer operations would greatly benefit video coding/decoding and other media applications.

Intel has once again released half-baked instruction set extension (with the exception of three operand syntax) even though they said "we won't do that anymore", "orthogonality is important", etc, etc. AVX may eventually become complete over several iterations, but if the trend with incremental additions continues, the code we are writing will spend more time checking which features are supported by the CPU than it spends doing some usefull work with it and Intel's CPU documentation in year 2020 may read "to determine whether your CPU supports AVX10.11b check the bit 255 of the XMM7 register after executing CPUID with EAX=0x7FFFFFFF".

Coupled with SandyBridge's unjustified socket change to force us to buy new mainboards (1156 to 1155 pins with no significant interface changes such as PCIE3.0, more memory controller channels, or support for faster DDR3), Intel CPU and chipset business seems more and more marketing instead of innovation and performance driven.

I am disappointed with first SandyBridge CPUs to the point that I will not upgrade (with i7-920 @ 3.4GHz I have the same performance as I would have with 2600K anyway), and I do not intend to support AVX, at least not this unfinished one, because it is simply not worth the effort.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

your mileage may vary, my #1 request for the future will be more L1 DCache bandwidth, the true limiter of performance for 256-bit code on Sandy IMHO (and the source of disapointing SSE to AVX speedups) is this limit of one 256-bit load per clock (vs 2 128-bit load per clock with SSE), it's simply not matching well with the 16 flops per clock we can theoretically get

When I was at the IDF last year, I specifically asked whether SB will have 256-bit paths throughout the chip or will Intel cut corners like they did with Pentium 4. I have been told that it will be fully fledged 256-bit chip but it turns out that my concern was justified after all -- they just combined two 128-bit loads into single 256-bit one instead of expanding them both to 256 bits. Now I am starting to doubt whether FPMUL/FPADD also works as 2x 128-bit and thus cannot bring any speedup in this first AVX capable CPU generation.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

the speedups are there, though deceptive when compared to the IACA estimates

Many applications which depend on MidLevel cache (L2) see limited gain for AVX-256 over 128-bit vectorized code, due to the 128-bit path to L2 in the current implementation. The MKL [DS]GEMM showed substantial gains after months of hand coding to gain L1 locality while implementing AVX-256. I suppose IACA doesn't consider cache locality.
Public presentations allude to the lack of hardware support for fast mis-aligned 256-bit access in the current implementation; that is dealt with by explicitly splitting into 128-bit moves (at the instruction level; compilers do it automatically), which take advantage of hardware support for 128-bit moves on 4- and 8-byte alignments.

yes I've remarked for the 256-bit unaligned moves that the Intel compiler now expands things like _mm256_storeu_ps in 2 VEX-128 vmovups, aligned moves such as_mm256_store_ps are now generating a 256-bit vmovups (instead ofa vmovaps in previous versions), I was a big fan of vmovaps since it was handy to catch missing 32B alignment

I'm not sure about the L1toL2bandwidth limitation since Iget slightly better speedups with HT enabled (and thus only ~ 16 KBL1D per thread)

Though in a lot ofcritical loops I have something like 1 256-bit load each 3 instructions so I suppose this is a seriousbottleneck, the 2nd load port makes128-bit code fly thusdecreasingthe SSE to AVX speedup

Tim,

Can you share with us the information whether current implementation of AVX is capable of executing ADD/MUL/DIV/AND/OR/XOR upon 8 pairs of single precision FP values in a single operation or is it being split internally into two 128-bit operations? Also, what about the upcoming LGA2011 version (if you are allowed to comment of course) -- will there be any architectural differences apart from support for PCIE3.0 and four-channel memory controller?

That is my biggest concern about SandyBridge and AVX, apart from L2 cache path width.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

The AVX-256 parallel operations perform the operations on 8 single precision operands in parallel, not sequenced/pipelined with the second 128-bits following right after the first 128 bits, except for div/sqrt, which are sequenced. You could still argue that those simpler operations are split (as evidenced by restrictions on shuffling), but the FP resources are expanded to double the parallelism going from 128-bit to 256-bit data. I'm not qualified to comment on effect of socket/PCI/memory controller changes, but I don't see them influencing the FP unit internals.

Tim,

What I wanted to know when I asked about the socket is whether some of those architectural limitations you are mentioning will be removed because after all it will be a different die with different litography masks due to socket change. Will Intel use the chance to further improve AVX performance with the new socket in Q3 2011, or will we have to wait for the next CPU generation to realize its full potential?

Also, do you have any data on how much of an improvement can three-operand syntax bring without other code changes?

Furthermore, when you say DIVPS is pipelined, I presume that it is preferable to MULPS with 1/X instead of DIVPS with X?

Finally, are those restrictions on shuffling caused by the lack of 256-bit ALU? Is that the reason why Intel did not even attempt to implement integer part of the AVX, even pipelined?

Sorry for so many questions, but we developers really need to know this and so far SandyBridge reviews only offer marketing hype.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Three operand syntax often gives more than 10% reduction in number of instructions required. Of course, the net effect on performance is far less. I doubt this will be sufficient incentive to change many scalar applications over to AVX.
In my personal view, the invert and multiply scheme as default implementation for single precision vectorization seemed a relic of the weak divide performance of early SSE CPUs, made unnecessary by the strong divide performance of recent production CPUs. As you suggest, the AVX implementation, with little improvement in divide and sqrt, may push us back in that direction.
Restricting most new instruction introduction to floating point operations seems to be a consequence of priority on increasing number of cores and improving multi-core scaling as a more effective way to gain both integer and floating point performance, postponing additional integer instructions until after another process shrink.

Tim,

What I don't understand from your answer is how Intel engineers expect to gain both integer and floating performance by multi-core scaling if they do not scale integer units by the same amount by which they scale floating point units?

Wouldn't doubling integer performance (and thus performance of many multimedia application -- image processing, video coding/decoding, etc) by supporting wider integer vectors in AVX be a low hanging fruit?

Since new integer instructions will be added to AVX anyway, I presume that the cost of their decoding is not a factor which made Intel decide not to support them yet.

With that in mind, is the cost of 256-bit ALU in terms of silicon real estate really so prohibitive that Intel had to give up on instruction set orthogonality and full-fledged shuffle, not to mention the chance to improve performance of integer applications?

I also have to wonder why has Intel wasted resources on QuickSync instead of investing those resources into improving integer performance (by making 256-bit ALU and integer AVX) thus allowing wide variety of software (including for example x264 encoder) to get 2x performance gain? What good is such specialized transcoding engine when the output image quality has been sacrificed for speed?

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

I suppose a number of applications which were considered important in the decision didn't support integer vectorization, but were expected to scale well with threading. This is speculation on my part, with no inside knowledge. SPEC 2006 is still considered important for marketing, and has practically no integer vectorization. SPECrate, at least, benefits from multiple cores. Evolution toward a wider group of applications for which platforms will be designed is slow.
Hardware designers recognize the need to prepare for the applications which will be important 5 years in the future, in time to make many of these decisions, but there's a lack of useful data.

I must say that I disagree on the "lack of useful data" -- data is out there, but it seems that the wrong people are looking at it.

Intel is trumpeting multimedia performance, but by not having integer AVX instructions it is not doing any favor to multimedia application developers or the users alike.

What I have in mind is the following:

1. Image processing (Photoshop, Gimp, etc)
2. Video processing (Premiere, VirtualDub, AVISynth, etc)
3. Video codecs (x264, VP8, XviD, etc)
4. Sound processing (SoundForge, Logic Audio, Cubase, Sonar, etc)
5. String processing (all text processing applications/libraries that are using SSE4.2 instructions)

All of those applications would benefit from wider integer vectors and better integer performance and very quickly because they are being actively developed with short update cycles. Please do not tell me that Intel engineers are unable to notice that those applications will not go away -- their performance can only become more important in the future as the amount of data we need to process and store keeps growing at a staggering rate.

Moreover, it is time to stop chasing the benchmarks because they are useless in real life, and are misleading your customers. For example, in Sandra Dhrystone benchmark SandyBridge 2600K has 30% advantage over Core i7-920 clocked at the same speed, but once you compare integer performance in real applications this advantage melts down to miserable 6%.

Finally, I was under impression that Intel has asked us for an opinion and based on what we said decided to give us early access to new features so we can pave the way in our code for real performance boost which will come later with new CPU revisions. By releasing another incomplete instruction set extension we are again unable to do that for you. How do you expect integer vectorization to take off if hardware support is not there?

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Quoting bronxzvI will also welcome full support in AVX for 256-bit packed integers, though for my ownapplicationfield (realtime 3D rendering)the most important 256-bit integer related instructions are already here : 8 x 32-bit int<->floats conversions

It will be also nice to have vgather/vscatter LRBni like, though for 3D applications it's generally better (i.e. faster since it minimize L1 DCache misses and replace a series of 32-bit loads by 128-bit or 256-bit loads) to keep data in AoS form for the cases requiring gathers (and that's the only casewhere SoA/hybrid SoAare to be avoided I'll say), and thena series of gather is replaced by a swizzle operation

That's interesting because my main application field is also real-time 3D rendering (I'm the lead developer of SwiftShader). I can't think of a singleoperation that would help 3D rendering more than gather/scatter though. In particular it would help speed up texture sampling, which requires fetching many 32-bit texels at various memory locations. Even though a 2x2 footprint of texels can wrap/clamp/mirror, in the general case they are close to each other so it's perfect for a gather instruction as implemented by Larrabee. It would also substantiallyspeed up vertex attribute fetch and transcendental functions.

Note that a 240 mm 6-core CPU with FMA would actually have the same computing power as a 240 mm GeForce GTS 450. So CPUs and GPUs appear to be on a collision course but the lack of gather/scatter still makes the CPU hopelessly inefficient at some workloads.

Another example is ray-tracing. There hasn't been a real breakthrough yet because GPUs are not good at recursion (too much stack space per thread, and GPUs need thousands of threads to achieve good utilization). But while CPUs are good at recursion they're not good at ray-tracing because the rays may slightly diverge and need the ability to access multiple memory locations in parallel.

Giving the CPU gather/scatter capabilities would allow to drop the IGP and replace it with generic CPU cores instead. This unification results in an architecture which is overall more powerful and allows developers to create something entirely new and achieve great performance.

As I said I agree with you and I haveno better ideafo an useful instruction than vgather (vscatter is less important for my stuff), though the most important step before that is to have a true high performance 256-bit implementation, at the moment (on Sandy) we can load two XMM registers per clock with SSE and only one YMM register per clock it's a strong limit for my renderer where on a lot of kernels load:store ratio is well above 2:1 and thus sustained load bandwidth from the L1 DCache is a key limiter

>In particular it would help speed up texture sampling, which requires fetching many 32-bit texels at various memory locations.

with bilinear interpolation that will be 64-bit loads (LDRI) or 128-bit loads (HDRI) since you always access two adjacent texels in a row, with gradient data packed together (for bump mapping for ex.) there is even more data to access in // thus my Swizzle instead of Gather comment in a previous post

>a gather instruction as implemented by Larrabee.
the only thing I know about the implementation in Larrabee is Abrash's comment that it will be not at top performance (I read it as microcoded and slow) in the first batch of chips, maybe you know more that that ?

>It would also substantiallyspeed up vertex attribute fetch >

if you have for ex. a mesh topology, each vertex will be shared by many faces so you are better of to keep vertex data in AoS form (XYZUVW) and then use Swizzle instead of Gather

>transcendental functions.

if you use a LUT based imlementation you acces multiple items inparallel (the coefficient of the spline), the Swizzle argument also apply here

The point is that we already have swizzle in form of shuffle instructions (SHUFPS/PSHUFD). What we don't have is parallel load from different locations indexed by elements of another vector.

What we are trying to say is that:

1. Parallel load is a prerequisite for swizzle, not vice versa.
2. Swizzle can be implemented as a part of a parallel load instruction if necessary, not vice versa

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

be assured I understand the behavior of vgather and vscatter, whatever the implementation of these (in code, microcode, or with monster xbars) maximizing data locality will be always a good idea, a bad data layout with a true hardware vgather will give you less performance than plain AVX code doing256-bit loads (masked if required) then swizzle with the *existing* instructions

inmost the cases I'm aware of for3D rendering you not only access N elements in parallel (N=8 with AVX packed floats) to use all the SIMD computation slots but you generally need to gather M distinctvalues with the sameinteger vector (indices)

for example for vertex data (X,Y,Z,U,V,W) M = 6, for FP32 color data (R,G,B,A) M = 4, for 3rd order polynomials approximations M=4, and so on

with a SoA layout you will have to use M gatheroperations with different base addresses and the same indices vector (M*N distinct memory locations)

with an AoS layout you will have to use N swizzle operations from N distinct memory locations

Tim you talk about a "128-bit path to L2 in the current implementation" though in the optimization manual http://www.intel.com/Assets/PDF/manual/248966.pdf, page 2-16 the bandwidth per core for the L2 Cache is documented as "1 x 32 bytes per cycle", who is right?

Quoting bronxzv>In particular it would help speed up texture sampling, which requires fetching many 32-bit texels at various memory locations.

with bilinear interpolation that will be 64-bit loads (LDRI) or 128-bit loads (HDRI) since you always access two adjacent texels in a row, with gradient data packed together (for bump mapping for ex.) there is even more data to access in // thus my Swizzle instead of Gather comment in a previous post

>a gather instruction as implemented by Larrabee.
the only thing I know about the implementation in Larrabee is Abrash's comment that it will be not at top performance (I read it as microcoded and slow) in the first batch of chips, maybe you know more that that ?

>It would also substantiallyspeed up vertex attribute fetch >

if you have for ex. a mesh topology, each vertex will be shared by many faces so you are better of to keep vertex data in AoS form (XYZUVW) and then use Swizzle instead of Gather

>transcendental functions.

if you use a LUT based imlementation you acces multiple items inparallel (the coefficient of the spline), the Swizzle argument also apply here

You can't just load two adjecent texels. Due to texture addressing modes like wrap/mirror/clamp (and different possibilities in multiple dimensions), the texels you need can be in quite different locations than the typical 2x2 footprint.

Furthermore, if you want to improve the cache hit ratio by using texture swizzling (i.e. fitting 2D texel blocks into cache lines by swapping addressing bits around), the address for each texel can vary even more (while still improving overall locality). With a gather instruction, this would be no problem at all.

Tom Forsyth's presentation (software.intel.com/file/15545) claims that for Larrabee "offsets referring to the same cache line can happen on the same clock". So bascially the throughput would be the same as the number of unique cache lines that need to be accessed. Could you point me to the documentwhere Abrash said it would not be at top performance for the first chips? I wonder if he was talking about early prototypes which load elements sequentially, or whether he was actually talking about the ability to improve the performance further by splitting the gather operation over multiple load units.

As for vertex data, if you read 8 xyzuvw structures and want to transpose this into 6 AVX registers you need a ton of swizzle instructions. Furthermore, the attributes could be stored in separate streams. A gather instruction would solve both problems at once.

Tiny lookup tables can be implemented with swizzle instructions, but they are too small to get good accuracy. NVIDIA's presentation on G80's SFU (http://arith.polito.it/foils/11_2.pdf) can give you an indication of the size of tables that are required for high accuracy: between 6.5 and 13 cache lines. Since high correlation can be expected, the gather instruction would only need to access a few cache lines though. With a pair of 128-bit gather units AVX would be able to do these parallel lookups in very few cycles.

Anyway, while I'm passionate about graphics with no limitations, I believe gather/scatter would be of great value beyond that as well. As Igor noted there are other multimedia applications which would immediately benefit from it. But I also believe it would enable new applications to emerge. Things which some people are currently trying to fit onto the GPU architecture (GPGPU applications) but aren't very succesful at either due to the GPUs limited programming model. A CPU with gather/scatter support would revolutionize throughput computing.

What I really don't understand is why you are arguing against gather instruction when you already have all you need for your purpose? Why don't you just keep doing your 256-bit masked loads and swizzling to your heart's content with your well organized data, and leave others here to discuss further improvements to what you are already satisfied with? It is not as if you are going to lose anything if we get what we want.

There are certain cases where you simply cannot rearrange the data, or where data rearranging is prohibitively expensive (read: large 3D datasets).

Finally, just take a look at the optimization manual linked (Page 505, Example 11-8) assembler code for gather emulation, and then tell me again that real hardware wouldn't be faster.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

>you are arguing against gather instruction
huh? where ?

>It is not as if you are going to lose anything if we get what we want.
well it will depends of the area it will take on the chip, I'll prefer 256-bit datapaths to the L1 DCache if I have to choose, now having both will be cool

>just take a look at the optimization manual linked (Page 505, Example 11-8) assembler code for gather emulation, and then tell me again that real hardware wouldn't be faster.

yup I remember reading it the other day, the indices are read from a buffer IIRC, in a more useful implementation you'll have the indices in a YMM register and moveeach individualindex to a GPR with PEXTRD (after a VEXTRACTF128 for the high part) then INSERTPS from memory, that's 18 instructionsincluding the final VINSERTF128, the throughput is better than "single instructions" like VSQRTPS orVDIVPS forworkloadsfitting in the L1 DCache, I have no idea ofthe speedup an hardware implementation will provide, though as I stated several times I will welcome the instruction even if the speedup is modest

> NVIDIA's presentation on G80's SFU (http://aith.polito.it/foils/11_2.pdf) can give you an indication of the size of tables that are required for high accuracy: betw

they use 2nd order polynomials so what I was calling M is = 3 in this case, with3rd order polynomialsM = 4 and is a better fit for 128-bit loads,ifthe table is big it's even more important to use an AoS layouti.e. (c0,c1,c2) packed together in your example

I'll see if I find a pointer to Abrash's comment and I will post it here

>So bascially the throughput would be the same as the number of unique cache lines that need to be accessed.

I'm afraid you are way too optimistic

Forsyth also says (slide 47)

"
Gather/scatter limited by cache speed

L1$ can only handle a few accesses per
clock, not 16 different ones
Address generation and virtual->physical are expensive
Speed may change between different processors
"

still searching the Abrash's reference...

I just found the Abrash's reference :

http://www.drdobbs.com/high-performance-computing/216402188;jsessionid=UXUQLDWA2IBWJQE1GHRSKH4ATMY32JVN?pgno=5

"
Finally, note that in the initial version of the hardware, a few aspects of the Larrabee architecture -- in particular vcompress, vexpand, vgather, vscatter, and transcendentals and other higher math functions -- are implemented as pseudo-instructions, using hardware-assisted instruction sequences, although this will change in the future.

"

Not having 256-bit datapaths and adding gather is out of the question. Gather would not be too usefull without them.

So, in the best case you will use 18 instructions to do a single parallel load instead of one?

1. What if register pressure is high, and you don't have GPRs to spare?

You will be spilling registers to memory and reloading them generating additional cache/bus traffic for already memory intenisve operation with poor data locality. How will that help?

2. How will compiler "learn" to perform such a parallel load in order to be able to vectorize loops where such load is needed?

Best you can do is write it on your own each time you need it using intrinsics or inline assembler. Instead of writing one intrinsic / instruction, or letting compiler take care of it, you will have to write 18 or more.

As I said multiple times, initial implementation of gather does not have to be faster than the current alternative as far as I am concerned -- at least it will make code more clear, enable compiler to auto-vectorize more loops, and pave a way for future hardware implementations which will be considerably faster.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Quoting Igor LevickiAs I said multiple times, initial implementation of gather does not have to be faster than the current alternative as far as I am concerned -- at least it will make code more clear, enable compiler to auto-vectorize more loops, and pave a way for future hardware implementations which will be considerably faster.

Amen.

>1. What if register pressure is high, and you don't have GPRs to spare?

you re-use 8x a *single* architected GPR that's a non issue

>. How will compiler "learn" to perform such a parallel load in order to be able to vectorize loops where such load is needed?

much like if there was a vscather instruction butby simply instentiating the optimized code, it will be free to optimize accross multiple scathers (unlike if it was a hardware instruction), in most the cases I'm aware of you use a series of scather with the same packed indices, the access pattern can be optimized when considering all the scathers together, and it's of paramount importance since the bottleneck is clearly the L1 DCache (nr of ports #1 limitation) and will always be

Amen to what ?

if you use high level constructs likeinlined vscather(), vcompress(), etc.functionsI don't see whygenerating a single instruction instead of several ones will make the source code more clear

maybe you are talking about theASM

Quoting bronxzv>So bascially the throughput would be the same as the number of unique cache lines that need to be accessed.

I'm afraid you are way too optimistic

Forsyth also says (slide 47)

"
Gather/scatter limited by cache speed

L1$ can only handle a few accesses per
clock, not 16 different ones
Address generation and virtual->physical are expensive
Speed may change between different processors
"

He'smerelysaying that there should be some coherence to achieve a goodspeedup with gather/scatter. The worst case is when each of the elements is on a different cache line, which will take 16 L1 cache accesses. But for all the applications already mentioned here there will be high coherence between the address of each of the elements being loaded/stored, so typically it will be much faster.

It's interesting that Forsyth mentions that L1 can handle "a few" accesses. With two read ports the worst case for gather would be just 8 cycles. In the case of AVX it's a mere 4 cycles (versus 18 to emulate it). And again, that's the worst case, the typical case is probably between 1-2 cycles!

Another option is to organize the L1 cache into banks. Basically each of the load/store units would have its own L1 cache bank (or a pair of load/store units could share a multi-port cache bank so there can be four load/store units in total with merely two banks). Since there can be duplicate data in each of the banks it can be necessary to increase the total L1 cache size to ensure good temporal coherency though. But note that this is actually already how a multi-core architecture works. Obviously it's a lot cheaper to double the L1 size than to double the number of cores. But anyway, cache banking might onlybe worth it when expecting very high gather/scatter performance even with incoherent data locations. Given that Larrabee has wider vectors and is definitely running lots of SIMD code, my guess is that Abrash and Forsyth are talking about the possibility of further improving gather/scatter performance, closer to how GPUs perform. Quoting the rest of Forsyth's slide number 47:

"Offsets referring to the same cache line canhappen on the same clock

A gather where all offsets point to the same cache line willbe much faster than one where they point to 16 differentcache lines

Gather/scatter allows SOA/AOS mixing, but data layout design is still important for top speed"

So it's all just about clarifying that a compromise has been used to keep the logic size reasonable. With a sensible data layout this compromise doesn't stand in the way of achieving high performance.

There are clearly many options, ranging from the trivial microcoded implementation, tomulti-banking + multi-porting + multi-LSU. But the point is that it's a risk-free investment. I think it's perfectlyfine if the first implementation takes 4 cycles by using the two load units. And if after several years Intel benchmarks the applications which use these instructions, and finds that it's not worth the transistors to attempt to reduce that to 1 cycle, that's fine too. If they find that gather/scatter is widely used and it makes a significant difference to have a faster implementation, great!

Quoting bronxzv
> NVIDIA's presentation on G80's SFU (http://aith.polito.it/foils/11_2.pdf) can give you an indication of the size of tables that are required for high accuracy: betw

they use 2nd order polynomials so what I was calling M is = 3 in this case, with3rd order polynomialsM = 4 and is a better fit for 128-bit loads,ifthe table is big it's even more important to use an AoS layouti.e. (c0,c1,c2) packed together in your example

With AVX (N= 8), you'd need 8 of these 128-bit loads, and then transpose this 4x8 matrix. If I'm not mistaken, that's 16 shuffle instructions. Extracting the individual addresses takes9 instructions too. And I'm not even counting the spilling instructions.

Just to be clear here: That's at least 33 instructions before you can do any useful work! With FMA the polynomial itself takes only 4 instructions.

With a gather instruction, these table lookups would only take4 instructions.

well I think that we basically all agree, I simply think we can start writing code today with the "complete" SIMDphilosophy in mind, FYI I just posted a request on the Intel C++ forum:

http://software.intel.com/en-us/forums/showthread.php?t=80085

at the moment I'm very sceptical with new instructions since sometimes not only the speedups aren't there but we even get slowdowns when using them, one example from the past was BLENDVPS which was slower thanthe ANDPS/ANDNPS/ORPS equivalent when introduced (though now it's way faster, it screams on SNB thanks to the dual blend units), a today's example is the fact that using 2 128-bit VMOVUPS is faster than a single 256-bit VMOVUPS (for unaligned moves), thus the optimal code for SNB will be up to 2x slower in the future when we eventually get 256-bit datapaths

here is a post of mine about this very topic:

http://www.realworldtech.com/forums/index.cfm?action=detail&id=115959&threadid=115645&roomid=2

the best solution is probably to use the intrinsic sinceit generates two instuctions for SNB and will generate 1 single instruction for future targets

Sigh...

I said "when you have no GPRs to spare" and you answered with "you reuse one
GPR 8x". Which part of "no GPRs to spare" you did not understand?!?

There are situations (algorithms) when you don't have even that one GPR free -- you have to spill one to memory and reload it later, and often you have to do that inside of a loop.

Furthermore, no access pattern optimization is possible if indices are calculated during runtime.

Finally, inlining 18+ instructions each time you use gather/scatter increases the code size which in turn reduces decoding throughput and thus IPC, not to mention that it also prevents Loop Stream Detector from kicking in.

So no, I wouldn't say we all agree.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

>So no, I wouldn't say we all agree.

you were the one stating that "initial implementation of gather does not have to be faster than the current alternative as far as I am concerned" so I fail to see where we disagree

>Furthermore, no access pattern optimization is possible if indices are calculated during runtime.

huh ? a lot of optimizations are possible withAoS layouts and each individual gatherfrom the same base address +singleelement offsets (and exactly the same packed indices computed dynamically), we are talking about 128-bit moves instead of 32-bit moves in the cases I have in mind

now, well, if you really think that spilling a single GPR to the L1 DCache will consumes "bus traffic" (sic) I suppose there isn't much left to discuss as far as real world timings are concerned

FYI much of the LSD optimizations are gone on SNB (decoded icache makes it redundant for high performance code), in fact loop unrolling is more important than before (I get pretty speedups by toying wth the #pragma unroll)

one goal of the game with NHM was to try loop fission to maximize LSD usage, now with SNB the goal is just reversed, loop fusion to avoid like the plague useless load/store necessary for multi-passes loop fission

Quoting bronxzvAmen to what ?

if you use high level constructs likeinlined vscather(), vcompress(), etc.functionsI don't see whygenerating a single instruction instead of several ones will make the source code more clear

maybe you are talking about theASM

It's not about making the source code clearer, not even the assembly code (although it's helpful when debugging).

It's really about knowing thatthe singleinstruction has the potential of becoming faster than the instruction sequence.

Software development cycles can be quite long, and customers don't upgrade their software the minute you release a new version. So to speed up the adoption rate (that's ROI for who's paying the bills), it's important to have early access to new instructions, even if initially they're not (much) faster than a sequence of instructions.

Gather/scatter is most likely already on the roadmap. But instead of waiting for the transistor budget and engineering budget for a high performance implementation, after which it still takes many years for developers to make use of it and get the applications into the hands of customers, they could add a cheap implementation in the near future and by the time the high performance implementatation is ready there will be an instant speedup for the applications customers are already using. It's a big incentive for people to buy the new hardware.

Again, it's faster ROI. Everyone wins.

>It's not about making the source code clearer,

well you said "Amen" after this Igor's comment:

"-- at least it will make code more clear"

Quoting bronxzv

well I think that we basically all agree, I simply think we can start writing code today with the "complete" SIMDphilosophy in mind, FYI I just posted a request on the Intel C++ forum:

http://software.intel.com/en-us/forums/showthread.php?t=80085

at the moment I'm very sceptical with new instructions since sometimes not only the speedups aren't there but we even get slowdowns when using them, one example from the past was BLENDVPS which was slower thanthe ANDPS/ANDNPS/ORPS equivalent when introduced (though now it's way faster, it screams on SNB thanks to the dual blend units), a today's example is the fact that using 2 128-bit VMOVUPS is faster than a single 256-bit VMOVUPS (for unaligned moves), thus the optimal code for SNB will be up to 2x slower in the future when we eventually get 256-bit datapaths

here is a post of mine about this very topic:

http://www.realworldtech.com/forums/index.cfm?action=detail&id=115959&threadid=115645&roomid=2

the best solution is probably to use the intrinsic sinceit generates two instuctions for SNB and will generate 1 single instruction for future targets

In theorydevelopers can indeed start writing fully parallel SIMD code today. In practice, there's a lot more involved. If I tell my superiors we should start investing time and money into rewriting SSE code into AVX code, using (abstracted) gather/scatter operations which will have an influence on the entire architecture, he'll want a justification for that. Right now, sticking to SSE and benefiting from the extra 128-bit execution units sounds like a better plan. So like I said before it will still take many years for the majority of multimedia software development companies to consider using AVX, and even longer for Intel to see a return on its investment.

BLENDPSis a good example...of doing it all wrong. Adding instructions which intitially are slower will obviously not get adopted any faster. You can expect developers to check for AVX support before attempting to use BLENVPS.Sothe adoption is delayed bythree years and what's worse it costed transistors.Either way, this failure should not make you sceptic about gather/scatter. It merely shows that the initial implementation has to be at least as fast as the instruction sequence to emulate it.

In a way it even makes me hopeful that we'll see gather/scatter sooner rather than later. Maybe BLENDVPS was added this early just for code density reasons (both assembly and binary code). If that's enough of a reason, then certainly gather/scatter must look awesome. Ok maybe there was a pinch of sarcasm there, but still, I honestly can't think of a reason not to add gather/scatter instructions at the earliest possible.

Spilling registers to L1D which on
Sandy Bridge has the best case latency of 4 cycles has considerable performance impact and is even being
discouraged in the latest optimization reference manual.

Furthermore, I really hate when I have to quote myself because someone is putting words in my mouth:

You will be spilling registers to memory and reloading them generating
additional cache/bus traffic
for already memory intenisve operation with
poor data locality. How will that help?

So what you wrote above is not only an incorrect quote of my post, but it also implies that I am incompetent and it is a pure malice on your part. If you cannot use facts instead of ad hominem attacks, then perhaps I should just ignore the rest of your rambling and use the report button instead.

Regarding "at least it will make code more clear" quote -- again you are taking what I said out of context to suit your purpose of attacking people you debate with. There is a continuation to that sentence that says "enable compiler to auto-vectorize more loops, and pave a way for
future hardware implementations which will be considerably faster."
.

We do not agree because:

- We believe that we should use one new instruction

To get better performance in our case:

a) User has to buy a new CPU

- You believe that we should use an intrinsic function

To get better performance in your case:

a) Developer must buy a new compiler which emits faster intrinsic function
b) Developer must recompile, QA test, and release
c) User has to buy a new CPU
d) User has to pay for new software version due to our development costs
e) User has to spend time reinstalling software

Which one of those gives an end user better incenitive to upgrade?

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Quoting bronxzv>It's not about making the source code clearer,

well you said "Amen" after this Igor's comment:

"-- at least it will make code more clear"

I said amen to the entire comment. Clearer (assembly) code is an welcome bonus, but certainly not the main reason to want gather/scatter even if initially it isn't faster.

Pages

Login to leave a comment.