Sandy Bridge: SSE performance and AVX gather/scatter

124 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
bronxzv's picture

>If I tell my superiors we should start investing time and money into rewriting SSE code into AVX code, using (abstracted) gather/scatter operations

something you can say is that other pure software renderers like my Kribi 3D stuff tested here *before any tuning* are already on a fast track to AVX optimizations

http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view&id=99&Itemid=1&limit=1&limitstart=6

bronxzv's picture

honestly I feel like we are going nowhere,I can't come with a more concrete idea than requesting intrinsicsthat will map to instructions ifthey are available at some point in the future

in your "We" vs "You" argument there isseveralelephants in the room:
- how willthe backward compatibility with legacy targets be assured using only one instruction, isn't it also a good idea to optimize for the installed base ?
- which one can producevalidated applications today andwhich ones are just waiting still for a better future ?

Igor Levicki's picture

Legacy targets do not have AVX. We are discussing lack of gather in AVX.

Optimizing is happening right now for this SandyBridge.

If this SandyBridge had gather instruction and we used it now in our applications, then next SandyBridge with faster gather instruction could still use the same software we are optimizing now which would automatically run faster by means of CPU upgrade alone.

Is that concept so difficult to grasp or what?

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
c0d1f1ed's picture
Quoting bronxzv something you can say is that other pure software renderers like my Kribi 3D stuff tested here *before any tuning* are already on a fast track to AVX optimizations

http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view&id=99&Itemid=1&limit=1&limitstart=6

Cool!

May I ask what your expectations are after tuning? 11% sounds like only a minor improvement for something that's supposed to double the computing power (although obviously the SSE path also benefits from Sandy Bridge's extra execution units).

For SwiftShader I found Sandy Bridge to be 30% faster clock-for-clock compared to Nehalem. Unless a significant part of that is due to other things than the execution units, this means that in theory the use of AVX could speed things up by at most 50%.

So are you expecting to get closer to that 50%, or are you limited by the lack of integer AVX instructions, bandwidth, or dare I ask... load/store and swizzle?

bronxzv's picture

>11% sounds like only a minor improvement for something that's supposed

sure I was very sad when M.S. told me thescores he was getting and I gota 2600K PC only a few days later so I wasn't able to profile anything

I know hope something like 20% vs SSE on Sandy Bridge or around 50% better IPC overall vs Nehalem

Some loops have no speedups or even slowdowns (now fixed) in the version tested by Michael S.

when testing with a single thread and turbo off, I made these measurements:

- no speedup for aligned copies of arrays or set/clear buffers (16B/clock L1D cache write bottleneck)
- slowdowns for unaligned copies of arrays, due to an issue with the implementation in Sandy Bridge, 2 128-bit vmovups faster than a single 256-bit vmovups
-poor speedups for L2-cache blocked case (1.15 x overall)
- common speedup around 1.3 x for L1D-cache blocked case and normal load/store
- best observed speedup so far 1.82 x (L1D-cache blocked with lower than average load/store)
- generally speaking I get more than the overall speedup for all the swizzle/compress/gather stuff thanks to new instructions like VPERMILPS (and PEXTRD/VINSERTPSwhich are SSE4.1struff that I'm not using in the fallback SSE-2 path) maybe it explains why I'm not seing Amdhall sohard at work ATMthan some other people...
- VBLENDVPS is clearly a killer instruction, I get 1.6x speedups in loops with heavy VBLENDVPS, it's in all case way faster than masked moves, masked moves allow for cleaner and shorter ASM, it's just way slower
- The lack of 256-bit integers isn't very important in my case since the most useful instructions like packed conversions bewteenfloatsand ints are here at full speed

- all in all I expect at most 1.3 x overall speedup with turbo off and a single thread, this should amount for roughly 1.2 x speedup with 8 threads (8x more LLC/memory bandwidth requirements)and turbo on

>although obviously the SSE path also benefits from Sandy Bridge's extra execution units).

my understanding is that the 2nd load port and the decoded icache are very effective to maximize lgacy SSE code throughput, but AFAIK the only extra execution unit for SSE is the 2nd blend unit, which execution units do you have in mind ?

bronxzv's picture

>Optimizing is happening right now for this SandyBridge.

sureand this good Sandy hasno gather instruction, it's an hard fact of life, hey man not even the SDE (whichhas FMA3 andhalf-floats already btw)has a gather instruction, that's why I use an inlined gather() function that generates optimal AVX code (or pretty optimal) and also optimized fallback paths like SSE-2 from the same source code

now let's imaginesome alternate reality whereSandyhas a gather instruction, it's even there in the specs since day 1 and wasn't removed unlike FMA4 for ex., it's all cool and dandy and we are all happy with our "complete SIMD ISA", still our customers have mostly SSEn enabled machines, a lot have still XP as OS that willprobablynevereven support AVX, or have Seven and will not install the SP 1 very fast (yes there will be 100'000 s of Sandy machine sold with no AVX support due to the lack of support in Seven ATM), what can we do about it ? we will *need* to generate at least 2 code paths, a sensible way will be using a singleintrinsic allowing the compiler to generate optimized code for all targets

Is that concept so difficult to grasp or what?

Igor Levicki's picture

In that "alternate reality" customers would have an incenitve to upgrade hardware because speedup would be immediate and automatic. Gather instruction would map to an intrinsic so no difference there for you and you could still have two code paths -- with gather instruction and with SSE emulation.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Tim Prince's picture
Quoting bronxzv
- VBLENDVPS is clearly a killer instruction, I get 1.6x speedups in loops with heavy VBLENDVPS, it's in all case way faster than masked moves, masked moves allow for cleaner and shorter ASM, it's just way slower

my understanding is that the 2nd load port and the decoded icache are very effective to maximize lgacy SSE code throughput

The compilers should choose vblend instructions for vectorization whenever it is possible (likely requiring VECTOR ALIGNED pragma). masked move is useful as a last resort when it enables auto-vectorization. I see 2.0x speedups comparing AVX-256 vectorization with masked move against non-vector SSE code.

The 2nd load port doubles the speed of existing single thread code which gathers operands into a packed operand. In my tests, old SSE code does as well as AVX-256 in that case. In the multi-threaded case, there might be an advantage for hardware which could perform gather operations across cores without replicating all cached data. A first micro-coded implementation of a gather instruction would likely not deal with the cache line replication.

Decoded icache is supposed to avoid performance obstacles encountered when missing alignment and unroll optimization for loop stream detection.

Your quotations for L1D and L2 cache blocking are interesting.

bronxzv's picture

Tim, FYI here is an ASM dump ofour kernel with the best speedup so far, the AVX version is 1.82 x faster than the equivalent SSE path (2600K at 3.4 GHz, turbo off, single thread, L1D$ blocked, 100% 32-B alignment)

it allows the computation of a 3Dboundingbox from XYZ data inSoA form so it's arguably useful code with a great speedup

; LOE eax edx ecx ebx esi edi ymm0 ymm1 ymm2 ymm3 ymm4 ymm5

.B16.14: ; Preds .B16.14 .B16.13

vmovups ymm6, YMMWORD PTR [esi+edi*4] ;325.22

vmovups ymm7, YMMWORD PTR [ebx+edi*4] ;326.22

vminps ymm4, ymm4, ymm6 ;325.5

vmaxps ymm1, ymm1, ymm6 ;325.5

vminps ymm5, ymm5, ymm7 ;326.5

vmaxps ymm0, ymm0, ymm7 ;326.5

vmovups ymm6, YMMWORD PTR [edx+edi*4] ;327.22

add edi, 8 ;323.29

vminps ymm2, ymm2, ymm6 ;327.5

cmp edi, eax ;323.21

vmaxps ymm3, ymm3, ymm6 ;327.5

jb .B16.14 ; Prob 82% ;323.21

Thomas Willhalm (Intel)'s picture

Bronxzv, Igor,

Following your discussion, I think that both of you made your point very clear. However, you are coming to different conclusions as your willingness to write and supportdifferent code paths and software version differ. Also your estimation howlikely customers are to buy new hardware or software differs a lot. As both questions depend on industry and company, there is probably no clear answer. Furthermore, this is actually not a technical but a business question. I therefore suggest that you leave this argument as it is. Your high quality contributions are highly appreciated and it would be a pity to spoil this otherwise interesting thread with a flame war.

Kind regards
Thomas

Igor Levicki's picture

Thomas,

If you were following my contributions so far, you would have known that I was always ready to support different code paths. That has not changed. I just don't think that intrinsics are the solution to every problem -- I always prefer hardware implementation over software.

Regarding customers, I built many systems for many people including myself so far. I have also upgraded many systems. I cannot speak about USA market, but there are other markets I am familiar with such as Eastern and Central Europe, Russia, China, India, etc, which prefer simple upgrades that bring noticeable performance improvements. The simplest upgrade is to replace just the CPU -- you don't even have to reinstall the operating system for that. My estimation is also based on the fact that with this economic crisis majority of people cannot afford to pony up for a full hardware and software upgrade every year or even sooner.

Finally, if implementation of gather and scatter is indeed reduced to some business decision, instead of driven by the need for innovation and enabling, then I am truly disappointed in Intel.

I will never again mention those, and I will also refrain from other suggestions as well from now on -- I don't want them ending up in someone's drawer waiting for the marketing team to figure out how to pitch them to the illiterate masses.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
bronxzv's picture

>you are coming to different conclusions

I don't think so, we basically want the same thing: a gather instruction in AVX much like VGATHER in LRBni, it will map to one (or several) intrinsic(s) as all other AVX instructions and I'm sure Igor isn't against that (see his last post on the subject)

Now my request is merely a new intrinsic, much like_mm256_set_ps(which is already a variant of gather btw) but with a base address and 8 packed indices in a __m256i argument instead of the 8 floats with set_ps

c0d1f1ed's picture
Quoting bronxzv my understanding is that the 2nd load port and the decoded icache are very effective to maximize lgacy SSE code throughput, but AFAIK the only extra execution unit for SSE is the 2nd blend unit, which execution units do you have in mind ?

I must have incorrectly interpreted these 'Execution Cluster' images: http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-expo...

I was under the impression that two ports were now each able to take 128-bit ADD and MUL instructions. Apparently they got extended in the other direction though, overlapping logic from the integer pipelines.

Frankly I have to say this makes AMD's Bulldozer architecture look really interesting, at least for AVX in its current form.

On the other hand, it means that whatever is responsible for the 30% performance increase for SwiftShader is really impressive! I've only been able to play with a Sandy Bridge demo system for 15 minutes, so I don't have a detailed analysis, but likely the dual load port is a big help for the sequential load hotspots (texture sampling, vertex fetch, lookup tables, etc.).

Still, given that a gather instruction would ideally be able to perform8 load operations and a matrix transposition in a single cycle, I think that would help even more than 30%. With integer AVX support the throughput could theoretically double, so it's important not to make it data starved with sequential load/store. There's plenty of cache-line coherence, which can be exploited with gather/scatter.

So my only hope is that Intel doesn't leave things half-done. As I've mentioned before, in the future developers will need architectures with abalanced mix of ILP, DLP and TLP. It looks like Sandy Bridge's dual load ports is a step forward in ILP, and AVX is an attempt at increasing DLP, but the potential of more thandoubling the performance is held back by lack of integer operations and gather/scatter. I fully realize these things take time, but it would have been incredibly useful to already have access to these instructions even if not implemented optimally.

As for TLP, I believe the number of cores should keep increasing, but not at the expense of completing the AVX instruction set. Here's why: It will still take many years before the compilers and tools will assist or automate multi-threaded development, with good scaling behavior.So for now it's easier to achieve higher performance in a wide range of software by parallelizingperformance-critical loops, rather than attempting to split it into threads.

But that's just my take. I'm curious what Intel's visionof the long-term future is like and how they plan to obtain a synergy with software development.

Thomas Willhalm (Intel)'s picture

Igor,

I am aware that you already support different code paths, but my impression was that you would prefer fewer code pathsthan bronxyz does.In any case,you certainly have a valid point thatrequiring customers to upgrade hardware and software at the same time puts an extra burden on them and I don'tthink it's important if you label this as "business" or "technical".

I would like to stressthat I cannot to speak on behalf of Intel.Personally however, I highly appreciate your technicalinsights and see a tremendous value in your suggestions and feed-back. In fact, I had alreadypointed out this thread to Intel architects who are working on future instruction sets and who read it with high interest.

Kind regards
Thomas

Igor Levicki's picture

Thomas,

So it was a misunderstanding. I am not against multiple code paths.

Regarding customers having to upgrade both hardware and software, please bear in mind that if something is done only with intrinsics, it also forces developers to upgrade software (compiler) to be able to pass the benefit down to the end user.

That is not only costly for developers in terms of money, but it requires a lot of work on our side just to be sure that upgrading the compiler doesn't break build or backward compatibility, or that it does not introduce some regressions elsewhere. On larger projects, the cost of such an effort often blocks the compiler upgrade initiative.

When it comes to hardware, apart from adding some missing instructions such as HMINPS/HMAXPS/HMINPOSPS/HMAXPOSPS/etc, what I would love to see implemented in hardware as soon as possible is:

- GATHER
- SCATTER
- FMA
- LERP (linear interpolation)

First two I already explained.

FMA is usefull for DSP tasks (and has many other uses as well). First implementation does not need to be faster than MUL+ADD -- it just has to provide extra accuracy that comes from the lack of intermediate rounding.

LERP (linear interpolation) is also usefull for all sorts of DSP, scientific, and medical imaging tasks, and with CPUs and GPUs converging adding it would be a logical step forward -- with three and four operand non-destructive syntax available now I don't see any issues with adding it.

If you had GATHER and SCATTER instructions, it would also be usefull to have chunky (packed) to planar and planar to chunky conversion instructions. You could fetch 16 8-bit pixels, and write them out to 8 different memory locations (planes) with SCATTER as 16 consecutive bits belonging to each plane.

An example:

C2PBW XMM0, XMM1

Would reorder:

XMM1 (src) (numbers mean bit.byte)
15.07 15.06 15.05 15.04 15.03 15.02 15.01 15.00 14.07 ... 00.07 ... 00.00

To:

XMM0 (dst)
15.07 14.07 ... 00.07 15.06 14.06 ... 00.06 ... 15.00 ... 00.00

Hypothetical P2CWB would do the reverse transformation.

Of course, this "bit shuffle" could be generalized with additional parameter(s) to become usefull beyond packed to planar conversion which I had in mind.

I have a SIMD code written for this purpose, but it takes a lot of instructions to do this seemingly simple transformation. This could be usefull for image and video manipulation software.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
c0d1f1ed's picture

By the way, even for floating-point applications integer vector operations are important. For example see my 2x implementation here: http://www.devmaster.net/forums/showpost.php?p=43569&postcount=10

c0d1f1ed's picture
Quoting Igor Levicki LERP (linear interpolation) is also usefull for all sorts of DSP, scientific, and medical imaging tasks, and with CPUs and GPUs converging adding it would be a logical step forward -- with three and four operand non-destructive syntax available now I don't see any issues with adding it.

I believe NVIDIA implements LERP using only FMA execution units (i.e. using multiple instructions).

I don't think it makes sense to add hardware support forLERP. First of all, would it have to be implemented as a*x+b*(1-x), or as a+(b-a)*x? There can be significant differences in the result due to rounding, denormals,orNaNs. Either way you're looking at adding another multiplier or adder, which increases latency.And it won't be used very often (except maybe for very select applications) so there's not a good return in performance for the transistors investment.

I believe two FMA units per core would offer a much better tradeoff between area and performance.

bronxzv's picture

note that thecommon E(x) idiom

cvtps2dq xmm2, xmm1
cvtdq2ps xmm1, xmm2

can be directly extended to AVX-256 since there is a 256-bit variant of these two goodies

only your paddd and pslld will require 2 instructions instead of 1, it doesn't matter much since they aren't in the critical path

I'm sure you know that but if you want to use this functionin a loop you are better off to keep as much as possible your constants (C0,C1,..) in registers, it isn't that important for 128-bit code buttoo many loads are a big limiter for SSE to AVX-256 scalability

all in all this is a wonderful example to put in good light AVX-256

Igor Levicki's picture

a + (b - a) * x is what I believe to be the most commonly used variant. I don't think it would be too hard to implement. I don't mind if it gets implemented using more than one uop internally as long as it performs same or better than the mix of instructions needed to perform the same operation.

I have also been suggesting an instruction that returns just the fractional part of float to int conversion (hypothetical FRACPS).

For example I have 1.55f. To get 0.55f I have to:

CVTTPS2DQ(1.55f) = 1
CVTDQ2PS(1) = 1.0f
SUBPS(1.55f - 1.0f) = 0.55f

That is also pretty common operation. Even better if the instruction could return both fractional and integer parts.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
c0d1f1ed's picture
Quoting bronxzv only your paddd and pslld will require 2 instructions instead of 1, it doesn't matter much since they aren't in the critical path

I'm sure you know that but if you want to use this functionin a loop you are better off to keep as much as possible your constants (C0,C1,..) in registers, it isn't that important for 128-bit code buttoo many loads are a big limiter for SSE to AVX-256 scalability

all in all this is a wonderful example to put in good light AVX-256

Indeed there will be a nice performance improvement, but my point was that even for floating-point applications you need integer operations. So a new code path will be required once the 256-bit instructions appear.

Anyway, I respect your opinion that for you this doesn't matter much. For other projects it's a big deal to go through another development cycle though. And my only point with this example was that it will likely also affect applications which are highly floating-point oriented.

Off course nobody can change anything about that now, but if for example it's possible to reasonably easily add the 256-bit packed integer instructions to Ivy Bridge (by executing them as two 128-bit operations), instead of waiting for Haswell, that would be beneficital for everyone. Or if Haswell is only planned to add FMA and 256-bit packed integer support was planned even later (when full 256-bit execution units are feasible), I think it would still be better to expose the integer instructions in Haswell.

It's clear to me that the execution core has to be extended in stages. It doesn't make sense for Intel to invest a lot of transistors into something that won't be widely used for several years, and for which the usage pattern isn't clear yet. But that doesn't mean the instructions aren't useful yet, even if not executed at full rate. In fact it might make sense to leave it that way...

Just look at NVIDIA's Fermi architecture. Some implementations have 4 FMA units for every SFU unit, while other implementations have 6 FMA units for every SFU unit. Also for some implementations every pair of FMA units can execute a double-precision floating-point operation, while other implementations appear to use the SFU for that. The instructions haven't changed, they just have different latency. This allows them to adjust the hardware to the 'mix' of instructions applications use.

For AVX this means it might really make sense to only have 128-bit packed integer execution units for many years to come. But the 256-bit instructions are needed so developers can use them and Intel can in turn analyze their usage and evolve the hardware accordingly. It allows them to very accurately determine when it makes sense to extend the integer executution units to 256-bit, if ever. Analyzing the SSE packed integer instructions is not entirely the same, because developers may decide to use a different implementation due to register pressure and additional instructions to get the data to and from the upper half of the YMM registers.

The 128-bit execution of 256-bit DIV and SQRT instructions is a prime example how to do this the right way. And it shows that 256-bit packed integer instructions were within reach for Sandy Bridge, which makes me hopeful that they'll be added to Ivy Bridge or Haswell at the latest.

c0d1f1ed's picture
Quoting Igor Levicki a + (b - a) * x is what I believe to be the most commonly used variant. I don't think it would be too hard to implement. I don't mind if it gets implemented using more than one uop internally as long as it performs same or better than the mix of instructions needed to perform the same operation.

I have also been suggesting an instruction that returns just the fractional part of float to int conversion (hypothetical FRACPS).

For example I have 1.55f. To get 0.55f I have to:

CVTTPS2DQ(1.55f) = 1
CVTDQ2PS(1) = 1.0f
SUBPS(1.55f - 1.0f) = 0.55f

That is also pretty common operation. Even better if the instruction could return both fractional and integer parts.

a + (b - a) * x is susceptible to precision issues. If a = 1.0f and b = 1.0e-24f, then (b - a) = -1.0f and when x = 1.0f the result is 0.0f. This may lead to a division by zero. So generally a * (1 - x) + b * x is preferred, which doesn't suffer from this issue. However, it's clearly even more expensive to add hardware support for it.

Note that neither DirectX 10or OpenCL defines a LERP instruction or macro. It's just too tricky to know what the developer expects. And I haven't even touched the issue of what should happen with the rounding bits. Clearly it's better to just let the developer write the lerp version he wants explicitly. I don't think there's any realistic potential for optimizing it in hardware, except for making use of FMA.

As for a fraction operation, SSE4 features the ROUNDPS instruction, which can perform four different rounding operations (FLOOR, CEIL, ROUND and TRUNC), depending on the immediate operand. So a fraction operation only takes two instructions, which I think is already quite good. And it works outside of the range of 32-bit integers as well. Anyway, the upper five bits of the immediate operand are reserved, so they might extend it to include a FRAC modetoo (the integer part obviously corresponds to your choice of FLOOR, CEIL or TUNC). My most important use of ROUNDPS is to compute the fraction part so it would be welcome indeed.

Igor Levicki's picture

Regarding lerp(), I just checked, GPU hardware uses this form:

dest = src0 * src1 + (1 - src0) * src2

Furthermore, various shader compilers often replace LRP instruction with FMA when possible nowadays, probably because hardware implementations have changed considerably since LRP was first introduced (i.e. I don't believe GPUs had FMA back then).

Note that I did list LERP after FMA because I was aware that they partially overlap. What I was trying to say is that if you have one, you can have the other as well without breaking the budget.

Regarding ROUNDPS, yes I know about that instruction, but other than being able to pick any rounding mode for the integer (as float) result, there is no advantage in my case -- I need both integer (as int) result, and fraction (as float) result. So no, unfortunately I cannot reduce number of instructions with ROUNDPS.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
bronxzv's picture

>big deal to go through another development cycle though. And my only point with this

it's not a new "development cycle" if you work at a high level, just a recompilation, the point I was trying to make all along is that development cycles are longer than ISA refresh cycles, we have already to plan for FMA3 anyway (and then 512-bit vectors, etc.),btw imagine what we will have todo if Larrabee was here as yet another "x86" target ?so it's no morepratical to work inassembly or even directlywith the intrinsics for any multi-man*year projectotherwise youspend all your human resourcesoptimizing zillions of different code paths instead of adding features and improving your algorithms

>And it shows that 256-bit packed integer instructions were within reach for Sandy Bridge, which makes me hopeful that they'll be added to Ivy Bridge or Haswell at the latest.

provided that "post 32-nm" instructions like the FP16 conversions and even FMA3 are already supported by the Intel C++ compiler and the SDE, but not the 256-bit integer instructions, I'll not count on it ATM for any project

Igor Levicki's picture

>it's not a new "development cycle" if you work at a high level, just a recompilation...

Well, that depends heavily on the project size, company size and organisational structure.

If it is a "one man band" (i.e. if you are working alone), then yes, it is possible just to recompile whenever you want, although I sincerely hope that you at least perform functional and performance regression testing.

I had a situation where new version of Intel C++ compiler together with recompilation reduced overall performance by 10% -- I had to change compilation options and restructure parts of the code just to stay at the previous performance level, not to mention that intrinsic code output (and thus its performance) can also differ between two compiler versions.

And what about the situation where you want to have multiple code paths and new path is introduced while old one you still need is removed at the same time?

For example, how will you keep supporting Pentium 3 which no longer has a code path in IPP 7.0 and everything else up to and including SandyBridge? You have to keep using IPP 6.1 for that Pentium 3, and IPP 7.0 for SandyBridge.

Another example, compiler options change, with next compiler and another recompile you can no longer generate specialized code paths for specific architectures even though in some cases you had considerable benefits from doing so.

So no, it is not possible to do it all "at a high level", and there is a reason why you implement things in hardware and write low-level code. Software implementations are way too fluid and introduce too many variables, not to mention additional cost.

Regarding 256-bit packed integer support versus FP16 and FMA3 support -- for me it is clear that Intel is playing catch up with GPUs.

They will fail at that simply because GPUs have much shorter release cycle, and are advancing performance and functionality at a much faster rate than CPUs.

GPUs have an industry that drives the change (gaming/multimedia), while CPUs still benchmark their base performance on synthetic workloads from five years ago, which is IMO rather pathetic goal given how much R&D money is being burned every year.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
bronxzv's picture

>sincerely hope that you at least perform functional

sure we do

>and performance regression testing.

yepwe do, we explain it here

http://www.inartis.com/Products/Kribi%203D%20Engine/Default.aspx

"

The essential task in the final phase of the development of a 3D rendering engine, much like for a racing car engine, is tuning for the best possible performances on actual machines. After each small change in the program code, very precise timings show us the amount of speedup achieved (if any). For this purpose, we time with a stopwatch the rendering of a sequence of images, the laps of our racecourse.

"

>Pentium 3 which no longer has a code path in IPP 7.0

IIRC they have reverted some new limitations(i.e. removed paths)afterusershave complained in the XE 2011 release (IPP 7 update 1), not sure about the Katmai path though

c0d1f1ed's picture
Quoting Igor Levicki Furthermore, various shader compilers often replace LRP instruction with FMA when possible nowadays, probably because hardware implementations have changed considerably since LRP was first introduced (i.e. I don't believe GPUs had FMA back then).

Note that I did list LERP after FMA because I was aware that they partially overlap. What I was trying to say is that if you have one, you can have the other as well without breaking the budget.

GPUs had FMA support from the very beginning, and LERP has always been a macro which expanded to multiple instructions. Nothing has changed there as far as I'm aware. In fact like I said LERP is gone from DirectX 10 and OpenCL.

SoI don't see the point in adding a LERP instruction for the CPU either. It's just additions and multiplies so you need adders and multipliers. If you want LERP to be faster, you just need FMA execution units, and if you want it to be faster still, you need wider vectors. Both FMA and wider vectors help all other arithmetic too. But there's no way you can gain anything from a LERP instruction (unless you want to waste die space).

I think it's very helpful to have early access to new instructions if there's the potential they will become faster in later generations, but for the case of LERP I just don't see that potential.

Igor Levicki's picture

Well, LERP uses a constant (1.0) which you need to load from memory or keep in a register. Hardcoded instruction could use internal constant like it was possible with FPU. That would either remove one cache/memory access, or free one SIMD register when using hardcoded instruction instead of sequence of instructions that do the same thing.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
jan v.'s picture

Anyone already experimented with AVX2 gather ?

I've been using it for texture interpolation. Unfortunately I've seen only small speeups of my ported code from SSE2 to AVX2. Less than 10%. The gather has a rather annoying property that it resets the mask. From a software point of view this is totally useless and requires more code. A gather without mask also would have been nice to save a register.

From what I read, the gather is executed by microcode, producing uops, explaining the poor performance.... At least it's there and hopefully will get better later.

Tim Prince's picture

Guessing as to what you refer to, gather with mask is used to avoid potential access failures when using uninitialized pointers in the masked off elements.  Did you check whether -opt-assume-safe-padding is applicable to your compiler choices?

I don't know whether there would be a 32-bit signed vs. unsigned index vs. 64-bit pointer performance issue, in part since I can't see your code from here and anyway am not an expert in your subject.

As you indicate, the initial implementation of gather was not advertised as a performance improvement over equivalent compiler simulated gather.  But you didn't say whether you tried sse4 with simulated gather, for example.

jan v.'s picture

Having the mask is fine. The mask indicates what elements to load, the annoying thing it that afther the gather has finished the mask is set to 0.  With the vs2012 compiler, using the gather intrinsic, the compiler doesn't even take into account the mask is set to 0, leading to wrong code. I already filed a bug for that with Microsoft.

The reason I see no performance improvement, is imho, simply because the cache can not be loaded in paralell from up to 8 addresses, and everything is serialized there...

Tim Prince's picture

OK; as you say, I haven't seen VS2012 generate simulated (and certainly not AVX2) gather, and AVX is the only /arch: there which could be used.

I agree, the 8 separate addresses could each touch a different cache line, and the accesses may be limited to 2 vector elements per clock cycle, which you could probably achieve with SSE2.

bronxzv's picture

Quote:

jan v. wrote:Anyone already experimented with AVX2 gather ?

I have ported a big project to AVX2 well before to be able to test it on actual hardware (simply validated with the SDE). Just after I can start tuning my code on a retail Core i7 4770K setup, my first change was to comment out all my specialized AVX2 gather code paths. Software synthetized gather (as used for my AVX code path) is faster, thus the gather "optimization" was in fact leading to a performance regression!

As you can see in the Optimization Manual https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf page C-5, gather instructions have high reciprocal throughput (~ 10 clocks for 256-bit gather instructions). Hint: there are a few advices at pages 11-59/11-60 for gather optimizations.

Also of note :
- VTune Amplifier reports a lot of performance warning events when using the gather instructions.
- There is a bunch of errata related to gather in the Haswell specification update http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf.

My impression so far is that if someone get significant speedups from gather instructions it tells more about how poorly optimized is his/her software synthetized gather path (required for legacy targets) than how fast the gather instructions are. btw there is a common myth that gather instructions are somehow required for proper vectorization, that's clearly not true since these instructions are slower (or slightly faster in some corner cases, maybe) than their emulation with other basic instructions, so there is no more, no less vectorization opportunities.

All in all, I'll say that AVX2 gather instructions aren't ready for prime time at the moment. 

iliyapolak's picture

>>>gather instructions have high reciprocal throughput (~ 10 clocks for 256-bit gather instructions). Hint: there is a few advices at pages 11-59/11-60 for gather optimizations.>>>

Optimization manual also states that cpi for gather instructions is measured when memory references are cached in L1.If it is not the situation cpi will rise due to latency of memory access.

bronxzv's picture

Quote:

iliyapolak wrote:cpi for gather instructions is measured when memory references are cached in L1.If it is not the situation cpi will rise due to latency of memory access.

sure, in other words the gather instructions are worthless even in the situation where they should provide the best speedup (low L1D cache misses)

 

iliyapolak's picture

Very true.

Btw I am waiting for new computer to start testing AVX2.

Tim Prince's picture

Enough people have been saying they wanted the gather instructions to simplify intrinsics coding under Visual Studio that the CPU architects appear to have been encouraged to include them, one of the goals being to see whether there would be sufficient use to justify later hardware features to accelerate it.

You might compare the Intel(c) Xeon Phi(tm) version of gather, which should fetch all operands from a single cache line simultaneously, but currently requires iteration over the group of cache lines involved (which the compiler handles implicitly). 

c0d1f1ed's picture

It's good to avoid ISA fragmentation by providing new instructions even if their implementation is not optimal yet. But that requires them to not be slower than legacy code. Correct me if I'm wrong, but that doesn't seem to be the case for Haswell's gather implementation. Developers will still need multiple code paths so the purpose of providing these instructions early on was defeated. I'm glad there's an implicit promise in the docs that it will get faster, but developers will have to test for AVX2 support and the new architecture before using gather.

For the same reasons of ISA fragmentation I'm very disappointed that TSX support is not available on all Haswell models.

The reason this is critically important is because most developers don't bother with more than a couple of code paths. More paths mean higher QA cost and having to budget in a higher support cost for when one of the code paths needs maintenance. They'd much rather make a one-time investment to support AVX2 with a not-slower gather implementation and TSX. Without all of that, it becomes harder to justify as there is no clarity on when such an investment will provide value. It's all about the ROI. But this also affects Intel's ROI. If AVX2 and TSX are underutilized then it's just dead silicon that swallowed up a lot of R&D cost and will take longer to become a selling point.

bronxzv's picture

Quote:

c0d1f1ed wrote:But that requires them to not be slower than legacy code. Correct me if I'm wrong, but that doesn't seem to be the case for Haswell's gather implementation

it was an overall regression in my own use cases, though I'm not sure it applies to other people use cases or even to all my functions, maybe some individual functions get a speedup and others a slowdown?

to get a deeper insight I'll try to write a series of gather-focused microbenchmarks and I'll publish my findings here

jan v.'s picture

I've been updating my software renderer, making use of AVX2 gather (see bottom web page). There is a speedup, of around 10%. Clearing the destination register caused some extra speedup.  It's faster as the DX10 version on HD4600, but only with an external PCIe3 GPU, to write the software rendered images to.

bronxzv's picture

Quote:

c0d1f1ed wrote:Correct me if I'm wrong, but that doesn't seem to be the case for Haswell's gather implementation
  

I just wrote a microbenchmark where hardware gather provides some speedup, I tried to make a simple example where the legacy AVX path use 18 instructions (the very same implementation we discussed in the past), the code is a simplistic float to float LUT based conversion

Source code (partial):

void GatherTest(const float *lut, const float *src, float *dst, int n)  
{
  for (int i=0; i<n; i+=8) Store(dst+i,Gather(lut,Trunc(OctoFloat(src+i))));
}

AVX path, aka "SW gather":

.B3.3:                          ; Preds .B3.3 .B3.2
        vcvttps2dq ymm2, YMMWORD PTR [esi+eax*4]                ;123.51
        vmovd     edi, xmm2                                     ;123.40
        vextracti128 xmm6, ymm2, 1                              ;123.40
        vmovss    xmm0, DWORD PTR [ecx+edi*4]                   ;123.40
        vpextrd   edi, xmm2, 1                                  ;123.40
        vinsertps xmm1, xmm0, DWORD PTR [ecx+edi*4], 16         ;123.40
        vpextrd   edi, xmm2, 2                                  ;123.40
        vinsertps xmm3, xmm1, DWORD PTR [ecx+edi*4], 32         ;123.40
        vpextrd   edi, xmm2, 3                                  ;123.40
        vinsertps xmm0, xmm3, DWORD PTR [ecx+edi*4], 48         ;123.40
        vmovd     edi, xmm6                                     ;123.40
        vmovss    xmm4, DWORD PTR [ecx+edi*4]                   ;123.40
        vpextrd   edi, xmm6, 1                                  ;123.40
        vinsertps xmm5, xmm4, DWORD PTR [ecx+edi*4], 16         ;123.40
        vpextrd   edi, xmm6, 2                                  ;123.40
        vinsertps xmm7, xmm5, DWORD PTR [ecx+edi*4], 32         ;123.40
        vpextrd   edi, xmm6, 3                                  ;123.40
        vinsertps xmm1, xmm7, DWORD PTR [ecx+edi*4], 48         ;123.40
        vinsertf128 ymm2, ymm0, xmm1, 1                         ;123.40
        vmovups   YMMWORD PTR [ebx+eax*4], ymm2                 ;123.28
        add       eax, 8                                        ;123.22
        cmp       eax, edx                                      ;123.19
        jl        .B3.3         ; Prob 82%                      ;123.19

AVX2 path, aka "HW gather":

.B4.3:                          ; Preds .B4.3 .B4.2
        vcvttps2dq ymm0, YMMWORD PTR [edi+eax*4]                ;139.53
        vpcmpeqd  ymm1, ymm1, ymm1                              ;139.40
        vxorps    ymm2, ymm2, ymm2                              ;139.40
        vgatherdps ymm2, YMMWORD PTR [ecx+ymm0*4], ymm1         ;139.40
        vmovups   YMMWORD PTR [esi+eax*4], ymm2                 ;139.28
        add       eax, 8                                        ;139.22
        cmp       eax, edx                                      ;139.19
        jl        .B4.3         ; Prob 82%                      ;139.19

timings (single thread):

    128 elts      (1536 B): SW gather 0.569 ns/elt  HW gather 0.562 ns/elt  HW speedup = 1.012 x
    256 elts      (3072 B): SW gather 0.578 ns/elt  HW gather 0.547 ns/elt  HW speedup = 1.058 x
    512 elts      (6144 B): SW gather 0.568 ns/elt  HW gather 0.543 ns/elt  HW speedup = 1.048 x
   1024 elts     (12288 B): SW gather 0.560 ns/elt  HW gather 0.544 ns/elt  HW speedup = 1.030 x
   2048 elts     (24576 B): SW gather 0.725 ns/elt  HW gather 0.692 ns/elt  HW speedup = 1.047 x
   4096 elts     (49152 B): SW gather 0.719 ns/elt  HW gather 0.678 ns/elt  HW speedup = 1.061 x
   8192 elts     (98304 B): SW gather 0.694 ns/elt  HW gather 0.607 ns/elt  HW speedup = 1.144 x
  16384 elts    (196608 B): SW gather 0.650 ns/elt  HW gather 0.568 ns/elt  HW speedup = 1.143 x
  32768 elts    (393216 B): SW gather 0.777 ns/elt  HW gather 0.782 ns/elt  HW speedup = 0.994 x
  65536 elts    (786432 B): SW gather 0.975 ns/elt  HW gather 0.991 ns/elt  HW speedup = 0.984 x
 131072 elts   (1572864 B): SW gather 1.323 ns/elt  HW gather 1.362 ns/elt  HW speedup = 0.971 x
 262144 elts   (3145728 B): SW gather 1.526 ns/elt  HW gather 1.539 ns/elt  HW speedup = 0.992 x
 524288 elts   (6291456 B): SW gather 1.790 ns/elt  HW gather 1.835 ns/elt  HW speedup = 0.975 x
1048576 elts  (12582912 B): SW gather 2.186 ns/elt  HW gather 2.250 ns/elt  HW speedup = 0.972 x
2097152 elts  (25165824 B): SW gather 4.056 ns/elt  HW gather 4.081 ns/elt  HW speedup = 0.994 x
4194304 elts  (50331648 B): SW gather 6.224 ns/elt  HW gather 6.236 ns/elt  HW speedup = 0.998 x
8388608 elts (100663296 B): SW gather 7.915 ns/elt  HW gather 7.919 ns/elt  HW speedup = 1.000 x

the best speedup (x1.14) is for the total workset (input & output arrays + LUT) in the L2 cache

 

Configuration: Core i7 4770K @ 3.5 GHz (both core ratio and cache ratio fixed at 35 with bclock = 100.0 MHz) + DDR3-2400, HT disabled 

bronxzv's picture

I get better speedups after 8x unrolling (~best unroll factor) since HW gather is faster but SW gather unchanged

Source code (partial):

void GatherTest(const float *lut, const float *src, float *dst, int n) 
{
#pragma unroll(8)
  for (int i=0; i<n; i+=8) Store(dst+i,Gather(lut,Trunc(OctoFloat(src+i))));
}

AVX path not shown, way too complex after 8x unrolling...

AVX2 path:

.B4.7:                          ; Preds .B4.7 .B4.6
        vcvttps2dq ymm1, YMMWORD PTR [edx+ebx]                  ;139.53
        inc       ecx                                           ;139.3
        vmovdqa   ymm2, ymm0                                    ;139.40
        vgatherdps ymm3, YMMWORD PTR [edi+ymm1*4], ymm2         ;139.40
        vmovups   YMMWORD PTR [edx+eax], ymm3                   ;139.28
        vcvttps2dq ymm4, YMMWORD PTR [32+edx+ebx]               ;139.53
        vmovdqa   ymm5, ymm0                                    ;139.40
        vgatherdps ymm6, YMMWORD PTR [edi+ymm4*4], ymm5         ;139.40
        vmovups   YMMWORD PTR [32+edx+eax], ymm6                ;139.28
        vcvttps2dq ymm7, YMMWORD PTR [64+edx+ebx]               ;139.53
        vmovdqa   ymm1, ymm0                                    ;139.40
        vgatherdps ymm2, YMMWORD PTR [edi+ymm7*4], ymm1         ;139.40
        vmovups   YMMWORD PTR [64+edx+eax], ymm2                ;139.28
        vcvttps2dq ymm3, YMMWORD PTR [96+edx+ebx]               ;139.53
        vmovdqa   ymm4, ymm0                                    ;139.40
        vgatherdps ymm5, YMMWORD PTR [edi+ymm3*4], ymm4         ;139.40
        vmovups   YMMWORD PTR [96+edx+eax], ymm5                ;139.28
        vcvttps2dq ymm6, YMMWORD PTR [128+edx+ebx]              ;139.53
        vmovdqa   ymm1, ymm0                                    ;139.40
        vgatherdps ymm7, YMMWORD PTR [edi+ymm6*4], ymm1         ;139.40
        vmovups   YMMWORD PTR [128+edx+eax], ymm7               ;139.28
        vcvttps2dq ymm1, YMMWORD PTR [160+edx+ebx]              ;139.53
        vmovdqa   ymm2, ymm0                                    ;139.40
        vgatherdps ymm3, YMMWORD PTR [edi+ymm1*4], ymm2         ;139.40
        vmovups   YMMWORD PTR [160+edx+eax], ymm3               ;139.28
        vcvttps2dq ymm4, YMMWORD PTR [192+edx+ebx]              ;139.53
        vmovdqa   ymm5, ymm0                                    ;139.40
        vgatherdps ymm6, YMMWORD PTR [edi+ymm4*4], ymm5         ;139.40
        vmovups   YMMWORD PTR [192+edx+eax], ymm6               ;139.28
        vcvttps2dq ymm7, YMMWORD PTR [224+edx+ebx]              ;139.53
        vmovdqa   ymm1, ymm0                                    ;139.40
        vgatherdps ymm2, YMMWORD PTR [edi+ymm7*4], ymm1         ;139.40
        vmovups   YMMWORD PTR [224+edx+eax], ymm2               ;139.28
        add       edx, 256                                      ;139.3
        cmp       ecx, esi                                      ;139.3
        jb        .B4.7         ; Prob 99%                      ;139.3

timings (single thread):

    128 elts      (1536 B): SW gather 0.570 ns/elt  HW gather 0.481 ns/elt  HW speedup = 1.185 x
    256 elts      (3072 B): SW gather 0.586 ns/elt  HW gather 0.456 ns/elt  HW speedup = 1.286 x
    512 elts      (6144 B): SW gather 0.571 ns/elt  HW gather 0.447 ns/elt  HW speedup = 1.276 x
   1024 elts     (12288 B): SW gather 0.560 ns/elt  HW gather 0.444 ns/elt  HW speedup = 1.263 x
   2048 elts     (24576 B): SW gather 0.634 ns/elt  HW gather 0.572 ns/elt  HW speedup = 1.109 x
   4096 elts     (49152 B): SW gather 0.718 ns/elt  HW gather 0.591 ns/elt  HW speedup = 1.216 x
   8192 elts     (98304 B): SW gather 0.695 ns/elt  HW gather 0.575 ns/elt  HW speedup = 1.209 x
  16384 elts    (196608 B): SW gather 0.696 ns/elt  HW gather 0.624 ns/elt  HW speedup = 1.116 x
  32768 elts    (393216 B): SW gather 0.740 ns/elt  HW gather 0.733 ns/elt  HW speedup = 1.011 x
  65536 elts    (786432 B): SW gather 1.066 ns/elt  HW gather 1.080 ns/elt  HW speedup = 0.987 x
 131072 elts   (1572864 B): SW gather 1.351 ns/elt  HW gather 1.357 ns/elt  HW speedup = 0.996 x
 262144 elts   (3145728 B): SW gather 1.539 ns/elt  HW gather 1.542 ns/elt  HW speedup = 0.998 x
 524288 elts   (6291456 B): SW gather 1.714 ns/elt  HW gather 1.767 ns/elt  HW speedup = 0.970 x
1048576 elts  (12582912 B): SW gather 2.195 ns/elt  HW gather 2.103 ns/elt  HW speedup = 1.044 x
2097152 elts  (25165824 B): SW gather 4.057 ns/elt  HW gather 4.017 ns/elt  HW speedup = 1.010 x
4194304 elts  (50331648 B): SW gather 6.220 ns/elt  HW gather 6.203 ns/elt  HW speedup = 1.003 x
8388608 elts (100663296 B): SW gather 7.909 ns/elt  HW gather 7.908 ns/elt  HW speedup = 1.000 x

best speedup is now 1.26-1.28 x when the whole workset fit in the L1D cache

I also tested this code under VTune and I don't get the same performance warnings (for ex. "Machine Clears") than with my full project, there is simply a lot of assists (Filled Pipeline Slots -> Retiring -> Assists: 0.092) which looks normal since gather is implemented as microcode at the moment

 

Configuration: Core i7 4770K @ 3.5 GHz (both core ratio and cache ratio fixed at 35 with bclock = 100.0 MHz) + DDR3-2400, HT disabled

bronxzv's picture

Quote:

jan v. wrote:Clearing the destination register caused some extra speedup.

interesting, I see that the Intel compiler do just that, see vxorps ymm2, ymm2, ymm2 in my example above (AFAIK this is a "zeroing idiom"), though it's omitted in the unrolled version for some reason

bronxzv's picture

Quote:

jan v. wrote:I've been updating my software renderer, making use of AVX2 gather (see bottom web page). There is a speedup, of around 10%.

out of curiosity I tested your demo, keeping the default initial view point I get these scores:

FQuake64.exe : 208 fps
FQuake64 AVX2.exe : 195 fps

so it looks like the AVX2 path is slower than the other one (supposedly AVX ?), how can I measure the 10% speedup you are refering to ?

 

Configuration: Core i7 4770K @ 3.5 GHz (turbo up to 4.0 GHz) + DDR3-2400, HT enabled

Sergey Kostrov's picture

>>...the best speedup ( x1.14 ) is for the total workset ( input & output arrays + LUT ) in the L2 cache...

It actually means 14% improvement for that kind of processing. If you ever looked at Intel MKL Release Notes you could see improvements numbers like 2% or 3% for some functions ( algorithms ) and it makes a difference when large data sets need to be processed.

bronxzv's picture

Quote:

Sergey Kostrov wrote:It actually means 14% improvement for that kind of processing.

note that my goal was to find an upper bound for the use case the most detrimental to legacy "software" gather, namely when it is implemented as a generic function equivalent to AVX2 hardware gather (start from a 256-bit vector of indices and store to a 256-bit destination), after further tests the best speedup I have reached so far is at 1.29 x (29 %), example shown below (this code has no practical purpose):

AVX2 path (unrolled 8x):

.B6.7:                          ; Preds .B6.7 .B6.6
;;;   {
;;;     const OctoInt indices(Trunc(work));
        vcvttps2dq ymm3, ymm1                                   ;170.22
        inc       edx                                           ;168.3
;;;     checkSum ^= Gather(lut,indices);
        vmovdqa   ymm4, ymm2                                    ;171.13
        vgatherdps ymm5, YMMWORD PTR [esi+ymm3*4], ymm4         ;171.13
;;;     work = work * work;
        vmulps    ymm3, ymm1, ymm1                              ;172.19
        vxorps    ymm6, ymm0, ymm5                              ;171.10
        vcvttps2dq ymm0, ymm3                                   ;170.22
        vmovdqa   ymm1, ymm2                                    ;171.13
        vgatherdps ymm7, YMMWORD PTR [esi+ymm0*4], ymm1         ;171.13
        vxorps    ymm4, ymm6, ymm7                              ;171.10
        vmulps    ymm6, ymm3, ymm3                              ;172.19
        vcvttps2dq ymm0, ymm6                                   ;170.22
        vmovdqa   ymm1, ymm2                                    ;171.13
        vgatherdps ymm5, YMMWORD PTR [esi+ymm0*4], ymm1         ;171.13
        vxorps    ymm1, ymm4, ymm5                              ;171.10
        vmulps    ymm4, ymm6, ymm6                              ;172.19
        vcvttps2dq ymm7, ymm4                                   ;170.22
        vmovdqa   ymm0, ymm2                                    ;171.13
        vgatherdps ymm3, YMMWORD PTR [esi+ymm7*4], ymm0         ;171.13
        vxorps    ymm6, ymm1, ymm3                              ;171.10
        vmulps    ymm1, ymm4, ymm4                              ;172.19
        vcvttps2dq ymm5, ymm1                                   ;170.22
        vmovdqa   ymm0, ymm2                                    ;171.13
        vgatherdps ymm7, YMMWORD PTR [esi+ymm5*4], ymm0         ;171.13
        vxorps    ymm4, ymm6, ymm7                              ;171.10
        vmulps    ymm6, ymm1, ymm1                              ;172.19
        vcvttps2dq ymm0, ymm6                                   ;170.22
        vmovdqa   ymm3, ymm2                                    ;171.13
        vgatherdps ymm5, YMMWORD PTR [esi+ymm0*4], ymm3         ;171.13
        vxorps    ymm1, ymm4, ymm5                              ;171.10
        vmulps    ymm4, ymm6, ymm6                              ;172.19
        vcvttps2dq ymm7, ymm4                                   ;170.22
        vmovdqa   ymm0, ymm2                                    ;171.13
        vgatherdps ymm3, YMMWORD PTR [esi+ymm7*4], ymm0         ;171.13
        vxorps    ymm6, ymm1, ymm3                              ;171.10
        vmulps    ymm1, ymm4, ymm4                              ;172.19
        vcvttps2dq ymm5, ymm1                                   ;170.22
        vmulps    ymm1, ymm1, ymm1                              ;172.19
        vmovdqa   ymm0, ymm2                                    ;171.13
        vgatherdps ymm7, YMMWORD PTR [esi+ymm5*4], ymm0         ;171.13
        vxorps    ymm0, ymm6, ymm7                              ;171.10
        cmp       edx, eax                                      ;168.3
        jb        .B6.7         ; Prob 99%                      ;168.3
 

I still have to do more tests, for example the same kind of code in a multi-thread application and HT enabled, but I suppose it answers c0d1f1ed's main question: it makes sense to consider using AVX2 gather instructions for today's code (for some use cases) since there isn't always a performance regression

Quote:

Sergey Kostrov wrote:If you ever looked at Intel MKL Release Notes you could see improvements numbers like 2% or 3% for some functions ( algorithms ) .

I can't see any mention of gather in the latest MKL Release Notes (MKL 11.0 update 4), so you probably read it somewhere else, if you find it again please let me know

Quote:

Sergey Kostrov wrote:and it makes a difference when large data sets need to be processed.

actually it makes more difference with small data sets (the best 29% speedup I reached is with a 16 KiB (read only) dataset entirely in the L1D cache, with truly big sets you are mostly LLC cache bound, or worse, memory bound

c0d1f1ed's picture

That's actually not bad at all! Worst case seems to be a 3% performance loss, which is neglibible, especially since 14% or more can be gained. And with even more to be gained in future implementations, I see no reason to hold back on using gather.

bronxzv, was the performance regression you observed due to not clearing the destination register? Apparently that is required to break a dependency chain.

While the mask and blend functionality of gather may not seem very valuable from a software point of view, it is actually essential for handling interrupts at the hardware level. It can store the partial result in the destination register and track where it left off by updating the mask register, before both get stored in memory for a thread switch to occur. The alternative would have been to discard the partial result and start the gather operation all over again when resuming. This would have made it perform worse than an extract/insert sequence.

c0d1f1ed's picture

Quote:

jan v. wrote:I've been updating my software renderer, making use of AVX2 gather (see bottom web page). There is a speedup, of around 10%.

Does that include widening to 256-bit?

Quote:

It's faster as the DX10 version on HD4600, but only with an external PCIe3 GPU, to write the software rendered images to.

That is hugely impressive! Hopefully that helps Intel take CPU-GPU unification seriously. If the IGP was replaced with more CPU cores, that would make software like yours faster overall than using 'dedicated' hardware. It's much easier to program than a heterogeneous system, and there would be no limitations.

That just leaves power consumption as an issue. But that could be addressed by equipping each core with two clusters of 512-bit SIMD units, running at half frequency, each dedicated to one thread. The loss of Hyper-Threading for SIMD operations (not at the software level but at the hardware level), can be replaced by executing AVX-1024 operations in two cycles, which would be more power efficient to boot.

jan v.'s picture

Quote:

bronxzv wrote:

out of curiosity I tested your demo, keeping the default initial view point I get these scores:

FQuake64.exe : 208 fps
FQuake64 AVX2.exe : 195 fps

so it looks like the AVX2 path is slower than the other one (supposedly AVX ?), how can I measure the 10% speedup you are refering to ?

 

Configuration: Core i7 4770K @ 3.5 GHz (turbo up to 4.0 GHz) + DDR3-2400, HT enabled

I'm running on: Core i7 4770K @ 3.5 GHz (turbo does 3.9 GHz for all cores) + DDR3-1600, HT enabled, Win 7, at 2560x1440 resolution

FQuake64.exe : 295 fps            --> actually this is a SSE2 version so 128 bit
FQuake64 AVX2.exe : 324 fps   -> 256 bit

That is with the screen to a discrete GPU, supporting PCIe3 (8GB/s), AMD 7970.

With the IGP displaying, CPU rendering,  I'm only getting a mere 100 fps, some buffer copying must be going on, eating all the memory bandwidth. The IGP in DX10 does 260 fps.

jan v.'s picture

Quote:

c0d1f1ed wrote:

Does that include widening to 256-bit?

That is hugely impressive! Hopefully that helps Intel take CPU-GPU unification seriously. If the IGP was replaced with more CPU cores, that would make software like yours faster overall than using 'dedicated' hardware. It's much easier to program than a heterogeneous system, and there would be no limitations.

Yes, 256-bit.

I'm winning here mainly because the type of rendering I'm doing is very bandwidth efficient. Given more memory bandwidth like with the integrated 128 MB cache, the IGP would be faster. And indeed it uses some 50 more Watt.

bronxzv's picture

Quote:

jan v. wrote:I'm running on: Core i7 4770K @ 3.5 GHz (turbo does 3.9 GHz for all cores) + DDR3-1600, HT enabled, Win 7, at 2560x1440 resolution

FYI, I tested at 1600 x 1200 with the 4770K iGPU, btw it looks like there is a big serial portion of code in your demo at each frame (thus low overall CPU usage, less than 85% on my machine, in other words you are leaving more than 15% performance potential on the table at the moment), it's probably when you copy the rendered frames to the front buffer, I'll advise to do that in parallel with the rendering of the next frame (triple buffering), this way copying the rendered frames will be concurrent with rendering, typically one thread is enough to saturate the PCIe bandwidth so that's a very sensible solution to do it like that, a 8-thread pool for rendering and a single extra thread for copying the final frames

EDIT: I measured the time to copy a 32-bit 1600x1200 frame with Direct 2D and the iGPU (CPU at 3.9 GHz fixed) and it takes 3.28 += 0.02 ms (executing in parallel with 8 running threads, > 99% overall CPU usage), i.e. the hard limit is at ~300 fps if copying the rendered frames overlap with rendering next frames (which my own engine do btw)

Quote:

jan v. wrote:FQuake64.exe : 295 fps            --> actually this is a SSE2 version so 128 bit
FQuake64 AVX2.exe : 324 fps   -> 256 bit

ah! so your 10% are not the gain from gather in isolation, now, 10% speedup looks quite low from SSE2 to AVX2, you can probably get more from going from 128-bit to 256-bit floats and ints + FMA + gather, I'm getting around 45% speedup for my own texture mapping code (using 256-bit unpack and shuffle + FMA, but neither using gather nor generic permute at the moment)

Pages

Login to leave a comment.