Why no FMA in AVX in Sandy Bridge?

Why no FMA in AVX in Sandy Bridge?

Imagen de Igor Levicki

I have heard that Sandy Bridge won't have FMA implementation.

If that rumor is true, I would really like to know who decided that x86 developers should wait more to finally get fused multiply-add isntruction? Is it so useless in real code or the marketing department has again started doing an engineer's job?

Hereby I publicly voice my displeasure over that poor decision.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
publicaciones de 19 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Quoting - Igor Levicki

I have heard that Sandy Bridge won't have FMA implementation.

If that rumor is true, I would really like to know who decided that x86 developers should wait more to finally get fused multiply-add isntruction? Is it so useless in real code or the marketing department has again started doing an engineer's job?

Hereby I publicly voice my displeasure over that poor decision.

Below is a response from the Engineering team:

Hi Igor,

Sandy Bridge will not have FMA, it's targeted for a future processor. I apologize if there is any confusion I (or Intel) caused. In our defense, we did discuss feature timing in the last two Intel developer forums (and now to my embarrassment, I see that presentation has been removed from the IDF content catalog at http://www.intel.com/idf , we'll have it up in time for the upcoming IDF on Oct 20). And it's on a separate CPUID feature flag (separate section of the document too ) in the programming reference.

Anyway, enough for my justifications. There is no intent to 'market' here, we're just engineers: Our strategy going forward is to disclose the industry early on our directions, first to get feedback on the value (and definition) of features like wider vectors, FMA, new instructions, and secondly to get software ready as early as possible. From your perspective is this the right strategy, or are we just confusing people? (and for anyone else reading this: While I appreciate the private mails, I especially like feedback discussions to happen in public forums...). So far I have collected a lot of feedback on the definition and direction and we hope to provide some public response to it shortly.

It sounds like you are an FMA supporter - beyond the raw FLOPS improvement, do you have any sensitivity to the numerical advantages FMA can provide? There are obviously a lot of tradeoffs in the implementations we can provide, and having some data to understand how you would use it would be very helpful.

Regards,

Mark Buxton

Imagen de Igor Levicki

Mark,
Early disclosure of new technologies is always a good thing in my opinion, especially here in the computer industry -- it isasymbiosis of hardware featuresetand software supportthatmakes computers tick.
There are several reasons why FMA is important to developers:
- Porting -- AltiVec has FMA and x86 doesn't (take a look at FFTW 3.0+ codelets which use FMA on Motorola for a speed boost, paper).
- DAW -- Any DSP operationwhere there is accumulation (like inFIR/IIR filters used for delay lines, reverberation, etc) would benefit from FMA's additional precision.
- Larrabee -- based on x86 it isnot a CPU nor a GPU, but it still has to have FMA if you want it to succeed as a number cruncher.
Especially taking the Larrabee into the picture it doesn't make any sense tokeep the FMA out from x86 for so long. Trust me, you really need developers to start using FMA before Larrabee hits the shelves, and you definitely want Larrabee to have FMA. The only way to accomplish both is to make FMA part of x86 as fast as possible.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Hi Igor,

Sandy Bridge will not have FMA, it's targeted for a future processor. I apologize if there is any confusion I (or Intel) caused. In our defense, we did discuss feature timing in the last two Intel developer forums (and now to my embarrassment, I see that presentation has been removed from the IDF content catalog at http://www.intel.com/idf , we'll have it up in time for the upcoming IDF on Oct 20). And it's on a separate CPUID feature flag (separate section of the document too ) in the programming reference.

Anyway, enough for my justifications. There is no intent to 'market' here, we're just engineers: Our strategy going forward is to disclose the industry early on our directions, first to get feedback on the value (and definition) of features like wider vectors, FMA, new instructions, and secondly to get software ready as early as possible. From your perspective is this the right strategy, or are we just confusing people? (and for anyone else reading this: While I appreciate the private mails, I especially like feedback discussions to happen in public forums...). So far I have collected a lot of feedback on the definition and direction and we hope to provide some public response to it shortly.

It sounds like you are an FMA supporter - beyond the raw FLOPS improvement, do you have any sensitivity to the numerical advantages FMA can provide? There are obviously a lot of tradeoffs in the implementations we can provide, and having some data to understand how you would use it would be very helpful.

Regards,

Mark Buxton

(OK, that was wierd, my final post went through two days late....)

Igor, thanks for your response/support!
Regards,
Mark

On Itanium, for applications I support, we saw about 5% difference in performance between FMA disabled (using separate fma instructions for multiple and for add) and enabled. I don't know how such figures from other platforms could be translated to a Sandy Bridge successor, when instruction issue rate is not such a limiter as it was in past architectures. The big gains for FMA occur in serial code, where the full latency of add and multiply is exposed. We do everything we can to avoid such situations; maybe someone thinks those are the only cases which count. I expect the gain from initial AVX to be much larger than the subsequent gain from FMA, but I don't put FMA in the category of those instruction additions which caused more noise than benefit.

On MIPS R8000, there were disastrous situations where applications broke with FMA, for example sqrt(a*a -b*b) producing a NaN or run-time abort when a == b, because one product is rounded and the other is not. Other than that, FMA usually gives more accurate results, the major objection being these small inconsistencies.

Imagen de Igor Levicki

You are welcome.
@tim18:
I am not sure in which applications you are seeing only5% performance gain? Could you please elaborate?

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Quoting - Igor Levicki

Mark,
Early disclosure of new technologies is always a good thing in my opinion, especially here in the computer industry -- it isasymbiosis of hardware featuresetand software supportthatmakes computers tick.
There are several reasons why FMA is important to developers:
- Porting -- AltiVec has FMA and x86 doesn't (take a look at FFTW 3.0+ codelets which use FMA on Motorola for a speed boost, paper).
- DAW -- Any DSP operationwhere there is accumulation (like inFIR/IIR filters used for delay lines, reverberation, etc) would benefit from FMA's additional precision.
- Larrabee -- based on x86 it isnot a CPU nor a GPU, but it still has to have FMA if you want it to succeed as a number cruncher.
Especially taking the Larrabee into the picture it doesn't make any sense tokeep the FMA out from x86 for so long. Trust me, you really need developers to start using FMA before Larrabee hits the shelves, and you definitely want Larrabee to have FMA. The only way to accomplish both is to make FMA part of x86 as fast as possible.

I second all of these points. FMA has been implemented in GPUs for years now as a very effective way to double the raw FLOPS performance. In graphics and multimedia code the occurence of a multiply followed by an addition is so common that effective performance increases of over 50% are no exception.

It seems to me that the transistor budget required to widen datapaths to 256-bit for AVX is far greater than that for adding FMA support. So I don't fully understandwhy it has been postponed.

If a compromise was really necessary I believe supporting the instructions without actual FMA execution units (by splitting them into two operations) would have been a better option. This way software developers can use it early and when their binaries get run on a future CPU with actual FMA units it would result in a performance boost without code changes.

I'm convinced this applies to additional instructions as well. For instance scatter/gather operations are still sorely missing so they should be added as soon as reasonably possible, even if early implementations are not optimal. Developers need functional instructions, not specifications on paper, for fast adoption of new ISA extensions. Popular instructions are then automatically put in the spotlight so you know what deserves a faster implementation for later processors...

Quoting - c0d1f1ed

(snip)
It seems to me that the transistor budget required to widen datapaths to 256-bit for AVX is far greater than that for adding FMA support. So I don't fully understandwhy it has been postponed.

If a compromise was really necessary I believe supporting the instructions without actual FMA execution units (by splitting them into two operations) would have been a better option. This way software developers can use it early and when their binaries get run on a future CPU with actual FMA units it would result in a performance boost without code changes.

(snip)

Doing a full FMA (that doubles FLOPS) will indeed be very expensive for us, unfortunately I can't discuss all the reasons. When we looked at the performance benefit vs. cost for a wide variety of workloads, 256-bit vectors came out on top - at least when the user is able to put the effort into vectorizing their code :). That's why we did wider vectors first.

You have an interesting suggestion about deploying a 2-uop FMA. Would you still support it if the performance were not equal or better (in all cases) to the alternative mul+add - i.e. the additional latency of putting multiply on the critical path would not becompensated by higher throughput in such an architecture. Some codes are sensitive to this effect (or you would have to be really smart about where you could deploy such an FMA)?

Regards,

Mark

Imagen de Igor Levicki

Mark,
If the suggested 2-uop FMA would have the sameprecision as the fully implementedFMA (in other words no intermediate rounding), and if theperformance stays the same as for MUL +ADD combination I would use it.
Moreover, I suggested (on severaloccasions)thatSCATTER andGATHER instructions needto be added even if the first implementation isn't more efficient than thecurrent ways of performing the same operation (although my guess is that it would be more efficient anyway, at least because of the reduced code size). Later in Larrabee those could automatically map on texture fetch hardware.
I also asked for an instruction that returns integral and fractional parts of an XMM register.
Rationale and examples for those instructions can be found here:

http://software.intel.com/en-us/forums/showthread.php?t=52844
http://software.intel.com/en-us/forums/showpost.php?p=56866
I am really hoping someone will finallynotice those ideas and add those sorely missing instructions to the ISA.
You should also consider linear interpolation instruction (aka LRPin GPU assembler)now that Larrabee is in the pipeline.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

The (uncommitted) gather instruction we have been discussing is not a gather in the usual sense of HPC or compiler technology. It is simply a strided load, not an indirect indexed load.
The value of the strided load in several applications hinges on whether it would accelerate matrix transposition.
Previous discussions of indirect indexed load indicated that the most likely instruction implementation would not improve on the performance of compiled code.

Imagen de Igor Levicki

Quoting - tim18

Previous discussions of indirect indexed load indicated that the most likely instruction implementation would not improve on the performance of compiled code.

What? Are you 100% positive about that? Have you run a simulation?

Old code I posted can be a bit shorter now that we have INSERTPS but still take a look at this mess again:

; SSE4.1 gather emulation:

	mov		esi, dword ptr [data]
	mov		edx, dword ptr [index]
loop:
	...
	mov		eax, dword ptr [edx]
	movss		xmm0, [esi + eax]
	mov		eax, dword ptr [edx + 4]
	insertps		xmm0, [esi + eax], 1
	mov		eax, dword ptr [edx + 8]
	insertps		xmm0, [esi + eax], 2
	mov		eax, dword ptr [edx + 12]
	insertps		xmm0, [esi + eax], 3
	...
	jnz		loop

; HYPOTHETICAL gather instruction:

	mov		esi, dword ptr [data]
	mov		edx, dword ptr [index]
loop:
	...
	gmovps		xmm0, xmmword ptr [edx]
	...
	jnz		loop

So, is it really faster to fetch, decode and execute eight instructions taking 37 bytes (more in 64-bit mode) on a critical code path, instead of a single possibly less than optimally implemented instruction?

I would really like to understand why a single instruction might be slower.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Quoting - mjbuxton
You have an interesting suggestion about deploying a 2-uop FMA. Would you still support it if the performance were not equal or better (in all cases) to the alternative mul+add - i.e. the additional latency of putting multiply on the critical path would not becompensated by higher throughput in such an architecture. Some codes are sensitive to this effect (or you would have to be really smart about where you could deploy such an FMA)?

Absolutely. It's really about adoption and compatibility:
Scenario 1: FMA instructions are added later when single uop execution units are available.
Let's say this happens in four years. At that point developers will be eager to use FMA, but they have to be careful to still support older processors. So they have the choice of writing two code paths, or just not using FMA till it's ubiquitous. Maintaining multiple code paths is a software engineer's daily nightmare (it's not just FMA, it's other ISA extensions and many other system parameters as well). So it's not uncommon to only start supporting new instructions years later. In fact I believe that only recently it has become relatively safe to assume SSE2 support as a minimum (i.e. putting that on the box won't cost us a significant number of clients). That's a full 7 years after its introduction! So in this scenario FMA would suffer pretty slow adoption up to the year 2019...
Scenario 2: FMA instructions are added sooner and executed in two uops.
Developers can and will experiment with these instructions sooner. Compilers and other tools will support them years sooner too. Code size, extra precision, and the potential of seeing faster implementations in future processors (without requiring a code rewrite)are enough incentive for the early adopters. By the time single uop FMA processors become available they'll see a nice boost in performance. That's good for Intel too since real-world applications can be used as benchmarks, which is a lot more convincingfor consumers than numbers on paper and a much later return on investment. And just as importantly, those 2 uop FMA processors will still run applicaitons that have one code path and demand FMA as a minimum. They won't run it faster than an application with two code paths (one using separate mul and add) but at leat they'll run it. There's nothing more frustrating than not being able to run an application because the hardware doesn't support it (and guess who gets the blame).
So I think scenario 2 is a win for everybody (hardware guys, software guys and consumers). And I strongly believe it applies to much more than FMA. Of course you can't just blindly start adding instructions, but if you already decided you're going to invest transistors into a feature at some point, it really doesn't hurt to have a functional 'interface' much sooner. In fact, if it turns out that developers are not so interested in the feature after all, you have the option of postponing the full-fledged implementation a couple years till they're more interested, investing those transistors elsewhere in the meantime.
Lastly, in case anyone's worried about the marketing aspects: It's simply a case of not marketing to consumers until the faster execution units are added. Core 2's vastly increased SSE performance has been a grand succes despite that SSE has been around for a decade. It's easy to market when the numbers speak for themselves. ;)

Quoting - tim18

The (uncommitted) gather instruction we have been discussing is not a gather in the usual sense of HPC or compiler technology. It is simply a strided load, not an indirect indexed load.
The value of the strided load in several applications hinges on whether it would accelerate matrix transposition.
Previous discussions of indirect indexed load indicated that the most likely instruction implementation would not improve on the performance of compiled code.

Personally I think a strided load would be a waste in the long term. Sooner or later true scatter/gather will be added (*) and the strided load becomes another superseded legacy instruction that you have to drag with you till the end of days.

If that's not a concern, fine, but please consider adding the gather instruction as soon as possible. An early implementation could work just like in Larrabee; using multiple wide loads till all the elements have been 'gathered'. It would definitely be faster than using individualinsertps instructions, with a minimal latencyequal tothat of a movups (for sequential indexes or indexes all in the same vector).

And it would be useful for a lot more than just matrix transposition. It opens the door for things that aren't even conceivable today. Truely any loop that involves independent iterations could be (automatically) parallelized when we have scatter/gather instructions, no matter how the data is organized, or even in the presence of pointer chasing. So it's not just for HPC or multimedia (although those would benefit massively as well). If you think that's radical, please realise that the rules for writing high performance software have already changeddramatically when we went multi-core. So you might as well finish what you started and add scatter/gather support or the CPU will keep losing terrain to the GPU. You're nearing the point where people just buy the cheapest CPU available and rather invest in a more powerful GPU to do the 'real work'. The competition (both AMD and NVIDIA) are in rather sweet spots to take the biggest pieces of the pie in this scenario. So you'd better give people good reasons to keep buying the latest CPUs, by adding instructions to support algorithms that would otherwise run more optimally outside of the CPU. The only reason I care is because I believe it's better for the end user.

Anyhow, I like Igor's suggested syntax of a gather instructions, but I believe the following would be even more powerful:

movups ymm0, [r0+ymm1*4]

Note that I'm using the same mnemonic as a regular load. And in fact I believe it could use the same encoding except for one bit to indicate the use of a vector register as index(es). Also note how r0 is used as a base pointer instead of requiring the implicit use of rsi, and I can scale the indices (all using regular SIB byte encoding).

(*) P.S: It's really out of the question whether or not scatter/gather will be necessary. As you continue to widen the vectors, accessing data at different locations becomes a massive bottleneck. AVX can scale up to 1024-bit (32 dwords) so you better have flexible and fast ways to get data in and out of such vectors. Neither insertps or a strided load helps much when an arithmetic operation on up to 32 elements costs one cycle (throughput) while the load costs 32 cycles or more! So it seems obvious to me to architect the scatter/gather instructions sooner rather than later and make them as future proof as possible.

Quoting - c0d1f1ed

You're nearing the point where people just buy the cheapest CPU available and rather invest in a more powerful GPU to do the 'real work'.

I almost thought I was exaggerating, and then I found this: NVIDIA Introduces NVIDIA Quadro CX...

Imagen de Igor Levicki

@c0d1f1ed:
I completely agree that the strided load is useless. On the other hand, both FMA and gatherare essential.
As for the gather syntax, it was just an example off the top of my head. Your ideaisbetter.
@Mark:
Of course we don't want to add instructions blindly butit has been done beforeby Intel engineers completely ignoring developer's needs in the process.
So, being more picky as to what gets added from now onis simply not enough -- you need to evaluate our ideas and to add thoseinstructions that we reallyneed,instead ofmorecarefully picking from your own ideas, whichseem to be severely restricted by knowing theimplementation cost and marketing value in advance.
Moreover,we are only hearingexcuses how thosesuggested instructions would be impractical or inefficientbut we need some sort of proof, because the trust has been brokenby adding useless instructions to the ISA in the past.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Quoting - Igor Levicki

I completely agree that the strided load is useless.

That's not exactly what I said. It's definitely somewhat useful in certain cases. But it's just going to be entirely superseded by a gather instruction sooner or later so why bother. It would be yet another stopgap that makes x86 look even messier in the long run. I'd much ratherhave a gather instruction thatin its first implementationdoesn't provide much if any benefit over insertps, but is entirely flexible and that holds the promise of getting faster implementations over time with no code change required.

Imagen de Igor Levicki

Quoting - c0d1f1ed

That's not exactly what I said. It's definitely somewhat useful in certain cases. But it's just going to be entirely superseded by a gather instruction sooner or later so why bother. It would be yet another stopgap that makes x86 look even messier in the long run. I'd much ratherhave a gather instruction thatin its first implementationdoesn't provide much if any benefit over insertps, but is entirely flexible and that holds the promise of getting faster implementations over time with no code change required.

I was simply trying toamplifythe point thatstrided load instruction(which canbe emulatedusing INSERTPS today) will be possible to emulatevia gather instruction in the future but not vice versa -- i.e. you cannot emulate gather with strided load and most likely that strided load instruction will suffer from the same cache line split penalty as all current implementations of unaligned load.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

This gather/scater instruction you are describing would definitely be one of the more useful instructions - freeing sse from just working efficiently on 'chunks'. I hope to see it soon!

Inicie sesión para dejar un comentario.