Converging AVX and LRBni

Converging AVX and LRBni

Ritratto di c0d1f1ed

Hi all,

With Larrabee being canned as a discrete GPU, I was wondering whether it makes sense to actually let the CPU take the role of GPU and high-throughput computing device. Obviously power consumption is a big issue, but since AVX is specified to process up to 1024-bit registers, it could execute such wide operations using SNB's existing 256-bit execution units in four cycles (throughput). Since it's one instruction this takes a lot less power than four 256-bit instructions. Basically you get the benefit of in-order execution within an out-of-order architecture. The only other thing that would be missing to be able to get rid of the IGP (and replacing it with generic cores) is support for gather/scatter instructions. Since SNB already has two 128-bit load units it seems possible to me to achieve a throughput of one 256-bit gather every cycle, or 1024-bit every four cycles. In my experience (as lead SwiftShader developer) this makes software texture sampling perfectly feasible, while also offering massive benefits in all other high-throuhput tasks. Basically you'd get Larrabee in a CPU socket, without compromising any single-threaded or scalar performance! Thoughts? Nicolas
63 post / 0 new
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di Matthias Kretz

Have you missed the MIC announcement? I agree that it would be nice to have LRBni features in the CPU. But at least Intel hasn't completely given up on the Larrabee developments yet.

Vc: SIMD Vector Classes for C++ http://code.compeng.uni-frankfurt.de/projects/vc
Ritratto di c0d1f1ed
Quoting Matthias Kretz Have you missed the MIC announcement? I agree that it would be nice to have LRBni features in the CPU. But at least Intel hasn't completely given up on the Larrabee developments yet.

No I haven't missed the MIC announcement. It looks like Knight's Corner will be quite impressive and could be very successful in the HPC market (NVIDIA's Tesla chips are selling like mad for supercomputers, despite the lack of a fully generic/flexible programming model and relatively low effective performance at complex tasks).

That said, the MIC product just seems to have to make up for the investment into Larrabee as a discrete GPU. Frankly it's too soon for high-end graphics to become fully programmable. There's potential to do things more efficiently but the software market isn't going to radically change course overnight, and Larrabee can't match the performance of the competition's cards which have been fully dedicated to Direct3D/OpenGL rendering for ages. Larrabee could have been great as a game console chip, but it looks like Intel lost that opportunity as well.

Low-end graphics doesn't have that disadvantage though. The competition isn't nearly as fierce, and a CPU with high throughput performance would be welcome for a very wide variety of markets. It can deliver adequate low-end graphics with limitless features, but it would also be awesome at other tasks. By increasing the number of cores they can even conquer the mid-end gaphics market and eventually the high-end market. This bottom-up approach seems a lot more realistic and low-risk to me than Larrabee's unsuccesful top-down attack of the market. Imagine hardcore gamers buying 256-core CPUs by the end of this decade, and other markets having other core counts to satisfy the computing needs. I don't think heterogeneous architectures are the future; developers detest them (i.e. are limited by their features) and a heterogenous architecture which combines the best of both worlds is within reach.

Intel could leverage what it learned from Larrabee to obtain graphics market dominance through its CPU product line. Executing 1024-bit instructions on 256-bit execution units can help keep the power consumption in check (and help hide memory latency), while gather/scatter is essential for the variety of memory access patterns and data ordering operations needed by graphics and many other computing tasks. Both of these features seem perfectly feasible, without hurting peformance for legacy workloads.

Ritratto di c0d1f1ed

Could any Intel engineer/scientist tell me what the expected power efficiency would be for executing AVX-1024 on 256-bit execution units?

Since out-of-order execution takes a lot of power it seems to me that replacing four AVX-256 instructions with one AVX-1024 instruction, without widening the execution unit, could be a fairly significant power saving, close to that of in-order execution (like Larrabee or other throughput oriented architectures). It seems to me it could combine the advantages of both the GPU and CPU. Or am I overlooking something that makes heterogeneous achitectures more attractive? Then why does Sandy Bridge have more GFLOPS in its CPU than its IGP? All that's lacking to get good efficiency is gather/scatter and some technology to lower the power consumption...
Ritratto di c0d1f1ed

I just realized I might be able to partially estimate the power consumption impact myself, using existing instructions which operate on registers wider than the actual execution units.

One of these is the vdivps instruction. It's executed on an 128-bit execution unit, so vdivps essentially replaces two divps instructions.I wonder though, if vdivps is split into multiple uops which are scheduled separately, or whether it's one uop and the execution unit takes care of sequencing? Unfortunately it looks like it's the former since Agner Fog's documents say it takes 3 uops (essentially two divps and a vinsertf128)? Can anyone think of other instructions or perhaps an entirely different approach to estimate the power consumption impact of executing 1024-bit AVX instructions on (sequencing) 256-bit execution units? Is this even a viable idea at all?
Ritratto di bronxzv

this makes software texture sampling perfectly feasible

pure software texture sampling is already clearly feasible with today's AVX, look at this example

http://www.inartis.com/Products/Kribi%203D%20Player/FeaturesLab/page_example/Materials_bump_map.aspx

the dummies are with a diffuse map, a bump map, a reflection map and it's renderered (with per sample lighting and sample exact shadows casting) at 40+ fps @ 1920 x 1080 on a 2600K at stock frequency

Ritratto di admin
That's not good enough. Don't get me wrong, you've got an impressive software renderer. But let's face it, that scene is pretty simple. There's only two low-poly objects, there's some simple lighting by today's standards (no long shaders), it appears to be using per-polygon mipmapping instead of per-quad mipmapping, I can't spot any trilinear or anisotropic filtering, and based on previous conversations we had you're packing multiple textures together. The latter is a clever trick to compensate somewhat for the lack of gather/scatter support, but unfortunately it's not generally applicable. To really converge the GPU and CPU into a superior homogeneous architecture, higher effective performance is required, with lower power consumption. With FMA support already on the way, I think gather/scatter will be absolutely critical to be able to make efficient use of all this computing power. And executing 1024-bit instructions on 256-bit execution units should help keep the power consumption in check.
Ritratto di bronxzv

There's only two low-poly objects

sure indeed, that's why it's an interesting example when talking about texture sampling since a significant CPU budget is for texture operations not for scene graph traversal, geometry, etc. btw the 3 textures (diffuse map, bump gradient map and reflection map) are fully independent (3 distinct mipmap pyramids)

with high poly count model such as :

http://www.inartis.com/Company/Lab/KribiBenchmark/KB_Robots.aspx
or
http://www.inartis.com/Company/Lab/KribiBenchmark/KB_Skyline_City.aspx

less than 5% of the time is spent for textures

Ritratto di c0d1f1ed
Quoting bronxzv


sure indeed, that's why it's an interesting example when talking about texture sampling since a significant CPU budget is for texture operations not for scene graph traversal, geometry, etc.

Which is exactly why it's not good enough. Real 3D applications are far more complex so you need the extra efficiency gather/scatter would bring (not just for texturing but other tasks as well), and you also need the lower power consumption of executing 1024-bit instructions on 256-bit execution units (for integer operations even 128-bit would probably work well).

I can't see why you'd argue against that. I hate to tell you but WebGL and Flash Molehill (featuring SwiftShader support) could quickly make Kribi obsolete. The only way the software renderer can still prove to be useful, is if it's actually consitently faster and more flexible than an IGP. You absolutely need gather/scatter to achieve that.

Ritratto di c0d1f1ed

By the way, the value of converging AVX and LRBni goes way beyond graphics. So it's likely for gather/scatter support to be added to increase SIMD efficiency anyhow, regardless of whether you see much need for it in your software renderer. The other big thing that separates the CPU and Larrabee would be performance/Watt due to the out-of-order architecture, but that might be fixable by reducing the instruction rate...

Unless anyone sees any reasons why such features are not likely to be feasible?

Ritratto di bronxzv

Unless anyone sees any reasons why such features are not likely to be feasible?

as a matter of fact neithergather/scatter nor 1024-bit vectors are announced yet for AVX so we can't plan for these

in both cases I think they are not matching well with SMT, 4-way SMT is a more likely future IMHO

Ritratto di c0d1f1ed

It's not about what's announced yet or not. It's about what makes sense. FMA will be added sooner or later, and at that point the bottleneck from irregular data access patterns will be unbearable. To make all those transistors spent on SIMD really count, they should add the one thing to make it complete: gather/scatter. It's only a matter of time before Intel or AMD realizes how big of a win that would be (making every CPU capable of efficient and flexible high-throughput computing like Larrabee).

1024-bit vectors are not announced yet either, but they are already part of the AVX spec! So it seems relatively simple to implement it by executing the instructions on 256-bit execution units. If it significantly improves performance/Watt, it makes sense and will be added at some point.

I also can't help but think Intel already has long-term plans with AVX. It must have costed a lot of transistors and a lot of designing hours, to implement it into Sandy Bridge. FMA and AVX-1024 clearly indicate they intend on investing even more time and resources into it, and indicates they don't think the IGP is suitable for anything beyond legacy graphics. Perhaps the plans beyond FMA haven't solidified yet, so I don't think it hurts to discuss what I think makes most sense, on an Intel forum.

Besides, your product will become obsolete unless it can outperform the IGP and offer superior features. So you'd better join the discussion on what makes sense, and evaluate the cost/gain of all the options. It would give you very valuable insights once some of these things do get announced. In fact you could lead the custom software rendering revolution by offering frameworks with unique features.

I don't think 4-way SMT is worthwhile. First of all, 2-way SMT offers about 30% speedup at best, so 4-way SMT will likely offer much less. Also, I believe the thread count should be kept as low as possible. It's not unlikely that the gains from 4-way SMT are nullified by the synchronization overhead of managing twice the number of threads. Core counts will keep increasing anyway, so there's no need to make things even harder with more threads per core.
Correct me if I'm wrong, but I think what you're really after is 'strands' not threads; synchronous execution like on a GPU. I believe this is exactly what can be achieved by executing 1024-bit instructions on 256-bit execution units. It offers similar latency hiding advantages, and doesn't suffer from thread synchronization overhead. What's more, unlike 4-way SMT it reduces the instruction rate, offering power savings not unlike a GPU. Larrabee's high power consumption must have been partly due to 4-way SMT, and made it miss its goals. We don't want Intel to make that same mistake twice. For the CPU I can't really think of a reason not to implement AVX-1024 on the existing 256-bit and 128-bit execution units...

Ritratto di bronxzv
Correct me if I'm wrong, but I think what you're really after is 'strands' not threads; synchronous execution like on a GPU.

no, no, I mean 4 thread contexts as in Power 7 orLarrabee

>I believe this is exactly what can be achieved by executing 1024-bit instructions on 256-bit execution units. It offers similar latency hiding advantages,

not at all, in case of a cache miss the microcoded execution of the 4 sub-parts will be stalled unlike with SMT where at least one threadamong four will do useful work most of the time

in my experience (i.e. the graphics workloads I'm accustomed to, it may not apply to others) it's easier to achieve a good scalability with more cores than with wider vectors,a reasonably well optimized renderer will avoid synchronization between theads within the computation of a frame. Also stacked DRAM is around the cornerso the bandwitdh/latency gap will be widened shortly, maybe more than 2x and having more thread contexts will be welcome just to keep this 30% speedup

Ritratto di c0d1f1ed

I know you meant 4 thread contexts, but I don't think it offers the best cost/gain balance.

With my AVX-1024 proposal execution won't easily stall because of a cache miss. Independent instructions take four times longer to execute (but also perform four times more work), so by the time there are only dependent instructions left, the cache miss highly likely has been resolved. So it really does help hide miss latency. And in the case of RAM accesses, there's still 2-way SMT to cover for that. In total this solution is two times more effective at hiding memory access latency than 4-way SMT alone, and then there's the thread synchronization and power consumption advantage.

Indeed it's currently hard to achieve good scalability with wider vectors, but that's exactly because of the lack of gather/scatter!

Ritratto di Thomas Willhalm (Intel)

The latest Intel microarchitecture code-named "Sandy Bridge" comes with a decoded instruction cache. If the decoded instructions are executed from cache, in particular the power consumption of the decoding is avoided altogether.

Ritratto di c0d1f1ed
Quoting Thomas Willhalm (Intel)

The latest Intel microarchitecture code-named "Sandy Bridge" comes with a decoded instruction cache. If the decoded instructions are executed from cache, in particular the power consumption of the decoding is avoided altogether.

I know. And that's obviously a nice power saving feature, but the CPU's peak performance/Watt is still lower than that of the IGP. However, the IGP's absolute performance is lower, and it's not useful for anything other than legacy graphics (e.g. I can't imagine efficiently performing physics calculations on it while also rendering graphics, while all of the CPU's computing power is left unused). The CPU will even pull ahead further when support for FMA instructions is added! To make the IGP catch up with that and make it more flexible for GPGPU tasks, it would have to become a lot bigger and a lot more complex. But that seems pointless to me given that we've already got all this flexible processing power in the CPU part.

So it seems more worthwhile to me to think of how to (further) lower the CPU's power consumption instead. As far as I'm aware out-of-order instruction scheduling is still responsible for much of the power consumption. Since getting rid of out-of-order execution itself is obviously not an option, the amount of arithmetic work per instruction should be increased instead. Executing 1024-bit AVX operations on the existing execution units in multiple cycles seems to me like the perfect way to achieve that.

Gather/scatter support would also massively improve performance/Watt. Currently it takes 18 extract/insert instructions to perform a 256-bit gather operation of 32-bit elements. Each of them occupies ALU pipelines and moves around 128-bit of data. In total that's a lot of data movement and thus lots of heat. Not to mention each of these 18 instructions is scheduled separately and there's a long (false) dependency chain. Performing gather/scatter in the the load/store units instead would free up the ALU pipelines, drastically reduce the data movement, and turn it into a single instruction. Even a relatively straightforward implementation where a gather operation is executed as two sets of four load operations, would increase the peak throughput to one gather instruction every four cycles. The combination of both lower power consumption and higher performance should make it pretty irresistible.

Ritratto di Igor Levicki

Well, gather and scatter instructions are something I was arguing for ever since Pentium IV days.

The CPU still has a long road to reach some of the performance points which are commodity for today's GPU:

1. Memory bandwidth (today's high-end video card has >1GB RAM with 192GB/sec bandwidth)
2. High thread count
3. Dedicated tesselation, texturing and video decoding/processing hardware

Etc, etc. On the other hand, look how fast are GPU's getting CPU related functionality:
http://www.nvidia.com/docs/IO/100940/GeForce_GTX_580_Datasheet.pdf

Did you ever think that C++ on a GPU will be possible?

Last year CPU's main selling point for HPC was double precision and 64-bit memory addressing. How about now when GPU has both and when GPU release cycle is every 6 months while new CPUs take 2 years to market?

Not to mention that to get better GPU you just have to get another video card, while with CPU you need to change everything because CPU designers don't care about hardware compatibility and yet they bring smallest increases in performance and functionality while almost every new GPU is a revolution in itself.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Ritratto di bronxzv

>Each of them occupies ALU pipelines and moves around 128-bit of data

Nope, they aremostly 32-bit moves from/to the 256-bit PRF

>each of these 18 instructions is scheduled separately

which is great since you can interleave them withinstructions from the other thread, it's faster overallin case ofcache misses than a fat serializing monolitic instruction

Ritratto di c0d1f1ed
Quoting bronxzv >Each of them occupies ALU pipelines and moves around 128-bit of data

Nope, they aremostly 32-bit moves from/to the 256-bit PRF

>each of these 18 instructions is scheduled separately

which is great since you can interleave them withinstructions from the other thread, it's faster overallin case ofcache misses than a fat serializing monolitic instruction

extractps is a port 5 uop, not coincidentally the same port for all types of shuffle operations. 128-bit of data goes in, and 32-bit gets extracted.insertps is also a port 5 uop, and if the inserted data comes out of memory (like for gather) there's an extra port 2/3 uop for the load. 128-bit + 32-bit goes in, and 128-bit comes out.

The reason it's hauling around all this data is because the result has to be carried back to the top of the ALU by the forwarding network in case the next instruction needs it as input. The PRF is hardly involved as long as you're executing dependent instructions. Since the CPU doesn't know you're planning on overwriting each element, it's needlessly pushing around lots of data.

There's absolutely nothing great about scheduling 18 instructions which aren't even intended to be dependent.Gather could simply be a pair of port 2/3 uops instead. That doesn't mean it's a fat monolithic instruction. Each of the four loads for every 128-bit half can finish in any order, it doesn't occupy port 5, your thread can advance faster, and there's less of a chance that the second thread also runs out of work and the whole core stalls.

So really, there's nothing but advantages to having hardware support for gather/scatter.

Ritratto di sirrida

It would be fine if at least the gather command were there; in my experience the scatter command is not used as often.

Programming such endless chains of insertxx commands is ugly, stupid, slow, space-inefficient and really should be unnecessary. All these commands must be decoded and executed, and needlessly fill the memory and op caches. Gather commands could (for the first implementation) at least generate the flood of ops by itself and the out-of-order mechanism will care for the proper interlacing with other commands which can be executed in parallel (superscalar). I know that such a command will have a *long* latency.

...and things get even worse if one wants to gather small entities such as bytes. I have often missed commands such as "rep xlat" ("jecxz end; redo: lodsb; xlat; stosb; loop redo; end:" with al preserved). If gather was there, it could be simulated with some code around "gather.byte xmm1,[ea]" where the bytes of xmm1 hold unsigned offsets to ea (effective memory address).

By the way: I've never understood why e.g. the handy commands "rep stosb", "rep movsb", "jecxz", "loop", "enter" (level 0) and "leave" can be outperformed by any replacement stuff. Modern CPUs ought to be smart enough to execute such dedicated commands much faster. It's a shame that they don't!

Ritratto di c0d1f1ed
Igor, while I share your love for gather/scatter, I don't think your other expectations are very realistic.
1. High memory bandwidth at low latency is very expensive. GPUs thank their high bandwidth to GDDR memory with high latency, but that would be unacceptable for a CPU since it doesn't tolerate high latency. Lots of software relies on low RAM latency so you don't want to compromise their performance. Besides, the CPU greatly compensates for lower RAM bandwidth by having large caches. Future denser cache technology like T-RAM could make RAM bandwidth even less of an issue.
2. Having lots of threads is not a good thing. It comes with extra synchronization overhead, and some software just doesn't scale well to a high thread count. If you can achieve the same peak throughput, with less threads, that's always a good thing in practice. Besides, GPUs do not have a high thread count. The terminology is confusing, but truely independent instruction streams are called kernels on the GPU, and only very recently they managed to get a few kernels to execute concurrently. What GPU manufacturers like to call threads are really dependent strands within the same kernel which run on tighly coupled SIMD lanes. In this sense a quad-core Hyper-Threaded CPU with two 256-bit AVX execution units processing 1024-bit vectors could be considered to have 512 of these "threads" in flight, while only really running 8 independent "kernels". Anyhow, again, the CPU isn't as far behind as it might appear. 3. You don't need dedicated tesselation or texturing units to outperform the IGP. A homogeneous architecture has more area available, and using dynamic code generation you only pay for the features that are actually in use. Also, tesselation and even texturing is very likely to become fully programmable at some point in the future anyhow. Furthermore, dedicated hardware is only useful for legacy graphics. For anything else it would be idle silicon (read: a waste). Gather/scatter and AVX-1024 would also benefit lots of other high-throughput applications.

C++ on the GPU is still a bit of gimmick really. Each thread (strand) only has access to a 1 kB stack. So you can forget about deep function calls and long recursion. It even starts spilling data to slow memory long before this limit is reached. The performance per strand is also really low, so you'd better have thousands of them to compensate for the latency. And last but not least you can't launch kernels from within kernels. So basically you shouldn't expect a quick port of your existing C++ code to run efficiently (or even run at all).

That said, the clock is indeed ticking since GPU manufacturers also see the benefits of evolving their architecture into something more CPU-like. Fortunately Intel merely has to add FMA, gather/scatter and power saving features like executing AVX-1024 on execution units of lesser width to regain dominance in the HPC market (and far beyond).

Ritratto di bronxzv

you may be right for the vinsertps (6/8 loads) though I'lllove a confirmation by someone in the know
for the low elements you can use 32-bit vmovss (2/8 loads) and it looks like an easy optimization to simply clear the 96 MSBs instead of "moving" 128-bit, here again I'll welcomeanexplanation of the actual working beyond our guesswork

for extracting the 32-bit indices from the very same XMM
vmovd edi, xmm2
...
vpextrd edi, xmm2, 1
...
vpextrd edi, xmm2, 2
...
vpextrd edi, xmm2, 3
...
it looks rather odd to move 128-bit each time, at least nothing in the ISA ask for it so it can be optimized in forthcoming chips if it's really as bad as you said

Ritratto di bronxzv

>Programming such endless chains of insertxx commands is ugly, stupid, slow

Programming gather is no more effort than calling a Gather() inlined function (*1) or using the array notation for gather A[B[:]] if you useIntel C++

>Gather commands could at least generate the flood of ops

I can't see how it will be a real improvement since the uop cache pressure will be the same than with a software synthetized gather and execution willbe on par with the current situation, only the x86 code densitywill be better but it's really notan importantpoint for high performance code with a lot of inlining/unrolling

*1: Examples of generic gather functions (4 & 8 FP32 elements)

#define INLINE _forceinline

INLINE __m128 Gather(const float *base, const __m128i &indices)

{

__m128 res = _mm_load_ss(base+_mm_cvtsi128_si32(indices));

res = _mm_insert_ps(res,_mm_load_ss(base+_mm_extract_epi32(indices,1)),_MM_MK_INSERTPS_NDX(0,1,0));

res = _mm_insert_ps(res,_mm_load_ss(base+_mm_extract_epi32(indices,2)),_MM_MK_INSERTPS_NDX(0,2,0));

res = _mm_insert_ps(res,_mm_load_ss(base+_mm_extract_epi32(indices,3)),_MM_MK_INSERTPS_NDX(0,3,0));

return res;

}

INLINE __m256 Gather(const float *base, const __m256i &indices)

{

const __m128 low = Gather(base,_mm256_extractf128_si256(indices,0)),

high = Gather(base,_mm256_extractf128_si256(indices,1));

return _mm256_insertf128_ps(_mm256_castps128_ps256(low),high,1);

}

compiling the 256-bit variant generates the 18 instructions discussed with c0d1f1ed:

vmovd edi, xmm2

vextractf128 xmm6, ymm2, 1

vmovss xmm0, DWORD PTR [ecx+edi*4]

vpextrd edi, xmm2, 1

vinsertps xmm1, xmm0, DWORD PTR [ecx+edi*4], 16

vpextrd edi, xmm2, 2

vinsertps xmm3, xmm1, DWORD PTR [ecx+edi*4], 32

vpextrd edi, xmm2, 3

vinsertps xmm0, xmm3, DWORD PTR [ecx+edi*4], 48

vmovd edi, xmm6

vmovss xmm4, DWORD PTR [ecx+edi*4]

vpextrd edi, xmm6, 1

vinsertps xmm5, xmm4, DWORD PTR [ecx+edi*4], 16

vpextrd edi, xmm6, 2

vinsertps xmm7, xmm5, DWORD PTR [ecx+edi*4], 32

vpextrd edi, xmm6, 3

vinsertps xmm1, xmm7, DWORD PTR [ecx+edi*4], 48

vinsertf128 ymm2, ymm0, xmm1, 1

Ritratto di sirrida

All these commands cost code space and much more ops than necessary for a naive implementation of a gather command.
Also, all these commands must be decoded and they trash my precious registers (in your example edi and ymm0..7).
Would you create a whopping bunch of 32 or more commands for my example of gathering bytes in xmm registers?
For the assumed avx extension for integer ops on ymm things get worse...

Ritratto di bronxzv

>much more ops than necessary

it's pretty difficult to tell since the ISA of the uops isn't disclosed, I don't see where it can be significantly simplified, you need to extractthe integer indices from vector registers to GPRs,compute the base + index addresses,load thevaluesand insert them in vector registers
the hot spot is clearly for loads anyway, limited by the cache hierarchy, not the ISA

to improvecode density, and maybe also execution speed, an interestingmiddle ground (between the current situation and a fat indivisible vgather instruction that may triggermultiple cache misses) will be to introduce a new addressing mode where we can usea singleelementof an ymm register as offset

ASM code may look like :
vinsertps xmm1, xmm1, DWORD PTR [ecx+ymm0[DWORD 1]*4], 16
vinsertps xmm1,xmm1, DWORD PTR [ecx+ymm0[DWORD 2]*4], 32

>they trash my precious registers (in your example edi and ymm0..7)

yes good point, though it's compiler generated code, the code is different if register pressure is higher, the insertions may be like this for example vinsertps xmm0, xmm0, DWORD PTR [ecx+edi*4], 16, wasting less registers

Ritratto di c0d1f1ed
Quoting bronxzv you may be right for the vinsertps (6/8 loads) though I'lllove a confirmation by someone in the know
for the low elements you can use 32-bit vmovss (2/8 loads) and it looks like an easy optimization to simply clear the 96 MSBs instead of "moving" 128-bit, here again I'll welcomeanexplanation of the actual working beyond our guesswork

for extracting the 32-bit indices from the very same XMM
vmovd edi, xmm2
...
vpextrd edi, xmm2, 1
...
vpextrd edi, xmm2, 2
...
vpextrd edi, xmm2, 3
...
it looks rather odd to move 128-bit each time, at least nothing in the ISA ask for it so it can be optimized in forthcoming chips if it's really as bad as you said

It's not guesswork. Every modern pipelined processor uses result forwarding to eliminate read-after-write hazards. Trust me, I'm in the know. I have a masters degree in computer science and engineering (and a minor in embedded systems). You can also read about the added latency for bypassing results between execution domains in Intel's Optimization Reference Manual.

Forwarding also affects extract instructions. Take the following code sequence:

paddd xmm0, xmm1

pextrd eax, xmm0, 3
sub eax, 123

You might think the sub could directly use the fourth element of the result of the paddd right after it finishes executing (eliminating the pextrd), but this would complicate the forwarding network in multiple places, adding gate delay. That's some delay and complication right where you don't want it. So making extract instructions more efficient would compromise everything else. Instead they just forward all 128-bit as-is, and use the next cycle to execute the pextrd, after which the result is forwarded to the sub instruction.

But while there's nothing that can be done in the above case, a gather operation really doesn't need any of this forwarding; it shouldn't even involve the ALU pipelines at all! It's also fine if a gather instruction takes extra latency (it will still be much faster than 18 instructions, and throughput is far more critical for SIMD code anyway). It can also use a weaker memory consistency model. These things should make it feasible to prevent it from affecting the performance of regular load operations.

Ritratto di sirrida

...and as we have seen in Copy and modify forwarding does not work as good as it should, at least on i7 and Atom...

Ritratto di c0d1f1ed
Quoting bronxzv I can't see how it will be a real improvement since the uop cache pressure will be the same than with a software synthetized gather and execution willbe on par with the current situation, only the x86 code densitywill be better

No, a misaligned load instruction is still one uop, even if it has to access two cache lines. So likewise a 128-bit gather instruction can be just a single uop even if it has to access four cache lines. There's no benefit at all in having multiple uops and scheduling them individually. Dependent instructions can't commence anyway till all data has been loaded. So whether it's an aligned load, a misaligned load, or a gather, it can treat it as one uop which has either finished or not. It's the load unit's responsability to fetch each portion of the data. Even a vmovaps load instruction is a single uop, but it issues on both port 2 and 3.

This doesn't just free up lots of uop cache space (from 18 fused uops down to 1), but also avoids all of the power consumption related to scheduling and register renaming and such.

Ritratto di bronxzv

>It's not guesswork.

sure it is, I'll be interested to have the input from an insider, though

Ritratto di bronxzv

>So likewise a 128-bit gather instruction can be just a single uop

Sorry but I was answering this sirrida'scomment "Gather commands could at least generate the flood of ops", I'm suresirrida was meaning it as an instruction decoded as multi uops likeSSE on the P!!!/P4 or vdivps/pd vsqrtps/pd on SNB, itlooks interesting since it will match well with SMT, unlike a fat vgather

now if it's really possible to implement a single uop genericvgather andthat a thread canissue a vgather before the other thread(s) vgather(s) retire, potentially thousands of cycles later, (i.e. ifvgatheris not serializing like vdivps for example), be assuredI'll use it from day one, otherwiseit will be clearly the#1 source of stalls and the #1 instruction to avoid, well IMHO

Ritratto di sirrida

I meant "Gather commands could at least generate the flood of ops" as a means to easily get a first implementation. Later CPU generations will surely have a better implementation.

Ritratto di bronxzv

>I meant "Gather commands could at least generate the flood of ops" as a means to easily get a first
>implementation. Later CPU generations will surely have a better implementation.

yes, you were clear about it, and I suppose it will be quite simple to implement themulti-uop solution though I'm not sure it will really provide concrete speedups (since the bottlenecks are elsewhere: nr of load ports, cache and memory hierarchies)the incentive to use it will be low, not many ISVs want one more code path without significant speedup

Ritratto di sirrida

At least the 108 bytes of your solution will become about 5 bytes and writing, reading and debugging will be much easier at assembly level.
OK, there's one more code path, but this is the price to pay.
The anticipated speedup might come later...

Ritratto di c0d1f1ed
Quoting bronxzv to improvecode density, and maybe also execution speed, an interestingmiddle ground (between the current situation and a fat indivisible vgather instruction that may triggermultiple cache misses) will be to introduce a new addressing mode where we can usea singleelementof an ymm register as offset

I'm sorry but it's pretty pointless and even wasteful to add an instruction which will be supersceded by gather/scatter at some point in the near or far future. It's also not obvious how to encode your instruction in the first place. And you're sending a large vector to the load unit for each of these instructions, while only using a minor portion (the same issue as the forwarding that takes place with the extract instrutions). And last but not least each of your insert instructions are still dependent and needlessly carry lots of data around.

It could shave off a few cycles but remains flawed, so in my opinion it would be a lot more worthwhile to investe the transistors into an actual gather implementation.

Ritratto di bronxzv

>And you're sending a large vector to the load unit for each of these instructions, while only using a minor portion

here again I don't see why it will be not possible to move only 32-bit between the PRFand the AGU using amultiplexer,it looks farsimpler than a fully functional vgatherand a concrete step that will fit nicely with SMT, one more time it will be nice to have the input from someone nearer to the action than we are

Ritratto di bronxzv

>writing, reading and debugging will be much easier at assembly level

so true, though in my experience readable ASM and fast code are orthogonal issues at best, most of the time faster code is less readable in the ASM dump like when we geta significant speedup using a "#pragma unroll" directive, or when using two 128-bit vmovups is way faster than a single 256-bit vmovups with unaligned arrays

Ritratto di c0d1f1ed
Quoting bronxzv >It's not guesswork.

sure it is, I'll be interested to have the input from an insider, though

Did you read the documents I linked? Forwarding is the only way a fully pipelined architecture can execute dependent instructions back-to-back (which has been possible for every Intel chip since the 486). Reading and writing the register file simply takes too many cycles, so the results are directly looped back in case the next instructions requires the new value instead of the old value it read a couple cycles earlier. Here's a good explanation: modern microprocessors(figure 4).

Ritratto di bronxzv

hey Nicolas, Idon't see anythingat your linkgoing against the fact that it's possible to move only 32-bit (or only 64-bit / 128-bit *not always 256-bit*)toan inputthat don't need more than 32-bit

with a P6 like design (Nehalem) the move can be froma result buffer (in case of forwarding ) or from the RRF when the result is no more available in a result buffer

my understanding is thatwith Sandy Bridgeit's simply always from the unified PRF, it can forward as soon as the result is available, before the RAT update is completed

http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=5

"
Allocation, renaming, scheduling and retiring are all different for Sandy Bridge, and minimize the movement and replication of data within the processor
"

""
PRF-based renaming and scheduling is substantially more power efficient because it eliminates movement of 32, 64, 128 or 256-bit data values
"

Ritratto di jimdempseyatthecove

Bronxzv,

I do not have my Sandy Bridge system yet so I cannot try this out.

Sandy Bridge has HT. Run some performance test code that uses two threads, those of HT siblings. Have one thread be the master thread and the other a slave thread. Have both threads setup MONITOR to monitor a list of addresses used for mailbox. The master thread, when it knows it will require a gather some time in the future (fairly long time) writes the pointer to the gather list into the mailbox, the slave thread MWAITing on the mail box gets the gather list, gathers the data and writes as 256-bit item into mailbox gather results. With sufficient advanced notice this could complete prior to the master thread needing the data. The master thread should be able to read the gathered data in one gulp if ready (or issue MWAIT till ready, or go get the data itself in the event the slave thread got preempted). The slave thread could serve as a data pipeline prefetcher and post-storer. This will "waste" one thread one that you do not want using the AVX anyway.

Jim Dempsey

www.quickthreadprogramming.com
Ritratto di bronxzv

Jim,

It's an interesting idea, along the line of software speculative precomputation.It will be particularly effective when "gathering" a big chunckof data(i.e. you'll pass it an array of indices and it will set an array of values) with a high cache miss rate and if the master threadis able todo other useful work in the meantime.

>one thread one that you do not want using the AVX anyway.
In my use cases I typically work with a pool of threads (1 thread per logical processor) so each hardware thread need AVX support, btw the speedups from hyperthreading are slightly better for AVX code

Ritratto di c0d1f1ed
Quoting bronxzv

hey Nicolas, Idon't see anythingat your linkgoing against the fact that it's possible to move only 32-bit (or only 64-bit / 128-bit *not always 256-bit*)toan inputthat don't need more than 32-bit

with a P6 like design (Nehalem) the move can be froma result buffer (in case of forwarding ) or from the RRF when the result is no more available in a result buffer

my understanding is thatwith Sandy Bridgeit's simply always from the unified PRF, it can forward as soon as the result is available, before the RAT update is completed

http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=5

"
Allocation, renaming, scheduling and retiring are all different for Sandy Bridge, and minimize the movement and replication of data within the processor
"

""
PRF-based renaming and scheduling is substantially more power efficient because it eliminates movement of 32, 64, 128 or 256-bit data values
"

Although Sandy Bridge eliminates storing data in the ROB, the forwarding network and register file accesses are performance critical. The use of a PRF doesn't change that. Extracting 32-bit data from a vector takes several gate delays, and you can't squeeze that in there without affecting timings. So the only sensible solution to insert/extract parts of a register is to send the entire vector to the execution unit and take a full clock cycle to perform the operation as a separate instruction.

Gather/scatter on the other hand wouldn't affect those critical parts of arithmetic instruction execution at all. They are parallel load/store operations, so they only affect the load/store units. And I think it can be implemented without impacting regular load/store performance. First of all it can take advantage of the fact that the index values can be extracted sequentially; the first element can be used directly, and for subsequent indices it has a full cycle to shift the vector to the right. This merely requires sending the index vector once, and doesn't require any control bits to select which element to extract. Furthermore, in case of gather the elements can be inserted in arbitrary order. This can be handled by a separate piece of logic which collects the elements into a dedicated gather register the next cycle, sending it to the PRF when complete. This adds a cycle of latency to gather, but should leave regular load latency unaffected.

An advanced implementation could check which indices point to the same cache line, and gather up to four elements in parallel. Although it's more challenging to prevent this from affecting regular load/store performance, it seems well worth it to me to further improve performance/Watt of throughput computing (and beyond). Tons of algorithms contain loops which could be successfully parallelized with gather/scatter support. Of course it makes sense to only invest in this advanced implementation once software already makes use of gather/scatter. So I think the above cheap implementation with a maximum throughput of one gather every four cycles makes most sense for the near future.

Ritratto di c0d1f1ed
Quoting sirrida At least the 108 bytes of your solution will become about 5 bytes and writing, reading and debugging will be much easier at assembly level.
OK, there's one more code path, but this is the price to pay.
The anticipated speedup might come later...

A gather/scatter instruction which expands into ~18 (fused) uops could indeed save code bytes and reduce register pressure, but overall it doesn't seem worthwhile to me. Assembly cleanliness certainly isn't much of a convincing argument.

The real issue with it though is that a custom code sequence might be faster. Take for instance the example of approximating a function with piecewise polynomials. A gather/scatter based solution would perform parallel table lookups for each of the coefficients. But you could also look up all coefficients at once and transpose the elements (from AoS to SoA). Despite requiring many shuffle operations, the latter would be faster if the gather/scatter implementation is a straight expansion into extract/insert operations.

So in my humble opinion we really need a gather/scatter implementation which offers a sizable advantage from the get-go. A 256-bit gather operation with a throughput of 4 cycles can't be beaten by custom code and offers many other advantages thanks to only taking a single uop, and seems quite feasible to me. I don't see any point in asking for anything less.

Ritratto di bronxzv
1>A gather/scatter based solution would perform parallel table lookups for each of the coefficients.

yes, for a 3rd order polynomial and FP32 coefficients it will be 4 dictinct gathers of unrelated 32-bit values

2>could also look up all coefficients at once and transpose the elements

yesthis is my favorite solution, for a 3rd order polynomial you will move only aligned 128-bit values from the LUT

IMHO (2) is faster in all casessinceahardware implementation will typically miss the 128-bit movesoptimization when processing the series of gathers in (1). Also for (1) each cache linewill be visited4xmore times than with (2).

more generally, the access pattern to the cache/memory hierarchy has more impact onperformance and power consumption thanthe particular implementation of gather, software synthetized gather can *optimize across multiple gathers* , an oportunity that a hardware implementaton will typically lack

Ritratto di c0d1f1ed
Quoting jimdempseyatthecove Sandy Bridge has HT. Run some performance test code that uses two threads, those of HT siblings. Have one thread be the master thread and the other a slave thread. Have both threads setup MONITOR to monitor a list of addresses used for mailbox. The master thread, when it knows it will require a gather some time in the future (fairly long time) writes the pointer to the gather list into the mailbox, the slave thread MWAITing on the mail box gets the gather list, gathers the data and writes as 256-bit item into mailbox gather results. With sufficient advanced notice this could complete prior to the master thread needing the data. The master thread should be able to read the gathered data in one gulp if ready (or issue MWAIT till ready, or go get the data itself in the event the slave thread got preempted). The slave thread could serve as a data pipeline prefetcher and post-storer. This will "waste" one thread one that you do not want using the AVX anyway.

Not all Sandy Bridge processors have Hyper-Threading.

Anyway, assuming all future processors will feature HT, I still don't think your suggestion would work well in practice. First of all passing data between the threads has a considerable bandwidth and synchronization overhead. Also, very few algorithms can deal with such high latency, or would require restructuring which counteracts any potential advantage. But most importantly, I don't think performing data gathering in another thread on the same core really helps increase total instruction rate. You might as well do it in the same thread: Sandy Bridge has a substatial out-of-order instruction window. So just place the extract/insert instructions early enough and let the hardware take care of the rest. Even if a miss causes a stall, you still have the second thread to issue more work.

Also in my experience you can gain about 30% performance from Hyper-Threading, even for SIMD heavy code. I seriously doubt that for practical algorithms you can gain more from splitting the work unevenly.

What we really need is higher throughput, lower latency, fewer uops, lower register pressure, lower power consumption, etc. All of these things can be achieved with a minor amount of logic. It would enable revolutionary new software, which might then make it worth to invest into an advanced implementation with even more advantages.

Ritratto di sirrida

> A gather/scatter ... but overall it
doesn't seem worthwhile to me.
> Assembly cleanliness certainly isn't much
of a convincing argument.

Would you think the same for a future gather command like "gather.byte zmm0,[eax]" reading 64 bytes, which will expand into at least 128 commands costing about 800 bytes of code if fully expanded (emulated)?
In my opinion even the effort of 18 commands to emulate 1 is far too much.
And as I must debug my hand-written code I appreciate any cleanliness as long as it does not cost too much performance.

We process e.g. a lot of ISO A0 sized images at 600 dpi, true color.
Most operations are at byte or word level. I miss the byte/integer operations for AVX very much!
A gather command for bytes and/or words for SSE/AVX is sourly missing for ages.
See also my comment for the suggested "rep xlat".

Ritratto di bronxzv

>Would you think the same for a future gather command like "gather.byte zmm0,[eax]" reading 64 bytes, which will expand into at least 128 commands costing about 800 bytes of code if fully expanded (emulated)?

I don't think it's really possible to "expand" it as you think, i.e. to emit more than 100 uops, it will take ages each time the x86 instruction is decoded, for example the Sandy Bridecomplex decoder can emit at most 4 uops

http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4

So it will be most probably run from microcode, like the x87 transcendentals instructions, the latency/thoughput figures will bemonstruous,beating FCOS, FPTAN and such, in case ofmany cache misses (there can be32 misseswith byte indices and a 8x scaling such asthe maximum scaling forLRBni VGATHERD) this single instruction will completely stall a core (L2 misses)or choke the whole CPU (LLC misses/TLB misses)

Ritratto di c0d1f1ed
Quoting sirrida > A gather/scatter ... but overall it
doesn't seem worthwhile to me.
> Assembly cleanliness certainly isn't much
of a convincing argument.

Would you think the same for a future gather command like "gather.byte zmm0,[eax]" reading 64 bytes, which will expand into at least 128 commands costing about 800 bytes of code if fully expanded (emulated)?
In my opinion even the effort of 18 commands to emulate 1 is far too much.
And as I must debug my hand-written code I appreciate any cleanliness as long as it does not cost too much performance.

Yes it can be annoying to debug hand-tuned code, but you have to weigh that against the benefits, and if necessary develop tools to assist you (ranging from a simple macro, to a full-blown compiler).Try to look at this from Intel's perspective, and its customers. What's the value in adding such a microcoded instruction, to the average person buying a computer? Also, where does this stop? Intel won't just add any instruction an assembly programmer might find convenient. So seriously, it absolutely has to add sizable benefit before it will be taken into consideration.

An actual gather instruction which is faster than the sequential code sequence is the only thing I can realistically imagine Intel to find valuable.

By the way, note the comment for Figure 12 in this article: A First look at the Larrabee New Instructions (LRBni). It appears to imply that Intel thinks it's feasible to have a single non-microcoded gather instruction for collecting 16 dwords. So I'm fairly confident that getting Sandy Bridge's pair of load units to each manage 4 loads is well within reach.

Ritratto di c0d1f1ed
Quoting bronxzv >Would you think the same for a future gather command like "gather.byte zmm0,[eax]" reading 64 bytes, which will expand into at least 128 commands costing about 800 bytes of code if fully expanded (emulated)?

I don't think it's really possible to "expand" it as you think, i.e. to emit more than 100 uops, it will take ages each time the x86 instruction is decoded, for example the Sandy Bridecomplex decoder can emit at most 4 uops

http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4

So it will be most probably run from microcode, like the x87 transcendentals instructions, the latency/thoughput figures will bemonstruous,beating FCOS, FPTAN and such, in case ofmany cache misses (there can be 64 misses...) this single instruction will completely stall a core (L2 misses)or choke the whole CPU (LLC misses/TLB misses)

There can't be that many misses for byte translate. The offsets can only access 4 cache lines. The bigger problem, again, is extracting each of the indices and inserting each of the loaded values. The only way to prevent that from taking extra ALU cycles and constantly moving the whole vector, is with an actual gather instruction executed by the LSU.

In my opinion byte gather is less valuable than dword gather though, and it could impede an advanced implementation since gathering up to 16 bytes from a single cache line is much harder than gathering 4 dwords. A mixed implementation might work, but personally I'd already beecstatic if the first version just got us a dword gather. Byte (and word) gather have some value too but can also be achieved with a few dword gathers. Especially with the advanced implementation it would still be blazing fast (since cache misses should be rare).

Ritratto di bronxzv

sorry but it looks like you answered an older version of my post (see my remark about the 8x scaling that I have added after posting the 1st version)

>There can't be that many misses for byte translate. The offsets can only access 4 cache lines.

nope, with a 8x scaling (the max for VGATHERD)the address range willbe 256*8 = 2048 bytes, or 32 cache lines (assuming 64B lines)

Ritratto di bronxzv

>note the comment for Figure 12 in this article:

"
Figure 12:
vgatherd v1 {k1}, [rbx + v2*4]. This is a simplified representation of what is currently a hardware-assisted multi-instruction sequence, but will become a single instruction in the future.
"

good catch, it's another evidence, besides the Abrash's comment at the end of the paper (the one we discussed the other day *1) that Knights Ferry hasn't full hardware support for VGATHERD, the way it's phrased sounds like if the compiler expand an intrinsic to a series of instructions,much like an inlined function

hopefully the 1st MIC product will have true hardware support for the instruction, it will be fun to test
btw is there any vector computer ever produced with a generic gather, LRBni style?AFAIK the classical vector supercomputers (see also the Tarantula EV8 proposal) provide support for non-unit stride gather but not a vector indirect addressing mode

*1 : "Finally, note that in the initial version of the hardware, a few aspects of the Larrabee architecture -- in particular vcompress, vexpand, vgather, vscatter, and transcendentals and other higher math functions -- are implemented as pseudo-instructions, using hardware-assisted instruction sequences, although this will change in the future."

Ritratto di sirrida

Most of us cry for a gather operation. I definitely need it for bytes and words.

> So I'm fairly confident that getting Sandy Bridge's pair of load units
> to each manage 4 loads is well within reach.

This would also mean that my sketched worst case for ZMM (64 bytes) would cost 16 cycles (16*4 byte lookup) plus memory latency.
I think 16 cycles (+ memory latency) for such a mighty command is not too much and will outperform the replacement coding for sure. Several other commands have a higher latency. For XMM it will cost only 4 cycles plus memory latency which is quite fast even in absolute terms.
For byte lookups never more than a block of 256 adjacent bytes will be read since all the indexes/offsets are in the range 0..255.
For words there will be only 8*4 word lookups and 32 cache misses for ZMM in the index range 0..65535 (offset range 0..128 Ki). This will probably be the real worst case.

I know how to use macros and compilers, but hand-tuned code is often much faster because I can manually schedule the commands at the cost of maintainability.

Pagine

Accedere per lasciare un commento.