>>First of all passing data between the threads has a considerable bandwidth and synchronization overhead. You do not pass data element by element from thread to thread. Depending on the design of the HT core, HT siblings share L1 and/or L2 cache. One of the threads (slave) is taking the hit for memory latency in performing the read gathers and write to RAM (which additionally writes to L1 and L2 assuming not non-temporal store). This would be (should be) beneficial when the pre-gathered data to be process tends to not reside in L1/L2 cache .AND. when the code combines this with SSE/AVX instructions with other data that is currently in L1/L2. The code path length, independent of the gather would have to take more clocks than the gather (4, 8 or more clocks per 4/8 way gather). While this will add 2 cycles to the throughput (per cache line write by slave, read by master) should the code path after the gather consume more than 6/10 cycles you should see a net gain. However, I would guess that you need more than~ 30% additional clock cycles (8/11). An additional concern is then overall effect of preemption by the O/S on one of the pairs of threads (as this will defeat the goal of priming the cache).
Note that the use of two (or more) HT siblings in this manner effectively produce a pipeline stream IOW coded as a parallel pipeline.
Jim, I am just speculating but I am afraid that if the master thread is doing some usefull work, then prefetching by the slave thread at the same time would evict usefull data from L1 and L2 and thus actually decrease performance.
Gather/scatter is intended to improve throughput when the data does reside in L1/L2. If it's higher up then I expect the CPU will pile up several misses and then the second thread (executing similar code) should simply take over and cover (most of) the latency.
Having lots of misses means there are plenty of cycles to do the extract/insert sequentially, so gather/scatter wouldn't help in that case. But since the L1 cache typically has a hit rate of around 95-99%, hardware gather/scatter can often speed things up considerably.Besides, if you do expect lots of misses you should use software prefetching to increase the likelyhood that the data will be close when you perform the actual gather (or scatter) operation. Note that LRBni includes gather/scatter prefetch instructions. The same could be added to AVX, so you don't need 18 instructions for the prefetch and 18 instrutions for the actual gather. There should even be plenty of elements located within the same cache line, so that an advanced gather implementation can collect them all at once and the throughput is further increased.
Here's a document which shows how badly CPUs need gather/scatter to maximize SIMD performance:Debunkingthe 100XGPUvs.CPUmyth.
With AVX and soon FMA sequentially loading elements which reside in L1/L2 cache is a massive bottleneck. Storing the data in AoS format requires lots of shuffle instructions to allow efficient SoA style processing. In other words there's a growing discrepancy between arithmetic processing power and the ability to get data into the pipelines even if it's stored in only mildly irregular locations. So I don't think doing sequential extract/insert in a second thread solves anything. We need dedicated gather/scatter support.
>if the master thread is doing some usefull work, then prefetching by the slave thread at the same time would evict usefull data from L1 and L2 and thus actually decrease performance.
good point, though there is probably some buffer sizes where it will make sense (1 KB - 1MB once packed), after all people get speedups with software speculative precomputation in some cases
even ifthe packed datais not fitting entirely in L1D and that you are "gathering" very sparse data (it doesn't apply to the use cases discussed with c0d1f1ed, though) you may increase the spatial coherence by 10x or more thus decreasing10x ormore(*1) the bandwidth required to move it from L2 to L1D the 2nd time, you'll also evict 10x less L1D lines the 2nd time, the latency will be more than 10x bettersince the streamer will quickin (loads with regular strides the 2nd time, random addresses the 1st time)
*1:if a single float per cache line in the source data, it will be 64/sizeof(float) =16x less bandwidth hungry the 2nd time, this is acommon case with linear algebrafor big matrices for example
>>Gather/scatter is intended to improve throughput when the data does reside in L1/L2. If it's higher up then I expect the CPU will pile up several misses and then the second thread (executing similar code) should simply take over and cover (most of) the latency
Yes, but since we are talking about HT sibling threads, under many circumstances the second thread would cause unnecessary eviction from the smaller L1/L2 (core shared) caches. As to if cooperative HT programming helps or hinders, this would depend on what is being done. IOW use the technique when appropriate.
I think gather
We just posted information on the gather instructions we plan to support in Intel microarchitecture Haswell. See http://software.intel.com/file/36945. Blog coming soon.
Heh... finally some serious flexibility -- after gather I especially like shifts with independent counts per element.How long we will have to wait for Haswell and how many kidneys we need to sell for the platform upgrade this time? :D
Thanks for a veryexciting reading, the 1st word that come to mind is "WOW!"
in both cases I think they are not matching well with SMT, 4-way SMT is a more likely future IMHO
Reading the above, I would think you're not aware of the future AVX or current MIC gather instructions.How does this square with http://software.intel.com/en-us/forums/showthread.php?t=83399&o=a&s=lr
Even if the initial version will be only micro-coded without performance advantage over current AVX code, it opens the door for you to show the extent of your intended use, and provide examples for optimization of future hardware.
Likewise, you might provide more detail on how you expect to show an advantage for 4-way SMT. I understand that it has shown a measurable advantage over 2 or 3 threads per core for linpack benchmarks, but not in the context of gather/scatter.
We just posted information on the gather instructions we plan to support in Intel microarchitecture Haswell.See http://software.intel.com/file/36945.Blog coming soon.
I'm very happy to see that most of the integer commands have been promoted to YMM.Unfortunately the gather command only works on entities of 32 bit or greater since as described in my reply #53 we really need it also for bytes and words. I understand that there are some obstacles implementing it if there are page faults during execution of such command.
>Reading the above, I would think you're not aware of the future AVX
Hey! consider the timing, I'mneither an insider nor under a CNDA with Intel, so I wasn't aware of them May 23,2011 (the date I posted this), though I'mwell aware of them now as you can see herehttp://software.intel.com/en-us/forums/showpost.php?p=152492
>or current MIC gather instruction
I was talking about *AVX*, since the whole thread subject is "Converging AVX and LRBni" they are two distinct x86 ISAs in the context of this thread.Anyway, be assured I'm well aware of the VGATHERD instruction inLRBni since the day the 1st Abrash's paper was published (more than 2 years ago), I havezero fresh news about the MIC products, though.
>Likewise, you might provide more detail on how you expect to show an advantage for 4-way SMT.
well exactly like I enjoy nice speedups (up to 1.3x on Sandy Bridge) with 2 threads: simply allocate a pool of threads with 1 thread per hardware thread context, minimize synchronization as much as possible, etc.. Since both the canceled EV8, Power 7 andLarrabee 1 feature 4hardware threadsthere is obviously some good reasons (i.e. simulations have validated the idea)to have more than2 hw threads, particularly when consideringthe wider and wider RAM bandwidth/latency gap
>but not in the context of gather/scatter.
exactly, thus my comment "they are not matching well with SMT", I told that because I suspect that at least the first implementation will be not able to track more than one gather instruction at the same time, if it's high (huge in case of cache miss(es)) latency and serializing like VSQRTPS it can choke code using intensively VGATHERDPS in critical loops due to threads fighting for the gather resource, now I'll love to be surprised and thatthe executions of several gathers will be allowed to overlap
I've done some research and I can probably answer my own question now...
Each execution port has its own scheduler (reservation station), so when an AVX-1024 uop is issued on a 256-bit execution unit, the corresponding scheduler could be clock gated for 3 out of 4 cycles. While the scheduler is asleep a small and power efficient piece of sequencing logic can keep feeding in the 256-bit registers. NVIDIA uses similar register stringing (and Intel has a patent agreement with them) so I'm confident it's feasible to implement AVX-1024 as a single uop executed in 4 cycles.
Higher up in the pipeline, dispatch, register rename and decoding stages can also be clock gated when the buffers are full. Can anyone tell me to which degree these things have already been implemented?
With AVX2 already offering gather and FMA support, AVX-1024 would be the only thing remaining to turn the CPU into a power-efficient high-throughput device, without sacrificing sequential performance. AVX-1024 implemented this way would not only have the potential to drastically reduce power consumption during DLP-limited workloads, but in practice also allow to better cover long cache or RAM access latencies. Out-of-order execution enables higher ILP than Larrabee and GPUs offer, which reduced cache contention (2-way Hyper-Threading still seems like a valuable compromise though).
So does anyone know of any obstacles which would cause you not to expect AVX-1024 support for the Skylake timeframe?