Sandy Bridge: SSE performance and AVX gather/scatter

126 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Citation :

bronxzv a écrit :

FYI, I tested at 1600 x 1200 with the 4770K iGPU, btw it looks like there is a big serial portion of code in your demo at each frame (thus low overall CPU usage, less than 85% on my machine, in other words you are leaving more than 15% performance potential on the table at the moment), it's probably when you copy the rendered frames to the front buffer, I'll advise to do that in parallel with the rendering of the next frame (triple buffering), this way copying the rendered frames will be concurrent with rendering, typically one thread is enough to saturate the PCIe bandwidth so that's a very sensible solution to do it like that, a 8-thread pool for rendering and a single extra thread for copying the final frames

Indeed there is still a serial part, that can be parallelized.  It's the geometry processing and setup for parallel rendering of screen blocks. The renderer doesn't  do any frame copying, that would take too much bandwidth anyhow.

Citation :

bronxzv a écrit :

ah! so you are not comparing the gain from gather in isolation, btw the speedup looks quite low from SSE2 to AVX2, I'm getting around 45% speedup for my own texture mapping code (using 256-bit unpack and shuffle + FMA, but neither using gather nor generic permute at the moment)

The speedup is so low, because the SSE2 perspective correct bilinear interpolation is already so efficient that it completely overlaps with memory reads. AVX2 does not make the maths faster as it's not the bottleneck. The bottleneck is texel fetching, and that is where gather brings a modest improvement.

Citation :

jan v. a écrit :
The speedup is so low, because the SSE2 perspective correct bilinear interpolation is already so efficient that and it completely overlaps with memory reads. AVX2 does not make the maths faster as it's not the bottleneck. The bottleneck is texel fetching, and that is where gather brings a modest improvement.

your textures look so low res and your mesh / BSP tree so simple that it's strange that memory bandwidth is an issue, do you know the size of your working set at each frame? also if you are memory bandwidth bound or even LLC bandwidth bound, there is no possibility that gather buy you 10% speedup (based on my microbenchmark results above in this thread)

EDIT: I just tested with VTune and your demo is indeed maxing out memory bandwidth apparently, for example FQuake64 AVX2.exe use a very steady (when looking around the room) 19 GB/s aggregate bandwidth (10.2 GB/s average read bw) with spikes at  20.8 GB/s, it never fell lower than 17.8 GB/s in a ~ 30 s test run

Citation :

bronxzv a écrit :

your textures look so low res and your mesh so simple that it's strange that memory bandwidth is an issue, do you know the size of your working set at each frame? also if you are memory bandwidth bound or even LLC bandwidth bound, there is no possibility that gather buy you 10% speedup

The texture and geometry, are from a game of decades ago. They are even 8 bit with palette, but I turn them in 32 bit for rendering (except for the sky and mud, where I do byte gathering instead of int, and an extra gather through the palette). The other textures are acutally larger than you would think, as all texure on the walls/floor is unique. They are blended from a material and light texture into a third texture buffer that is used for rendering.

With memory read I mean seen from a software point of view, so any instruction with a memory operand. I'm reasonably sure all texture seen in a frame fits in L3 cache.  L1D cache hit should be pretty high, so most texel fetches will hit the L1 cache.  That must be about the best case scenario for gather.

Citation :

jan v. a écrit :
The texture and geometry, are from a game of decades ago. They are even 8 bit with palette, but I turn them in 32 bit for rendering (except for the sky and mud, where I do byte gathering instead of int, and an extra gather through the palette). The other textures are acutally larger than you would think, as all texure on the walls/floor is unique. They are blended from a material and light texture into a third texture buffer that is used for rendering.

ah, it makes sense so it's indeed well bigger than I expected, I was thinking that you were doing two texture passes per sample (low res tiled albedo + light map) EDIT: I see that it's clearly explained on your site that you do a single pass, though it's not mentioned how often you update the LRU cache with the composited texture

please note that I edited my previous message with bandwidth measurements of your demo

Citation :

bronxzv a écrit :

EDIT: I just tested with VTune and your demo is indeed maxing out memory bandwidth apparently, for example FQuake64 AVX2.exe use a very steady (when looking around the room) 19 GB/s aggregate bandwidth (10.2 GB/s average read bw) with spikes at  20.8 GB/s, it never fell lower than 17.8 GB/s in a ~ 30 s test run

That is still with the IGP displaying ? If so you are mainly measuring the Intel driver copying buffers.

As I'm getting 324 (max 380fps looking at a wall ) fps at 2560x1440, that is 5.2 GB/s just for writing the rendered images. If these can be sent to the discrete GPU with one read, there should be no more than 10.4 GB/s  memory bandwidth usage (as all texture fits in L3 cache imho).

Remark that 400 fps seems to be the limit allowed by PCIe3 bandwidth. Even with point sampling (hit space bar) and close to wall it doesn't get any faster.

Citation :

jan v. a écrit :

Quote:

bronxzvwrote:

EDIT: I just tested with VTune and your demo is indeed maxing out memory bandwidth apparently, for example FQuake64 AVX2.exe use a very steady (when looking around the room) 19 GB/s aggregate bandwidth (10.2 GB/s average read bw) with spikes at  20.8 GB/s, it never fell lower than 17.8 GB/s in a ~ 30 s test run

That is still with the IGP displaying ? If so you are mainly measuring the Intel driver copying buffers.

the IGP, sure, as a true fan of pure software rendering I have no discrete gfx card on this system... frame buffer size = 1600x1200x4B = 7.68MB, at 195 fps that is just ~1.5B GB/s of write bandwidth, or 3 GB/s of aggregate copy bandwidth, display is at 60 Hz so ~0.5 GB/s of extra read bandwidth for the screen refresh

Citation :

jan v. a écrit :
Even with point sampling (hit space bar) and close to wall it doesn't get any faster.

thanks for the hint, I tested it again with this keyboard shortcut, FQuake64 AVX2.exe is at 194-195 fps with bilinear interpolation and 200-201 fps with point sampling, so indeed it looks like bottlenecks are somewhere else, an idea to better show the advantage of AVX2 vs SSE2 for your texture mapping code will be to have more than one texture per final sample, for example using transparent surfaces, bump maps and reflection maps

Citation :

bronxzv a écrit :

so indeed it looks like bottlenecks are somewhere else, an idea to better show the advantage of AVX2 vs SSE2 for your texture mapping code will be to have more than one texture per final sample, for example using transparent surfaces, bump maps and reflection maps

Actually I already had something different in mind:   bicubic interpolation :-)

Citation :

jan v. a écrit :
That is still with the IGP displaying ?

out of curiosity I have measured the time it takes to display a frame (copy one of my own back buffer to the IGPU front buffer) without the rendering fighting for bandwidth, I get these timings at 1600 x 1200 32-bit (alpha ignored):

Direct 2D : 2.45 ms per frame
GDI : 10.85 ms per frame

 

 

Citation :

jan v. a écrit :

Quote:

bronxzvwrote:

so indeed it looks like bottlenecks are somewhere else, an idea to better show the advantage of AVX2 vs SSE2 for your texture mapping code will be to have more than one texture per final sample, for example using transparent surfaces, bump maps and reflection maps

Actually I already had something different in mind:   bicubic interpolation :-)

sure, bicubic interpolation is also a nice idea since GPUs don't have direct hardware support for it (AFAIK) so GPU people must ressort to fragment shaders to get it working

Citation :

c0d1f1ed a écrit :
bronxzv, was the performance regression you observed due to not clearing the destination register? Apparently that is required to break a dependency chain.

it's possible, the culprit code looks like this (used in a lot of different place):

        vpcmpeqd  ymm0, ymm0, ymm0                              ;118.3
        vmovdqu   xmm4, XMMWORD PTR [rdx]                       ;118.3
        vmovdqu xmm1, XMMWORD PTR [16+rdx] ;118.36
        vmovdqa   ymm5, ymm0                                    ;118.3
        vpgatherdq ymm3, YMMWORD PTR [rcx+xmm4], ymm5           ;118.3
        vpgatherdq ymm2, YMMWORD PTR [rcx+xmm1], ymm0           ;118.36
 

the ymm3 and ymm2 destinations aren't cleared before the gather instructions, this is generated by the Intel compiler from the intrinsic _mm256_i32gather_epi64   

when I try to explicitely clear the destination variables with _mm256_setzero_si256() the compiler ignore it so it is more challenging to test than I would like 

EDIT: I managed to test it resorting to inline ASM for clearing the destination (vpxor used  before vpgatherdq) and I get no speedup, in fact a slowdown unfortunately

test                      duration                     speedup
 
AVX baseline              13523 - 13665 kclocks 
AVX2 gather               14289 - 14358 kclocks        0.946  x 
AVX2 gather w/ clr dst    15052 - 15167 kclocks        0.898 x

 

 

 

I've adapted my SSE2 versus AVX2 texture mapping demo. Now it can run full screen, when hitting the Enter key. With only the HD 4600 displaying it's now much faster, compared to windowed. Apparently the extra bit blitting was a bit too much overhead.

Edit: The full screen works on win7, on win8 it seems it doesn't...

>>>bicubic interpolation is also a nice idea since GPUs don't have direct hardware support for it>>>

Bicubic interpolation uses easily representable (at machine code level) instructions which are implemented inside shader processors.

>>>bicubic interpolation is also a nice idea since GPUs don't have direct hardware support for it>>>

Or do you mean hardware accelerated like transcendental functions unit?

Hi bronxzv,

sorry for asking this twice(large part of our fx-related conversation was lost during the forum transition).How  writing to frame buffer is managed by software renedering?Do you use DirectX to do that?

Thanks in advance

Citation :

iliyapolak a écrit :

sorry for asking this twice(large part of our fx-related conversation was lost during the forum transition).How  writing to frame buffer is managed by software renedering?Do you use DirectX to do that?

He already mentionned that:

" I measured the time to copy a 32-bit 1600x1200 frame with Direct 2D and the iGPU (CPU at 3.9 GHz fixed) and it takes 3.28 += 0.02 "

Citation :

iliyapolak a écrit :

>>>bicubic interpolation is also a nice idea since GPUs don't have direct hardware support for it>>>

Bicubic interpolation uses easily representable (at machine code level) instructions which are implemented inside shader processors.

by "no direct hardware support " I was meaning that the Texturing Units can't be used directly for it, so you must resort to a software based solution on GPUs too (so it levels the playing field when comparing GPU and CPU rendering speed), there was a paper about that in one of the GPU Gems or Shader X book, sorry but I don't remember exactly the title of the paper

Citation :

iliyapolak a écrit :

Hi bronxzv,

sorry for asking this twice(large part of our fx-related conversation was lost during the forum transition).How  writing to frame buffer is managed by software renedering?Do you use DirectX to do that?

Thanks in advance

I was using Direct X in the past (actually the IDirect3D9 interface) with even my own memory copy routines for some specific targets, for compatibility with all platforms there is also a GDI-based fallback

now I have replaced the custom D3D version with one based on Direct 2D (requires >= Windows 7 or Vista with a patch), the code is much simpler now and the performance was the same (D3D vs D2D) when I tested it on a low end discrete GPU

here are the timings (D2D and GDI) that I get on Haswell with the IGPU: http://software.intel.com/en-us/comment/reply/285867/1740857

 

 

 

Citation :

jan v. a écrit :

Quote:

iliyapolakwrote:

sorry for asking this twice(large part of our fx-related conversation was lost during the forum transition).How  writing to frame buffer is managed by software renedering?Do you use DirectX to do that?

He already mentionned that:

" I measured the time to copy a 32-bit 1600x1200 frame with Direct 2D and the iGPU (CPU at 3.9 GHz fixed) and it takes 3.28 += 0.02 "

I must admit that did not read the whole thread:)

>>>by "no direct hardware support " I was meaning that the Texturing Units can't be used directly for it, so you must resort to a software based solution on GPUs too>>>

I see what you mean.

>>>here are the timings (D2D and GDI) that I get on Haswell with the IGPU: http://software.intel.com/en-us/comment/reply/285867/1740857>>>

I think that D2D is faster because more of its operations better utilizes hardware acceleration.

Citation :

bronxzv a écrit :

EDIT: I just tested with VTune and your demo is indeed maxing out memory bandwidth apparently, for example FQuake64 AVX2.exe use a very steady (when looking around the room) 19 GB/s aggregate bandwidth (10.2 GB/s average read bw) with spikes at  20.8 GB/s, it never fell lower than 17.8 GB/s in a ~ 30 s test run

Did you check the new version I put, that can run in full screen (Enter key), did you see any frame rate improvement ?  (only works for win 7, Vista)

Citation :

jan v. a écrit :
Did you check the new version I put, that can run in full screen (Enter key), did you see any frame rate improvement ?  (only works for win 7, Vista)

Hi jan, no I didn't and I have Windows 8 on the Haswell system so if I understand it well ful screen mode will not work on it, once you have a Windows 8 version available I'll be glad to test it

Citation :

bronxzv a écrit :

Serially inserting and extracting elements was still somewhat acceptable for SSE, but with 256-bit AVXitbecomes a serious bottleneck,

For "slightly divergent locations", i.e.most elements in the same 64B cache line, AFAIK with SSEthe bestsolutionwas with indirect jumps to a series of static shuflles (controls as immediates) in order to maximize 128-bit load/stores. Now with AVX we can use dynamic shuffles (controls inYMM registers)using VPERMILPS. Based on IACA the new AVX solution is more than 2x faster than legacy SSE, I suppose it will be even more than 2x faster on real hardware since the main issue with the indirect jump solution is the high branch miss rate

Could you point me at an example of loading 4 double floats into a YMM from non-contiguous locations, using AVX intrinsics? This is exactly the problem I have (on Sandy Bridge), and the only way I can see to do it is to "marshall" the data in contiguous locations, which is obviously a non-starter from a performance standpoint.

Citation :

JEROME B. (Intel) a écrit :

Could you point me at an example of loading 4 double floats into a YMM from non-contiguous locations, using AVX intrinsics? This is exactly the problem I have (on Sandy Bridge), and the only way I can see to do it is to "marshall" the data in contiguous locations, which is obviously a non-starter from a performance standpoint.

It's generally slower to arrange data in memory just before to fill an YMM register (among other reasons because store to load forwarding is blocked with 4 x 64-bit stores followed by a 256-bit load)

I suppose the Intel compiler generates the best code sequence if you use _mm256_set_pd

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today