How to extract DWORD from upper half of 256-bit register?

How to extract DWORD from upper half of 256-bit register?

imagem de Igor Levicki

Congratulations to Intel CPU instruction set engineers for managing to make YET ANOTHER non-orthogonal instruction set extension -- why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?

Any ideas/tricks to work around this engineering "oversight"?

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
64 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

Igor, There are many intrinsic functions for extraction in immintrin.h header file ( search for all places where a word 'extract' is used ). If the instruction you've expected to see is missing why wouldn't you apply a workaround and use what is available now.

I understood that you need to extract signed or unsigned 32-bit values from __m256i union:

...
typedef union _MMINTRIN_TYPE(32) __m256i {
#if !defined(_MSC_VER)
/*
* To support GNU compatible intialization with initializers list,
* make first union member to be of int64 type.
*/
__int64 m256i_gcc_compatibility[4];
#endif
__int8 m256i_i8[32];
__int16 m256i_i16[16];
__int32 m256i_i32[8];
__int64 m256i_i64[4];
unsigned __int8 m256i_u8[32];
unsigned __int16 m256i_u16[16];
unsigned __int32 m256i_u32[8];
unsigned __int64 m256i_u64[4];
} __m256i;
...

Is that correct?

imagem de Igor Levicki

Sergey,

What I want is to extract arbitrary DWORD from say YMM0 register. For XMM0 register, the instruction for extracting DWORD 3 is PEXTRD eax, XMM0, 3 while there is no such instruction to extract DWORD 7 from YMM0.

Yes, I could use intrinsics, write __m256i val = _mm256_load_si256(mem) and then DWORD part = val.m256i_u32[7] but that does not translate to a single assembler instruction. You can understand my post as a complaint about non-orthogonality of AVX2 extensions.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Hi Igor,

>>...What I want is to extract arbitrary DWORD from say YMM0 register. For XMM0 register, the instruction for extracting
>>DWORD 3 is PEXTRD eax, XMM0, 3 while there is no such instruction to extract DWORD 7 from YMM0.
>>
>>Yes, I could use intrinsics, write __m256i val = _mm256_load_si256(mem) and then DWORD part = val.m256i_u32[7] but
>>that does not translate to a single assembler instruction. You can understand my post as a complaint about
>>non-orthogonality of AVX2 extensions.

Thanks for the clarification. I'll take a look at Instructions Set Manual and I'm surprised that such extraction is Not available.

imagem de Igor Levicki

You will notice that is not the only one missing instruction.

The whole AVX business reminds me of extending AX to EAX -- you get access to 32 bits (EAX), 16 bits (AX), but there is no cheap access to the upper 16-bit register half except through shifts and masks. Same with AVX, just instead of 32 and 16 it is 256 and 128.

Another part where they did not make instruction set orthogonal is parallel bit shift -- does not exist for words and bytes which in my opinion would be the most common use cases.

Final part of my complaint is that if they already decide not to implement VPEXTRD eax, ymm0, 7 they could at least document the fastest alternative with 2 or 3 instructions instead of having all of us guess and test.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

What about these two intrinsic functions?

[ immintrin.h ( Intel version ) ]
...
extern __m128i __ICL_INTRINCC _mm256_extractf128_si256( __m256i, const int );
...
extern __m128i __ICL_INTRINCC _mm256_extracti128_si256( __m256i, const int );
...

I think they almost what you need but still don't return a DWORD type.

Note: Microsoft's version of immintrin.h doesn't have declaration for the 2nd function, that is _mm256_extracti128_si256.

Quote:

Igor Levicki wrote:
why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?

to be consistent with the AVX2 philosophy for all promoted SSEn instructions (same behavior for both 128-bit lanes with no cross-lane dependency)  256-bit VPEXTRD will have to return 2 results in two detination GPRs which isn't possible with VEX encoding

Quote:

Igor Levicki wrote:
Any ideas/tricks to work around this

extracts: depending on your use case a single VPERMD will do the trick (with proper indices in a register initialized out of your critical loop), you'll have your result in the low double word of the destination YMM, if you really need the result in a GPR the fastest sequence AFAIK is VEXTRACTI128 followed by VPEXTRD

inserts: for your insertions from a GPR I suggest to use a VPINSRD, VINSERTI128 sequence

imagem de Igor Levicki

Quote:

bronxzv wrote:
to be consistent with the AVX2 philosophy for all promoted SSEn instructions (same behavior for both 128-bit lanes with no cross-lane dependency)  256-bit VPEXTRD will have to return 2 results in two detination GPRs which isn't possible with VEX encoding

But I disagree!

While for other instructions doing the same thing in lower and upper lane is essential, INSERT/EXTRACT instructions are a different thing alltogether -- they should not be promoted in the same way. Their purpose is scalar access to vector elements, not parallel processing so they should just be extended to allow access to all elements.

Quote:

bronxzv wrote:
extracts: depending on your use case a single VPERMD will do the trick (with proper indices in a register initialized out of your critical loop), you'll have your result in the low double word of the destination YMM, if you really need the result in a GPR the fastest sequence AFAIK is VEXTRACTI128 followed by VPEXTRD

inserts: for your insertions from a GPR I suggest to use a VPINSRD, VINSERTI128 sequence

Yes, I figured that out but still it would be better if the set was made orthogonal to begin with. I see no good reason not to expand PEXTRD/PINSRD to allow indices from 4 to 7.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Quote:

Igor Levicki wrote:
But I disagree!

the choice was probably done to simplify hardware design more than programmer's convenience, one can also argue that pack/unpack isn't convenient the way it was expanded to 256-bit or that 128-bit shifts aren't promoted to 256-bit shifts which isn't "orthogonal"

all in all I'll say that VPERMD is more convenient than legacy extracts since the element index can be set dynamically (ymm idx register) instead of statically (immediate value), it is incredibly useful for a lot of other use cases, I found a new use for it yesterday for example: dynamically specified broadcast, unlike native broadcast where the low element is replicated you can specify the index of the element to be replicated

imagem de Igor Levicki

I wonder... did you manage to get theoretical 50% speedup with AVX2 integer code compared to SSE2/SSSE3/SSE4.1 integer code?

I am seeing ~33% so far, this may well be caused by the "simplified hardware design" you mention.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Quote:

Igor Levicki wrote:
I wonder... did you manage to get theoretical 50% speedup with AVX2 integer code compared to SSE2/SSSE3/SSE4.1 integer code?

actually the max theoretical speedup is 2x i.e. 100% (even more with new instructions like VPERMD) but I have no single test with only integer instructions so I can't report any real world values for integer only, the best speedup I measured with production code is 1.82x (82%) for mixed int and fp when comparing a SSE2 path with an AVX2 path (incl. FMA), note that this is for a single kernel with high L1D cache locality, not a full application

Quote:

Igor Levicki wrote:
I am seeing ~33% so far, this may well be caused by the "simplified hardware design" you mention.

my "simplified design" remark was for the two fully distinct execution stacks with duplicated 128-bit execution units, it has nothing to do with any throughput limitation, your deceptive speedup may be due to incomplete vectorization (hint: you mentioned scalar inserts/extracts as important for you so I suppose they are used in some of your hotspots) or L2$/LLC$/memory bandwidth limitation (or both)

if you want better optimization advices I'll suggest to post code snippets of your hotspots

imagem de Igor Levicki

When I said 50% I actually meant 50% shorter execution time which would translate into 2x speedup. Sorry for confusion.

Attached is the code with simple test driver. My results are:

     test_C : 6345.035 ms
test_SSE4.1 : 3944.771 ms
  test_AVX2 : 2190.420 ms

Difference is 1.80x here too, but that difference gets smaller (1.51x) if you change pragma for SSE4.1 function and let compiler generate 128-bit SSE with VEX prefix and 3-operand syntax. However, that also exposes an issue with intrinsics and arch optimization -- compiler uses vpbroadcastb which is not in SSE4.1 set. I didn't bother to check whether speedup is due to vpbroadcastb use or due to VEX+3op but I personally doubt vpbroadcastb is that much faster. Also, there is a much more sinister issue with intrinsics -- if you don't specify arch compiler will generate plain SSE2/SSSE3 instructions for _mm256_set1_epi8() in the middle of AVX2+VEX+3op code causing severe performance penalty by state transitions.

The CPI for test_AVX2() is 0.345 out of theoretical 0.250. Not sure if it can get any better than that, but you are welcome to try.

Finally, I don't understand why compiler is avoiding aligned memory access in AVX2 code when memory is aligned -- it still uses vmovdqu. I think I will just go back to using pure assembler and living with a nightmare of maintaining two versions of ASM code for 32-bit and 64-bit rather then letting compiler do whatever it wants with intrinsics.

Anexos: 

AnexoTamanho
Download test.cpp4.7 KB
-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Quote:

Igor Levicki wrote:
When I said 50% I actually meant 50% shorter execution time which would translate into 2x speedup. Sorry for confusion.

so the 33% you were mentioning stands for a x1.49 speedup as per this definition http://en.wikipedia.org/wiki/Speedup this looks pretty good already

Quote:

Igor Levicki wrote:
Attached is the code with simple test driver. My results are:

     test_C : 6345.035 ms
test_SSE4.1 : 3944.771 ms
  test_AVX2 : 2190.420 ms

Difference is 1.80x here too,

a 1.80x speedup looks very good to me, there is maybe not much room for improvement, probably nothing obvious I suppose

Quote:

Igor Levicki wrote:
but that difference gets smaller (1.51x) if you change pragma for SSE4.1 function and let compiler generate 128-bit SSE with VEX prefix and 3-operand syntax. However, that also exposes an issue with intrinsics and arch optimization -- compiler uses vpbroadcastb which is not in SSE4.1 set. I didn't bother to check whether speedup is due to vpbroadcastb use or due to VEX+3op but I personally doubt vpbroadcastb is that much faster. Also, there is a much more sinister issue with intrinsics -- if you don't specify arch compiler will generate plain SSE2/SSSE3 instructions for _mm256_set1_epi8() in the middle of AVX2+VEX+3op code causing severe performance penalty by state transitions.

The CPI for test_AVX2() is 0.345 out of theoretical 0.250. Not sure if it can get any better than that, but you are welcome to try.

this CPI looks indeed very good, so I suppose your optimizations are already well done

Quote:

Igor Levicki wrote:
Finally, I don't understand why compiler is avoiding aligned memory access in AVX2 code when memory is aligned -- it still uses vmovdqu.

because the encoding is more compact AFAIK (so potentially slightly less uopcache/icache misses on a big application), besides second order effect like icache misses vmovdqu speed is exactly the same than vmovdqa, note that it is the same with vmovups preferred (by the Intel compiler) over vmovaps for fp code

> Finally, I don't understand why compiler is avoiding aligned memory access in AVX2 code when memory is aligned

AFAIK, in Sandy Bridge and later CPUs, movdqa and movdqu are equivalent, when memory is aligned. See Architecture Optimization Manual, Table C-12a. vmovdqa and vmovdqu are even closer as vmovdqa doesn't fail on unaligned memory. I think I even saw a recommendation to always use vmovdqu somewhere, but I can't remember the document now.

imagem de Igor Levicki

Quote:

andysem wrote:
AFAIK, in Sandy Bridge and later CPUs, movdqa and movdqu are equivalent, when memory is aligned. See Architecture Optimization Manual, Table C-12a. vmovdqa and vmovdqu are even closer as vmovdqa doesn't fail on unaligned memory. I think I even saw a recommendation to always use vmovdqu somewhere, but I can't remember the document now.

Well, 14.0 beta on Linux seems to emit aligned loads for those constants. I guess we will never know what is right.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

There's no un-aligned penalty upon SB, IB, and HW (for 128-bit loads), so long as you're within the same cacheline.  When you have a memory access that spans a cacheline or a page you take a significant hit in latency of ~5 and ~28 clks on that load.  So.. as long as you don't span across cachelines or pages.. you're loads, whether aligned or unaligned in SSE/AVX.. will not take longer.

Perfwise

Quote:

Igor Levicki wrote:
Well, 14.0 beta on Linux seems to emit aligned loads for those constants. I guess we will never know what is right.

as I posted above (sorry but my post was delayed by moderation for several days!) maybe the compiler use unaligned moves because the encoding is more compact (to be verified)

Just an update.. upon HW in 256-bits there's no alignment penalty for loads which are mis-aligned from 256-bit alignment when using VMOVUPS.. but there's a penalty for spanning a cachline boundary and a page boundary.

Perfwise

imagem de Igor Levicki

But if you write const __m256i var = something; isnt the compiler free to align/order that value properly in read-only data segment?
Why would it ever need to use unaligned loads then when it can guarantee that the data will be properly aligned even without explicitly specifying __declspec(align(32))?

By the way, specifying alignment on __m256i variable doesn't force aligned loads in 13.1 update 5.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

This is a short follow up on Igor's test results:

>>Attached is the code with simple test driver. My results are:
>>
>> test_C : 6345.035 ms
>>test_SSE4.1 : 3944.771 ms
>> test_AVX2 : 2190.420 ms

Intel Core i7-3840QM ( 2.80 GHz )
Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846

[ 64-bit Windows 7 Professional / 64-bit test ]

test_C : 12904.534 ms
test_SSE4.1 : 6502.829 ms

[ 64-bit Windows 7 Professional / 32-bit test ]

test_C : 12423.721 ms
test_SSE4.1 : 7097.624 ms

imagem de iliyapolak

Regarding test_C function I wonder if removing a branch from within the loop and putting for_ loop inside the if_else block could lead to some execution speed up.

imagem de Igor Levicki

The action in if/else block directly depends on source data. I don't see how that could be moved out of the loop? You need to do a test for every bit in source to decide what to write to destination.

The only other thing I could think of would be to use a lookup table of 256 x 24 bytes (6,144 bytes in size), fetch a byte from source, and memcpy() the corresponding row from the table. To do that, you would have to precompute the table because the background and foreground color can be different each time so the speedup might be noticable only for large pictures. You could also try doing a smaller table (16 x 12 bytes) which would be faster to precompute and split the source byte into two 4-bit nibbles for lookup, but it would be less efficient to copy from such table (dwords instead of qwords). On the other hand, it would compete less for L1 cache bandwidth. Without testing it is impossible to say which one would be faster.

For the monochrome bitmap of 4,593 x 6,000 pixels I am getting ~18 ms for AVX2 code I wrote which is somewhere around 1460 MB/sec or 11,680 MPixels/sec.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
imagem de iliyapolak

Sorry misunderstood the  code.

Quote:

Igor Levicki wrote:

When I said 50% I actually meant 50% shorter execution time which would translate into 2x speedup. Sorry for confusion.

Attached is the code with simple test driver. My results are:

     test_C : 6345.035 ms
test_SSE4.1 : 3944.771 ms
  test_AVX2 : 2190.420 ms

Difference is 1.80x here too, but that difference gets smaller (1.51x) if you change pragma for SSE4.1 function and let compiler generate 128-bit SSE

after a rapid check of your AVX2 code path, it appears that you are effectivly using only 75% of the YMM registers width (24-byte / 32-byte), so 25% of the computations (the ones for the unused higher 8-byte) are done in pure waste, this is a classical case of partial vectorization 

for better speedups you'll have to process more pixels in parallel, I'm not sure if it's possible in your case, though

just another remark, the operation below is useless:
blend_mask = _mm256_and_si256(blend_mask, sign_mask);

BLENDVx instructions use only the MSB of the mask elements so clearing the lower bits isn't required

imagem de iliyapolak

>>>The only other thing I could think of would be to use a lookup table of 256 x 24 bytes (6,144 bytes in size), fetch a byte from source, and memcpy() the corresponding row from the table>>>

memcpy() could be a potential bottleneck in this case.IIRC this function at machine code level usess rep movsb(d).I was thinking about the some kind of streaming load/store function(based on inline assembly) which could speed data transaction.

imagem de Igor Levicki

Quote:

bronxzv wrote:
for better speedups you'll have to process more pixels in parallel, I'm not sure if it's possible in your case, though

I am afraid it is not, at least not efficiently.

Quote:

bronxzv wrote:
just another remark, the operation below is useless:
blend_mask = _mm256_and_si256(blend_mask, sign_mask);

BLENDVx instructions use only the MSB of the mask elements so clearing the lower bits isn't required

Yes, I am aware of that. I did that while I was visualizing data flow. It can be removed now.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
imagem de Igor Levicki

Quote:

iliyapolak wrote:
memcpy() could be a potential bottleneck in this case.IIRC this function at machine code level usess rep movsb(d).I was thinking about the some kind of streaming load/store function(based on inline assembly) which could speed data transaction.

memcpy() is long past simple rep movsb in every compiler -- it is replaced by optimal sequence of instructions and inlined for short copies.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
imagem de iliyapolak

@Igor 

do you that memcpy() could be implemented with the help of streaming store instruction and probably loop-unrolled?Did you disassemble memcpy() function?

Quote:

Igor Levicki wrote:

Quote:

iliyapolak wrote:memcpy() could be a potential bottleneck in this case.IIRC this function at machine code level usess rep movsb(d).I was thinking about the some kind of streaming load/store function(based on inline assembly) which could speed data transaction.

memcpy() is long past simple rep movsb in every compiler -- it is replaced by optimal sequence of instructions and inlined for short copies.

actually REP MOVSB is again a sensible choice for memset/memcpy, at least for some dataset sizes (>= 128 bytes), since they are optimized for best throughput in Ivy Bridge and Haswell, with more compact code than unrolled sequences of 16-byte moves but similar speed, that's what they call ERMSB in the IA optimization reference manual, have a look at pages 3-65 to 3-68 of the June 2013 edition

imagem de iliyapolak

@bronxzv

so rep movsb(d) can be used  for specific dataset sizes?Strange because unrolled streaming version seems to be faster,but at cost of more machine code to be executed.

Quote:

iliyapolak wrote:

@bronxzv

so rep movsb(d) can be used  for specific dataset sizes?Strange because unrolled streaming version seems to be faster,but at cost of more machine code to be executed.

note that streaming stores are much slower than regular (cached) stores when your workset fit in the LLC or even worse if you can work with L2 cache blocking, with streaming stores you force slow and power hungry memory transactions that should not occur with temporal data used for auxiliary intermediate results, as is typical with multi-passes algorithms and cache blocking, btw it will be interesting to see how the L4 cache in Iris Pro deal with streaming stores (streaming stores bypass also the L4 or not ?)

please refer to the optimization manual for advices about REP MOVSB (for which dataset sizes the usage is sensible, etc.) since I have no practical experience with this on IVB/HSW, exactly like Igor I was thinking it was something of the distant past until it was resurrected in IVB and I read about it in the guide only a few weeks ago

imagem de iliyapolak

@bronxzv

Thanks for reply.Initially I was confused by seemingly much larger memory bandwidth which could go through load/store ports combined with prefetching and loop unrolling,but I see that this is not the case.Btw my understanding is that streaming stores can be used for large memory movement of non-temporal data.

imagem de Igor Levicki

From my own experience, streaming stores can be used to avoid cache pollution. Therefore they are only usefull if the data will not be consumed immediately. Before consuming the data you need a memory fence operation to make sure all outstanding writes have completed. Also, you need to keep in mind that when using intrinsics compiler may reorder your reads and writes unless you use a barrier as well. Streaming stores will compete for write buffers (important when considering threading on logical cores), and they will not work as intended if you do not write out a full cache line of data at a time because they use write combining buffers.

Overall, they are usefull in very limited number of scenarios and must be used with great care. That is all from memory (I worked with them long time ago) so it might not be 100% accurate or it might not even be true for the latest CPUs.

Now regarding memcpy() -- I couldn't get 13.1 update 5 compiler to emit rep movsb/w/d for memcpy() and I was too lazy to write it in assembler (hey, it's Friday :)) but here are the results for using memcpy() for copying of 24 bytes from fixed source to fixed destination:

// IA32            (mov) = 1990.800 ms
// SSE4.1        (movsd) = 1532.112 ms
// SSE4.2 (movups/movsd) = 1312.965 ms
// AVX2 (vmovdqu/vmovsd) = 1312.950 ms

In all cases compiler has inlined memcpy() call and replaced it with 12xMOV, 6xMOVSD, 2xMOVUPS+2xMOVSD, or 2xVMOVUPS+2xVMOVSD. Code size was (not counting source byte fetch and one shift for address calculation) MOV 53 bytes, MOVSD 38 bytes, MOVUPS+MOVSD 23 bytes, and VMOVUPS+VMOVSD 25 bytes.

From the perspective of code size, (V)MOVUPS+(V)MOVSD seems most efficient and it is also fastest in the test. REP MOVSD might be even shorter, but I am not sure about the speed.

Finally, bear in mind that the quick test I did for this is not realistic because it uses color table precalculated one time and in advance (not measured) and it always copies same data where in real life it would have to make a table every time (because of different foreground/background colors) and it would copy from different table rows, not from the same one (depending on source byte). To make it realistic I would have to create a test driver with real-life data (i.e. a large monochrome BMP image).

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

>>// IA32 (mov) = 1990.800 ms
>>// SSE4.1 (movsd) = 1532.112 ms
>>// SSE4.2 (movups/movsd) = 1312.965 ms
>>// AVX2 (vmovdqu/vmovsd) = 1312.950 ms

I tested the memcpy some time ago and an overhead of calling the function could take all advantages unless it is inlined especially when you copy just 24 bytes. Also, test results show that memcpy is faster than my FastMemCopy128 if a memory block is less than 128K ( 131072 bytes ). The 2nd function is based on:
...
for( j = i; j < ( i + iPageSize ); j += 32 )
{
_mm_stream_ps( ( RTfloat * )( ( RTchar * )pDst + j ), _mm_load_ps( ( RTfloat * )( ( RTchar * )pSrc + j ) ) );
_mm_stream_ps( ( RTfloat * )( ( RTchar * )pDst + j + 16 ), _mm_load_ps( ( RTfloat * )( ( RTchar * )pSrc + j + 16 ) ) );
}
...

imagem de Igor Levicki

Sergey,

I am pretty sure you could get more performance with streaming stores if you use MOVAPS to prefetch 2KB of data to L1 cache and then MOVNTPS to stream those 2KB out to memory. That requires three loops, outer loop going in 2KB blocks, and two inner loops, one to prefetch, the other to stream. You need to write out a full cache line worth of data at a time in both inner loops. You can also play with prefetch distance and different block sizes to see if more bandwidth can be squeezed out. It goes without saying that you should use VirtualAlloc() and VirtualLock() for the 4KB page where you will buffer data for streaming and that source and destination must be aligned.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
imagem de John D. McCalpin

Streaming stores can be slower than ordinary stores even when the data is going to/from memory if you are only using one or two threads per chip.  This is because the performance is limited by the number of cache miss buffers that a single core can utilize, and the limiting value is often a small fraction of the peak bandwidth available to the chip.  With normal loads, the store misses can be prefetched into the L2 cache so that the Load Fill Buffers are occupied for a much shorter duration.  With streaming stores, each transaction holds onto a Load Fill Buffer until the line is transferred all the way to the memory controller, so fewer transactions can be performed per unit time.

Once you are using several threads, the extra read traffic associated with the store misses becomes the limiting factor and streaming stores become more efficient. 

On my Xeon E5-2670 systems, a single thread runs the STREAM benchmark quite a bit faster if I disable streaming stores.  (I can't find the numbers right now, but I think it was ~14 GB/s without streaming stores and ~10 GB/s with streaming stores.)    When using all cores the performance ratio tracks the ratio of total traffic, so the case with streaming stores gives ~38 GB/s (per chip), while the case without streaming stores is about 2/3 of that performance for the Copy and Scale kernels and 3/4 of that performance for the Add and Triad kernels.

John D. McCalpin, PhD "Dr. Bandwidth"

Quote:

Igor Levicki wrote:
I am afraid it is not, at least not efficiently.

sure, I've learned the hard way that it's far easier to spot this kind of problem than to fix it

anyway your SSE4.1 path has the same issue (75% useful computations), so it doesn't explain the deceptive SSE4.1 (VEX.128 AVX) to AVX2 scaling, it's maybe a problem with your test framework with a function call overhead at each loop iteration (it's not "fair" for the fastest code path), I'm quite sure you'll have better timings (and better SSE4.1 to AVX2 speedup) if you introduce a small inner loop calling your function with inlining, for example 100 iterations in the inner loop (workload entirely in the L1D cache with random input data, more real-world like) and 10 M iterations for the outer loop (in order to have 1G calls to the profiled function like the current version, thus comparable timings)

if you test it again this way I'll be interested to hear about your findings

imagem de Igor Levicki

Well, I am testing it with no inlining because I wanted to eliminate variations in generated code between various functions caused by global compiler optimizations -- this version is exactly testing the "kernel" of each version and since all have the same penalty for CALL/RET it can be safely discarded. Furthermore, this code should already be keeping things in L1D entirely since it always reads the same source and writes the same destination address. Finally, there is no difference in calculation speed due to randomness of input data since input is the same all the time (the only time I would want random data is to measure impact of cache misses on table lookups which I don't have in this code).

Anyway, feel free to experiment with the code and by all means let me know if you find an efficient way to pack data and use full register width.  Just bear in mind that it wouldn't be the first time that the most obvious solution is also the fastest one when it comes to assembler optimization (at least it happens to me often) :)

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Quote:

Igor Levicki wrote:

 optimizations -- this version is exactly testing the "kernel" of each version and since all have the same penalty for CALL/RET it can be safely

that's exactly what I call being "unfair" with the fastest path, the more the tested function is optimized, the more this fixed function call overhead becomes important and biases the comparison (level down the speedup), your AVX2 path looks very good and the way you expand the 8-bit mask to a 256-bit mask very clever and already the optimal solution, it will be a shame that its true potential don't show up in the measurements

Quote:

Igor Levicki wrote:
Anyway, feel free to experiment with the code

I have just tested it (plugged your routines in a test framework of mine with a small inner loop as outlined above in this thread) to see the impact of the function call overhead, my findings below:

NOINLINE [1]:
baseline 3698.9 ms optimized 2854.97 ms speedup = 1.296 x

INLINE [2]:
baseline 3539.13 ms optimized 2241.87 ms speedup = 1.579 x

the studied function is called 1e9 times as in your example,  "baseline" is for your function test_SSE41 unchanged besides the INLINE/NOINLINE prefix, "optimized" is for your function test_AVX2 unchanged

after removing the useless code in test_AVX2 i.e. "blend_mask = _mm256_and_si256(blend_mask, sign_mask);" and "_mm256_zeroupper();" I get a better speedup (INLINE case shown):

baseline 3528.59 ms optimized 2122.16 ms speedup = 1.663 x

 

configuration: Core i7 4770K @ 4 GHz, Intel C++ compiler v13.1.3.198 (64-bit)

[1] #define NOINLINE __declspec(noinline)

[2] #define INLINE _forceinline

imagem de Igor Levicki

Quote:

John D. McCalpin wrote:
Streaming stores...

Hello and welcome to my humble thread :)

Regarding streaming stores, this is the fastest variant for me on Haswell:

void memcopy(void *dst, const void *src, size_t nbytes)
{
    __asm    {
        mov        esi, src
        mov        edi, dst
        mov        ecx, nbytes
        shr        ecx, 6
main_loop:
        test        ecx, ecx
        jz        main_loop_end
        prefetcht0    [esi + 64 * 30]
        movaps        xmm0, [esi]
        movaps        xmm1, [esi + 16]
        movaps        xmm2, [esi + 32]
        movaps        xmm3, [esi + 48]
        movntps        [edi], xmm0
        movntps        [edi + 16], xmm1
        movntps        [edi + 32], xmm2
        movntps        [edi + 48], xmm3
        add        esi, 64
        add        edi, 64
        sub        ecx, 1
        jmp        main_loop
main_loop_end:
        sfence
    }
}

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

>>...Regarding streaming stores, this is the fastest variant for me on Haswell...

Thanks Igor and I'll check if it is the fastest version for Ivy Bridge. By the way, that code is well known and I think Intel Optimization Manual has a chapter with it.

imagem de Igor Levicki

Quote:

Sergey Kostrov wrote:
Thanks Igor and I'll check if it is the fastest version for Ivy Bridge. By the way, that code is well known and I think Intel Optimization Manual has a chapter with it.

You may need to tweak prefetch distance, this is tuned for Haswell. The code might be known, but I don't think many people know that AVX2 version with 2 YMM registers and vex prefixes is slower on Haswell than this.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
imagem de iliyapolak

It looks like tweaked single loop version of example from ISDM.

imagem de iliyapolak

I think that code posted by Igor could execute two unrolled loop cycles load and one unrolled loop store simultaneously occupying 3  execution load/store ports.

>>>>...
>>>>prefetcht0 [ esi + 64 * 30 ]
>>>>...
>>
>>...You may need to tweak prefetch distance, this is tuned for Haswell...

I've noticed that a magic number is 1920 ( = 64 * 30 ).

I simply would like to understand that this is Not 1080p/i related ( Full HD / 1920x1080 resolution ) and this is related to something else?

imagem de Igor Levicki

Quote:

Sergey Kostrov wrote:
I've noticed that a magic number is 1920 ( = 64 * 30 ).

I simply would like to understand that this is Not 1080p/i related ( Full HD / 1920x1080 resolution ) and this is related to something else?

It is a coincidence -- I determined the number by trial and error (I made a loop which increments prefetch distance by 1 cache line and re-tests).

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

>>...Regarding streaming stores, this is the fastest variant for me on Haswell...

Igor, Did you verify / compare performance of your memcopy function against CRT function memcpy?

I'll post my results for two systems later. Thanks.

Note: USR - stands for USER

*** Performance Results for a system with Ivy Bridge CPU - Prefetch Offset is 1920 bytes ***

Number of bytes to copy is 262144 ( 256KB )
Test Case 1 - USR FastMemCopy
Completed in: 31 ticks
Test Case 2 - CRT memcpy
Completed in: 16 ticks

Number of bytes to copy is 524288 ( 512KB )
Test Case 1 - USR FastMemCopy
Completed in: 62 ticks
Test Case 2 - CRT memcpy
Completed in: 47 ticks

Number of bytes to copy is 1048576 ( 1024KB )
Test Case 1 - USR FastMemCopy
Completed in: 109 ticks
Test Case 2 - CRT memcpy
Completed in: 94 ticks

Number of bytes to copy is 2097152 ( 2048KB )
Test Case 1 - USR FastMemCopy
Completed in: 219 ticks
Test Case 2 - CRT memcpy
Completed in: 172 ticks

Number of bytes to copy is 4194304 ( 4096KB )
Test Case 1 - USR FastMemCopy
Completed in: 437 ticks
Test Case 2 - CRT memcpy
Completed in: 468 ticks

Number of bytes to copy is 8388608 ( 8192KB )
Test Case 1 - USR FastMemCopy
Completed in: 905 ticks
Test Case 2 - CRT memcpy
Completed in: 983 ticks

Number of bytes to copy is 16777216 ( 16384KB )
Test Case 1 - USR FastMemCopy
Completed in: 2043 ticks
Test Case 2 - CRT memcpy
Completed in: 2184 ticks

Number of bytes to copy is 33554432 ( 32768KB )
Test Case 1 - USR FastMemCopy
Completed in: 4274 ticks
Test Case 2 - CRT memcpy
Completed in: 4524 ticks

Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 8642 ticks
Test Case 2 - CRT memcpy
Completed in: 9080 ticks

*** Performance Results for a system with Ivy Bridge CPU - Different Prefetch Offsets ***

Prefetch Offset is 64 bytes
Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 9033 ticks
Test Case 2 - CRT memcpy
Completed in: 9095 ticks

Prefetch Offset is 128 bytes
Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 8970 ticks
Test Case 2 - CRT memcpy
Completed in: 9079 ticks

Prefetch Offset is 256 bytes
Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 8829 ticks
Test Case 2 - CRT memcpy
Completed in: 9032 ticks

Prefetch Offset is 512 bytes
Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 8690 ticks
Test Case 2 - CRT memcpy
Completed in: 9157 ticks

Prefetch Offset is 1024 bytes
Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 8612 ticks
Test Case 2 - CRT memcpy
Completed in: 9016 ticks

Prefetch Offset is 2048 bytes
Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 8612 ticks
Test Case 2 - CRT memcpy
Completed in: 9110 ticks

Prefetch Offset is 4096 bytes
Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 8611 ticks
Test Case 2 - CRT memcpy
Completed in: 9049 ticks

Prefetch Offset is 8192 bytes
Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 8580 ticks
Test Case 2 - CRT memcpy
Completed in: 9095 ticks

Prefetch Offset is 16384 bytes
Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 8658 ticks
Test Case 2 - CRT memcpy
Completed in: 9079 ticks

Prefetch Offset is 32768 bytes
Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 10265 ticks
Test Case 2 - CRT memcpy
Completed in: 9017 ticks

Prefetch Offset is 65536 bytes
Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 10265 ticks
Test Case 2 - CRT memcpy
Completed in: 9064 ticks

Prefetch Offset is 131072 bytes
Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 10561 ticks
Test Case 2 - CRT memcpy
Completed in: 9110 ticks

*** Performance Results for a system with Ivy Bridge CPU - Different Prefetch Offsets - 2048 vs 8192 ***

Prefetch Offset is 2048 bytes
Number of bytes to copy is 33554432 ( 32768KB )
Test Case 1 - USR FastMemCopy
Completed in: 4259 ticks
Completed in: 4275 ticks
Completed in: 4274 ticks
Completed in: 4290 ticks
Completed in: 4275 ticks
Completed in: 4274 ticks
Completed in: 4274 ticks
Completed in: 4291 ticks
Test Case 2 - CRT memcpy
Completed in: 4524 ticks
Completed in: 4508 ticks
Completed in: 4508 ticks
Completed in: 4524 ticks
Completed in: 4524 ticks
Completed in: 4524 ticks
Completed in: 4524 ticks
Completed in: 4509 ticks

Prefetch Offset is 8192 bytes
Number of bytes to copy is 33554432 ( 32768KB )
Test Case 1 - USR FastMemCopy
Completed in: 4274 ticks
Completed in: 4275 ticks
Completed in: 4274 ticks
Completed in: 4243 ticks
Completed in: 4197 ticks
Completed in: 4274 ticks
Completed in: 4275 ticks
Completed in: 4258 ticks
Test Case 2 - CRT memcpy
Completed in: 4477 ticks
Completed in: 4493 ticks
Completed in: 4493 ticks
Completed in: 4524 ticks
Completed in: 4524 ticks
Completed in: 4524 ticks
Completed in: 4508 ticks
Completed in: 4524 ticks

Páginas

Faça login para deixar um comentário.