How to extract DWORD from upper half of 256-bit register?

64 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Note: Simply for comparison in order to see a performance difference between Ivy Bridge and Pentium 4 systems.

*** Performance Results for a system with Pentium 4 CPU - Prefetch Offset is 1920 bytes ***

Number of bytes to copy is 262144 ( 256KB )
Test Case 1 - USR FastMemCopy
Completed in: 187 ticks
Test Case 2 - CRT memcpy
Completed in: 172 ticks

Number of bytes to copy is 524288 ( 512KB )
Test Case 1 - USR FastMemCopy
Completed in: 734 ticks
Test Case 2 - CRT memcpy
Completed in: 828 ticks

Number of bytes to copy is 1048576 ( 1024KB )
Test Case 1 - USR FastMemCopy
Completed in: 1516 ticks
Test Case 2 - CRT memcpy
Completed in: 1641 ticks

Number of bytes to copy is 2097152 ( 2048KB )
Test Case 1 - USR FastMemCopy
Completed in: 2797 ticks
Test Case 2 - CRT memcpy
Completed in: 2937 ticks

Number of bytes to copy is 4194304 ( 4096KB )
Test Case 1 - USR FastMemCopy
Completed in: 5985 ticks
Test Case 2 - CRT memcpy
Completed in: 6562 ticks

Number of bytes to copy is 8388608 ( 8192KB )
Test Case 1 - USR FastMemCopy
Completed in: 11859 ticks
Test Case 2 - CRT memcpy
Completed in: 12391 ticks

Number of bytes to copy is 16777216 ( 16384KB )
Test Case 1 - USR FastMemCopy
Completed in: 20094 ticks
Test Case 2 - CRT memcpy
Completed in: 20343 ticks

Number of bytes to copy is 33554432 ( 32768KB )
Test Case 1 - USR FastMemCopy
Completed in: 44031 ticks
Test Case 2 - CRT memcpy
Completed in: 47250 ticks

Number of bytes to copy is 67108864 ( 65536KB )
Test Case 1 - USR FastMemCopy
Completed in: 78812 ticks
Test Case 2 - CRT memcpy
Completed in: 79281 ticks

Igor,

I see that FastMemCopy ( based on the example you've provided / I did some modifications to support 32-bit and 64-bit platforms ) doesn't outperform CRT memcpy for smaller memory blocks up to some threshold ( 2MB / 4MB and it depends on a system / see performance data ). For larger memory blocks FastMemCopy is faster but a performance difference drops down as soon as the size of a memory block increases.

Quote:

Sergey Kostrov wrote:

Igor,

I see that FastMemCopy ( based on the example you've provided / I did some modifications to support 32-bit and 64-bit platforms ) doesn't outperform CRT memcpy for smaller memory blocks up to some threshold ( 2MB / 4MB and it depends on a system / see performance data ). For larger memory blocks FastMemCopy is faster but a performance difference drops down as soon as the size of a memory block increases.

I never claimed that it outperforms CRT memcpy() because CRT version is optimized for many more cases and especially for small copies.

The code I posted is just a demo of streaming stores and prefetch. Reason why it has a tiny bit better performance on blocks larger than 2MB/4MB is because streaming bypasses cache (it is using write-combining buffers) so the speed is not affected when your dataset is too big to fit in the last level cache.

>>...Reason why it has a tiny bit better performance on blocks larger than 2MB/4MB is because streaming bypasses cache...

This is exactly what I've observed on two systems with different CPUs.

Here is the AVX-based variation of memory copying routine.It is based on eight  loop unrolling and non-temporal stores through the YMMn registers.Prefetching is software based.Later I will test  iteration unrolling and hardware prefetching.

void FastAVX_MemCpy(const void * source, void * dest,const unsigned int length){

    const unsigned int Len = length;
    if(NULL == source || NULL == dest){
        if(!source){
            printf("Null pointer has been passed [%p] \n",&source);
            exit(1);
        }else
             if(!dest){
                 printf("Null pointer has been passed [%p] \n",&dest);
                 exit(1);
             }
    }
    else
         if(Len % 32 != 0){
             printf("length argument must be a multiplies of 32 %d \n",Len);
             exit(1);
         }else{
               //(__m256 *)source;
               //(__m256 *)dest;

               _asm{
                    
                    mov edi,dest
                    mov esi,source
                    mov edx,source
                    add edx,dword ptr Len
                    align 32
copy_loop:
                    prefetcht0 [esi+256 * 32]
                    vmovaps ymm0, ymmword ptr [esi]
                    vmovaps ymm1, ymmword ptr [esi+32]
                    vmovaps ymm2, ymmword ptr [esi+64]
                    vmovaps ymm3, ymmword ptr [esi+96]
                    vmovaps ymm4, ymmword ptr [esi+128]
                    vmovaps ymm5, ymmword ptr [esi+160]
                    vmovaps ymm6, ymmword ptr [esi+192]
                    vmovaps ymm7, ymmword ptr [esi+228]
                    vmovntps ymmword ptr [edi], ymm0
                    vmovntps ymmword ptr [edi+32], ymm1
                    vmovntps ymmword ptr [edi+64], ymm2
                    vmovntps ymmword ptr [edi+96], ymm3
                    vmovntps ymmword ptr [edi+128],ymm4
                    vmovntps ymmword ptr [edi+160],ymm5
                    vmovntps ymmword ptr [edi+190],ymm6
                    vmovntps ymmword ptr [edi+228],ymm7
                    add esi,256
                    add edi,256
                    cmp esi,edx
                    jne copy_loop
                    sfence
               }

         }

}

Two errors in previous post code snippet.There should be a zeroing of ECX register in prefetching loop and offset during unrolling should be esi+ecx+224.

Quote:

iliyapolak wrote:

copy_loop:
                    prefetcht0 [esi+256 * 32]
                    vmovaps ymm0, ymmword ptr [esi]
                    vmovaps ymm1, ymmword ptr [esi+32]
                    vmovaps ymm2, ymmword ptr [esi+64]
                    vmovaps ymm3, ymmword ptr [esi+96]
                    vmovaps ymm4, ymmword ptr [esi+128]
                    vmovaps ymm5, ymmword ptr [esi+160]
                    vmovaps ymm6, ymmword ptr [esi+192]
                    vmovaps ymm7, ymmword ptr [esi+228]

you read 256 bytes per iteration but prefetch only 64 bytes (one cache line)

this will be probably faster to add 3 more prefetches or to simply unroll 2x, both cases with perfectly matched prefetch and read sizes

EDIT clumsy comment of mine removed

I need still need to tweak that code and test it.I think that prefetching distance is greater than 64 bytes.On every iteration of the loop esi register is incremented by 256*32 bytes.

Keep in mind two things:

1. prefetch instruction is merely a hint. It may or may not do anything depending on various factors.

2. When it does something, prefetch instruction competes for memory bus bandwidth so use it sparingly or you are going to get opposite effect from the one you wanted.

 

one prefetch instruction prefetch typically one cache line (64 B) so you are prefetching explicitely only 1/4 of your data in your example

Quote:

iliyapolak wrote:
I think that prefetching distance is greater than 64 bytes.On every iteration of the loop esi register is incremented by 256*32 bytes.

your prefetch distance is another issue, it looks quite big at 8KB, it can be an issue if you have another thread fighting for the L1D cache, also you'll not prefetch the initial 8KB, but as you say it's a matter of tuning

 

Thanks for correcting me.It seems that I did not read that sentence in Optimization Manual which states that single cache line is loaded by prefetch instruction.

Prefetching distance can be directly related to the data needed for the computation.The problem is to find how far ahead  prefetch the data.Prefetching too far can saturate the bus and as you pointed it out can cause issue with another thread is competing for L1 data cache.For particle diffusion program prefetching distance of one particle object at least  could be sufficient(of course the issue of memory data layout of such objects should be taken into account ).

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today