Cilk_for returns wrong data in array.

Cilk_for returns wrong data in array.

Hello everyone. I am new to multi threading programming. Recently, i have a project, which i apply cilk_for into it. Here is the code:

void myfunction(short *myarray)
{
	m128i *array = (m128i*) myarray
	cilk_for(int i=0; i<N_LOOP1; i++)
		{
			for(int z = 0; z<N_LOOP2; z+=8)
			{
				array[z]        =  _mm_and_si128(array[z],mym128i);
				array[z+1]        =  _mm_and_si128(array[z+1],mym128i);
				array[z+2]        =  _mm_and_si128(array[z+2],mym128i);
				array[z+3]        =  _mm_and_si128(array[z+3],mym128i);
				array[z+4]        =  _mm_and_si128(array[z+4],mym128i);
				array[z+5]        =  _mm_and_si128(array[z+5],mym128i);
				array[z+6]        =  _mm_and_si128(array[z+6],mym128i);
				array[z+7]        =  _mm_and_si128(array[z+7],mym128i);
				array+=8;
			}
		}
}

After the above code ran, ridiculous thing happens. The data in array isn't updated correctly. For example, if i have an array with 1000 elements, there is a chance that the array will be updated correctly (1000 elements are AND-ed). But there is also a chance that some parts of the array will be omited (first element to 300th element are AND-ed, 301st element to 505th element aren't AND-ed, 506th element to 707th element are AND-ed, etc,...). These omited parts are random in each individual run, so i think the problem here is about cache miss. Am I right? Please tell me, any help is appreciated. :)

 

publicaciones de 9 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

This code has a data race.  The increment on line 16 should  be removed, and the array indexes calculated as a function of i and z.

 

Cita:

bradley@mit.edu escribió:

This code has a data race.  The increment on line 16 should  be removed, and the array indexes calculated as a function of i and z.

 

 

Thank you bradley, you're right. The data race is in line 16. Here is my complete function:

void myfunction(short *myarray, int N_LOOP1, int N_LOOP2)
{
	__m128i mym128i = _mm_set1_epi16(0xdfff);
	__m128i *array = (__m128i*) myarray;
	cilk_for(int i=0; i<N_LOOP1; i++)
		{
		int step = i*N_LOOP2; 
			for(int z = 0; z<N_LOOP2; z+=8)
			{
				array[step+z]        =  _mm_and_si128(array[step+z],mym128i);
				array[step+z+1]        =  _mm_and_si128(array[step+z+1],mym128i);
				array[step+z+2]        =  _mm_and_si128(array[step+z+2],mym128i);
				array[step+z+3]        =  _mm_and_si128(array[step+z+3],mym128i);
				array[step+z+4]        =  _mm_and_si128(array[step+z+4],mym128i);
				array[step+z+5]        =  _mm_and_si128(array[step+z+5],mym128i);
				array[step+z+6]        =  _mm_and_si128(array[step+z+6],mym128i);
				array[step+z+7]        =  _mm_and_si128(array[step+z+7],mym128i);
			}
		}
}

after the small modification, the data in array is now correct. But the executed time of cilk_for is about 1,5 time of normal for-loop. Do you have any ideal?

Update1 - Funny thing: My build mode is "Release" with /O3 flag. If i run this code by windows debugger (release mode) in visual studio, the executed time of cilk_for is about 1,5 time of normal for-loop, as I said above, and the CPU usage of cilk_for code is only about 80% (2 physical core, 4 thread). But if i run this code by *.exe file, the executed times of cilk_for and normal for are the same and the CPU usage of cilk_for code is 100%. This makes me so confused.  

What is N_LOOP1 and N_LOOP2? There is overhead associated with a cilk_for loop. You need to have enough work to amortize that overhead.

All timing should always be done outside of the debugger.  When you're running under the debugger, there are various events (such as as module load and thread creation) which cause the application to "coordinate" with the debugger. When that happens, all threads are suspended until the event is dealt with. See http://msdn.microsoft.com/en-us/library/windows/desktop/ms679302(v=vs.85).aspx for more information.

    - Barry

Barry's comments are the first thing I'd check.  Here's a 2nd-order thing to check after that.

The cilk_for is thwarting code-generation optimizations in the loop body.  With a "for", each statement of the loop body becomes two instructions (separated in the code to account for delays):

vpand     offset(%r10,%r9), %xmm0, %xmmju
vmovdqu   %xmmj, offset(%r10,%r9) 

That's as good as it can get.  The vpand and vmovdqu pair in the code are separated somewhat for sake of instruction scheduling.  These two instructions can be pipelined with a throughput of one pair per clock cycle, assuming that sufficient cache bandwidth exists.  

With the cilk_for, I'm seeing a longer sequence:

lea       offset(%r11), %r15d                          
movslq    %r15d, %r15                                 
shlq      $4, %r15                       
vpand     (%r9,%r15), %xmm0, %xmm2                
vmovdqu   %xmm2, (%r9,%r15)       

I'm not sure of the pipelined throughput for this sequence, but it definitely requires more resources.  

The code-generation issue is related to a known issue with phase order in the compiler, where cilk_for loops are too soon and confuse some of the loop-oriented optimizations.  Basically, the lowering of a cilk_for creates a function for executing the body.  The function takes a struct as an argument.  The struct contains fields that point to variables outside the loop that are used by the loop body.  It's all very similar to lowering of a C++ lambda expression declared with "[&]{...}".)  The extra level of indirection confuses the loop optimizer.  (

A possible work-around is "do it yourself" optimization.  Immediately inside the cilk_for loop, create local temporary variables that capture the values of interest.  In this example, there's a simple way to do that: move the initializations of mym128i and array to inside the cilk_for, like this:

void myfunction(short *myarray, int N_LOOP1, int N_LOOP2)
{
        cilk_for(int i=0; i<N_LOOP1; i++)
                {
                __m128i mym128i = _mm_set1_epi16(0xdfff);
                __m128i *array = (__m128i*) myarray;
                int step = i*N_LOOP2;
                        for(int z = 0; z<N_LOOP2; z+=8)
                        {
                                array[step+z]          =  _mm_and_si128(array[step+z],mym128i);
                                array[step+z+1]        =  _mm_and_si128(array[step+z+1],mym128i);
                                array[step+z+2]        =  _mm_and_si128(array[step+z+2],mym128i);
                                array[step+z+3]        =  _mm_and_si128(array[step+z+3],mym128i);
                                array[step+z+4]        =  _mm_and_si128(array[step+z+4],mym128i);
                                array[step+z+5]        =  _mm_and_si128(array[step+z+5],mym128i);
                                array[step+z+6]        =  _mm_and_si128(array[step+z+6],mym128i);
                                array[step+z+7]        =  _mm_and_si128(array[step+z+7],mym128i);
                        }
                }
}

Now mym128i and array are dumb simple scalar variables, not indirects off structure fields, that the compiler understands.  With the code written that way, I'm seeing assembly code for the inner loop that has the efficient vpand/vmovdqu pairs.

FYI, I'm using icc 14.0.3.174, with "icc -S -O3 -xHost" on an Intel(R) 4th Generation Core(TM) processor.  I tried 15.0 beta, and it had the same problem.

Thank you, Barry and Arch :)

@Barry: I want to divide my array into 2 section to manipulate it, that is why I have N_LOOP1 and N_LOOP2. The total array's element is equal to N_LOOP1 * N_LOOP2 :).

As I mentioned in the above post, the CPU usage of cilk_for code is 100%, but in the debugging mode (Release), it can't reach to 100%. Same thing happens if I build my project into DLL for another project, CPU usage of cilk_for code is about 80~90%. I think that is because some communicate event happen between DLL and running project. Is that right?

@Arch: So, if the cilk_for code makes it harder for the code-generation to be optimized, one cilk_for loop with total iteration (N_LOOP1 * N_LOOP2) would be better? 

I've tried your code above, sadly, the running time of cilk_for is still higher than normal for. Do you think the amount of work in my code can armotize the overhead of cilk_for?

I am using Intel® C++ Composer XE 2013 SP1 for Windows, Intel Core i3-3220. Build mode is release with /O3 flag.

What are the values of N_LOOP1 and N_LOOP2?  We need to know those to estimate how much work is in the loop and which levels of the memory hierarchy are being flogged.

Cita:

Arch D. Robison (Intel) escribió:

What are the values of N_LOOP1 and N_LOOP2?  We need to know those to estimate how much work is in the loop and which levels of the memory hierarchy are being flogged.

N_LOOP1 is about 300 and N_LOOP2 is about 50000. :)

Given that the code is simply advancing through memory, I wouldn't be surprised if you're hitting memory bandwidth issues. You can verify this by running two copies of the serial code simultaneously and seeing how they perform.

    - Barry

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya