Inner loops with OpenCL

Inner loops with OpenCL

Hello

I am new to OpenCL and want to parallelize some looping code thats
doing lu factorization with the looping structure showed by exact code
as below:

for(int k = 0; k < N-1; k++)
{
for(int i = k+1; i < N; i++)
S[i*N + k] = S[i*N + k] / S[k*N + k];

for(int j = k+1; j < N; j++)
for(int i = k+1; i < N; i++)
S[i*N + j] -= S[i*N + k] * S[k*N + j];
}

I have done with the simple opencl kernel with single work items (no groping). Thats following:

int IDx = get_global_id(0);
int IDy = get_global_id(1);

for(int k = 0; k < n-1; k++)
{
barrier(CLK_GLOBAL_MEM_FENCE);

if(IDy > k && IDx == k)
matrix[IDy*n + IDx] = matrix[IDy*n + IDx] / matrix[IDx*n + IDx];

barrier(CLK_GLOBAL_MEM_FENCE);

for(int j = k+1; j < n; j++)
{
if(IDy > k && IDx == j)
matrix[IDy*n + IDx] -= matrix[IDy*n + k] * matrix[k*n + IDx];
}
}

But I dont get correct results when compared to the serial code, this
is my personal try for OpenCL kernel and I am still learning how this
data parallel scheme in OpenCL works, Can you point out what I am doing
wrong in the kernel?

7 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Anybody there ....

Hello everybody,

I encounter exactly the same problem: A barrier within a inner loop
seems not to guarantee a synchronisation within items of the same work
group. Does anybody have an explanation for this issue?

Thank you in advance and kind regards
Stefan H.,
Germany

Sorry for the late answer.
I will check this issue and return to you as soon as possible.
Can you give me the version of the OpenCL SDK you are using?

Thanks.

Hi Stefanh,To be clear a barrier only guarantees that all threads will stop at that point and wait for all memory operations to complete before continueing on. That does not mean they didn't put memory access out of order before that call or after it.To be fair I could see how the above problem is an overly excited parser unrolling loops.

Hi,

I went through the code and i think that the issue is in the implementation of the algorithm which requires that all work groups must complete their work prior to proceeding to the code after the barrier. However, please note that a barrier in OpenCL does not provide this mechanism (synchronization between work groups) as it only synchronizes work items in the same work group.

To validate this assumption please run the code with a single work group configuration (local size == global size).

Please let us know what you think about our analysis.

Regards,

Amjad

"I went through the code and i think that the issue is in the
implementation of the algorithm which requires that all work groups must
complete their work prior to proceeding to the code after the barrier."

This was my original thought but I didn't have enough time to verify that it was occuring and to be honest wasn't 100% of how the original algorithm was intendend to function. I was going to recommend testing the code on the CPU using an outer loop to simulate random the execution environment to ensure it was correct. After you add the outer loop you could write a get_global_id() function which returned an random id which has not been used. I have used this is the past to debug issues with algorithms for the GPU and I am sure it will work here.

Also I recently saw a paper on a similar subject about comparing openCL algorithms to their C counterparts semantically to ensure equality. It was very interesting as they ran their code through the Bullet physics engine's OpenCL implementation and found a number of unknown errors as well as compiler bugs. Obviously it wouldn't work when you make architecture trade offs but for simple porting of algorithms it could be of a real benifit.

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen