Random memory read performance difference between GPU and CPU (I7-4770R)?

Random memory read performance difference between GPU and CPU (I7-4770R)?

We are running a simple code doing random reads and sequential write (i.e. gather operation) on both the CPU and GPU part of the I7-4770R (separately, one at a time) and experiencing 4x slower performance on the GPU compared to the CPU. When doing sequential reads and writes and even random writes, the performance is very similar indicating that both the internals of the chip as well as the memory controller allows the GPU to access the DRAM with the same speed the CPU does. However have no idea why random reads suffer a 4x performance penalty and this limits our application’s performance quite a lot. Would be good to know what the reason of this performance difference is and see whether there is some remedy for it.

Here are also the numbers from our experiments. The metric is execution time, so the lower the better.

 

MAP

REDUCE

GATHER

SCATTER

Intel i-4770r IrisPro-16G mem-4 Cores-OpenMP-CPU

24.73

13.65

36.34

231.67

Intel i-4770r IrisPro-16G mem-40 EU-OpenCL-GPU

23.55

16.29

167.03

270.7

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Norbert,

Would it be possible to provide your benchmarks to us? If you do not want to post it in a public forum, you could send it as a private message.

Thanks!

 

Hi Robert,
 
Thank you for the help. Please see the details here:
 

For map:

-          create an input  and an output array of 32M integer elements each

-          fill the input array with data.

-          Walk the input array in sequence , assigning value of each input array element to output array element in squence

 

Int a[n], b[n];

fill_data(a);

for(i=o; i<32*1024*1024; ++i)

  b[i] = a[i];

 

For gather:

-          Create an input, and output and a index array of 32M elements each

-          fill the input array with data

-          fill the index array with random indices into the output array

-          walk the index array in sequence,  using the random index value to gather from input array for sequential assignment to output

 

int a[n], b[n], index[n];

fill_data(a)

fill_random_index(index);  // fills with random value between 0 and 32M-1

for(i=0; i<32*1024*1024; ++i)

{

   idx = index[i];

   b[i] = a[idx];

}

 

For scatter:

-          similar to gather, but  walkt the index array in sequence and using the random index value to scatter to output array from sequentially read input

 

int a[n], b[n], index[n];

fill_data(a)

fill_random_index(index);  // fills with random value between 0 and 32M-1

for(i=0; i<32*1024*1024; ++i)

{

   idx = index[i];

   b[idx] = a[i];

}

 

On the GPU, it’s just single OpenCL kernels, on CPU, we use openmp with multiple cores.

Best regards,
Norbert

Norbert,

Couple of things:

1. What about reduce?

2. If you provide the actual code, this would speed up things quite a bit.

Thanks!

Leave a Comment

Please sign in to add a comment. Not a member? Join today