Hello,
I have been testing the Intel's OpenCL SDK for heterogenous computing with the HD2500 iGPU. I ran a few benchmarks to test the memory bandwidth of both CPU and iGPU devices. Here are the results:
---------------------------------------------------------------------------------------------------------------------------------
1. Memory Read [Single] : All threads read from a single physical address.
CPU - 70 GB/s; iGPU - ~5 GB/s
2. Memory Read [Linear] : Thread read data sequentially memory address according to their thread id
CPU - 50 GB/s; iGPU - 5.8 GB/s
3. Memory Read [Uncached] : The reads are offsetted so that the cache thrashing is maximum
CPU - 5.8 GB/s; iGPU - 4.5 GB/s
4. Memory Write [linear] : Threads writing to sequential memory addresses
CPU - 60 GB/s; iGPU - 1.3 GB/s
---------------------------------------------------------------------------------------------------------------------------------
Using vec4 datatype for CPU gives the maximum bandwidth. This is what the optimization guide recommends too. But for GPU, I get the same bandwidth for all datatypes. Few questions I have:
a) How the iGPU's shader core (EU) is laid out? I do know that it has 4 ALUs but do they work on different threads (OpenCL thread i.e a work item) or only on 1 thread like the VLIW4 unit in previous AMD GPUs?
b) Why is the iGPU access to global memory crippled compared to CPU? Ok CPU has big caches but doesnt the IVB has an L1, L2, L3 hiearchy too? This is nearly equal to PCIe transfer speeds, in that case I have much better options to do CPU+GPU compute ;)
Btw I also tested its bandwidth to OpenCL shared memory (part of L3 cache) and I got around 20 GB/s. This seems okay.
c) What is the best way to share data between CPU/GPU which gives the maximum memory bandwidth?


