iGPU memory bandwidth on IVB

iGPU memory bandwidth on IVB

Bild des Benutzers Priyadarshi

Hello,

I have been testing the Intel's OpenCL SDK for heterogenous computing with the HD2500 iGPU. I ran a few benchmarks to test the memory bandwidth of both CPU and iGPU devices. Here are the results:

---------------------------------------------------------------------------------------------------------------------------------

1. Memory Read [Single] : All threads read from a single physical address.

CPU - 70 GB/s; iGPU - ~5 GB/s

2. Memory Read [Linear] : Thread read data sequentially memory address according to their thread id

CPU - 50 GB/s; iGPU - 5.8 GB/s

3. Memory Read [Uncached] : The reads are offsetted so that the cache thrashing is maximum

CPU - 5.8 GB/s; iGPU - 4.5 GB/s

4. Memory Write [linear] : Threads writing to sequential memory addresses

CPU - 60 GB/s; iGPU - 1.3 GB/s

---------------------------------------------------------------------------------------------------------------------------------

Using vec4 datatype for CPU gives the maximum bandwidth. This is what the optimization guide recommends too. But for GPU, I get the same bandwidth for all datatypes. Few questions I have:

a) How the iGPU's shader core (EU) is laid out? I do know that it has 4 ALUs but do they work on different threads (OpenCL thread i.e a work item) or only on 1 thread like the VLIW4 unit in previous AMD GPUs?

b) Why is the iGPU access to global memory crippled compared to CPU? Ok CPU has big caches but doesnt the IVB has an L1, L2, L3 hiearchy too? This is nearly equal to PCIe transfer speeds, in that case I have much better options to do CPU+GPU compute ;)

Btw I also tested its bandwidth to OpenCL shared memory (part of L3 cache) and I got around 20 GB/s. This seems okay. 

c) What is the best way to share data between CPU/GPU which gives the maximum memory bandwidth? 

3 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers Raghu Muthyalampalli (Intel)

a) Each EU has several threads and each thread has several SIMD units. Work item mapping to underlying hardware happens in the following order:
- SIMD channels of one EU thread (say thread0, on EU0)
- if more threads needed, spread to adjacent EUs (thread0 on EU1, EU2 etc.)
- then to additional threads on EUs (thread1 on EU0, thread1 on EU1 etc)
So it is very important to pick the correct WG size (best thing to do is experiment with various WG sizes or use the analysis feature of the KernelBuilder). Please look at the optimization guide for more information

b) If multiple threads are accessing the same cache line the accesses can be serialized on processor graphics. It is better to move the data to shared local cache for better performance. Again look at the optimization guide.

c) This depends on what you are doing. Can you share us your algorithm? Are you trying to access the same data from CPU and GPU or is the data being copied back and forth between the devices?

Thanks,
Raghu

Bild des Benutzers Priyadarshi

Thanks for the information. But I don't think I understood you correctly here :

a) Is 'Thread' a software or a hardware construct in the terminology you are using? This is what I got from the Intel's Open Source HD graphics programmer manual: Thread is an instance of a kernel program executed on an EU.

This is the general notion of a software thread which is equivalent to a work-item in OpenCL. Now when you say that a EU has several threads and each thread having multiple SIMD units, I am getting confused. Just as a work-item is assigned to a Processing Element in OpenCL, a thread should run on a (single) SIMD unit.

Getting higher, A compute unit in OpenCL which takes the charge of one work-group consists of multiple processing elements or SIMD units. Since Intel reports 16 compute units in the platform api and there are 16 EUs on the IVB GPU, so a EU is basically a compute unit?

I am just trying to map the OpenCL device model to the Intel's HD graphics architecture and it would be really helpful if its explained in the same hierarchy. Also, it will be helpful to know the number of processing elements in each compute unit on the IVB gpu.

b) Okay, multiple threads accessing the same cache line are serialized, but shouldn't the cache provide a higher memory bandwidth than the RAM (global memory ) for the GPU? If you see my results for the memory bandwidth test, I have tried to access the memory for both cached and uncached paths but I am getting the same bandwidth for them. That means the cache is not working when accessing the data from global memory for GPU device.

c) I am trying to work on ray-tracing algorithm where both CPU and GPU work on a list of triangles. One device builds the bounding volume hiearchy where the other device will use that data structure for efficient path tracing into the scene. This could improve ray tracing performance for dynamic scenes where the data structure needs to be updated per frame and hence having 2 devices with shared memory will definitely help.

Melden Sie sich an, um einen Kommentar zu hinterlassen.