work group with 1 work item using ~100 float8 vectors?

work group with 1 work item using ~100 float8 vectors?

Will the Intel HD Graphics OpenCL compiler support "1 work item" work groups that are float8 vectors?

Example:

__kernel
__attribute__((vec_type_hint(float8),reqd_work_group_size(1,1,1)))
void __kernel(__global const float8* const restrict in, __global float8* const restrict out)
{
  ... // lots and lots of float8 vector registers
}

The goal is to occupy as many float8 registers as possible in a single work item.  The kernel I'm designing can benefit from float4 swizzling ops and I'm assuming float8 is the narrowest width that matches the 128x8 register file found in Ivy and Haswell architectures.

Questions:

  • Does the HD Graphics OpenCL compiler support allocating as many as 128 registers on IvyBridge and Haswell?
  • If this isn't supported, why no?
  • If this isn't support then what is the best work group size to acquire the most possible registers per work item?

Thanks, I'm very impressed with the HD Graphics architecture.  The EUs and sub-slices appear to have *huge* amounts of resources compared to other low power GPUs.

 

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Just to be clear, my question is will the compiler map a vec_type_hint'ed float8 work-group of 1 work item onto an EU's 128x8 general register file?  

The reason why I ask is that I have a kernel that is very SIMD and not very SIMT and would map perfectly onto an EU thread and its 128x8 register file.

Thanks, I'd really like to get an answer!

IVB and HSW have 128 256-bit registers in the GRF. So the float8 should fit perfectly in each register. I dont think the compiler imposes any restriction on how many of these available registers a program can use. Also note that there are 128 registers per thread.

I have asked the experts for more details but if you see behavior otherwise, please do let us know.

Thanks,
Raghu

Thanks Raghu, that's great news.

128 registers per EU thread is stunning and more HD Graphics OpenCL devs should be made aware of how why this is useful!

A quick napkin calculation shows that the HD5x00 series has an immense amount of resources:

  • 256 KB of shared -- 4 sub-slices x 64KB
  • 1120 KB of registers -- 4 sub-slices x 10 EUs x 7 threads x 128x8 32-bit registers
  • 320 ALUs -- issuing a max of 1 or 2 ops per clock

This is very good and is even more resources than some entry-level discrete GPUs.

Leave a Comment

Please sign in to add a comment. Not a member? Join today