I'm just curious, is there a particular reason why the maximum work-group size on my Core i7 920 is 1024? I currently cannot think of any hardware-specific limit that explains that exact number.
More concretely, I've read in "Writing Optimal OpenCL Code with the Intel OpenCL SDK" that it is suggested to use 64-128 bytes per work-group (for kernels without barriers). However, in my case 4096 bytes per work-group (1024 work-items that operate on floats) lead to the best performance, and I'm trying to explain why (it's a simple filtered backprojection, with all output voxels being calculated independently).
Any hints what could explain the better performance with 4096 bytes instead of the suggested 64-128 bytes? What factors might have an influence?