This may seem a little out of context, but here is a thought:
Texas Instrument DSPs and many other performance oriented (embedded) CPU designs allow the programmer to dedicate a part of the CPU cache to become addressible as "local memory" which is much faster than global memory.
Memory bandwidth becomes a serious problem as the core count is increased (especially for linear algebra math which highly depends on the memory bandwidth).
This "local memory" is also present on the GPUs. Having the ability to configure 12MBytes of L3 cache on Xeon CPU out of a total 24MB as "local memory" would give an impression of honey and milk as far as the eye can see. Especially if one considers that in most todays high end CPU core designs this local memory is less than 128KBytes.
The feature request to Intel is therefore:
Please consider allowing the programmer to use user definable amount of L1, L2 or L3 cache as "local memory" and thus make the hardware platform reconfigurable for various types of loads. This would also enable introduction of AVX v2 (doubling the register width), whose performance would scale nearly linearly with number of cores with much less dependence on the global RAM.
Ok, I now : ) But we can dream and maybe you can give some thought to it.