With Larrabee being canned as a discrete GPU, I was wondering whether it makes sense to actually let the CPU take the role of GPU and high-throughput computing device.
Obviously power consumption is a big issue, but since AVX is specified to process up to 1024-bit registers, it could execute such wide operations using SNB's existing 256-bit execution units in four cycles (throughput). Since it's one instruction this takes a lot less power than four 256-bit instructions. Basically you get the benefit of in-order execution within an out-of-order architecture.
The only other thing that would be missing to be able to get rid of the IGP (and replacing it with generic cores) is support for gather/scatter instructions. Since SNB already has two 128-bit load units it seems possible to me to achieve a throughput of one 256-bit gather every cycle, or 1024-bit every four cycles. In my experience (as lead SwiftShader developer) this makes software texture sampling perfectly feasible, while also offering massive benefits in all other high-throuhput tasks.
Basically you'd get Larrabee in a CPU socket, without compromising any single-threaded or scalar performance!