I have written a paper to explain programming for the Intel Xeon Phi coprocessor. The part that may surprise you is this: it's a paper focused on just doing parallel programming. Understanding how to restructure to expose more parallelism is critically important to enable the best performance on any device (processors, GPUs or coprocessors). Advice for successful parallel programming can be summarized as “Program with lots of threads that use vectors with your preferred programming languages and parallelism models.” This restructuring itself will generally yield benefits on most general-purpose computing systems, a bonus due to the emphasis on common programming languages, models, and tools that span these processors and coprocessors. I refer to this bonus as the dual-transforming-tuning advantage - an advantage you would lose by switching to a CUDA or OpenCL based solutions.
Intel Xeon Phi coprocessors are designed to extend the reach of applications that have demonstrated the ability to fully utilize the scaling capabilities of Intel Xeon processor-based systems and fully exploit available processor vector capabilities or memory bandwidth. For such applications, the Intel Xeon Phi coprocessors offer additional power-efficient scaling, vector support, and local memory bandwidth, while maintaining the programmability and support associated with Intel Xeon processors.
In my paper, I work to explain more fully the implications of such high levels of parallelism and the work needed to develop parallelism, while benefiting your application on processors as well.
I hope you find it useful.
In addition to this paper, there is a succinct document that explains the same concepts, but in a "flowchart format", and links to additional resources. http://software.intel.com/en-us/articles/is-the-intel-xeon-phi-coprocessor-right-for-me