Detecting Memory Bandwidth Saturation in Threaded Applications (PDF 23
Avoiding and Identifying False Sharing Among Threads (PDF 218KB)
A toolkit that gives 6 Steps to Increase Performance Through Vectorization in Your Application
Since Version 2013 Update 4, the VTune(TM) Amplifier performance profiler has
This paper details the implementation of out of order queues, an OpenCL™ construct that allows independent kernels to execute simultaneously whenever possible, and thus keep all GPU assets fully utilized.