Is there any way to get the number of cache misses in my program without using vtune?
Maybe a compiler option like -profile-something .. ?
Well, if you can find something to program the PMU counters (I don't have one - used SEP while it was being distributed), then the _rdpmc intrinsic will let you programmatically query them (recall that only two counters can be active at once in KNC). Note that while we can get decent L1 hit data for KNC, there is no reliable way to get L2 hit data. See http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding for more information.