Finding the right fit for your application on Intel® Xeon and Intel® Xeon Phi™ processors

Not all applications are created equal.   Some are chomping at the bit to harvest as much parallelism as a target platform can provide.  Those may be good candidates for running on an Intel® Xeon Phi™ Coprocessor.  Other applications are scalar (not vectorized) and sequential (not threaded).  They won't even make full use of an Intel Xeon processor, much less an Intel Xeon Phi Coprocessor.  Before moving to a highly-parallel platform, the developer-tuner needs to expose enough parallelism to hit the limits of the Intel Xeon platform.  Once the demand for threads and vectors or memory bandwidth exceeds what an Intel Xeon processor can deliver, an Intel Xeon Phi coprocessor has the potential to provide further performance improvements.

Assessing whether an application holds promise for showing compelling performance with a given platform, or currently exposes enough parallelism to make ready use of it, is a challenge that faces many developers today.  An application may have potential, but that potential may not yet be fully realized.  And the cability of platforms to harvest that potential may change over time, such that an application may not have a good fit for the first implementation in a processor family, but a later generation of the same family may be able to offer compelling performance for the same code.

At ISC13, I'm giving a theater presentation and chalk talk seeking to address the following questions that we tend to have as application developers and tuners:

  • When would I need an Intel Xeon Phi coprocessor vs. an Intel Xeon processor?
  • How do I tell whether my application is a good fit for a Intel Xeon Phi Coprocessor?
  • What should my expectations be for the speedup I can achieve?
  • What do I need to do to make the application sign on "extreme hardware?"
  • How do I develop a good intuition about this?

Here's a link to the ISC13 theater presentation and chalk talk.  Check back here for updates to that content.

I've also worked with my colleage Chao Mei to prepare a lab that works through these issues.  It goes step by step, with make files, reference solutions, VTune project files and even an answer key, so you can make use of it as a beginner.  The document for the lab is here, and the tar file with files used by the lab is here.  I strongly believe that we need to make our developer and analysis tools more powerful, effective and intuitive if we're to help motivate developers to do the hard work of parallelizing their applications, regardless of platform. 

Have fun exposing and harvesting extreme parallelism!  You might also check out a related "right for me" blog.

I hope to see you at the theater presentation, Wed. June 19 at 1:20pm in the Intel booth at ISC, with a chalk talk to follow. 

CJ Newburn

Performance and Feature Architect, Intel

For more complete information about compiler optimizations, see our Optimization Notice.

Comments



Great question. Sometimes,

Great question. Sometimes, it's easy to figure out what the number of iterations are, for each of the loop levels that get collapsed, through code inspection. But the number of iterations may not be obvious or even determinable through static analysis. So we're looking into the most effective ways to determine the loop trip count empirically. Among the available options are Advisor, which uses user-inserted instrumentation, and the compiler, which can yield average counts with profile-guided optimization. We welcome feedback on folks' experience with best ways to reveal those iteration counts, and on the value that that has for evaluating which target has best fit and in tuning code.