Intel Xeon Phi Coprocessor June 2013 Developer Webinar Q&A Responses

Answers for the questions raised during the June session of our Introduction to High Performance Application Development for Intel® Xeon® and Intel® Xeon Phi™ processors class have been assembled.  There were some duplicates and other questions we couldn't decipher, either because of the wording or because of implied context that was not spelled out, and a couple where we just didn't find an answer.  We tried to address the rest, which appear below:

Q: Yesterday, you indicated that we would get PDFs of the slides, but I didn't receive them.  Is there something else that I need to do?
A: No....nothing you need to do.  We have to prepare the presentations as well as send our recorded broadcast through some scrubbing to get it ready to post.  The lecture recordings along with the presentations will be available by next Monday, July 22nd at http://software.intel.com/en-us/articles/intel-xeon-phi-webinar.  The Q&A responses are available with this post.

Q: What is the meaning of * on Linux OS?
A: It's a footnote marker, usually tied to the footnote: "Other names and brands may be claimed as the property of others."  In other words, it’s a disclaimer for names that may be branded by other companies.

Q: Can you have multiple Xeon Phi coprocessors on a single server?
A: Yes, it is possible to have multiple Intel® Xeon Phi™ coprocessors on a single host system.    Work with your OEM to make sure the system is configured correctly, especially if you want to use a fast interconnect such as InfiniBand*.

Q: Ideally what are basic requirements to populate Xeon Phi cards
A: Adequate power, cooling and PCI-Express* connectivity are among the more critical requirements to support Intel Xeon Phi coprocessors.  Particular systems could require BIOS changes to enable the required support of memory-mapped I/O in addresses above 4 GB.  Work with your OEM to ensure your system has adequate support, but know that some older systems may not have adequate resources to support even one Intel Xeon Phi coprocessor.

Q: Does Phi have direct access to HDD? If not, is virtual shared memory the most feasible approach to access HDD retrieved data in memory?
A: The host file system can be mounted on the coprocessor using NFS; this is what you would do when running natively on the coprocessor. In offload mode, you can either use virtual shared memory with C++, or you can explicitly pass data from the host to and from the coprocessor using explicit data offload directives, which are available for both C++ and Fortran.

Q: Xeon Phi use Host memory only, or it has its own local memory as well?
A: The coprocessor has its own memory, at least 6 GB, nominally 8, which is separate from the host memory.  Data are marshaled across from host to coprocessor and  back again (if needed).  Or each host and coprocessor can operate simultaneously on their separate memories without interfering with each other.

Q: Can the MIC be used with older Xeon proc?
A: Technically, the Intel Xeon Phi coprocessor is simply an attached coprocessor.  It runs its own software stack but the hardware DOES have some basic physical requirements for power, cooling and memory-mapped I/O address space that may not be available on an older processor system, depending on how old.  So you might be able to run with older Intel Xeon platforms but we recommend using a fairly new model.  If you have a specific Intel Xeon processor model in mind, you can ask if that is supported on the Intel Many Integrated Core Architecture forum: http://software.intel.com/en-us/forums/intel-many-integrated-core.

Q: Do you mean global memory or shared?
A: For native execution, I'm referring to the shared memory on the coprocessor. GPUs employing CUDA* make a distinction between global and shared memory, which might indicate where this question is coming from.  There's no such distinction in memory on the Intel Xeon Phi coprocessor: all system memory on the coprocessor is equally available to all HW threads on the coprocessor; individual threads on a core can share data cached locally, as is true with most host memory hierarchies.

Q:What is the max size of the input supported by MIC for native execution?
A: All Intel Xeon Phi coprocessors have at least 6 GB of system memory (most have 8) that is then shared between uses supporting the virtual file system, file caching, and processor memory to hold programs and data. So 6 GB is a pretty substantial limit on the size of input.  However, native execution can use NFS to establish external file connections, and through buffering the coprocessor could handle much larger data streams.

Q: What's minimum memory space occupied by MIC system
A: Each coprocessor has a minimum of 6 GB of memory (most have 8), which is shared for support of the virtual file system, file caching, and processor memory to hold programs and data.  On an idle coprocessor the reserved memory footprint is under 300 MB (as reported by "top"), but the virtual file system in this generically configured MPSS instance amounts to another 400 MB.  Top running on a sample coprocessor just now with no load shows just under 6 GB free.  However, the MPSS file image itself is only 80 MB and MPSS configuration options can replace much of the virtual file system with files mapped from the host, maximizing the memory available on the coprocessor for computation.

Q: When comparing the MIC with the Xeon processor how many cores does the Xeon have? Are they all being used in parallel?
A: Current Intel Xeon systems have 16 cores (2 sockets, 8 cores/socket). The host CPUs can be used in parallel with each other, and in parallel with the cores running on the coprocessor. 
Q: Are you simply comparing the MIC against the performance of a single Xeon core running serially?
A: It once was a common practice to measure multi-processor scaling against a single serial processor to aggrandize the scaling number.  It is still normal to measure threaded performance against equivalent work done serially, but only to estimate the serial vs parallel fraction in order to make an Amdahl's Law-based prediction about potential scalability.  For our benchmark performance tests, we normally compare the performance on one or two coprocessors to the same work performed on a dual-socket Intel Xeon host system.

Q: How does the job scheduler treat the Phi? Does it count it as one single node?
A: That really depends on the job scheduler.  You'll need to double-check if the job scheduler you're using supports Intel Xeon Phi coprocessors and how they are supported.  But generally, an Intel Xeon Phi coprocessor is treated as just another node, with a unique hostname and IP address.  So when requesting access to an Intel Xeon Phi card, you'll generally get access to both the host and the coprocessor.  The Intel Xeon Phi hostname will be listed individually in the hosts lists provided by the job scheduler.

Q: The performance metric is Gflops. How about comparing the wall clock time, including copy in matrix and copyout the results?
A: Floating-point operations per second is a valid metric for comparing current algorithm performance against the expected resource limits, but our performance comparisons often include the costs for transferring data to and from the coprocessor, depending on the workload.  It's generally the case with our offload benchmarks.

Q: Once you've used the Cilk Plus constructs in order to offload complex objects, can you mix it with OpenMP and TBB in the offloaded code?
A: Yes, you can use whichever parallel method you like on the coprocessor.  Most certainly you may have the need to protect data access between threads with locks, and such support might come from either the OpenMP or Intel Threading Building Blocks runtime libraries, or even the pthreads library.  However, mixing such models may require additional resources (certainly for maintaining multiple thread pools for each of OpenMP, Intel TBB and Intel Cilk Plus).  Moreover, heavy intermixing of the various models may involve some additional overhead and could impact maintainability in trying to keep the various models and their roles straight.

Q: OK, so if I understand correctly you use Cilk Plus constructs in order to offload C++ objects such as a TBB task and then in the offloaded code simply use task in a TBB context.
A: Generally I would not expect such mixing between Intel TBB and Intel Cilk Plus.  Either use _Cilk_offload to migrate a Cilk-managed kernel to the coprocessor, running parallel constructs managed by the Intel Cilk runtime, or run an explicit offload that employs Intel TBB within an explicitly offloaded kernel.  Alternatively, you may be using Virtual Shared Memory to handle complex C++ objects of an existing application that you want to migrate whole to the coprocessor.  Access to VSM is made possible by constructs using the Cilk keyword, so you could just use those Cilk constructs to declare VSM objects and to launch an offloaded computation and then use Intel TBB to parallelize the coprocessor code, but you are certainly not limited to that.

Q: With a cluster scheduler in place, how can I execute symmetric programming model?
A: There're quite a few schedulers that already support Intel Xeon Phi coprocessors.  What they typically do is provide you full access to a node plus the coprocessor card(s) attached to it.  When running under a job scheduler with Intel MPI, simply don't specify a hosts file.  Intel MPI will interact with the job scheduler and grab the allocated node/card hostnames directly.  In your submission script, you simply call your "mpirun -n 4 ./exe" command.

Q: Since yesterday, they said that IDB was going away, to be replaced by the Eclipse Plugin, is there MPI debugging support in the Eclipse plugin?
A: That's a good point.  Intel MPI does support Eclipse.  Please check our Reference Manual for more details: http://software.intel.com/en-us/articles/intel-mpi-library-documentation.  Also remember that Intel Trace Analyzer and Collector can be used for debugging Intel MPI communications flows.

Q: What are the default values used by MKL if you don't set the native environment variables?
A: If we do not set the number of threads through the environment variable, Intel MKL will use all the available cores.

Q: The MKL performance comparison between Phi and Xeon E5-2680 uses matrix size as large as 26624. Does it mean the matrix is 26624 x 26624. Does the coprocessor have enough memory to hold such a large matrix?
A: Yes, the coprocessor typically has 8GB of memory. So, it would fit a matrix of 26624*26624 (~5.2 GB for double precision).

Q: Is FFT supported thru auto offload?
A: As of this time in Intel Math Kernel Library, no.

Q: On 1 MIC card is it better to spawn 60 ranks or 240 ranks?
A: You would like (ranks * threads) to be at least 120, otherwise you can't keep the execution units busy; 240 may be better. As explained yesterday, a single thread context can't initiate a new instruction every cycle, so you need at least two per core.  However, we usually find it advantageous to use a hybrid programming model using MPI and threading with the ranks.  So most spawn between 1 and 8 MPI ranks and then use threading within the MPI rank to utilize all the cores.

Q: On a Phi cluster, with a job scheduler, how can I submit or execute symmetric programming model execution?
A: Intel MPI supports all major job schedulers.  Don't specify the host names since Intel MPI will work directly with the job scheduler to get that list (which can contain both Intel Xeon hosts and Intel Xeon Phi coprocessor cards).  When running, simply include the mpirun command line in your submission script (e.g. mpirun -n 4 ./exe).  With a lot of the job schedulers you can specifically request Intel Xeon Phi coprocessor nodes.  If you've requested that, make sure to also enable the I_MPI_MIC environment variable before mpirun, and the appropriate I_MPI_MIC_{PREFIX, POSTFIX} variable if NFS sharing is available.

Q: Can I run an MPI program (like HPL) in a native mode on a Phi card?
A: Yes, MPI programs can run in native mode on an Intel Xeon Phi coprocessor.

Q: Could you have say two MPI processes offloading to two separate halves of the Xeon Phi?
A: Yes, you could have two MPI processes on the Intel Xeon Phi coprocessor.  You could also have two MPI ranks on the Intel Xeon Phi coprocessor and then each rank could have 120 threads.    We don't recommend 200 MPI ranks on the Intel Xeon Phi coprocessor.


Q: But could you have two MPI processes on the host side, offloading to the same MIC card, but each using different halves of the MIC?
A: Yes, you can.  By default, MPI will not know anything about the offload regions.  That means when you offload from two (or more) separate ranks on the host, the offloaded code will be put on the same set of cores on the Intel Xeon Phi card - as you can imagine, that's not ideal.  It means you will have to be explicit about the processor/core affinity within the offload commands.  Starting with Update 2 of the Intel Compiler, handling affinity of offload regions has gotten a lot easier using the KMP_PLACE_THREADS environment variable.  For more info, check out this article: http://software.intel.com/en-us/blogs/2013/02/15/new-kmp-place-threads-openmp-affinity-variable-in-update-2-compiler.  Eventually the PLACE features of OpenMP 4.0 will provide an industry standard way of doing such affinity splits.

Q: Can I run the MPI natively on the MIC card with MKL MP LINPACK code?
A: Yes, LINPACK has been ported to the coprocessor.  Please check the Intel MKL installation path for the package, or check the Intel MKL documentation site: http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation.

Q: How can I determine the correct balance between MPI ranks and OpenMP threads without trace analysing?
A: I like the Trace Analyzer and Collector due to its visual component.  But, again, an Intel Xeon Phi coprocessor card is simply another node on a cluster.  Let me ask you this: how would you determine the load balance between MPI and OpenMP if you just have a bunch of Intel Xeon nodes?  You should be able to use the same tools.  For example, if you run 'top' on the Intel Xeon processor to see how many of the cores are being used, you can use 'top' on the Intel Xeon Phi coprocessor as well.  Not ideal but it's available.  You can also get some basic statistics about your MPI usage via the I_MPI_STATS environment variable within Intel MPI, although that won't give you info on OpenMP threads.  For that, you can use VTune™ Amplifier XE, Intel's individual host processor performance tool.

Q: Can a node access a remote Xeon Phi over the network?
A: if you mean through something like MPI?  Yes, you may run an MPI application where some MPI ranks are on the Intel Xeon Phi coprocessor and some MPI ranks are on different systems.

Q: The assigned IP of the card is part of installation? or used DCHP?
A: It is part of the configuration for setting up MPSS, from config files in /etc/sysconfig/mic

Q: How do you compile offload code and execute?  Is it similar to CPU with OpenMP compiling?
A: As mentioned earlier, just the presence of offload pragmas or directives is sufficient to cause current Intel compilers to automatically generate code for both the host and the coprocessor, and link them as appropriate.  Then, when you run your program so compiled, the program will use the linked library code to detect the presence of the Intel Xeon Phi coprocessor on the host where the code is running, and will optionally employ the coprocessor to compute the result.  It does not even require the addition of a compiler command switch (like -openmp) in order to provide this capability: all you need is code enhanced with offload or VSM pragmas and an Intel compiler, and the coprocessor code will be available.

Q: How does the offload computations routed to MIC? Is there any service or daemon running?
A: Communications with the coprocessor(s) operate over the PCI-Express bus that they share.  Within the software stack there are specific layers of support that aid the communications process. SCIF (Symmetric Communications Interface) provides the low level process spawn and data transfer facilities.  The Coprocessor Object Interface (COI) provides support to language runtimes for marshaling data across to the coprocessor and back again and has a service visible on the coprocessor as coi_daemon.  VTune Amplifier XE also employs a server visible on the coprocessor when it is running: it's currently called "sep_mic_server3.10."

Q: Which C++ compilers supports offload directives?
A: The Intel C++ compiler supports offload pragmas and directives.    The new OpenMP 4.0 standards specify constructs for offload to coprocessor devices (target) and explicit vectorization (simd) and is an industry standard for parallel computation.  The Intel Compiler has been adopting OpenMP 4 changes in recent releases, with more to come.  We expect more compilers to adopt these OpenMP standards rather than support offloading through the specific directives Intel started with.  Intel has never been shy to add extensions where needed to support new features, and then refactor into standards like OpenMP.

Q: `offloads` notation is much similar to OpenACC. Does Intel Compiler support OpenACC on Xeon Phi too?
A: Intel participates in the OpenMP standards committee.  The OpenMP architecture review board is made of numerous industry and academic participants.  A goal of the OpenMP board was to adopt the best of the various offload techniques into their new OpenMP 4 specifications.   The Intel offload specifications as well as OpenACC specifications were considered by OpenMP board members.  The result of their work is the OpenMP 4 specifications, which were just ratified in 2013.  Intel is adopting and implementing the OpenMP 4 standards in addition to its current Intel offload specifications.   Intel supports the OpenMP standard specifications.

Q: Does the Xeon Phi support Open CL programming language?
A: OpenCL is available for Intel Xeon Phi coprocessors.  Here's a reference: http://software.intel.com/en-us/articles/opencl-design-and-programming-guide-for-the-intel-xeon-phi-coprocessor.

Q: Does OpenCL programming model take full advantage of the 512 registers, does it perform the appropriate vectorization?
A: Yes, Intel OpenCL supports vectorization for the 512 bit wide registers on the Intel Xeon Phi coprocessor.

Q: Intel compiler already support OpenMP 4.0?
A: The offload feature (and some, but not all, other features), will be supported in the next compiler version, currently in beta testing. The SIMD feature is already supported in the current compiler.

Q: Xeon Phi support both RHEL & SUSE Linux?
A: Either RHEL or SUSE Linux can be run on the Intel Xeon host processor. The Intel Manycore Platform Software Stack (Intel MPSS) that runs on the coprocessor itself is also Linux-based.

Q: For CoProcessor stripped Linux OS: is the Kernel limited to one/two cores of Xeon-Phi?
A: Not sure what you mean by "coprocessor stripped Linux OS" but Intel MPSS operates on all the HW threads of the coprocessor, as can easily been seen through common utilities running on the coprocessor.  Just try running "top" or "ps" on the coprocessor to see for yourself.

Q: Is Intel planning to provide models of MIC coprocessors for full system simulation in the Simics with backtracking search through the state space?
A: There are no plans at this time.

Q: Even if you compile with -align array64byte, do you have to still have to have assumed aligned for your do loops?
A: Even if you explicitly align a particular array, you still need to specify at the site where you use it that the array is aligned (the compiler is VERY cautious about such assumptions and aliasing may lead it to limit or avoid vectorization in order to be safe).  -align array64byte controls how the arrays are allocated. In the program unit where the array is allocated, the compiler should therefore know about the alignment. But in other program units, it may not, and you may need to tell it via a directive. Use -vec-report6 to see whether a loop was vectorized using aligned or unaligned data accesses.

Q: Could you use the vector intrinsics from C in Fortran by using the ISO binding in the newest fortran?
A: Yes, I believe that you can, but I haven't tried personally.

Q: Would Auto-Vectorized loop scale (variable number of cycles) in run-time?
A: Auto-vectorized loops can scale well provided that there is sufficient parallel computational work and a sufficient number of loop iterations, (typically, at least two or three times the vector width).

Q: Could one ask the compiler to vectorize only a small section of code such as a single loop? Are there disadvantages to vectorizing every possible loop in a large code instead of just hotspots?
A: In general, it’s fine to leave the vectorizer enabled for all code. That’s not the same as vectorizing every possible loop: the compiler will vectorize only if it thinks this is likely to result in better performance. In ”cold” regions of the code, it’s unlikely to make much difference to performance whether or not a loop is vectorized.
You can enable or disable vectorization, like many other optimizations, at the source file level using command line switches. You can set the general optimization level at the function level within a source file, by using pragmas. The vectorizer is enabled at –O2 and disabled at –O1.
There is no simple way to turn on vectorization for a single loop, but if vectorization is enabled, there is a pragma that tries to enforce vectorization for an individual loop (#pragma simd, to be used with caution).  When vectorization is enabled, (at –O2 and above), #pragma novector may be used to disable vectorization for an individual loop.

Q: So when your offloaded code is executing with the cilk+ model and encounters a data structure that doesn't exist on the MIC side, it issues an MMU page fault and then goes and fetches it from the host? Is this not a bottle neck? So is the cilk+ model slower than doing things explicitly?
A: Actually, codes using Virtual Shared Memory for data marshaling are guaranteed that those data will be marshaled across at the boundaries of the _Cilk_offload, using dirty-page tracking on each side to record which pages will need to be marshaled across to the other side.  Consequently, this works only for synchronous uses of _Cilk_offload (no _Cilk_spawn _Cilk_offload), and insofar as dirtied data are not pre-marshaled to the other side but instead wait until the _Cilk_offload call, there is the potential for some lost data transfer latency hiding (opportunity cost).  However, use of Virtual Shared Memory already anticipates a cost in data density--marshaled fields with contents other than arrays of POD--contents other than those needed on the target for computation--statistically reduced the effective bandwidth of data transferred between the nodes.  The reason to choose VSM is not convenience but necessity.

Q: What integration/support for GDB will be there on windows environment?
A: With Intel Composer XE 2013 SP1 there will be a Microsoft* Visual Studio* 2012 debugger integration for Intel Xeon Phi coprocessors. This solution uses GDB for remote debugging the coprocessor but is transparently integrated and easy to use. This debugger integration also supports the Language Extensions for Offload (LEO) without the need for additional configuration of the debugger, and will support multiple coprocessor cards. It works for both C/C++ and Fortran.

Q: Could you clarify/compare the difference between MPI model (Phi as local node) and the Virtual Shared Memory, in terms of data transfer speed?
A: We don't have any public performance results but I recommend running the Intel MPI Benchmarks (open source kernels that test anything from basic Ping-Pong to collective operation communication: www.intel.com/go/imb) on a machine that has Intel Xeon Phi coprocessors.  Make sure to download the latest Intel MPI (4.1 Update 1) since that has default optimizations when running on the coprocessor.

Q: Does the NFS share become a performance bottleneck?
A: It can if you open too many files from the many available threads on the coprocessor.  Lots of things have to be balanced when fanning out to this many HW threads, like reducing lock times because the multipliers can become excessive with so many potential contenders.  Likewise, too many threads contending over a single PCI Express channel for bandwidth to copy separate files can pose another resource constriction.  Better might be to employ some functional parallelism to focus a few cores on I/O to get the files into the coprocessor memory space, where the remaining cores can process.

For more complete information about compiler optimizations, see our Optimization Notice.