Intel® Xeon Phi™ Coprocessor September 2013 Developer Webinar Q&A Responses

The third session of our  High Performance Application Development for Intel® Xeon® and Intel® Xeon Phi™ processors class was held during the last week of September, and generated yet another list of questions.  We tried answering all we could, though there were some, due to loss of context ("what was that thing with slide 25?") or other issues that made them difficult to answer.  Hopefully the answer your waiting for will be among the list below:

Final Questions and Answers

Q: Are these slides available?
A:
 A version of them is available now at the Intel Developer Zone site: http://software.intel.com/mic-developer, under the Training tab and Webinars link.  There you'll find a link to an article archiving the presentations and slide decks that were collected during the June 2013 session of this training: http://software.intel.com/en-us/articles/intel-xeon-phi-webinar.  These presentations are very similar to the ones presented in September.

Unfortunately though, we have learned that technical difficulties in the recording process resulted in the loss of the September session recordings.  The presentation decks are mostly the same as those shared in June, but there were a few changes that we may provide as updates if we can get good recordings to pair them with.

Q: Would really like to practice concepts introduced in yesterday’s workshops before I begin to forget. Could you make those slides available soon? Thanks.
A:
We apologize for the time it has taken to get these questions and answers published, answers have to be researched, written and then verified before publication, and it is difficult to satisfy both needs.  Hopefully by now you've independently discovered http://software.intel.com/mic-developer  and the Training tab that resides in the middle of that page; clicking on that will expose a set of links, including one marked "Webinars."  Clicking on that will expose another link that shows an article where the slides and presentations can be accessed.  These materials are from an earlier version of these classes, but should be close enough to jog your memory.  Even easier, here's that last link: http://software.intel.com/en-us/articles/intel-xeon-phi-webinar.

Q: So, 4 hardware threads actually won't be able to run in parallel?
A:
In this first generation of the Intel® Many Integrated Core (Intel® MIC) Architecture there are up to 61 cores and 4 threads per core.  Each thread has its own set of registers and instruction pointer, they share execution units.   These are in-order execution units but both scalars and vectors run through pipeline stages.  The instructions of the HW threads are overlapped and running concurrently, each making use of the various pipeline stages in their turn.

Q: Is there any advantage to placing threads on a processor (on the co-processor) nearer the PCI interface in order to reduce latency and communication costs on the ring on the co-processor?  I am thinking of a signal processing application where one of those processors may be handling the communication across IB and other mechanisms are handled on-board the co-processor.
A:
Whenever we've posed questions to the architects about topology-base strategies for improving performance, they have discouraged us.  They have taken great care to distribute resources around the ring to avoid congestion and maximize resource utilization.  When you say "processor" I'm not sure whether you mean the host processor or one of the coprocessor cores.
But more fundamentally, as far as IB traffic, it's an RDMA protocol that is going to drop data into a memory buffer so traffic should be going straight from the PCI interface to one of the memory interfaces.  The cores are best used to drive the vector units and avoid serial execution in the management of a sequential protocol. So their relation to the PCI interface seems at best remote, at worst obstructionary.

Q: In real life (HPC) the matrices' size is big (hundreds of thousands of rows and columns for dense matrices and millions for sparse matrices) and it will not fit in 8 GB. What is your advice for such big data when using Intel Xeon Phi coprocessors?
A
: This is not a new problem.  There are many large problems that have been partitioned across an array of nodes and there are various paging schemes for dividing data among those nodes and communicating between them.  These have their various costs but paging can be very painful, and slow.  Whether we limit the available memory to 8 GB or 16 GB (available on the 7120P) or even double that, there will always be problems that don't fit.  Our advice is to design programs to take advantage of larger address spaces when available, but that can adapt to smaller address spaces, albeit with a cost in performance because of increased communications between nodes.

Q: Did you use vector operations’ optimization (AVX 2) on Intel Xeon (E5-2697) in your benchmarks?
A:
The three benchmark graphs shown in the Intel Math Kernel Library class used baseline data derived from the E5-2680 and E5-2697 processors, both of which have access to Intel Advanced Vector Extensions 2 (Intel AVX2). 

Q: target(mic) always seems to pick mic0 for me, leaving mic1 idle.  I can force mic1 with target(mic:1).  Is there a way to balance the load using target(mic)?
A:
The developer is responsible for handling load balancing between coprocessors. The offload runtime does not take care of this automatically. The APIs in offload.h are helpful for this. For example, one of the APIs returns the total number of available coprocessors in the system. You could then use this number to divide up the work between the coprocessors.  You can also manage multiple coprocessors using MPI.

Q: Do all of these executables and libraries that have been referenced occupy memory on the co-processor?
A:
Yes. For a native application this can be problematic, especially if you have very large libraries. The coprocessor OS is a flavor of Linux. You can remedy this problem by mounting a share directly to the coprocessor and housing your shared libraries there. For offload it is not as big of an issue.

Q: If we to run on a specific MIC card, we can specify in the target. if target MIC doesn't exist, process throws an error or gets terminated with exception?
A:
The developer has some control over the behavior in this case. We provide a set of offload APIs that you can use to determine the number of available targets and based on that information you can offload computations accordingly. If an offload fails you can decide to terminate the application or to continue execution on the host. Offload statements can include status clauses which indicate the success or failure of a given offload code section.

Q: Does a[m:n] mean array a from location m to location n or does it mean starting at location m and continuing for n more locations?
A:
It is the latter of those, and a little bit more.  The syntax is [start:length:stride] with the latter two parameters optional.

Q: Are you planning to support non-Intel C/C++ compilers (e.g. gcc)?
A:
Intel encourages compiler writers to enable the Intel® Xeon Phi™ Coprocessor, providing architectural details and optimization guides specifically for their benefit.  Intel does not support third-party compilers directly, but it does include a copy of gcc as part of the Intel MPSS distribution (nominally at /usr/linux-k1om-4.7/bin in Intel MPSS 2.x environments, this is a basic compiler for rebuilding the kernel, and so does not have enhancements to support for example, the vector instructions).

Q: Will the Xeon Phi support Java? The FAQ on Intel's site says "Not yet." Is there a timeline for this?
A:
No timeline yet. Typically we'll update the FAQ or post an announcement to the user forum when the status changes.

Q: Are there any plans to add OpenACC support to the Intel compilers to target the Phi co-processors?
A:
We work with the industry and academic OpenMP community.  We were pleased that the OpenMP* committee completed specifications addressing the issue of program section offload.  We feel the OpenMP specifications, reviewed and ratified by its many members, provide the best solution for our developer community in conforming to an industry standard.  Intel is working to support this recent OpenMP specification change and we feel this will meet developer needs.   At SC'12 a third-party vendor announced plans to support OpenACC on Intel Xeon Phi coprocessors, but we've heard little about it since.

Q: What about OpenCL as an Offload programming model?
A
: OpenCL is available and supported on Intel Xeon Phi coprocessors.  For more information, take a look at this:  http://software.intel.com/en-us/articles/opencl-design-and-programming-guide-for-the-intel-xeon-phi-coprocessor.

Q: Is there a way to get annotated listings?
A:
Annotated listings?  Like from the compiler?  You can get all kinds of verbosity out of the compilers, including heavily annotated assembly listings, vectorization and optimization reports.
Q: Thanks.  I am thinking of *source* listings not assembly...   I'm thinking of a listing with the vectorization/parallelization comments commingled with the source code in a single listing (like is available on Cray compilers).
A:
Are you referring to so-called "Loopmark" listings?  The answer is no. However, we are working to improve our opt/vec reports to make them more readable and this is one of the features we are considering. I'm glad to hear you ask for it as it strengthens arguments in favor of this! Would you be so kind to submit this request to our user forum or to Intel Premier Support?

Q: How does virtual shared memory work with multiple coprocessors and one host in the same program?
A:
There should be nothing in the semantics of Virtual Shared Memory protocol preventing its operation when using multiple coprocessors.   Both offload forms can target specific coprocessors, providing the impetus for VSM page forwarding.  I have not experimented with such a configuration, nor can I think of any benchmarks that might provide such an answer, though, so this may be a frustrating conclusion to your answer.

Q: I build software with gcc, using MKL (without xeon phis). If a user's machine has xeon phi will my software automatically take advantage of it?

A: With a current version of Intel Math Kernel Library on a host that has an Intel Xeon Phi coprocessor, all it takes to take advantage of the coprocessor is the addition of an environment variable, MKL_MIC_ENABLE=1, to enable Automatic Offload (AO).  No code changes or even a build cycle are required.  This probably won't give you all the performance that Intel MKL on the coprocessor can provide for any particular application, but any performance you gain should come with very little effort.  AO is limited to a subset of Intel MKL functions and large data volumes.  For more details, see http://software.intel.com/sites/default/files/11MIC42_How_to_Use_MKL_Automatic_Offload_0.pdf.

Q: Will MKL automatically use all coprocessors available?
A:
Using Intel Math Kernel Library's "Automatic Offload" mechanism, its runtime system will take advantage of any Intel brand coprocessors available, and schedule work to take advantage of available resources.

Q: Most systems these days have two CPUs per node.  How do things compare in that case?
A:
Is this question in the context of the Intel Math Kernel Library (Intel MKL) performance graphs?  All the performance graphs in the Intel MKL session today already do compare a single coprocessor to a pair of host-side CPUs.  The details of these comparisons are compressed in the hyper-fine text that papers the bottom half of those graphs.  If that is the question, the answer is: exactly the same.

Q: CNR = ?
A
: CNR stands for Conditional Numerical Reproducibility.  For more details, see http://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr

Q: For fortran how do we enable MKL_MIC_ENABLE as a function call?  I get a compiler error.
A:
The runtime library has a Fortran function, mkl_mic_enable() (Fortran include file: mkl.fi) which should enable coprocessor access from Fortran.  If you are having such technical problems you can always turn to the Intel Developer Zone forums, including one dedicated to issues on the Intel Xeon Phi coprocessor (http://software.intel.com/en-us/forums/intel-many-integrated-core), or one for Fortran issues on Linux and MacOS (http://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x) and yet another dedicated to issues with Intel Math Kernel Library (http://software.intel.com/en-us/forums/intel-math-kernel-library).

Q: Are these performance statistics reporting on Math library functions?  And what percentage of the Phi coproc do they actually use in performing the benchmark?  (Is the Phi completely utilized / "busy", or does it still have further capacity?)
A:
Yes, performance benchmarks on some Intel Math Kernel Library functions are listed here: http://software.intel.com/en-us/intel-mkl#pid-12768-1295. Utilization of the Intel Xeon Phi coprocessor during performance benchmarking depends on the functions to be benchmarked. For example, if we define efficiency to be the percentage of the theoretical peak performance that a function can utilize, then LINPACK running on a single Intel Xeon Phi coprocessor can achieve ~75% efficiency. But FFTs running on a single coprocessor card will have a much lower efficiency, because they are memory-bound operations.

Q: Hello, I'd like to know if triangular solvers in BLAS2 are supported for parallel implementation on Phi co-processors?
A:
Yes, the triangular solver, TRSM, has been parallelized.

Q: The matrix size is in terms of dimensions of the matrix or megabytes?
A:
The matrix sizes presented in the Intel Math Kernel Library class are specified by matrix dimension.

Q: Is root access needed to install the extra MKL components later?
A:
If you installed with supervisory access initially, then yes, you’ll need the same access to install extra Intel Math Kernel Library components later. Alternatively, you can install Intel MKL under ordinary user access, which will allow extra component installs at a later time without any special rights.
Q: The issue is that by making the installer optionally install parts by default, it makes it likely that an overworked sysadmin will accidentally leave out needed components for the users of a shared compute resource.  Can the "skip installing some components" be a pro-active choice, rather than the default?  Is the savings in space/complexity by installing only parts of MKL really that significant?
A
: The point in “skipping installing some components” is to save space as not every customer may need every component. Since this is a new feature, customers may choose to be little careful while installing and explicitly select the needed components. Otherwise only default components will be installed.

Q: We do not see any activity on the MIC using micsmc with MKL sequential with our dgemm and other test programs.  Will sequential be supported by MKL AO in the future?
A:
The main purpose of Intel's Many Integrated Core architecture is to provide parallel components to handle large numeric computations, so the basic assumption is that users want threading.  Intel Math Kernel Library sequential by design does not use Automatic Offload; instead you'd normally use Intel MKL Sequential to provide math kernel functions within individual threads that have already been invoked under some parallel construct, where you don't want additional parallelization. Please let us know why AO support for sequential Intel MKL is important to you. We may have better solutions.

Q: Do I need to tune MKL or will it always determine the best configuration? If so, what happens if I have other programs running simultaneously? What if I have a MIC that is intended to be dedicated to another function? Is there a way to exclude its use from MKL?
A:
Intel Math Kernel Library has several Automatic Offload controls to manage the offload devices to be used as well as to synchronize coprocessor access between Intel MKL AO and other offloads being handled by the Intel Compiler.  You can restrict Intel MKL use at the process level by using the environment variable that enables it, discussed elsewhere in this Q&A.  Also, see the Intel MKL documentation for details.

Q: Does the Library Link Line Advisor show what DLLs need to be distributed with an application?
A:
The Link Line Advisor only shows what shared libraries you need when building your application to resolve external function linkages. To decide which shared libraries are needed to distribute with your application, check the contents of the redist\<intel64|ia32>\mkl directory.

Q: Comparison with Xeon - is it with Sandybridge?
A:
Yes, the Intel Xeon processor comparison is using the Intel microarchitecture codename Sandy Bridge in these foils.

Q: In the example on slide 16, what does "-n" refer to?  The number of MPI processes started up on the one MIC that is specified?  Also, can the job be launched from the MIC?
Q: I have the answer to my first question.  Two processes are started on the MIC.
A:
As you discerned from the discussion that followed your question, the -n argument specifies the number of MPI processes to launch, and as slide 16 suggests, the MPI program, compiled for native execution and downloaded to the coprocessor in this example already, can be run on the coprocessor by logging into the coprocessor and executing the MPI program.  However the "job" (running two copies of the MPI program in two MPI processes) must be initiated from a node that has a copy of mpirun, which is not currently available on the coprocessor.

Q: Is it possible to run pthread code directly on mic to use all the cores (without mpi)?
A:
Yes, Intel offers a full selection of parallelization libraries, including pthreads and OpenMP* as well as Intel Threading Building Blocks and Intel Cilk Plus, all available to utilize all available cores.  This doesn't cover communications to and from the coprocessor.  You can utilize some low level libraries supplied by Intel and used for example by the offload support in the runtime system.  Or you can roll your own, including hooks into Intel Manycore Platform Software Stack, the Linux kernel running on the coprocessor, whose open source availability is shared in another answer in this collection.

Q: Do users have to have root access to the coprocessor in order to use it effectively?  If so, that will not bode well in secure environments, e.g., at DoD sites.
A
: No.  There may need to be some supervisory access to install the software and set up the infrastructure, but beyond that setup, the user should not need to know root passwords.  Instead, that setup can create standard ssh keys and construct user home directories accessible to the coprocessor that enable ordinary user access, as was shown in one of the early examples in the architecture introduction lecture, using "ssh mic0 top" where, having ssh-keygen established credentials, the user can run on the coprocessor without providing any special access credentials or passwords.

Q: How can *users* who do not have root access "install" missing components?  It is not users who can do this but rather system administrators, right?
A:
It is in the nature of modern operating systems to have access controls that restrict normal users to a subset of the full capabilities, for the usual safety and security concerns.  The same is true of the coprocessor.  As an ordinary user, you can no more easily write to /lib64 in the coprocessor's virtual file system than you could to the /lib64 directory on the host.  However, user accounts can be established on the coprocessor to enable ordinary user access in native mode, and the configurations set up for offload access permit host-side executions that employ any coprocessors without requiring special access.

Q: Does KNC OS micro-kernel support MSR/CPUID kernel modules?  Can we install any utilities on KNC OS image if we want to run some related code on KNC card?
A:
  You can look for yourself.  Intel Manycore Platform Software Stack is available as an open-source distribution.  For details, see http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss.  Those interested in building the kernel and flavoring it how they choose should be aware that Intel MPSS is migrating to the YOCTO Linux build environment.  More details on this are available here:  http://software.intel.com/en-us/articles/intelr-mpss-transition-to-yocto-faq.

Q: Since there is no paging device on the co-processor will the application crash if it runs out of memory?
Q: My question was about a valid allocation that runs out of virtual memory because there is no paging support
A:
Yes it will. For applications running on the coprocessor only, the application will seg fault. For an offloaded application the offload section will seg-fault. The programmer can decide either to terminate the application entirely or to continue execution on the host. While it is true there is no paging device on the coprocessor, that doesn't mean there's no paging support, just no page caching.  Available physical pages are mapped to satisfy virtual memory allocation demands on the coprocessor.
Q: If the host doesn't access dirty pages in a shared region, will there be copies sent to the host for those pages?
A:
VSM paging is dirty-page driven rather than demand driven.  The transfers occur at the point of offload/return, not at the point of data demand.  Therefore, all the pages should be copied rather than just the ones the host (or coprocessor) might use in the future.  If you plan to use coprocessor memory for to hold changes that need not be communicated back to the host, it would probably be more efficient to use locally allocated memory rather than Virtual Shared Memory.

Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.