Intel® Xeon Phi™ Coprocessor February Developer Webinar Q&A Responses

Response to our February session of the Intel® Xeon® and Xeon® Phi™ Introduction to High Performance Application Development for Multicore and Manycore-Live webinar was gratifying and overwhelming, but finally we worked through all of your questions.  Some of the questions required a context we lost with the transcript and some were only partially formed, or of special interest, or duplicates.  We gathered together all the questions of general interest from the webinar and farmed them out to our experts for more complete answers.  We'll assembled that list and sorted it by category, coalescing the questions from both days.  Hopefully we answered all of your questions.

Q: Why this presentation not covers Intel TBB related optimizations?
A: There are only so many topics we can cover in six hours, as we warned at the beginning.  If you have more questions, you can send them to the forum.  And we will be adjusting the content for future sessions of this webinar.
Q: Will these presentations be made available to attendees?
A: Both the presentations and video captures with voice-overs have been posted to http://software.intel.com/en-us/articles/intel-xeon-phi-training-m-core 
Q: Are you planning to publish those books [on parallel programming and programming for the Intel Xeon Phi coprocessor] in a digital edition such as Kindle ebooks?
A: We anticipate that digital editions will be available soon. We've been given hints it might be as early as April.
Q: If I want to bind a thread to a particular MIC core, where do I find the documentation on how to do that?
A: Techniques and API calls for doing this are documented in the compiler reference manuals.  For OpenMP,  explicit binding of threads to cores can be done via the environment variable, KMP_AFFINITY, or the new KMP_PLACE_THREADS (preferred). That's true for both Intel Xeon processors and Intel Xeon Phi coprocessors.  Check out the compiler documentation for details on OpenMP binding.  For users of pthreads, the usual pthreads API is available  
Q: Can we get the questions and answers also distributed?
A: A selection of the questions and complete answers will be posted to the Intel Xeon Phi coprocessor forum, http://software.intel.com/en-us/forums/ (and this is it).
Q: Will there be a certain possibility to run a testdrive of some of our applications on MIC?
A: There are several options we might suggest:
Option 1: Purchase or loan a system/coprocessor from an approved OEM: See our where-to-buy guide on Intel.com/XeonPhi and the list of available systems on http://www.intel.com/content/dam/www/public/us/en/documents/sales-briefs/xeon-phi-coprocessors-where-to-buy.pdf
Option 2: Request NSF allocation through TACC or NICS for academia. If you or your customer is eligible, visit https://www.xsede.org/ to submit an allocation application.
Option 3: Other options Colfax: http://www.colfax-intl.com/xeonphi/  SGI: www.sgi.com/phi
Q: I am a student and I am interested in obtaining a Phi card for my workstation to gain expertise.  Is there a way for an individual to purchase one direct from Intel (also, is there a student pricing option)?  Also, will the education version of Parallel Studio XE 2013 work with Phi?
A: Intel Xeon Phi coprocessor cards are available in systems from various OEMs, however not all workstations have the support required to enable the coprocessor.  Please check with your OEM about requirements and options.
Q: Regarding profiling, is there a version of gprof that supports Phi?
A: Though the Intel compiler generates instrumentation for gprof, such instrumentation is not available for Intel MIC Architecture.  We recommend the use of VTune Amplifier XE.
Q: Can some instructions occur conditionally (like warp divergence), and if so how would that be shown in Amplifier XE?
A: We don't speak of warps of threads in the parlance of the Intel Xeon Phi coprocessor, but we do have SIMD instructions operating on vector registers that have a similar structure and conditional execution via masking.  In the current VTune Amplifier a metric called Vectorization Intensity which is the ratio of the number of VPU elements active to the number of VPU operations executed will show how populated the vector ops were.  For example, the ideal vectorization intensity for the current coprocessor executing single-precision floating point instructions is 16, meaning that every VPU operation generated results for all 16 slots in the vector registers.  Learn more about useful events at http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding
Q: Is the VTune GUI available for Linux?
A: Yes, the Linux distribution of VTune Amplifier XE 2013 has both a GUI app (amplxe-gui) and a command line app (amplxe-cl).
Q: When can we expect support for the encryption instruction set in PHI, something similar to AES NI in std XEON?
A: Intel does not divulge future architecture plans.
Q: So the L2 cache can be seen as a NUMA like shared cache or as a distributed cache?
A: The L2 unified cache is a distributed cache, a slice associated with each core, with a distributed tag directory to check requests and minimize snoop traffic on the ring
Q: Any info about CAS(Compare and Swap) atomicity for 32bit/64bit/128bit?
Will it leads to cache line hot spot if all the hardware threads executes, say:
   barrier();
   CAS();

A: The Intel Many Integrated Core (Intel MIC) Architecture supports CMPXCHG, CMPXCHG8B, but not CMPXCHG16B. Normally the entire cache line would need to be read/written, which could produce a contentious cache line as suggested by the question.
Q: Do the Mic Cores and Xeon cores have independent hi-res clocks, if all are sampling the timer register is that a bottleneck?
A: each has its own "hi-res" clock, called the Time Stamp Counter or TSC which can be used for timing/sequencing, though the TSCs of the host and coprocessor run at different frequencies.  Reading the TSC takes no longer than a register transfer. However, the usual timing facilities available in the Linux API use a relatively lo-res system clock that also causes a system call, and we have seen cases where aggressive reference to the system clock, especially if done in individual worker threads, can be the source of both delay and thread serialization.
Q: Load balancing cost seems high for heterogeneous models - how much could be gained, in a relative sense?  In other words, why not run on coprocessor only?
A:  The host processor still offers a lot of compute power that would be wasted were it to be left idle in all cases.  But it really does depend on the cases.  Different programs present different load-balancing challenges.  That's why Intel offers the flexibility of multiple models to be able to select the model that works best over a broader selection of applications.  In some cases, host+offload might be best (say a code that has a lot of branchiness in places but also has embedded pure compute kernels that could take advantage of the coprocessor's features--the branchy code would run best on the host, then offload the kernels to the coprocessor).  Or there's a whole lot of some particular operation that you want to spread across a cluster, you might find a balance point in distributing MPI ranks that matches the rough amount of work done by the corresponding processors.  Or there may be programs so demanding of both computation and bandwidth that the effort to link host and coprocessor only slows things down.  All are possible in the gamut of suitable problems.
Q: Can we combine use of SSE on the host processor (for 8 bit and 16 bit operations) and still be able to use Phi for 32 bit operations?
If I have the following:
      int8_t a[32], b[32], c[32];
      for (i=0; i < 32; i++)
           c[i] = a[i] + b[i];
will the above for loop be able to take advantage of the Phi co-processor vector operations?

A: The vector instructions currently only support 32 and (a few) 64 bit integer operations, not 8 or 16 bit ones. (unlike SSE on Xeon). See the instruction manual at http://software.intel.com/sites/default/files/forum/278102/327364001en.pdf
Q: The floating point instructions are not bit-for-bit the same as a Xeon CPU ; what is the maximum difference between a Xeon Phi and a regular Xeon?  If I multiply two numbers on a Xeon and compare the result to the multiplication on the Phi, what is the maximum difference?
A: The individual instructions are IEEE-compliant on both platforms. Use of FMA on coprocessor but separate multiply and add on the host might lead to a difference in the last bit. If optimizations are performed differently on the coprocessor and host, this might lead to similar small differences, likewise for math functions. But all of these can be amplified if there are large cancellations in your application. If your application contains reductions that run in parallel, variations due to this are likely to be greater than any arising directly from differences between coprocessor and host. 
Q: generic question about the memory speed of the Xeon Phi: in one of the first slides, a memory speed of 5.5 GB/s was mentioned; this is almost the same as I get when going from host-to-GPU with a Tesla M2070 ; how does this compare?
A: The slide had a typographical error.  The correct answer is that the memory interface supports 5.5 GT/s (Giga Transactions/ Second).  With 8 memory controllers each having 2 32-bit channels, the total bandwidth available from the memory is just over 327 GB/s.  The limiting factor is ring performance so a practical bandwidth limit may be closer to 200 GB/s.
Q: So is there  a limit of 2G on the size of entities allocated byt _mm_malloc that would be less useful.
A: No, allocation of large objects is supported.
Q: In which Phi's memory is the stack allocated ? GDDR5  or memory associated to each processor ?
A: All memory directly on the coprocessor is supplied from the GDDR5 memories, including the stacks for each hardware thread..
Q: Coprocessor memory copy - can I copy sparse array regular structure
A: As long as the regular structure for sparse arrays contains no pointers, you should be fine.  The coprocessor does have vector gather-scatter to handle the sparse matrices in compressed form.
Q: Is it possible to memcopy from a phy to an other without using the system memory using virtual adressing ?
A: No, in the virtual shared memory model, each coprocessor is synchronized with the host memory, there is no direct synchronization that bypasses the host.  However, Intel MPI may sometimes use direct communication between coprocessors for short messages.
Q: Is there a Xeon Phi "pop count" instruction to count the number of set bits in a word?
A:yes, there is a scalar POPCNT instruction. See http://software.intel.com/sites/default/files/forum/278102/327364001en.pdf  page 632
Q: Question:  single-source for multi or many-core is a good idea, but what about architecture specific tuning e.g. loop-level modifications for vectorisation, cache blocking, and tuning for memory subsystem e.g. NUMA, prefetching - is it possible to have a single 'optimal' source for both multi- and many-core targets?
A: There are specializations that you can use (different environment variables, host-only and Intel Xeon Phi-only code sections) to customize execution at runtime, but the broader application you try to support, the hairier that code is likely to get.  The hooks are there, so how pretty it all looks will depend on you.
Q: Is the assembly code for the MIC the same as for the Xeon (or for the old P54 core) ?
A: No, they are not the same though certainly similar.  Many of the basic features are similar but the VPU instruction codings currently are different on the coprocessor.
Q: For the typical compressed row storage format for sparse matrices in a matrix vector product is there a scatter/gather function.
A: Intel MIC architecture supports vector scatter and gather instructions and vector gather prefetch instructions.
Q: Do all the cores on the MIC have to be executing the same instruction on each cycle?
A: Each thread is independent of the others in each core, though given the advantages of cache sharing, it likely will improve performance to have the local threads operate in conjunction on the same data, divide-and-conquer or some such.
Q: Does Phi support integer SIMD operations?
A: Intel Xeon Phi coprocessors have 32 512-bit vector registers that use SIMD operations.
Q: Do the SIMD integer instructions include logical shifts (left or right)?
A: Yes, there are vector logical shifts left and right and vector arithmetic shift right instructions, which operate on 32-bit integer elements of the vector.
Q: Is xeon phi supported instruction SSE?
A: No, the various Intel Streaming SIMD Extensions (Intel SSE) are not supported on Intel MIC Architecture.  Neither is the MMX instruction set.
Q: What is the bandwidth of L1 and L2 cache? 1 cache-line (64-bytes)/cycle? Half-duplex/full-duplex?
A: Cache line size is 64 bytes and the L1 data cache has a load-to-use latency of 1 cycle but vector instructions have different latencies than scalar instructions.  The L2 unified cache has a raw latency of 11 clocks but may be longer depending on where on the ring the hit occurs.
Q: Can you run a VM on a phi coprocessor?
A:Intel Xeon Phi coprocessors support a Virtual Machine Manager using the direct assignment model where the VMM directly assigns the coprocessor device to a particular VM.
Q: Only Is intel compiler 2013  support xeon phi?
A: Intel Composer XE 2013 (both C/C++ and Fortran) supports Intel Xeon Phi coprocessors. There is a gcc compiler that generate code for part of the instruction set but is of limited capabilities, intended primarily for building the coprocessor operating system.
Q: What is the difference between Xeon HT threads and Phi threads?
A: Both Intel Hyper-Threading Technology and the hardware threading on Intel Xeon Phi coprocessors share common execution units.   There are significant differences in the implementations though.  More of the thread context is replicated per thread in the coprocessor.  Also, the current generation Intel Xeon Phi coprocessors have in order execution units while current generation Intel Xeon processors have out of order execution.  They also have different memory subsystems.  The differences in latency due to the memory subsystems as well as the differences in scheduling instructions within the core all vary the efficiency.    Generally speaking we find that tuning software to run effectively on Intel Xeon Phi coprocessors improves performance on Intel Xeon platforms as well (this is generally because the tuned software makes better use of SIMD instructions on both platforms).
Q: Will loops with mixed data types vectorized?
A: Possibly.  The Intel MIC architecture supports vector scatter-gather, and a new BKM just went up on the website describing how to write C/C++ code that the compiler will recognize to take advantage of such instructions.  So there are cases where Array-of-Structures organized data can be "cherry-picked" to form vectors using vector gather and scatter instructions to improve performance.  However that may require touching a lot of cache lines for each vector, so better still for maximizing performance to have Structures-of-Arrays so that the vectors will be compact in memory and can be fetched in a single cache cycle.
Q: is virtual shared memory 32bit or 64bit? how do we map memory between the host (> 32 GB) and the MIC (< 8 GB) ?
A: Both host and coprocessors support 64-bit wide data access.  For Virtual Shared Memory, not all of the host memory is mapped, and not even all of the coprocessor memory is so mapped, but a window is established where named variables are allocated when they are declared with the __Cilk_shared attribute.  Those variables will be allocated in identical places within this virtual memory window so that references between them match.
Q: When using virtual shared memory, is there a copy of the data on the mic memory (8GB) that is kept in sync or the pointers refer to memory on the host?
A: The data is kept in sync on host and coprocessor, each with their own copy, on a page by page basis.  The sync happens at the offload statements
Q:In terms of performance, will there be performance degradation associated with using Virtual Shared Memory. i.e. if I were to code an algorithm once using the coprocessor offload method and the second time using VSM, is the performance going to be different?
A: As with many things Intel Xeon Phi, it depends on the algorithm you're trying to run.  And in particular to the data complexity that needs to be passed to the coprocessor.  VSM basically works as a dirty page handler: when host or coprocessor are active, writes to the VSM-attributed variables are tracked on a page-by-page basis.  Whenever an offload starts or stops, the dirty pages are exchanged.  So which wins, VSM or explicit data offload, will depend on whether it takes more work to copy the dirty pages implicitly or copy just the arrays or array sections explicitly.  There are cases where either can be the winner, however a careful use of explicit offload data management will generally do better than letting VSM handle it, particularly where moving large chunks of data is involved.  It will, of course, require more work.
Q: Regarding L2: Can there be multiple copies of the same data in different tiles?
A: The concurrency protocols follow the usual MESI rules: S stands for Shared and multiple caches can contain the same data in  read-only Shared state, but as soon as any of the readers try to write the page, it's marked as Modified and the other shared copies are sent invalidation notices that the data they hold are no longer current.
Q: Is there compiler option to just tell icc to align all local variables to 64 byte boundaries?
A: Not currently. There can be more limitations on C than on Fortran, due to requirements of gcc compatibility. That would be a good feature request for Intel Premier Support...
Q: What's the align default?
A: If unspecified, data alignment reverts to the GCC default of 4-byte alignment.  This can be overridden by various alignment options, depending on application.
Q: How the automatic parallel property works on xeon phi?  Does It show a better performance?
A: Automatic parallelization by the compiler using the -parallel switch is a lot less powerful than explict parallel programming using OpenMP, which should be preferred. However, automatic offload of functions such as DGEMM from the Intel Math Kernel Library (MKL) works well and gives better performance than running on the host for large matrices.
Q: Can a cuda programming model be used on xeon phi?
A: No, CUDA is not supported on the Intel Xeon Phi coprocessor. We support standard languages.  CUDA is a proprietary language/model from NVIDIA. Update: we plan to support the OpenMP 4.0 standard that is currently being finalized. The 13.1 compiler contains support for the proposed TARGET feature as described in Technical Report 1, and this will be updated as necessary
Q: Do you have a CUDA_VISIBLE_DEVICES like variable to select MICs with the scheduler?
A: There is nothing quite like CUDA_VISIBLE_DEVICES in the support for Intel Xeon Phi coprocessors.  The CUDA environment variable acts as a mask to select from the available devices within the scheduler that handles dispatch of computation. Support for the Intel Xeon Phi coprocessor provides the ability to explicitly select from a set of available devices, but there is no underlying work dispatch scheduler so no ability to schedule work for distribution across multiple devices.  For that we would use MPI.
Q: There is little documentation/examples form intel on use of  #pragma simd to vectorize outer-loops. Is there a reason for this, e.g. portability or some other intel concerns?
A: We have a number of array notation examples at: http://cilkplus.org/tutorial-array-notation.  We will look into adding #pragma simd examples. 
Q: How do you compile mic arch using gcc?
A: gcc does not currently support compiling general user applications for Intel MIC architecture. There is a limited version of gcc that is used for the OS.
Q: Does offload do memory management (copy to card and copy back to main memory) automatically?
A: Yes, but with conditions.  If you use an offload pragma or directive but don't specify modifiers on data transfer, all named arrays and scalars in lexical scope will be automatically copied over and back--the IN, OUT and NOCOPY modifiers can limit the data transfered at each offload.  With Virtual Shared Memory, only the data that have been declared Cilk_shared and have been modified by host or target will be marked for copy at the next offload/return, though the Virtual Shared Memory copies on the granularity of a page at a time.
Q: What happens in an offload section of code if you do not specify IN/OUT/INOUT etc for data that is accessed in the offloaded section of code?
A: All named arrays and scalars in lexical scope will be copied over and back.
Q: Offloaded data restrictions are principal or they will be removed during future development?
A: Depends which restrictions you are thinking of. The restriction that data offloaded by pragma must be bitwise copyable (no embedded pointers) is unlikely to be removed in the forseeable future. That restriction does not apply to implicit offload using keywords (virtual shared memory) in C/C++.
Q: Does the compiler give any output re. what is being copied/back and forth?
A:You can enable reporting by setting the OFFLOAD_REPORT environment variable: 1 select reporting of the time take, 2 adds amount of data transferred, 3 adds additional details.
Q: OpenMP pragmas do not usefully "understand" C++ object lifetimes. Do  the MIC pragmas understand C++ object lifetimes?
A.  While OpenMP has limitations in handling the C++ object model, lifetimes are simply an aspect of scope.  OpenMP takes advantage of scope to predetermine sharing.  OpenMP* on the Intel Xeon Phi coprocessors follows the same OpenMP specifications used on standard workstation systems.    Intel Cilk Plus offers simple easy to use threading model that is knowledgeable about C++ constructs.  You may find it more suitable than OpenMP given your concerns.  Intel Threading Building Blocks of course is also available on Intel Xeon Phi coprocessor based systems.  
Q: Is it possible to compile code for JUST the MIC? (so no host-version will be created)
A: Yes, the -mmic switch in the Intel compilers requests just the cross-compilation for the coprocessor, and is one way native applications are compiled (another is the limited gcc compiler, used for building the OS and various support libraries).
Q: In #pragma simd, why it is not possible to declare private an array, e.g. #pragma simd private(x) where we have double x[N];  Is the support planned for private arrays inside simd blocks in future?
A: Good question. It's technically more difficult, but being worked on.  Update: This has been implemented in the latest compiler version, 13.1, now available for download from the Intel Registration Center as Intel Composer XE 2013 update 2.
Q: Does MIC supports Unified Parallel C?
A: Intel does not itself provide a UPC implementation on either Intel Xeon processors or Intel MIC Architecture.
Q: Can the coprocessor access the data on the host file system? or the data has to be transferred to the host memory first, and then be copied to the coprocessor for further processing.
A: You mount a host or external file system on the coprocessor using NFS, then read from and write to it from there.  Nothing is mounted by default but there is documentation provided on how you can configure such mounts.
Q: At this time I have all my binaries and libraries exported to the Intel Phi using NFS, but I would like to know if other kind of FS can be used (like Lustre or GPFS).
A: It is possible to mount an external Lustre file system so that it is visible on the coprocessor.  For details, see http://software.intel.com/en-us/articles/configuring-intel-xeon-phi-coprocessors-inside-a-cluster
Q: Is there a way to enforce alignment for Fortran allocatable arrays?
A: Yes. The !DIR$ ATTRIBUTES ALIGN  directive can be used with allocatable arrays. Allocatable arrays are also aligned by the -align array64byte switch.
Q: I am curious: would the Intel Cilk style syntax be available on Fortran in the future, even if not yet?
A: We are constantly looking at what makes best sense support of Intel Xeon Phi coprocessors in standard languages and propose them accordingly to the standards bodies.  We do not have any information we can share with you at this time in regards to Cilk style syntax in Fortran.
Q: Is the CoArray Fortran standard supported under any of the programming model described in this presentation?
A: CAF is in development for Intel Xeon Phi coprocessors.  
Q: Is there a way to access individual lanes of vector registers from fortran/c (via directives for example)
A: In C, you can use intrinsics with masks to address individual lanes of the vector data types. (See the compiler user guide). There are also shuffle intrinsics. There is no equivalent for Fortran.
Q: Is virtual shared memory supported for Fortran?
A: not at this time.
Q: Can I use it as a linux coprocessor to host general applications (say in future) or for now it is only for openmp, intel tbb etc
A: It will run general applications, albeit slowly and misses some library support (e.g., X Windows), but it really is designed as a special purpose coprocessor to offload numerical computation or to serve as an MPI node in a cluster.  It will run single-threaded or low thread-count code, but not very efficiently.
Q: Will Intel IPP support offloading to Xeon Phi?
A: The Intel Integrated Performance Primitives library currently does not take advantage of available Intel Xeon Phi coprocessors. Intel IPP has a broad range of primitives and we are looking for inputs from customers to help us understand which of those primitives are are of most importance to them. So please, let us know what Intel IPP functions you are interested in having run on the Intel Xeon Phi coprocessor.
Q: Does OpenCV w. IPP need changes to utilize 512B vector unit?
A: OpenCV is working on the Intel Xeon Phi coprocessor here at Intel. It is compiled for the offload model so workloads can migrate from coprocessor to the host and back. But there is little vectorization at this point so performance is not near the speed of the host. But because both host and coprocessor are working, overall throughput can be better with the coprocessor present. The key thing is that the Intel Xeon Phi coprocessor is able to run all the OpenCV algorithms and now the challenge is to optimize the important ones with compiler directives.
The Intel SSE instructions present in OpenCV are not present on Intel MIC architecture. Intel IPP is not supported on the coprocessor so that option is out. Algorithms with floating point should benefit from the Intel compiler.
Q: Is there a Java option for coding?
A: Not yet.
Q: With CAO, can the user control the number of threads and their placement; and if so, how?
A: Yes. Users can control number of threads and thread affinity by setting necessary OpenMP environment-variables on the coprocessor. These can be set on the host side and be passed to the coprocessor side at runtime. Users can also call Intel MKL support functions (for example, mkl_set_num_threads) inside an offload region.
Q: Can a developer change the default offload limits such that they can force the MKL to offload always?
A: This was answered at the end of the presentation. In the case of CAO (using pragmas) there are not size limits placed on offloading. In the case of AO there is a size threshold, and it cannot be changed.
Q: Will a developer be able to write code for the host only, but link against the Xeon Phi supported MLK to get performance gain?  Basically, can a host-application developer completely ignore the co-processor and only let the MKL library worry about it?
A: This is the case that automatic offload is for. But automatic offload covers only a small set of Intel MKL functions today: GEMM, TRMM, TRSM, SYMM, LU, QR, CHOLESKY. For all other Intel MKL functions, users need to offload them using pragmas. 
Q: When u say MKL 11.0, is it possible to find the version of MKL I am using?
A: Intel MKL provides an API call for checking version numbers: mkl_get_version(), and mkl_get_version_string()
Q: Can several applications which use AO coesist piecefully on one host? Will there be any harmful overlap of computations?
A: If these AO functions are dealing with different data then there isn't overlap in the computation. However, there may be an oversubscription problem, where multiple AO function calls compete for the resources on the coprocessor. This is an area we are working on right now.
Q: What are the link libraries for MKL automatic offload?
A: No special libraries are needed for Intel MKL automatic offload. Applications are built the same way they are built for running on CPU. For compiler assisted offload and native execution the situation is different. Please use the online “Intel MKL Link Line Advisor”  (http://software.intel.com/sites/products/mkl/) to get recommendations on what libraries are needed and link line options.
Q: Can I run several MKL routines (not big for the whole Xeon MIC) in parallel without oversubscription?
A: Yes. It is possible. You can set mkl_num_threads for each of the Intel MKL routines. And you need carefully set thread affinity for each Intel MKL call such that they use different cores on the coprocessor in parallel. Refer to Intel Composer XE documentation for information on how to set thread affinity using OpenMP environment variables or service functions.
Q: Will MKL offload all operations to MIC if it is present automatically, or do we need include MKL calls within an !$offload region
A: Automatic offload is only available for a few BLAS level-3 functions and a few LAPACK functions: GEMM, TRMM, TRSM, SYMM, LU, QR, and Cholesky. For all other Intel MKL calls they need to be called inside an offload region.
Q: Is MKL tuned for multiple MICs per node in a cluster?
A: Some Intel MKL functions are tuned to use multiple coprocessors per node in a cluster. One benchmark that can benefit from this is HPL LINPACK. Please contact us if you are interested in more information about this.
Q: How many CPUs on the Phi were used for the LU Factorization example?
A: For the LU example, there are 61 cores (4 threads each) on the Intel Xeon Phi coprocessor. 
Q: Is the performance with Phi when the multiplied matrix is back on the host?
A: The Intel MKL performance charts we show are for native execution on the coprocessor. That is, matrices are not copied back and forth between the host CPU and the coprocessor
Q: [Regarding an MKL performance graph] Impressive graph, but how were the E5 2680s configured? HyperThreading? TurboBoost?
A: Hyper-threading and TurboBoost are turned off on the E5 processor. This is our typical recommendation when benchmarking MKL linear algebra functions
Q: When using MKL in native mode and the matrix size surpass a certain threshold, I lose the connect to Xeon Phi, and my librairies stocked in Xeon Phi has been erased.
A: It looks like the coprocessor hung up or was restarted?  If so, then yes you'll have to start from the beginning. Your code is stored in the memory on the coprocessor. When it restarts it does not retain the memory content. 
Q: Do MPI processes on the Xeon Phi boards run under a particular user ID?  Could (or should) multiple users run MPI processes on a single Phi?
A: Yes, each job you run is associated with you as a user.  But, beware, as MPI itself is not aware if a core is already running a process so if you're planning on sharing the Intel Xeon Phi card with other users (or even running multiple jobs simultaneously), look into enabling pinning domains for each job, or invest in a job scheduler.
Q: Nonblocking communications in MPI are supported yes?
A: Yes, indeed.  We fully support the MPI-2.2 standard (which includes all the extensions).  MPI-3.0 is coming later this year.
Q: If MPI runs on the host and on  MIC card with -np 2 option, on MIC card will all cores used or only one?
A: The -np 2 option will only start 2 ranks and both will be put on the host.  Intel MPI first fills out the cores available to it in the current node before moving to the next one.  You can overwrite this behavior by using the -perhost option.  mpirun -perhost 1 -n 2 will run 1 rank on the host and 1 rank on the Intel Xeon Phi card (if you have both listed in your hosts file).
Q: Is MPI the same on both the Xeon and Xeon Phi?
A: yes, the MPI is the same and Intel Xeon Phi coprocessors are supported with Intel MPI Lib 4.1.
Q: Virtual DAPL or OFA mode is recommended for MPI processes on host + MIC even when there is no Infiniband hardware on host?
A: You should be able to do that if you have OFED installed.  The latest OFED ships a software RDMA layer ported to Intel Xeon Phi coprocessors.  I would recommend taking a look on their website: www.openfabrics.org.
Q: I've used MPI in a batch environment (Torque/PBS mostly). Does the MIC support this?
A: To the best of my knowledge, that is currently not possible.  As mentioned before, the usual mode of reseveration right now is: you reserve the host node, and get all zintel Xeon Phi cards attached to that node for the use of a single user.  But, again, that would be dependent on how that batch scheduler implements running on the card.  Since you referenced Torque, I know Torque and Moab from Cluster Resources/Adaptive Computing support Intel Xeon Phi coprocessors. (http://docs.adaptivecomputing.com/mwm/Content/topics/accelerators/mics.html).
Q: How can I tell if the Traceanalyzer is installed on my system? Running "locate itacvars.sh" doesn't find anything...
A: You can no longer evaluate the Intel Trace Analyzer and Collector by itself since it's not a stand-alone product.  You would need to register of an evaluation of the Intel Cluster Studio or Cluster Studio XE (www.intel.com/go/clustertools).  No Mac version available.  But you do not have to run the GUI remotely.  You collect all the trace data via linking with the extra library and can transfer all the create *.stf* files to a local machine to do the analysis via the GUI.  If you have any particular questions on how to do that, just post to our forums at http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/.
Q: Is openMPI support in the roadmap?
A: This is not something that is dependent from our side but mostly how quickly will the Open MPI community implement support for Xeon Phi. At this time, Intel has made our software stack, as well as RDMA support on the Xeon Phi available as open source.  If interested in what's currently supported, check out http://software.intel.com/en-us/articles/intel-and-third-party-tools-and-libraries-available-with-support-for-intelr-xeon-phitm.
Q: When phi coprocessors are communicating and are located on different nodes, do they have to engage the host to transfer the data or can they do it directly?
A: Not necessarily.  We have a couple of ways within both the Intel® MPSS stack and the Intel MPI Library to allow direct communication between the coprocessors, some involving the host memory, others not.  For example, the CCL technology (which stands for Coprocessor Communication Link) implements an RDMA device which allows 2 coprocessors, for example, to directly talk to each other without having to do extra copying to the host.  That’s already part of the current Intel MPSS 2.1 stack.  More info can be found in Section 2.2.9 “Intel Xeon Phi™ Coprocessor Software Stack for MPI Applications” in the Intel Xeon Phi Coprocessor System Software Developers Guide (http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-system-software-developers-guide).  We also have some strategies that do have to involve copying messages out to host first (e.g. OFED over SCIF, others).  Both options have pros and cons: some offer better latency while others increase the bandwidth.  The Intel MPI Library generally tries to determine the best option for a certain situation although we have settings where you can go one way or another exclusively.
Q: Are there available results for scaling using multiple boxes using a hybrid programming model?
A: We have quite a few white papers on how different applications take advantage of the the Intel Xeon Phi architecture.  Check out the "Case Studies" tab on the Intel Xeon Phi Software Developer Community page (http://software.intel.com/en-us/mic-developer).  A lot of those applications do employ the hybrid method of execution (MPI + threading).
Q: Performance of existing MPI code in coprocessor only mode was really pure for the codes we tested. Have you seen examples, where you get acceptable performance for MPI (without using OpenMP, MIC specific optimization). Is MPI in coprocessor only mode a recommended programming model for MIC?
A: It's hard to recommend just one model, mainly because it would differ for each application domain.  I've seen a lot more examples where doing both MPI and threading certainly helps.  But if your application is not memory-bound (since you get a lot less memory/core on the Intel Xeon Phi coprocessor), I can see you taking advantage of a native run.  If you're interested in how specific applications behave on the Intel Xeon Phi coprocessor, you can take a look a the "Case Studies"  tab on the Intel Xeon Phi Software Developer Community page (http://software.intel.com/en-us/mic-developer).  
Q: Is there any whitepaper on MPI performance in the different modes?
A: Nothing on MPI performance at this time.  It's too dependent on the application profile.
Q: Any performance metrics available about: (two Phis) or (one Phis and one regular Xeon) over infiniband through one-side communication interface?
A: Nothing in regards to one-sided communication yet.  In fact, I would wait until the new MPI-3.0 one-sided comes into fruition (to be supported by Intel MPI later this year/early next year) since that offers a lot of performance improvements within the standard itself.
Q: Will that run on each core of the hosts and each core of the MIC a separate MPI rank ?
A: You can ran a rank per core but it might not be most efficient given memory and bandwidth of the Xeon Phi.
Q: Does the Intel MPI library support RMA on the the Xeon phi? are there performance numbers available for it?
A: No current performance data.  RMA on the Intel Xeon Phi coprocessor support with the Intel MPI Library is provided via the OFED layer in two major ways: the Coprocessor Communication Link (CCL) technology, or through OFED over SCIF.  For more information, check out section 2.2.9 "Intel® Xeon Phi™ coprocessor Software Stack for MPI applications" in the Intel Xeon Phi Coprocessor System Software Developers Guide (http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-system-software-developers-guide).
Q: Does trace analyser work with MPI processes created with MPI_Comm_spawn(...) as opposed to mpirun at session start-up?
A: Unfortunately, we don't support the MPI-2 extensions with the Intel Trace Analyzer and Collector.  At best, it'll record the MPI_Comm_spawn call but will not be able to detect any of the children processes created.
Q: mpirun -f mpi_hosts.. from the host doesn't work on the Xeon Phi system I'm using - it requests a password but then doesn't accept it. Running mpirun directly on the MIC card is fine. Any ideas how to get around the password issues?
A: Have you set up Intel MPSS to enable user accounts?  If you're getting password prompts, I suspect you are running into account permission problems rather than anything associated with MPI.  Be sure that passwordless ssh is setup both on the card and host.  If when you try to ssh from host to card you get prompted to enter your password, you might have to copy your ssh keys to the card as shown in the presentation today.  Typically this will require system administration rights to set up the virtual user account and copy ssh keys into it.
Q: One of the drawbacks of GPU parallel programming is the overhead of transfering arrays between CPU and GPU memory. Using Xeon Phi Coprocessors, also requires moving memory between host and coprocessor?
A: Yes, the model of a computation device connected to the host processor/memory complex over a peripheral interface such as PCIx imposes some tradeoffs that affect both GPUs and coprocessors.  If you recognize that the 16 channels of GDDR5 memory on the Intel Xeon Phi coprocessor represent 350 GB/s theoretical bandwidth whereas PCI Express 3.0 bandwidth on a 32x link is around 31.5 GB/s, you begin to see the problem.  It is a cost you can amortize for some algorithms by reusing the data transfered as much as you can, reordering calculations to take advantage of the data that are available, and making use on the Intel Xeon Phi coprocessor of the cache hierarchy to gain even further performance advantages even over using local memory.
Q: Phi has to go through Xeon to get to IB fabric?
A: No.  There's an RDMA device available and depending on host architecture, the coprocessor can get to the IB directly over the PCI Express bus, without having to go through the host.
Q: I believe only certain InfiniBand HCAs are supported for RDMA from the MIC.  Is there a particular card that has been more tested or is better support by Intel?
A: Not really.  We try to be as hardware independent as we can as long as the hardware vendor is using the standard HCA interfaces (e.g. dapl).  One thing I would recommend is using the OFED software stack for the Intel Xeon Phi cards.  The latest build includes an RPM ported to Intel MIC architecture.  OFED is open source and works on the majority of InfiniBand cards out there.
Q: What are your plans on enabling the usage of Mellanox software? For example the Fabric Collective Accelerator (FCA)
A: No plans today.  This would be something we need to discuss with Mellanox and what support they can provide from Intel Xeon Phi coprocessors on their end.  It's good for us to know that people are interested, though, so thanks for that.
Q: How much memory of the 8G does the OS take on the Phi's?
A: The kernel itself is pretty small but the configuration for the Virtual File System provides the means for users to preload various libraries in addition to the minimal set constructed by Intel MPSS.  You can make the coprocessor memory as small as you want by loading up the VFS, or you can NFS mount those resources and take a smaller footprint. Running top on an idle coprocessor using Intel MPSS 2.1.4982-15 with a nearly default inventory of files in the VFS, top showed about 225 MB used out of 8 GB, the rest available for applications.
Q: shmatt and shmget are now part of the linux kernel. Can I use those calls on the MIC's, does fork work on the MIC's
A: Intel MPSS 2.1.4982-15 uses the Linux 2.6.38.8 kernel and /dev/shm is defined, and fork does work.
Q: Is the Xeon Phi supported on the Mac?
A: Apple MacOS X is not a currently supported host for the Intel Xeon Phi coprocessor.
Q: When will be available support for free Linux distribution (like Debian and/or Ubuntu) on host site? So far are supported only two commercial distributions RHEL and SLES.
A: There are no plans to support other Linux variants at this time.  However, CentOS is known to work and others may work even though Intel at this time does not support them.
Q: Can I explicitly call fork on the coprocessor and control my own threading outside of OMP? And can I explicitly share memory ala shared memory segments?
A: The Linux Intel Xeon Phi coprocessor runs  a standard Linux.   Posix threads (or pthreads) are fully supported.    Yes you may create your own threads outside of OMP*.    If you prefer to create your own threads - it will work just fine on Intel Xeon Phi coprocessor cards.    We find that use of threading libraries such as OpenMP*, Intel CilkPlus threading and Intel Threading Building Blocks provide efficient thread pools for most applications.  You are not limited to these threading models, though.
Q: sched_setaffinity - does it work on the MIC cores?
A: Yes, the thread affinity properties in the  pthreads implementation for Intel MIC architecture do work.  You can also set affinities with the other available programming models.
Q: What are restrictions when programming for Xeon Phi? Is Xeon Phi supported for Windows? Is it possible to use the microsoft compiler?
A: Support for Intel Xeon Phi coprocessors hosted on Microsoft Windows will come this year.  At the current time Microsoft Visual Studio is not supported for Intel Xeon Phi coprocessors--the OS on the coprocessor will remain the Linux-based Intel MPSS--but you can use the Intel C++ Compiler, which can  easily be integrated into your MSVS IDE, for actual code generation.
Q: sudo to do an scp? I don't understand.
A: The coprocessor OS is a real OS.  /lib64 typically is read-only for regular users.  The example using sudo was to log in as root on the coprocessor to overwrite a protected directory.  For normal users, with home addresses on the coprocessor FS, ssh non-root is the normal activity.
Q: The sudo seems to be applied on the HOST side; the "root@mic" part would allow access to the /lib64 dir. Or is this version of scp heavily modified to transfer sudo-rights to the MIC ?
A: In order to run with root access on the coprocessor, you need to be logged into the coprocessor as root.  The easiest way to do that, given that ssh IDs have been placed in /root/.ssh on both host and coprocessor to enable password-less login, is to initiate the ssh task on the host as root.
Q: Is root required on the card to log in?
A: As long as your Intel MPSS environment has been configured to support normal user access on the coprocessor (by setting up accounts, home directories and ssh IDs in the coprocessor virtual file system), you can log into the coprocessor using ordinary user accounts.  Root access will be required in order to set up this configuration.
Q: Does an existing product go in just about any PCI system?  I'm not sure if they require special cooling.
A: We work closely with our OEM partners to validate systems in which Phi will work.  There are special requirements for cooling and minimum requirements for the PCI Express address space to support the Intel Xeon Phi coprocessor.  Please see  our software.intel.com/mic-developer pages for links to the Intel Xeon Phi product for OEM partners
Q: If I have two Mic's/node are they MIC:0 and MIC:1 ?
A: Intel MPSS installation predefines IP addresses and names in /etc/hosts on the host and coprocessor to facilitate communications.  The default names provided for the coprocessor start with "mic0" and go up from there.  Likewise, in the pragma or directive offload statements the explicit numbering for coprocessors starts at 0, ala "#pragma offload target(mic:0)" to refer to the first coprocessor in the system.
Q: remark just made: the filter pane can get too big with all the coprocessor threads; how does this fit with the remark that the xeon phi is not yet (fully) supported?
A: The two concepts are disjoint.  Both timeline and grid views show lots of data tagged with information captured when the event samples are taken.  Thus events can be tied to cores and statistically to threads, all through the auspices of the Event Based Sampling method that is supported on Intel Xeon Phi coprocessors.  Filtering can help focus on particular data of interest and thus cope with the large number of HW threads. There are other collectors supported on the host processor, though, whose availability is nascent or nonexistent on the coprocessor (e.g. a partially functional interface for itt_notify is available in Update 5).  Both the Locks and Waits and Concurrency analysis require instrumentation not yet available on the coprocessor for example.
Q: How I can resolve the VMLINUX kernel section in VTune so that I can see the methods that are called in VMLINUX?
A: Hot spot event traces in VTune Amplifier can report a lot of counts in VMLINUX.  The maps that can be used to identify functions from the capture RVAs are located in /lib/firmware/mic on the Intel MPSS-installed host.  Add this directory to the list of search libraries in the project you are using for Intel Xeon Phi coprocessor collections to enable VTune Amplifer to resolve those symbols.  
Q: does Vtune work with VxWorks?
A: Vtune Amplifier XE does work with VxWorks (see http://windriver.com/products/linux/performance-studio-for-intel-architecture/), but VxWorks is not currently supported for Intel Xeon Phi coprocessors.