Hi, anyone has the result of using mpi to test the host<-> mic bandwidth? I tried on my machine, the bandwidth is quite low (~0.4GB/sec). I just send data from host to the mic card using blocking function and measure the time. The downloadspeed test in the shoc benchmark can generate up to 10GB/sec. Any idea about the low bandwidth using MPI? Thanks a lot!
Intel® Many Integrated Core Architecture (Intel MIC Architecture)
Xeon Phi compatibility with Dell workstation
I am trying to buy an Intel Xeon Phi 5110P card for a research project that I have. It is very difficult to find information on the compatibility of the Xeon Phi with specific workstations. I am interested in the Dell Precision T5600 workstation and I am trying to find if the Xeon Phi is compatible with it. Although Dell appears in Intel's "Where to buy list" for Xeon Phi, I have not been able to find this information from Dell (but it is possible to configure online the T5600 with a Nvidia Tesla K20C).
Regarding building a native application for Intel Xeon Phi Coprocessors
I am very interested in building a native application for Intel Xeon Phi Coprocessor. As you know, the embedded Linux operating system runs on the Intel Xeon Phi coprocessors. My question is as follows:
Complex Division Performance Issue
I have noticed a performance issue with complex division on the MIC. Dividing two complex numbers by using the division operator is about 22x slower than if the operation is explicitly coded using the complex conjugate (see attached source file). I passed the -fcode-asm flag to the ifort compiler to dump the assembly code and noticed an unexpected difference. In the former case a call is made to an SVML subroutine named __svml_cdiv8, but in the latter the code is inlined. For the CPU inlined code is always used (meaning no calls to the external VML library).
Able to use fabric dapl but ofa
I'm able to use I_MPI_FABRICS=dapl but not I_MPI_FABRICS=ofa on my system.
For example I'm using IMB to test out the performance using command:
mpiexec.hydra -genv I_MPI_FABRICS=shm:tcp -n 1 -host bio-xinyi ~/tmp/imb/imb/3.2.4/src/IMB-MPI1 -off_cache 12,64 -npmin 64 -msglog 24:28 -time 10 -mem 1 PingPong Exchange : -n 1 -host mic0 /tmp/IMB-MPI1.mic
When using I_MPI_FABRICS=ofa, it shows:
Randomly slower cores
Hi,
I experience a severe performance imbalance in our Xeon Phi (5110P, latest MPSS): a few (1-3) random CPU cores are 10-20% slower than all the other cores. I created a minimal example which demonstrates this (see below).
observations:
Forcing AO with MKL?
I have a large numerically intensive C++ program that is a heavy user of Intel MKL 11.0.2 ( a lot of use of zgemm, for example)
I am experimenting with AO without changing the source code , but I can't seem to get the MIC to "kick-in" for typical, large problems. My first thought was to check that AO is working in a test problem so I wrote a simple program that uses zgemm and the same link options and environment as the main program.
Confusion about Windows OS used on MIC host
I just read the Intel Xeon Phi Coprocessor Developer's Quick Start Guide for Windows Host, and was told that
" The operating system (OS) supported and validated on the host are: Windows 7 Enterprise SP1 and Windows Server 2008 R2 SP1".
But Windows 7 Professional is installed on my computer. My question is as follows:
Is it a mandatory for the Windows Enterprise to be installed as OS?
Xeon Phi and Xen 4.2.1
I am trying to use the Xeon Phi card with a Xen guest VM (HVM guest) with PCI passthrough (supported by Intel's VT-d). The following is my setup:
SuperMicro SYS-1017GR-TF (has the X9SRG-F motherboard which is known to support the Phi with the latest BIOS)
Dom0: Debian Wheezy 64-bit
HVM Guest: CentOS 6.3 with stock kernel (2.6.32-279 x86_64), Xeon Phi is passed through with the Xen pciback driver
VTune does not show L2 statistics
After running the VTune on Xeon Phi tutorial given here (http://software.intel.com/en-us/blogs/2013/01/08/ready-to-run-applications-from-multicore-platform-onto-intel-xeon-phi-coprocessor), we were unable to obtain a non-zero figure for the L2 cache hit/miss rates in the results, despite trying all possible amplxe-cl options for the xeon phi {knc-bandwidth, knc-general-exploration, knc-lightweight-hotspots}.
