Prominent features of the Intel® Manycore Platform Software Stack (Intel® MPSS) version 3.2

The Intel® Manycore Platform Software Stack (Intel® MPSS) version 3.2 was released on March 17, 2014. This page lists the prominent features in this release.

Improvements in Intel® Coprocessor Offload Infrastructure (Intel® COI)

  • Two new COI Buffer APIs, COIBufferWriteEx() and COIBufferCopyEx(), have been introduced in this release. Each API allows the caller to specify the destination process for the buffer. This will effectively make the destination buffer exclusively valid in the process specified. It will invoke the proper DMA operations in the background to collate the data from its various regions all into the specified process. Please see /usr/include/intel-coi/source/COIBuffer_source.h file for details.
  • A new API called COIProcessSetCacheSize() allows the caller to tune the caching algorithm and memory consumption aggressiveness. The aggressiveness with which this size is set allows the application to gain additional performance at the cost of higher memory footprint. Please see /usr/include/intel-coi/source/COIBuffer_source.h for details.
  • COI Performance performance enhancements: In addition to various bug fixes, several performance enhancements were introduced to the COIBufferAddRef and COIBufferReleaseRef functions. For manually created reference counts these APIs will be a significant percentage faster now for most users.

Intel® Xeon Phi™ coprocessor Linux* kernel performance Improvements

File IO performance improvements: The performance of system calls for file IO (e.g. read/write/readv/writev) on the Intel® Xeon Phi™ coprocessor have been improved. While these file systems are relatively simple, consisting mostly of allocating pages (page cache) in memory, copying data to and from user buffers and freeing page cache pages, there are two main reasons that impact the performance of file IO on the Intel® Xeon Phi™ coprocessor.

  • The lower single threaded performance of the Intel® Xeon Phi™ coprocessor relative to the Intel® Xeon® processor - simple in-order cores running at much lower frequency.
  • Small caches - the impact of cache pollution due to memory copy/clear page type operations in the inner loops of the read/write system calls has a dramatic effect on performance.

This Intel® MPSS release attempts to improve the performance of system calls for reading and writing files on tmpfs and ramfs mount points. In addition to a set of kernel configuration parameters that enable the optimization (ON by default in the release), the following kernel command line options provide additional control to enable or disable the read and write optimizations:

  • vfs_read_optimization - on/off. If not specified, it is off by default. When on, it enables read side optimizations for files in the above file systems.
  • vfs_write_optimization - on/off. If not specified, it is off by default. When on, it enables write side optimizations for files in the above file systems.

As an example, to enable read optimizations, add vfs_read_optimization to the ExtraCommandLine as follows:

  1. Edit /etc/sysconfig/mic/default.conf
  2. Append "vfs_read_optimization=on" to the ExtraCommandLine
  3. Restart the mpss service

Additional details can be found here: http://software.intel.com/en-us/blogs/2014/01/07/improving-file-io-performance-on-intel-xeon-phi

mmap() performance improvements: The standard mmap(2) with the MAP_POPULATE flag provides a way to pre-fault pages which results in improved performance in HPC applications. mmap(2) calls into the page allocator which uses spinlocks that can become heavily contended in some scenarios, for example when multiple MPI ranks call mmap(2) during initialization to allocate memory. By tweaking ticket spinlocks and preventing threads from constantly reading cache lines to check ticket values, the scalability of  mmap(2) is dramatically improved on the Intel® Xeon Phi™ coprocessor.

This enhancement is enabled via a kernel configuration called SPINLOCK_SCALABLE (enabled by default).  This is recommended for better locking performance on multi-threaded systems such as the Intel® Xeon Phi™ coprocessor. When many threads are contending on a single lock, this option limits the maximum number of threads, defined by SPINLOCK_QUEUE_LENGTH, that spin on the cache line while other threads delay execution for a period of time (cycles) that depends on their distance from the current ticket value, i.e. (ticket - SPINLOCK_QUEUE_LENGTH) * SPINLOCK_QUEUE_DELAY.

Further performance improvements on mmap(2) with the MAP_POPULATE flag are achieved by applying the alternate movq based clear_page that provides faster throughput especially for large allocation sizes.

Support for Automatic Process Grouping in the Intel® Xeon Phi™ coprocessor Linux* kernel

Automatic process grouping, a feature available in the 2.6.38.8 kernel used on the coprocessor (a.k.a. "patch that does wonders") is enabled by default in this release. This patch changes how the process scheduler assigns shares of CPU time to each process. This feature can be disabled via a kernel command line option called noautogroup which can be passed to the kernel running on the Intel® Xeon Phi™ coprocessor via the ExtraCommandLine option in the default.conf file (see above for an example describing command line options)

For additional information, please refer to the following:

SCSI RDMA Protocol initiator on the Intel® Xeon Phi™ coprocessor

This feature allows the Intel® Xeon Phi™ coprocessor to act as a SCSI RDMA Protocol (SRP) initiator. With this feature the Intel® Xeon Phi™ coprocessor can access remote SRP targets on the network. An SRP target is a SCSI device on another computer. The SRP implementation requires an Infiniband HCA. After installing the OFED RPMs on the host, see /usr/share/doc/ofed-driver-2.6.32-279.el6.x86_64-3.2/srp-phi.txt for all the instructions

Intel® Xeon Phi™ Products Reliability Monitor

The Reliability Monitor is designed to monitor overall health of compute nodes at the cluster level. It runs on the head node or management node. The Reliability Monitor works closely with the RAS agent, like micrasd that runs on each compute node, to collect data such as uncorrectable errors or crash symptoms which is then logged on the management node.

Intel® MPSS 3.2 now supports Microsoft Windows* 8.1 OS on the host

Support to map multiple pages via scif_mmap()

While scif_mmap() will continue to support the same semantics as it did in previous Intel® MPSS releases (for both Linux* and Windows*), this release adds support to map multiple pages via scif_mmap() if the host OS is Windows* 8 or later, i.e. the length parameter can be in the range 4KB to 4GB - 4KB. In previous releases, scif_mmap() had the limitation that the length parameter had to be equal to one page. For details on scif_mmap() please refer to /usr/include/scif.h (assumes that you have installed the libscif dev headers RPM)

Windows* Bridging and Routing support:

The Intel® MPSS drivers and utilities provide support for IP networking over PCIe to all the Intel® Xeon Phi™ coprocessors plugged into the system.

  • Bridging: Windows* provides functionality for creating a software bridge that connects two or more networks so that they can communicate.  Network packets received on any of these bridged networks are passed unchanged to the bridge. The Intel® MPSS provides two forms of bridging.
    • Internal Bridging – provides the ability for the coprocessors to communicate with each other as well as with the host.
    • External Bridging – provides the ability for the coprocessors to communicate to the physical network interface of the host system. This will be the desired configuration in clusters. Please refer to the MPSS User’s Guide-Windows* on the Intel® MPSS download page for more details on how to set up bridging.
  • Routing: Windows provides functionality to route network traffic between two or more networks. IP routing is disabled by default in Windows. Manual steps are required to set up IP routing for the Intel® Xeon Phi™ coprocessors. Please refer to the MPSS User’s Guide-Windows* for more details on how to set up Routing:

Support for LDAP on the Intel® Xeon Phi™ coprocessor

The Intel® MPSS 3.2 release supports the Lightweight Directory Access Protocol (LDAP) on the Intel® Xeon Phi™ coprocessor. This allows administrators to setup authentication and resource limits via this mechanism. The blog URL below illustrates all the steps needed to set up LDAP support on the  MPSS host, set up a static bridge and configure Intel® Xeon Phi™ coprocessor cards to allow SSH login to the card by an LDAP user (see micctrl options for LDAP support below). In addition, it explains how to configure the Intel® Xeon Phi™ coprocessor for the LDAP users of the Intel® Coprocessor Offload Interface (Intel® COI) applications with an optional authentication of the offload user. Additional information can be found here:

http://software.intel.com/en-us/articles/setting-up-ldap-support-for-intel-xeon-phi-coprocessors

Network File System Version 4 (NFSv4) support in the Intel® Xeon Phi™ Linux* kernel

The Linux* kernel running on the Intel® Xeon Phi™ coprocessor now supports the NFSv4 client, which enables features like file locking (the mounting and locking features have been incorporated into the V4 protocol) and automounting via autofs daemon running on the coprocessor. Please refer to the MPSS User’s Guide for details at http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#lx32rel

Additional resources on autofs:

Mount Helpers Enabled: The CONFIG_FEATURE_MOUNT_HELPERS option was enabled in this release which supports the use of helpers with mount (e.g. via mount.nfs4).

Authentication and limits checks

In addition to support for LDAP, Intel® MPSS now supports HostBasedAuthentication: Host-based authentication is a feature that allows any user on a trusted host to log into another host (with the same username) on which this feature is enabled. Note that while per-user ssh public keys can achieve similar effects, maintaining the per-user keys could be an unwanted administrative overhead. A few files need to be modified to configure this in an environment consisting of Intel® Xeon Phi™ card. Please refer to the MPSS User’s Guide for details at http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#lx32rel

PAM support in sshd and processing of limits.conf enabled by default:

All Intel® MPSS 3.x releases have compiled the SSH daemon with PAM support, but in the 3.2 release that support is enabled by default. As a result, policies set in PAM's configuration now affect SSH sessions as well (they previously did not). The 3.2 release also includes and enables the pam_limits module, allowing session resource limits to be configured with limits.conf(5).

Remove requirement for a “filelist” for the for Coprocessor OS image on Linux* hosts

The MicDir and CommonDir configuration parameters (in the .conf files in /etc/sysconfig/mic) have required a "filelist" file be present on the host file system.  The purpose of this file was to define the owner and permissions of every file present in these directories. Since this served no real purpose on Linux* hosts given the owner and permissions are present from the source file, it has been removed.    Removing it simplifies the process of adding new files to the coprocessor’s file system.

Intel® True Scale enhancements

Earlier versions of Intel® MPSS supported Intel® True Scale fabric both in “offload mode” where an MPI application runs on the host while offloading computations to the coprocessor as well as “native mode” where the application can use MPI directly on the Intel® Xeon Phi™ coprocessor (i.e. peer-to-peer support). In this release support has been added for Red Hat* Enterprise Linux* (RHEL) 6.4, 6.5, and SUSE* Linux* Enterprise Server (SLES) 11 SP3.

In addition, Send DMA (SDMA) support was added for processes running on the Intel® Xeon Phi™ coprocessor using Intel® True Scale HCAs. This feature improves native mode bandwidth for such processes. The feature is enabled by default in this release and users can prevent the Performance Scaled Messaging (PSM) user library from using it by using the PSM_SDMA environment variable. While unlikely, problems can show up in the form of processes not starting up and/or hanging. In the very rare case of an unhandled issue in the driver, there might be a stack trace printed to system console. However, there should be no issues caused to the stability of the system or the Intel® Xeon Phi™ coprocessors.

Tools and Compilers

  • Cross-compiled Linux* software provided in the k1om tarball - The mpss-3.2-k1om.tar tarball available in the Intel® MPSS downlaod page contains binary RPMs for around three hundred Linux* software packages. Included among them are system daemons like cronie, rpcbind, and xinetd; performance and debugging tools like gperf, lsof, perf, and strace; utilities like bzip2, curl, rsync, sudo, and tar; scripting languages like awk, perl, and python; and development tools including a GCC toolchain, autotools, bison, cmake, flex, git, make, patch, and subversion. If zypper has been appropriately configured, the command "zypper install task-mpss-toolchain" will install a complete set of essential development tools.
  • Compatibility with x86-64 executables through software emulation - The mpss-3.2-k1om.tar tarball contains the binfmt-qemu RPM which provides support for running standard x86-64 executables on the Intel® Xeon Phi™ coprocessor. This compatibility works by using QEMU to emulate the x86-64 instruction set entirely in software, so it comes at the price of extremely high overhead. Consequently, it is useful only in scenarios where performance is not at all a concern, such as when the executable in question is used interactively. In those scenarios, one can sidestep the need to cross-compile the software in question.
  • Support for vtables in the Compiler - MYO added non-coherent arena to support vtables in the Intel® Composer XE compiler. This feature enables _Cilk_shared C++ objects with same virtual address between Windows* host and Linux* on the device including support for C++ virtual functions.

micctrl improvements and new features

  • micctrl supports a way to specify NFS options via the --addnfs command to  add an NFS mount to the Intel® Xeon Phi™ coprocessor. The syntax is as follows:

                micctrl --addnfs=<NFS_export> --dir=<mount dir> [--server=<server>] [-o <option1>,<option2>, . . .] [mic card list]

The --addnfs option adds an NFS mount entry to the Intel® Xeon Phi™ coprocessor card’s /etc/fstab file. “NFS_export” specifies the NFS export and the mount directory in the traditional <server>:<export> format. The server parameter may still be used for backward compatibility with previous versions of micctrl and will prepend its argument to the specified export. If the server is not specified, it places the IP address of the host in the server field. In this release, the NFS mount options for the fstab entry may also be specified by appending of list of comma separated options to the ‘-o’ option. If no options are specified then just the “defaults” mount option will be used.

  • micctrl –b checks for valid uOS image/initrd files - The MPSS daemon and the micctrl utility option -b (for boot) now check the validity of the kernel and initrd images before attempting to boot the Intel® Xeon Phi™ coprocessor. Among the checks performed include – correct image type and machine_type for “k1om”. The initrd is also checked and must be a compressed CPIO archive.
  • micctrl supports a new option for user configuration - The user configuration in the --initdefaults option has changed.  A new option called --users determines how the /etc/passwd and /etc/shadow files are populated, this option has four settings:
    • If the option is set to 'nochange' or if is not provided then the default behavior is to leave the /etc/passwd file in the coprocessor untouched. If this file does not exist then the option will be set to “overlay” as default.
    • If the option is set to “overlay” then the content of /etc/passwd will be replaced by the minimum users and the users found in the hosts /etc/passwd file.
    • If the option is set to “merge” then the /etc/passwd file on the host will be checked for any new users to be merged into the card.
    • If it is set to "none", it only creates the following users - root, ssh, micuser, nobody and nfsnobody. It does not include any other users.

If after the initial configuration the system administrator wishes to change this, a new micctrl

--userupdate option is available to be used.  It is called with the "none", "overlay" or "merge" options.  If none or overlay is specified the current users are removed and the users are recreated as specified above.  If the "merge" option is specified the hosts /etc/passwd file is checked for any users not currently in the cards user list and they are merged in.

At any time a new user can be added using the micctrl --useradd function.  In MPSS 3.2, the

--useradd option will also push the creation of the new user to the currently executing MIC card if it is in the online state. There is a similar option called --userdel. Please refer to the MPSS User Guide for details at http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#lx32rel

  • micctrl supports configuring LDAP as follows:

micctrl --ldap=<server> --base=<domain> [Mic Card List]

          The --ldap option configures the Intel® Xeon Phi™ coprocessor for LDAP user authentication. The server parameter specifies the address of an           existing LDAP server for the LDAP client on the coprocessor to connect to. This server parameter can also be set to the key word disable in               order to turn off LDAP user authentication and remove any existing LDAP server configuration. The base parameter specifies the Base domain             name for your LDAP server. Three LDAP rpms (nss-ldap, openldap, libldap) for the Intel® Xeon Phi™ coprocessor are required to successfully             run the command. Additionally, the Intel® Xeon Phi™ coprocessor must be configured to be used with an external bridge in order for it to reach           a remote LDAP server address.

Driver Boot Message Cleanup on failure

In the past the host driver would continue to print the “failure message” (see below) even after the Intel® Xeon Phi™ coprocessor has failed to boot. Now, the function in the host driver waiting for the coprocessor to boot has been fixed to recognize that that it has failed to boot and stop printing the message.

...
Waiting for MIC 0 boot 295
Waiting for MIC 0 boot 300

Timeout booting MIC, check your installation
Waiting for MIC 0 boot 305...

Documentation

To help understand and troubleshoot issues encountered when updating the flash or SMC firmware on the Intel® Xeon Phi™ coprocessor, the following FAQ document describes all the possible failure scenarios and the remedies for the failure. The document can be found here: 

http://software.intel.com/en-us/forums/topic/494772

Similarly, the following document describes the error/log messages reported by the MIC RAS daemon running on the host (if installed). It covers CRITICAL and ERROR messages only as INFO and WARNING messages do not need immediate action.  The document can be found here: 

http://software.intel.com/en-us/forums/topic/494773

Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.

* Other names and brands may be claimed as the property of others.

Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.