by Robert J. Woodruff
Software Engineering Manager
Get acquainted with the Infiniband* software architecture and how support for it has been added to Linux*
Kernel-mode support for the InfiniBand* architecture has been added to Linux*, and support for Infiniband architecture user-mode access is nearly complete. This article describes the InfiniBand software architecture and implementation for Linux being developed by the OpenIB.org alliance and other open source projects.
InfiniBand architecture defines a high-bandwidth, low-latency clustering interconnect used for high-performance computing (HPC) and enterprise data center class applications. It is an industry standard developed by the InfiniBand Trade Association*.
Since the release of the specification, Intel has been active in developing Linux open source software for InfiniBand architecture, starting with an initial project on SourceForge* and recently joining forces with the OpenIB.org alliance*.
The open source code for InfiniBand architecture has matured to the point where portions of it are included in the base Linux kernel. Other modules are under development and will be completed and submitted to Linux in the near future.
This article provides a high-level description of the OpenIB.org InfiniBand software architecture1 and implementation, plus links to related open source projects that use the OpenIB.org code.
1 The OpenIB.org Developer’s Workshop proceedings – http://www.openib.org/archives.htm*.
The Linux InfiniBand code consists of a set of kernel modules and associated user-mode shared libraries.
click to enlarge
Kernel Level InfiniBand Modules
The kernel code divides logically into three layers: the HCA driver(s), the core InfiniBand modules, and the upper level protocols. The core InfiniBand modules comprise the kernel level mid-layer for InfiniBand devices. The mid-layer allows access to multiple HCA NICs and provides a common set of shared services. These include the following services:
- User-level Access Modules – The user-level access modules implement the necessary proxying mechanisms to allow access to InfiniBand hardware from user-mode applications.
- The mid-layer provides the following functions:
- Communications Manager (CM) – The CM provides the services needed to allow clients to establish connections.
- SA Client – The SA (Subnet Administrator) client provides functio ns that allow clients to communicate with the subnet administrator. The SA contains important information, such as path records, that are needed to establish connections.
- SMA – The Subnet Manager Agent responds to subnet management packets that allow the subnet manager to query and configure the devices on each host.
- PMA – The performance management agent responds to management packets that allow retrieval of the hardware performance counters.
- MAD services – Management Datagram (MAD) services provide a set of interfaces that allow clients to access the special InfiniBand queue pairs, 0 and 1.
- GSI – The General Services Interface (GSI) allows clients to send and receive management packets on special QP 1.
- Queue pair (QP) redirection allows an upper level management protocol that would normally share access to special QP 1 to redirect that traffic to a dedicated QP. This is done for upper level management protocols that are bandwidth intensive.
- SMI – The Subnet Management Interface (SMI) allows clients to send and receive packets on special QP 0. This is typically used by the subnet manager.
- Verbs – The mid-layer provides access to the InfiniBand verbs supplied by the HCA driver. The InfiniBand architecture specification defines the verbs. A verb is a semantic description of a function that must be provided. The mid-layer translates these semantic descriptions into a set of Linux kernel application programming interfaces (APIs).
- The mid-layer is also responsible for resource tracking, reference counting, and resource cleanup in the event of an abnormal program termination or in the event a client closes the interface without releasing all of the allocated resources.
The lowest layer of the kernel-level InfiniBand stack consists of the HCA driver(s). Each HCA device requires an HCA-specific driver that registers with the mid-layer and provides the InfiniBand verbs.
Kernel-Level Upper Level Protocols
This section describes the upper level protocol (ULP) drivers envisioned for InfiniBand. Some of these drivers are being developed by the OpenIB.org alliance, and others are being developed by other open source projects, as noted below.
IP over InfiniBand (IPoIB) Driver
The IP over IB driver supports tunneling of Internet Protocol (IP) packets over InfiniBand hardware. The driver is implemented as a standard Linux network driver, and this allows any application or kernel driver that uses standard Linux network services to use the InfiniBand transport without modification.
However, to attain full performance and take advantage of some of the advanced features of the InfiniBand architecture, application developers may want to use the sockets direct protocol or the direct access protocol layer (DAPL) DAPL API.
Sockets Direct Protocol (SDP) Driver
The sockets direct protocol driver provides a high-performance interface for standard Linux socket applications and provides a boost in performance by bypassing the software TCP/IP stack.
The reason that TCP can be bypassed in this model is because InfiniBand hardware provides a reliable transport. Thus, the TCP software protocol provides communications reliability that is redundant. The SDP protocol is implemented as a separate network address family. For example, TCP/IP provides the AF_INET address family and SDP provides the AF_SDP (27) address family. To allow standard sockets applications to use SDP without modification, SDP provides a preloaded library that traps the libc sockets calls destined for AF_INET and redirects them to AF_SDP.
The kernel-level DAPL (kDAPL) driver provides a kernel-level interface to a remote direct memory access (RDMA) API defined by the DAT collaborative*. DAPL and kDAPL were defined to allow clients to access the low-level advanced features of several RDMA-enabled fabrics including iWarp, InfiniBand hardware, and Myrinet*. These low-level features, which are common to all of these RDMA fabrics, include queue pair based I/O semantics, RDMA read, RDMA write, and immediate data, in addition to the send/receive-based I/O semantics that are available from the sockets interfaces. Thus, for kernel drivers that want to implement an RDMA-enabled upper level protocol (ULP), but allow it to run on any RDMA enabled fabric, then kDAPL is the likely choice.
ULPs that want the lowest level of access directly to the InfiniBand primitives can call the core APIs directly.
SCSI RDMA Protocol (SRP) Driver
SCSI RDMA Protocol (SRP) was defined by the ANSI T10 committee to provide block storage capabilities for the InfiniBand architecture. SRP is a protocol that tunnels SCSI request packets over InfiniBand hardware using this industry-standard wire protocol. This allows one host driver to use storage target devices from various storage hardware vendors.
The SRP driver plugs into Linux using the SCSI mid-layer. Thus, to the upper layer Linux file systems and user applications that use those file systems, the SRP devices appear as any other locally attached storage device, even though they can be physically located anywhere on the fabric. iSer – iSCSI over RDMA
ISCSI over RDMA (iSer) is a storage protocol that was originally design for Ethernet by the RDMA consortium*.
Some of the members of OpenIB.org adopted this protocol and extended it for use on InfiniBand hardware. They have contributed this code to OpenIB.org and are actively working on porting it to use the OpenIB.org stack. The iSer code is being developed using the kDAPL library so that it is able to use other RDMA fabrics in addition to the InfiniBand fabric.
Lustre File System/Portals Driver
Lustre* is an open source project that has developed a clustered file system for Linux.
Lustre uses various interconnects using a transport abstraction layer called Portals, developed by Sandia National Laboratory*. Portals has been ported to the original first-generation InfiniBand stack* and it is expected that this code will eventually be ported to use the OpenIB.org (second-generation) InfiniBand stack.
NFS-R – NFS over RDMA
Network File System (NFS) over RDMA is a protocol being developed by the Internet Engineering Task Force (IETF)*.
This effort is extending NFS to take advantage of the RDMA features of the InfiniBand architecture and other RDMA enabled fabrics. Since this effort is targeting kDAPL as its interface to the fabric, the kDAPL work being developed by OpenIB.org will allow NFS-R to use the InfiniBand fabric. An open source project is underway to develop an NFS-R client at http://sourceforge.net/projects/nfs-rdma/*.
User-level InfiniBand Services
The InfiniBand architecture provides the unique attribute of allowing user-space processes to have direct access to the hardware registers. This allows applications to send and receive messages and perform RDMA operations without kernel involvement.
click here to enlarge
User-Level Core Services
To allow access from user-space, the OpenIB.org stack contains shared libraries that provide interfaces to applications and user-space upper level protocols. These include user-mode verbs, which are almost semantically identical to the kernel mode verbs, user-mode connection management, SA query and MAD services.
Some of the interfaces result in a trap to the kernel and a proxying of the service by the kernel mode InfinIBand core modules, and these include open/close functions, resource allocations and tracking, memory registration, and completion event signaling. However, for the speed-path operations for sending and receiving data, the user-space HCA library accesses the hardware registers and the send and receive queues directly through memory-mapped I/O space and kernel memory.
User-level DAPL (uDAPL) provides the user-space interface to the RDMA API defined by the DAT collaborative*. DAPL and uDAPL were defined to allow clients to access the low-level advanced features of several RDMA-enabled fabrics including iWarp, the InfiniBand fabric, and Myrinet. The low-level features that are common to all RDMA fabrics include queue pair based I/O semantics, RDMA read, RDMA write, and immediate data, in addition to the send/receive-based I/O semantics that are available from the sockets interfaces. Thus, for applications that want to develop an RDMA application bu t allow it to run on any RDMA enabled fabric, then uDAPL is the likely choice.
Applications that want the lowest level of access directly to the InfiniBand primitives can call the InfiniBand core APIs directly.
Message Passing Interface (MPI)
MPI is a communications message-passing library. MPI is based upon an industry specification that can be found at http://www-unix.mcs.anl.gov/mpi*. Several efforts, both commercial and open source, are underway to provide MPI libraries that run on the OpenIB.org stack. These include the Open MPI project, http://www.open-mpi.org*, the OSU MPI project, http://www.cse.ohio-state.edu*, plus a commercial implementation of MPICH2 from Intel.
OpenSM and Other Management Applications
Residing above the user-mode verbs is the subnet manager OpenSM. In the InfiniBand architecture, the subnet manager is needed to “bring up” the fabric. It discovers all of the nodes on the fabric and assigns the local identifiers (LIDs) in the HCAs. The LIDs are used as part of the address to remote nodes. The OpenSM also sets up the routing tables in the switches to support routing packets between nodes.
The OpenSM application also contains the subnet administrator (SA). Applications send queries to the SA to find out the path records for remote nodes, which are needed to establish connections between endpoints on the fabric.
Support for the InfiniBand architecture has finally arrived in the mainline Linux kernel. Additional drivers and libraries for user-mode support are nearly complete. These are available from www.openib.org* and will be submitted to the mainline kernel in the near future and should start to be included in Linux distributions.
Call to Action
If you are a developer and interested in contributing to the InfiniBand architecture open source project, visit www.openib.org* and subscribe to the e-mail list, email@example.com. The code is available for free download from the subversion database, http://www.openfabrics.org/resources.htm*. The process for contributing is like other kernel open-source development projects, such as the linux-kernel e-mail list (lkml). Simply create code patches and submit them to the list for discussion and inclusion. The maintainers are: Sean Hefty, firstname.lastname@example.org, Roland Dreier, email@example.com, and Hal Rosenstock, firstname.lastname@example.org.
About the Author
Bob Woodruff is a Software Engineering Manager at Intel. Prior to his position at Intel, he worked for Cray Research and Floating Point Systems. He has 24 years industry experience developing and managing the development of operating system software for Linux*, Microsoft Windows*, Unix*, and various other proprietary operating systems. He has been involved with development of the InfiniBand architecture from its inception and involved with Linux open-source development for the last several years, starting the original InfiniBand architecture open-source project on Sourceforge, http://infiniband.sourceforge.net*, and now with the www.openib.org* project.
- The OpenIB.org Developer’s Workshop proceedings – http://www.openib.org*
- InfiniBand* Trade Association – http://www.infinibandta.org*
- OpenIB.org Alliance – www.openib.org*
- Original InfiniBand* open source project – http://infiniband.sourceforge.net*
- DAT Collaborative - http://www.datcollaborative.org*
- MPI specification - http://www-unix.mcs.anl.gov/mpi*
- MPI open source projects - http://www.open-mpi.org*, http://www.cse.ohio-state.edu*
- Intel MPI commercial product - http://www.intel.com/cd/software/products/asmo-na/eng/308295.htm
- RDMA Consortium - http://www.rdmaconsortium.org*
- Interconnect Software Consortium (ICSC) - http://www.opengroup.org/icsc/*