by TW Burger
If you have a lot of numbers to crunch, a lot of users to support and/or a very little budget, you can use a DOS modified version of Linux* with surplus Intel hardware to create a very impressive system. This paper examines the mechanisms involved and the Open Source development platforms available to create parallel computing solutions.
Distributed Operating Systems
A distributed operating system (DOS) allows the creation, implementation and completion of programs (a single task or a number of different tasks running concurrently – often referred to as parallel processing) that can utilize a network of general purpose and often heterogeneous computers to accomplish a task. This network of computers forms a single multi-computer or cluster. Unlike a parallel processor or SMP (symmetric multiprocessor) system that has common RAM for UMA (uniform memory access) and has other shared resources, a cluster-based system’s individual computers (nodes) have no direct access to the resources of other nodes. Therefore, although conceptually similar in the functions required of an SMP operating system, the cluster OS has additional requirements to provide communication between cluster nodes which may be dissimilar in hardware and local operating systems. This is a heterogeneous cluster.
Distributed Operating Systems (DOS) have the following benefits:
- Speed - A process can be run faster by being divided into subtasks (threads) that are run on two or more nodes. Also, more or dependant processes can be run at the same time. Low powered machines can be combined to reach supercomputer performance by having each process a small portion of the task.
- Fail Safe Operation - Node failure can be compensated by others in the cluster improving program reliability.
- Remote Data Capture - Remote data acquisition systems, like point of sale, require a distributed system to work correctly.
- Off the Shelf - DOS can use cheap, generic and even dissimilar hardware.
- Scalability - DOS can be easily reshaped, expanded and contracted in scope as needs change.
- Cost – Clusters offer a very high performance to price ratio.
A DOS needs to supply these basic services:
- The ability to run processes on remote computers efficiently (distributed scheduling).
- The ability to provide data required for processes and get results from processes.
- The ability to provide resources required for processes.
- Synchronization of processes so process interaction is possible.
A DOS may also provide process migration if required. But, this can cause high overhead and complications in the DOS design.
A DOS is used in the control of clusters. These networked collections of personal computers and workstations are often referred to as a scalable computing cluster (SCC), isolated from a general purpose network of workstations (NOW). A NOW is used by multiple users. An SCC tends to consist of homogenous machines for ease of control and speed, while a NOW can be heterogeneous in th e mix of PCs and workstations. Unlike massively parallel processors (MPP) which typically allow a single user per partition (logical collection of processors), these topographies generally allow multiple users and time sharing. A cluster can emulate an MPP to solve problems using similar parallel programming methods.
Linux as a Basis for Building a Distributed Operating System
Being UNIX-based, Linux has an inherent capacity to be developed into a DOS. Linux Inter-Process Communication (IPC) provides a set of mechanisms for data exchange data and synchronization. These include:
- Shared memory - when one process writes some information in a shared memory, other processes can immediately use it.
- IPC Messages - structures for data exchange between processes using system calls.
- Semaphores – for process synchronization.
Linux is the Preferred Basis for DOS
Synchronicity is often the basis of scientific advancement. At about the same point in time these factors came into play:
- Linus Torvalds' Linux is made available to developers. It’s free, flexible, open source and inexpensive.
- Powerful desktop computers that can run Linux are plentiful and low cost.
- Easy, cheap and standardized networking becomes available
- Demand for distributed operating systems grows at an unparallel rate.
These factors have caused Linux to be adopted as a basis of many distributed operating systems. Because Linux is free with open source code, universities and research labs can experiment with and create application programmer interfaces (APIs, although, strictly speaking many of these tools were system programming tools). These efforts addressed the need for distributed operating systems that were becoming evident in the late 1980s for distributed, as well as massively parallel systems, to be developed.
Once the tools were available several Linux based distributed systems were built. A large number of these are based on a design called a Beowulf cluster.
Mechanisms for Distributed Versions of Linux
Mechanisms for Distributed Versions of Linux
Distributed systems built with Linux are generally based on UNIX scheduling and process migration, and inter-process communication.
Linux DOS scheduling systems are a subject unto themselves and are discussed in the paper: The Story of Distributed Scheduling on Linux. Distributed inter-process communication allows the coordination of and data transfer between tasks running on DOS clusters. There are several versions which are discussed here.
Distributed Inter-process Communication (DIPC)
DIPC is relatively new technology (1997) developed by Mohsen Sharifi and Kamran Karimi of the Computer Engineering Department of the Iran University of Science and Technology.
DIPC extends the Linux IPC (inter-process communications) for distributed process control by replacing it with a system using Distribu ted Shared Memory (DSM) using multiple-readers/single-writer protocol. This method allows standard Linux programs to run in a DOS modified version of Linux without modification to the program’s source code. DIPC is a programming tool, not a DOS. DIPC lets you use System V IPC shared memory and semaphores and message queues transparently across a cluster. DIPC only provides mechanisms for distributed programming. The policies outlining how a program is made to run parallel or where a process should run are determined by the programmer or the end user.
DIPC uses a single-writer/multiple-readers shared memory protocol. Processes on different computers that want to read a shared memory get sent that data. If a machine wants to perform a write on the same data, DIPC forces all the computers that read the shared memory to stop, and then allows the write to be performed. If needed, DIPC will also transfer the data to the computer that wants to become a writer.
The motivation for DIPC was that current distributed programming models required detailed involvement of the programmer in the process of transferring data over the network as with PVM software (see below). There was need for a simpler distributed programming model, usable by more programmers.
Linux IPC processes specify numerical keys to gain access to an IPC structure and can then use these structures to communicate. A key is normally unique to one computer. DIPC makes the IPC keys global so the key is known on any machine. Processes on different computers can communicate with each other the same way they did in a single machine.
Information about the DIPC version of the IPC keys in use is kept by a DIPC process called the referee. The referee is the DIPC name server. Each cluster has only one referee. Having the same referee places computers in the same cluster. All other processes in the cluster refer to this one to find out if a key is in use. The referee makes sure that only one computer at a time will attempt to create an IPC structure with a given key value. The DIPC concept of the referee as a central control entity simplifies design and implementation. However, the need to continuously communicate with the referee machine adds overhead and would create an increasing bottleneck as cluster configurations get bigger.
To simultaneously run the same program, that will use the same IPC keys (like utilities), in all the computers in the cluster, at the same time could create interference. To prevent any unwanted interactions the programs are required to be modified and recompiled using a flag in the IPC structures to declare it as DIPC. The flag is the only requirement to create a DIPC distributed program. Other than this flag to keep DIPC compatible with older programs, the system is totally transparent to programmers.
Local Area Multi-computer (LAM) is an extension of the MPI (message passing interface) communication standard. It is a system developed for heterogeneous computers on a network. LAM utilizes a dedicated cluster or existing network infrastructure to create one parallel computer to solve one problem. It includes a persistent run-time environment for parallel programs providing users with several debugging and monitoring tools .
The LAM team is comprised of students and faculty at Indiana University and has been ongoing for fifteen years.
A message passing interface (MPI) program consists of parallel autonomous processes, executing their own code, in a Multiple Instruction Multiple Data (MIMD) or Single Instruction Multiple Data (SIMD) style. MPI does not specify the execution model for each process. MPI research at the MCS Division at Argonne National Laboratory is funded principally by the Mathematical, Information, and Computational Sciences Division, Office of Advanced Scientific Computing Research, Office of Science of the U.S. Department of Energy.
MPI was designed by the MPI Forum (a collection of programmers and end users) independent of any specific implementation. Rather than a true DOS MPI is a library for writing application programs supporting heterogeneous computing of objects
The architecture that uses MPI is typically distributed or clustered. Each process usually executes in its own address space, but shared-memory implementations are possible. Processes communicate using MPI communication primitive calls. MPI uses point to point message passing and global operations based on a user defined group of processes. MPI also provides inter-application communications
MPI is designed to be used to run one job at a time so it does not provide a scheduling mechanism to specify the initial allocation of processes or binding to physical processors. MPI also does not provide dynamic creation or deletion of processes during execution, so the total number of processes is fixed.
PVM (parallel virtual machine) is a software package that permits a heterogeneous collection of Linux, UNIX and NT networked computers to be used as a single parallel computer. The software is very portable and is used on laptops, workstations and CRAY machines.
PVM is a product of the Computer Science and Mathematics Division of Oak Ridge National Laboratory. It is the source of basic and applied research in high-performance computing, applied mathematics, and intelligent systems.
PVM and MPI Compared
PVM is the existing de facto standard for distributed computing and MPI is being promoted as the message passing standard. MPI is expected to be faster within a large multiprocessor because it has many more point to point and collective communication options than PVM. MPI also has the ability to specify a logical communication topology.
PVM is better when applications will be run over heterogeneous networks having good interoperability between different hosts. PVM allows the development of fault tolerant applications that can survive host or task failures. The PVM model is built around the virtual machine concept - a set of heterogeneous hosts connected by a network that appears logically to the user as a single large parallel computer. This is not present in the MPI model. The virtual machine concept provides the basis for heterogeneity, portability, and encapsulation of functions that constitute PVM.
In PVM, parallel tasks exchange data using simple message passing constructs. The PVM interface is simple to use and understand. Portability was considered much more important than performance for two reasons: communication across standar d networks and the Internet is slow and the research was focused on scaling, fault tolerance, and heterogeneity of the virtual machine rather than process speed. PVM 3.0 was released in 1993 with a completely new API. The new design was required to enable a PVM application to run across a virtual machine composed of multiple large multiprocessors. PVM has continuously evolved to allow the addition third party debuggers and resource managers, transparent utilization of shared memory and support of ATM (asynchronous transfer mode) networks to move data between clusters of shared memory multiprocessors.
In contrast to PVM, which is a continuing research project, MPI was a committee specification derived from meetings in 19931994 of high performance computing experts to avoid duplicating efforts in building proprietary message passing APIs by Massively Parallel Processor (MPP) vendors. MPI allows writing a portable parallel application by standardizing message passing specifications. High performance is the focus of the MPI design. MPI contains the following main features:
- A large set of point to point communication routines and communication routines for communication among groups of processes.
- A communication context that provides support for the design of safe (unlikely to cause problems in other processes) parallel software libraries.
- The ability to specify communication topologies.
- The ability to create derived data types that describe messages of noncontiguous data.
The original MPI (MPI-1) based applications are not portable across a network of workstations because there was no standard method to start MPI tasks on separate hosts. Different MPI implementations used different methods. The new MPI2 specification corrects this problem and adds additional communication functions:
- MPISPAWN functions to start both MPI and non MPI processes.
- One-sided communication functions, such as put and get, that produce less overhead and a more natural interface to many applications.
- Non-blocking collective communication functions
- Language bindings for C++.
Still missing in MPI is the ability to create fault tolerant applications, interoperability among different MPI implementations, and dynamic determination of available resources.
The MPI interface was developed to encompass all message passing constructs and features of various MPPs and networked clusters so that programs would execute on each type of system. An MPI program written for a specific architecture can be copied, compiled and executed to another without modification.
PVM has this level of portability, but is also interoperable. PVM executables can also communicate with each other and run cooperatively across any set of different architectures. MPI would need to check the destination of every message and determine if the destination task is on the same host or on some other host. If it is on another host, using that vendor's MPI implementation, the message cannot be converted into a format that can be understood by the other MPI version.
MPI's lack of platform interoperability is due to the design emphasis on performance. Message passing performance requires that the library is optimized using native hardware code. PVM sacrifices performance in fa vor of the flexibility to communicate across architectures. When communicating locally or to another host of identical architecture, PVM uses native communication functions. When communicating to a different architecture, PVM uses the standard network communication functions. The PVM library must determine from the destination of each message whether to use the native or network communication, adding overhead to handling each message.
PVM and MPI also differ in their approach to language interoperability. PVM allows languages with differing function interfaces and scalar types to pass PVM messages. The MPI standard, for the sake of speed, does not allow this.
Distributed Operating Systems Based on Linux
The Beowulf Cluster
A Beowulf Cluster is a COTS (commodity off the shelf) based system developed by Thomas Sterling and Don Becker at the Goddard Space Flight Center in Greenbelt, Maryland in 1994. It is a distributed parallel super computer (DPSC). In the taxonomy of parallel computers, it is a hybrid of MPP (Massively Parallel Processors, like the nCube, or Cray T3D) and NOWs (Networks of Workstations). Beowulf shares logical topology with MPP systems, but MPPs are usually larger and have a faster interconnect network than a Beowulf cluster using parallel processors with specialized, dedicated connections and not networks PCs. Programs that do not require fine-grain (very frequent) inter-process communication can run effectively on Beowulf clusters. NOW style programming is usually concerned with using unused processor capacity on an installed base of connected workstations. The concern is load balancing problems and large communication latency. A program that runs on a NOW will run at least as well on a cluster. A Beowulf Cluster differs from a NOW in that the nodes in the cluster are dedicated to the cluster. The performance of an individual node is not subject to external factors, easing load balancing problems. The network cluster is isolated from external networks so the cluster network load is determined only by the application(s) being run on the cluster helping, eliminate unpredictable latency.
Beowulf is not a piece of software or a package but rather a method of connecting computers to construct a parallel system. There are standard pieces of software and hardware commonly found in a Beowulf cluster construction, although not all of them are essential. These may include:
- MPICH (portable MPI)
The Beowulf cluster can also use Ethernet channel-bonding (a patch for the Linux kernel that allows bonding multiple Ethernet interfaces into a faster virtual Ethernet interface) and the global PID (process identification) space patch for the Linux kernel. The PID patch allows the user to see all the processes that are running in the cluster.
Many universities have built versions of the Beowulf cluster and modified the basic system for research purposes. The Duke University Physics Department has constructed Brahma; a Linux based Distributed Parallel Supercomputer. The hardware configuration of Brahma is made up with Intel® Pentium® processor-based PCs linked with 100BaseT Ethernet. ; Brahma is not a true Beowulf cluster in that some of the CPUs are linked to a standard network. This is due to practical considerations, as these are powerful workstations that are required for other uses. It does show that Beowulf can be made tolerant of external workloads and network traffic. Finely grained code synchronization tasks are confined to a dedicated subset of faster machines so that network traffic interference is not an issue.
The parallelization of code on Brahma is accomplished via the PVM and MPI libraries. Future Brahma research will likely be in experimentation with the relatively new DIPC threaded models, and work on custom communications methodologies to optimize performance on multiple processor systems for IPC-bound code. Programming is done in C, C++ and FORTRAN with GNU programming utilities.
The Current Uses and Future of DOS
As with all emerging technologies, the eventual establishment of a standard will be required to entrench this standard into mainstream use. A set of universal standards is being established for distributed computing, and with interest from major hardware and software vendors growing this should only be a matter of time. Organizations like the Linux Clusters institute are working to standardize high performance computing using Linux clusters.
Linux clusters have been making headway into government and commercial applications. The IBM Linux Cluster was chosen by the National Center for Supercomputing Applications. The configuration consists of one cluster with 512 IBM xSeries x300 thin servers, each with two Intel Pentium III processors running Red Hat Linux*. A second cluster of160 IBM IntelliStation 6894 Z* Pro workstations, each with two Intel Itanium® processors is also used. The NCSA Linux cluster is capable of a peak performance of two teraflops. This triples the capability available to the national community of researchers and industrial partners who compute at NCSA.
Daimler Chrysler deployed an IBM Linux cluster for crash test simulation on a cluster that supports up to 512 nodes of IBM IntelliStation M Pro 6850 workstation powered by dual Intel® Xeon® processors operating at 2.2GHz.
Linux and PC-based distributed operating systems are currently mostly experimental projects and/or research tools of government, universities and large corporations. There is no official standard for a Linux based DOS, although MPI and PVM are popular tools and are considered de facto standards.
Red Hat and IBM both provide commercial solutions with IBM Linux Cluster* and Red Hat Cluster Suite. The predecessor - Red Hat Extreme Linux*, aka The Beowulf Project, was a collaboration between Red Hat, Inc. and the NASA Goddard Space Flight Center and is no longer an official Red Hat project.
Numerous resources are available for gathering detailed information about Linux based DOS. Documentation and software are available for free download, though the task of setting up a DOS will be far from plug and play.
Practical considerations include:
- Where do you get the computers to network into a cluster?
- How much power will they use?
- Can I stand the noise of all of those fans?
- Do I really need to simulate the Earth’s weather for the next 50 years?
About the Author
Thomas Wolfgang Burger is the owner of Thomas Wolfgang Burger Consulting. He has been a consultant, instructor, analyst and applications developer since 1978. He can be reached at firstname.lastname@example.org
Beowulf Job Manager (The), http://hendrix.imm.dtu.dk/software/jobd/*
Beowulf Project, http://www.beowulf.org/*
DIPC: The Linux Way of Distributed Programming, by Mohsen Sharifi and Kamran Karimi, Linux Journal #57 January 2000, http://www.linuxjournal.com/article/2417*
Distributed lock manager, Linux IBM, http://www.ibm.com/developerworks/opensource/?dwzone=opensource*
High Performance Computing/Supercomputing, Barry Kaplan, Electronic Commerce System, Internet Division, IBM Canada Ltd., http://www.cfdsc.ca/bulletins/07/0704.html *
Implementation of Distributed Process Management Protocol Server on Linux, Vikas Agarwal, March, 2000, http://www.cse.iitk.ac.in/research/mtech1998/9811126.html*
LAM - MPI http://www.lam-mpi.org/*
Linux Cluster HOWTO, Ram Samudrala (email@example.com), http://www.ram.org/computing/linux/linux_cluster.html.*
Linux Clustering information Center, http://www.lcic.org/*.
Open Distributed Systems, Jon Crowcroft
Parallel Virtual File System (The), http://parlweb.parl.clemson.edu/pvfs/index.html*
Review of Operating Systems, http://tunes.org/Review/OSes.html*