by John Sharp, Content Master Ltd
Specialized hardware creates huge, aggregate virtual computers from dispersed machines.
Grid computing allows supercomputer-class problems to be addressed with networks of commodity computers. The Grid combines distributed data and resources into a single transparent namespace, providing seamless, scalable access to wide-area distributed resources in an unobtrusive and robust manner. Using Grid computing protocols, geographically dispersed machines can collaborate, pooling their resources to perform complex tasks.
The connectivity of the World Wide Web allows computers located anywhere on the planet to participate in the same Grid. Grids allow resources to be shared and selected for appropriate tasks, and the results of those tasks are then aggregated together. The resources used are not necessarily restricted to raw CPU cycles; they could be data storage or other specialized machinery.
A Grid can combine an extensive range of heterogeneous hardware. Establishing a Grid is a non-trivial task, however, due to the range of technologies available and the need to respect the autonomy and privacy of resource owners. There is also a high likelihood of resource failure somewhere in the Grid. A well-designed Grid must address these issues.
This paper describes the high-level architecture of a typical Grid and how the resources interact. It also defines how Grid computing differs from other distributed computer architectures such as clusters.
What is a Grid?
Advances in computer technology, including ever more powerful hardware and increasingly sophisticated software, have made it possible to apply computers to solving a wide range of complex problems in the fields of science, engineering, and business. Examples include performing molecular modeling for drug design, brain activity analysis, many calculations in the realm of high-energy physics, and the SETI project (the Search for Extraterrestrial Intelligence).
There are still any number of problems that are beyond the capabilities of the current generation of supercomputers, however. Furthermore, the nature of these problems often requires access to resources not often found on a single computer. Grid computing provides one type of solution to these issues.
The original purpose behind Grid computing was to link together supercomputers spread across wide distances, but the aims have since moved beyond this scope. The term Grid was coined as an analogy with the power grid, supplying consistent, dependable, and transparent access to an electrical supply. Grid computing is intended to provide an equally consistent, dependable, and transparent collection of computing resources.
A Grid comprises a network of resources, each of which operates autonomously under local control, but which collaborate and communicate with each other. In this respect, a Grid differs from other architectures, such as a cluster where distributed resources are typically owned and managed by a centralized resource management and scheduling system (all users of a cluster connect through a centralized system that allocates resources to tasks).
Grids can be constructed using entire clusters as nodes in the Grid, together with other localized low-level middleware systems. Grids can additionally make use of other distributed paradigms; the Globus OGSI* (Open Grid Services Infrastructure) is based on Web services, for example.
Grid Infrastructure Requirements
A key precept of the Grid paradigm is that Grids should be transparent and seamless. Users, applications, and services should be able to view the Grid as a single (albeit gargantuan) virtual computer. Grid architectures are based on resource brokers, resolvers, and other pieces of Grid middleware that perform resource discovery, scheduling, and processing of jobs.
In order to maintain the seamless nature of a Grid, any architecture must consider a number of issues, including the following:
- The need to respect the local autonomy of the various administrative domains that comprise the Grid. The systems linked together will be managed by local administrators who must be allowed to implement their own security policies and protect their own resources as they see fit.
- The different computing resources will inevitably span a variety of heterogeneous hardware.
- An appreciation of the dynamic nature of the Grid. Computers may join or leave the Grid at any time. The architecture implemented by the Grid must be scalable, supporting anything from a small number of nodes to thousands of computers, without imposing an overhead that degrades performance.
- The importance of resilience. In any network, the chances of a single node failing increases as more and more nodes are added to the system. In a network involving many thousands of computers, it is likely that at least some computers will be offline. The Grid must be able to adapt dynamically, maintaining an up-to-date catalog of available resources.
A Grid must be non-intrusive to applications, services, and users not making use of it. There should be no observable degradation in service to local users accessing a computer that is also part of a Grid. This goal can be accomplished by careful scheduling of Grid tasks and by ensuring that those tasks execute at a suitably low priority.
Types of Grid
Many Grid implementations are oriented toward supplying specific types of resources. Grids can be categorized according to these resources. The most common types of Grids are Computational Grids, Data Grids, and Application Grids:
- Computational Grids provide resources for executing tasks, using spare CPU cycles on networked computers. Grid tasks are often scheduled to run as background tasks, to be performed when no higher priority local jobs are being executed. The World Wide Grid (WWG) and NSF TeraGrid are examples of this model.
- Data Grids provide secure access to, and management of, large distributed datasets. A data Grid typically implements replication and catalog services, giving the illusion that the entire dataset is actually held on a single piece of data storage. The data is usually processed using a computational Grid.
- Application Grids extend the notions o f computational and data Grids to provide transparent access to remote libraries and applications. In many instances, they can be implemented using Web services acting as facades for remote services in conjunction with UDDI (Universal Description, Discovery, and Integration), proving location transparency.
Other types of Grid are available; Knowledge Grids, for example, provide services that use information to help solve particular problems using specific algorithms. This is essentially a high-level form of computational Grid, where the logic is advertised and provided by the Grid itself rather than a client application.
Grid Components and Services
A Grid must be designed to provide services that hide the underlying differences between the computers in the network and present a single, unified view of the entire scheme:
- Communications. A Grid can comprise a variety of network technologies of varying quality, and it can implement many different protocols. The communications infrastructure provided by a Grid must be robust enough to handle and resolve communications failures between nodes, and it must support protocols that can transmit many diverse types of data in a reliable manner. Many of these features are inherent in existing Internet protocols. Grid-specific protocols such as GridFTP (based on the standard File Transfer Protocol) are available that can transfer data across a Grid in a reliable manner. GARP (the Grid Area Routing Protocol) is a resilient protocol that provides timely information about the state of resources throughout the Grid.
- Authentication and Authorization. In any networked environment, security is a complex issue. With Grids, that complexity is particularly acute. Grid security must interoperate with local security systems. Many Grid implementations take advantage of widely adopted, proven technologies, such as Kerberos and public-key encryption.
- Naming Services and Location Transparency. Resources must be identifiable and locatable. A single uniform namespace that spans the entire Grid is essential. A Grid-wide directory service such as Grid Index Information Services (GIIS) can combine views from multiple local catalogs, usually based on standard protocols such as LDAP (Lightweight Directory Access Protocol).
- Distributed File System. Distributed applications executing on a Grid need access to data held in files that may be spread across a large number of computers. A distributed file system provides a single view of the data storage available throughout the Grid and makes the physical location of files transparent to applications accessing those files.
- Resource Management. Different network applications can have varying network flows, incorporating periods of high and low latency. A Grid must provide a sufficient quality of service to cater to these differing rates and to ensure resource availability whenever possible. From a user's perspective, resource management should be transparent. The Globus project provides GARA* (the Globus Architecture for Reservation and Allocation), allowing advance reservation and end-to-end management of the quality of service for Grid resources. The Grid Resource Allocation Manag er* (GRAM) provides an interface to operating system-specific scheduling facilities. GARA and GRAM both make use of HTTP and TCP/IP to transmit data.
- Fault Tolerance. It is vital that Grids provide tools for monitoring, maintaining, and reconfiguring resources. These tools can be used to implement transparent failover in the event that a particular resource becomes unavailable.
The facilities of a Grid should be easily accessible to users and administrators. It is common, therefore, to provide graphical interfaces that allow users to submit jobs and monitor tasks as they are executed by the Grid. The Internet supplies an ideal framework for providing access to remote services, due to the connectivity available and the portable nature of the interfaces that can be generated.
Grid portals are Web sites comprising components that allow a user to submit tasks to a Grid and view results. Toolkits, such as NPACI GridPort*, are available for building Grid portals.
Applications developed to take advantage of a Grid can be built using Grid-enabled tools and technologies. A Grid should provide the interfaces, libraries, utilities, and programming APIs to support the development effort required. Common tools and libraries for building Grid applications include High Performance C++ (HPC++) and the Message Passing Interface (MPI).
HPC++ is a set of C++ tools and libraries developed by the HPC++ Consortium designed to support a portable model for parallel programming in C++. MPI is a portable specification that supports message passing across a range of environments. MPI supports many different platforms, including highly-parallel multi-processor computers, tightly connected clusters, and loosely connected heterogeneous networks. MPI has language bindings for C, C++, and Fortran.
A Grid is an architecture for integrating networks of standard computers and specialized hardware into a large virtual computer with the resources required to handle highly complex calculations. A Grid can make resources of an unprecedented size available, while providing terrific economy of scale. Grid technologies being developed provide an infrastructure that makes the location, hardware type, and operating system of computers participating in a Grid transparent.
Grids have to address a number of technical issues, implementing a single, seamless view of the computing resources available while allowing those same resources to be controlled and secured by local administrators.
The Internet allows networks to be built from large numbers of computers, which provide geographically dispersed users with access to high-performance hardware. Technologies based on the World Wide Web and Web services can be used to provide the foundation services of a Grid, which additional Grid-specific tools and protocols can then exploit.
The following resources provide additional details about various aspects of Grid computing:
- The World Wide Grid*
- NSF TeraGrid*
- The Globus Alliance*
- High-Performance C++*
- The Message Passing Interface Forum*
Intel, the world's largest chipmaker, also provides an array of value-added products and information to software developers:
- Intel® Software Partner Home provides software vendors with Intel's latest technologies, helping member companies to improve product lines and grow market share.
- Intel® Developer Zone offers free articles and training to help software developers maximize code performance and minimize time and effort.
- Intel Software Development Products include Compilers, Performance Analyzers, Performance Libraries and Threading Tools.
- IT@Intel, through a series of white papers, case studies, and other materials, describes the lessons it has learned in identifying, evaluating, and deploying new technologies.