| Last Modified On : | August 14, 2009 12:10 PM PDT |
Rate |
|
by Andrew Binstock
For years, universities and research institutions have been trying to develop supercomputing power that could scale incrementally and inexpensively. Their quest led them to consider a distributed computing model, called grid computing, in which multiple sites are networked together with high-speed interconnects. Using the high-bandwidth network backbone and specialized middleware, the computing resources of each site are then merged into one large computing grid. By aggregating computing power in this way, problems that involve very large data sets can be attacked. Grids are particularly good at solving problems like simulations of nuclear tests, modeling the behaviors of drug interactions, testing aerodynamics in a virtual wind tunnel, and the like.
In recent years, initial applications of grid computing have made their way into mainstream business computing. For example, IBM is using grids to perform data mining. Movie studios, such as Pixar use grids to render special effects. And small-scale grids are also showing up in the consumer space—curiously, in hosting Web-based multiplayer games such as those hosted by butterfly.net. As players log in, the hosting companies progressively enables more nodes in the grid, and so the horsepower needed to deliver real-time interaction and no-hiccup graphics scales smoothly with the number of players.
While the widespread use of grid computing by IT sites is still viewed as several years away—due primarily to the lack of management tools for this model—the necessary infrastructure is evolving quickly: standard tools such as the Globus Toolkit are emerging as are standards themselves (for example the Open Grid Services Architecture). Market-leading vendors such as IBM and Hewlett-Packard have announced active support. And so has Intel… in a big way.
The world's largest manufacturer of semiconductors has emerged as a key player in the development of grids by virtue of the favorable ratio of computing power to dollar cost its processors provide. Since grids depend on the ability to put together low-cost clusters that can easily be put assembled and placed on the grid, machines with attractive price-performance fit the model particularly well.
In a focused quest use the grid model to build a distributed supercomputer that could handle the biggest research problems in the US, the National Science Foundation granted $53 million to a group of computer companies to build the Distributed Terascale Facility (DTF)—a massive grid using IBM computers running Linux* on Intel processors. This grant was then broadened to develop an Extended Terascale Facility (ETF). The two projects are colloquially referred to jointly as TeraGrid, a reference to the grid's ability to perform trillions of floating-point operations per second (teraflops) and store trillions of bytes of data (terabytes).
The TeraGrid, as currently funded, is expected to be located in four research centers. The principal processing center will be hosted at the National Center for Supercomputing Applications (NCSA), which is part of the University of Illinois at Urbana-Champaign. When completed in 2003, the NCSA portion will host 2,024 Intel Itanium 2 processors configured as 512 nodes, which together are expected to deliver 8 teraflops of processing power. Supporting this processing power will be 250 TB of online storage. A sister center at the Argonne National Laboratories in Argonne, Illinois will use 384 Itanium 2 processors in 96 nodes supported by 125 TB of storage. These two centers will be networked to two centers in Southern California. The larger of these California sites will be hosted at the San Diego Supercomputing Center (operated by University of California at San Diego) in La Jolla. It will offer 3 teraflops of horsepower via 768 Itanium 2 processors arrayed in 192 nodes and supported by 250 TB of storage. A companion site in Pasadena, Calif. will be hosted at the California Institute of Technology (CalTech). Its configuration will be similar to the one at the Argonne National Labs. All told, these four sites are expected to deliver nearly 14 teraflops of computing power supported by 750 TB of storage. As such, TeraGrid will be world's largest and fastest distributed infrastructure for open scientific research in the world, and the most-powerful supercomputer in North America.
These four sites will be networked using a backbone of multiple 10GB/sec. optical fibers that support dense wave-division multiplexing (DWDM), a technique that enables considerable optical capacity by permitting multiple wavelengths to run simultaneously on one fiber. This high-bandwidth backbone runs from Los Angeles to Chicago. Smaller-capacity segments will run from the hubs to each of the individual sites. This layout, in which the backbone does not attach to any site, cannot be avoided because no 10GB/sec. fibers currently run to any of the sites. By the end of 2002, a map of the TeraGrid network will look like Figure 1. The backbone links use Qwest's optical network, while local segments are provided by various companies and resources, including the state-funded I-Wire project in the Chicago area and a fiber campus loop at the University of Illinois. The backbone is expected to carry 4 DWDM wavelengths on the fibers, achieving multiple pipes of 10GB Ethernet.
The differing levels of processing power at each site are a function of the different work each site will perform: Caltech will perform data mining, San Diego will explore databases technologies, Argonne data visualization, and NCSA—with the bulk of the processing power—will handle the computation-intensive.
The individual nodes are built by IBM using both 32-bit and 64-bit Intel processors. Because one of the key specifications of TeraGrid was the use of open-source operating systems, Red Hat's off-the-shelf Linux* (v. 7.2) was chosen to run the machines. A middleware layer called Myrinet* (from Arcadia, Calif.-based Myricom) coordinates the computing functions between all nodes and sites.
When researchers want to use the TeraGrid systems, they will simply login or FTP into an interactive node. This node hooks them to a management node that performs job control. This node in turn distributes the processing across local or remote nodes depend ing on the availability of resources and the nature of the computation. The storage is shared across the systems and placed where it is most appropriate. Questions of locality will be paramount in data placement so as to minimize the transportation of very large data sets back and forth across the network backbone. Figure 2 shows a rough diagram of how the clusters are used.
The TeraGrid network is designed for easy expandability. By use of clusters and off-the-shelf components, the sponsors of the project expect that many nodes will be added during the course of years. Small single nodes as well as entire computing facilities at other campuses are under active consideration. At the moment, nodes are planned at the Pittsburgh Supercomputing Center in Pennsylvania (a collaboration of Carnegie-Mellon University and the University of Pittsburgh), as well as on separate segments that arise from partnerships with other organizations, both commercial and purely academic.
To get the TeraGrid project underway quickly, existing systems were pressed into service. IBM x330 rack-mounted systems using Pentium® III Xeon® processors were brought in to form initial clusters for testing of the TeraGrid software and the network.
The use of the 32-bit Pentium III Xeon processors in this original configuration is by no means a stop-gap deployment. Consider that the machine in question runs 512 of the chips in dual-processing nodes. As such, this system is capable of delivering 1 teraflop, which is sufficient to place it in the top 75 systems worldwide in terms of processing power, as of mid-2002. It will become a permanent part of the TeraGrid.
A second cluster consisting of 160 of the original Itanium processors will supply an additional 1 teraflop. These systems foreshadow the arrival of large nodes built on Itanium 2 processors. Those nodes—with hundreds, even thousands of processors— are currently in final beta test and should begin deployment in late 2002, at which time TeraGrid should rev up quickly towards the specifications discussed previously.
At the Argonne National Laboratories site, 96 Pentium 4 processor-based IBM Intellistation workstations are being installed to serve as smaller-scale nodes that provide access to the TeraGrid and perform some local computing.
Paralleling this effort is extensive testing of the network and backbone fiber configuration with the goal of going live with much of the optical bandwidth before year end.
From its initial announcement, the TeraGrid project relied on Intel's 32- and 64-bit architectures for the processing power. The original TeraGrid specification called for Itanium 2-based systems to do most of the heavy lifting.
This processor makes considerable sense for projects like TeraGrid because of several salient factors:
Between the performance, scalability, compatibility, and favorable price-performance ratio, the Itanium 2 processor seems the natural choice to meet TeraGrid's goals and implementation requirements.
I am indebted to Charlie Catlett of the Argonne National Laboratory whose work was the inspiration for the illustrations and to Mike Pflugmacher of NCSA for his kind recitals of the current designs and project status.
