Scale large computation problems out to multiple machines in a virtualized environment. Problems of this type are a staple of high-performance computing.
Use grid computing “Scale-Out Virtualization.” Grid computing technology has undergone a significant evolution over the past three or four years: the grid has been gradually moving from its High-Performance Computing (HPC) roots in university and government labs to more "mainstream" enterprise applications such as financial models and graphical rendering for motion pictures. We call grids applied in this domain "enterprise grids." Most of the enterprise grids are run on server clusters within the data center.
The concept of enterprise grids is literally server virtualization turned inside out, and it represents the next step in workload disintermediation (decoupling applications from the physical platforms that run them): server virtualization allows multiple logical servers to run in one physical server. Each logical server runs one application. Conversely, in an enterprise grid environment, it is possible to apply more than one server, a node in grid parlance, to an application. We call this "Scale-Out Virtualization."
Enterprise grids of various sizes are being deployed in different areas. Grids with 8-64 nodes are common for Computer Aided Design (CAD) and Electronic Design Automation (EDA) applications. Larger grids with up to 256 nodes are common in financial services, oil exploration, and pharmaceuticals.
Some of these problems are characterized as "embarrassingly parallel." It is possible to partition these problems so that computation and the associated data sets for parts of the problem could be isolated to individual nodes, and hence there is very little communication between the nodes. Monte Carlo simulation for investment portfolio analysis is an example of such a problem. Today's commercial servers and networks (100 MB or Gigabit Ethernet) could be used to solve these problems using large grids.
Other problems may not be so easily partitioned due to the need to mo e data between nodes or from memory to the CPU on a single node. EDA applications are examples of such problems. Messages could be "bundled" to minimize the penalty due to high network latency. This would require rewriting some of the applications. Alternatively, expensive interconnect technologies may be required.
Based on the data from several enterprise problems, we have derived a heuristic we call the Rule of 10: the degradation in latency and bandwidth between two consecutive layers in the hierarchy should be no worse than a factor of 10 for all layers in a grid. The on-CPU cache, main memory, disks, and network are the layers. For the "embarrassingly parallel" applications, this rule may not apply. For applications with significant data movement, the factor may have to be as low as 6. However, the Rule of 10 seems applicable to many cases.
Let us consider a cluster of servers connected by a Gigabit Ethernet as an example. The actual bandwidth for Gigabit Ethernet is about 100 MB/s. At the next lower level in the hierarchy, memory bandwidths of 3.2 to 6.4 GB/s are typical today in commodity servers. Hence, the bandwidth is degraded by a factor of 32 to 64. The degradation is even higher for latency: with memory latencies in the order of 100 to 150 nanoseconds, and the Ethernet network latencies around microseconds, the degradation factor is 700-1000. Hence, solutions with significant network communication need to be rewritten to bundle the messages, or the Ethernet has to be replaced by networking technologies with lower latency, such as IBA.
There are ways to work around the Rule of 10. Intel future platforms, for example, are expected to offer on-CPU caches comparable to today's memory in size, thus reducing the need for memory access and hence the importance of reducing the memory access latency for many solutions.