Partner Newsletter Q1 2011 Intel Cluster Ready Articles 2

UC Irvine Standardizes GreenPlanet Cluster with Intel® Cluster Ready

Like many large, shared, academic HPC clusters, the 380-server GreenPlanet cluster at the University of California Irvine (UCI) School of Physical Sciences grew
over time – and became more difficult to manage.
“Over the past couple years, we’ve added anywhere from 2 to 50 nodes at a time, and we’ve probably had 30 research groups involved in buying 15 different configurations of nodes for the system,” recalls Dr. Nathan Crawford, Chemistry Modeling Facility Director for UCI. “Each time we added nodes, we tried to suggest a standard configuration, but the reality is that as the system got bigger, it got more varied.”
That variability created reliability problems and made them harder to resolve when they occurred. “It was harder to keep track of all the differences in spreadsheets or in our heads, and we were spending too much time tracking down configuration issues or trying to figure out what was in what,” Crawford says. "We needed an automated system that could record what the nodes should be and help us enforce proper configuration."
Research for a Green Planet
The GreenPlanet team decided to bring its Dell cluster into compliance with the Intel Cluster Ready specification and use Intel® Cluster Checker 1.6 to evaluate it. The heterogeneous cluster is based on Dell PowerEdge* servers with several generations of Intel® Xeon® processors and QLogic TrueScale* InfiniBand* interconnect.
The system supports a wide range of HPC applications, including research conducted by Nobel Prize-winning scientists. The latest sets of nodes added to the cluster supports UCI researchers who are leading work in the ATLAS experiment of CERN’s Large Hadron Collider. The sub-clusters are owned by different faculty and departments can be used individually, or the entire cluster can be dedicated to single resource-intensive tasks. The ATLAS nodes will be managed as a sub-cluster.
“The Intel Cluster Ready program allows us to have the first-time, every time, all-the-time response that we’ve really wanted to have,” says Ronald D. Hubbard, Executive Director of the GreenPlanet HPC Consortium in the UCI School of Physical Sciences. “We know exactly what we’re doing on a day-to-day basis. When we expand, it’s wonderful to be in a position where you can plug and play and not put yourself in peril with all the other jobs that are on the site at the same time.”
Smooth Process-Including a CentOS to CERN Scientific Linux Transition
Crawford and Chad Cantwell of UCI’s Physical Sciences Computational Support Group worked with Jeremy Siadal, a senior technical consultant with the Intel Cluster Ready program, to survey the system and create the cluster configuration file. They started by inventorying their existing hardware configurations and testing the Intel Cluster Ready software and processes on a four-node test cluster. UCI also upgraded several elements of the system, moving from Clustercorp Rocks+ 5.1 to Rocks+ 5.3 and from CentOS 5.3 to CERN Scientific Linux 5.5.
“UCI is one of the first large Intel Cluster Ready sites using Rocks+ with Scientific Linux. With the work Clustercorp has done to integrate Intel Cluster Ready into Rocks+, UCI didn’t have to do anything special to make this work,” says Siadal. “It’s nothing more than an extra checkbox at install-time.”
UCI also restructured the layout of the InfiniBand fabric and deployed the QLogic InfiniBand Fabric Suite* (IFS 6.0), including OpenFabrics Enterprise Distribution (OFED 1.5.2) networking software. Noncompliant system elements were identified with early Intel Cluster Checker runs and flagged in the configuration files.
The process was “about as smooth as we could hope for,” notes Crawford. The system was offline for just three days over the university’s winter break.
More Time for Science, More Jobs Running to Completion
The GreenPlanet team immediately began seeing value from the Intel Cluster Ready standard architecture and using Intel Cluster Checker for ongoing verification and performance checks.
“The process of getting certified cleaned up a lot of underlying issues,” Crawford says. “Already the system is much more stable. Intel Cluster Checker helped us track down strangely misconfigured nodes, identify memory issues caused by some flaky DIMMs that one professor had bought, and narrow down performance issues when our QDR nodes weren’t using the libraries properly.”
The system is now more standardized and variations are well-documented. “The config file that you create as you implement Intel Cluster Ready contains all the little weirdnesses of the cluster, so that can become a master document to fall back on,” says Crawford. “You can keep track of what the system is at any time in relation to how it should be. Basically, Intel Cluster Ready enforces proper configuration and running of the cluster.”
The result is a more stable, reliable compute resource and more productive scientists. “Intel Cluster Checker will allow us to maintain good conformity and make sure new nodes will fit into the system and play well with others,” Crawford says. “Any random bugs you’d normally get from slightly different library versions or BIOS settings won’t be there, which will eliminate issues with large parallel jobs failing due to random configurations. We’ll have more jobs running to completion, and scientists can spend less time trying to debug underlying hardware and software problems.”
For more, read the UCI Dell Case Study, or watch the video.

Participants in the UCI Intel Cluster Ready project included Ronald D. Hubbard, Nathan Crawford and Chad Cantwell of UCI and Eric Schwenkel and Jeremy Siadal of Intel.


For more complete information about compiler optimizations, see our Optimization Notice.