Tata-CRL Case Study

Case Study

Intel® Connects Cables

Computational Research Laboratories

High-performance computing

02/11/08

 

"Almost all our reliability problems went away when we went with the Intel optical cables." - Ashrut Ambastha, EKA Architect, Computational Research Laboratories

EKA:  Building the 4th Fastest Supercomputer in 30 Days

One of the world's fastest supercomputers uses Intel® Connects Cables to achieve 1201 Teraflop performance

On November 13, 2007 Computational Research Lab's (CRL) EKA HPC system broke into the elite top-ten list of the world's fastest supercomputers. CRL, located in Pune, India, is a subsidiary of Tata2 Sons Limited. The Tata-CRL supercomputer, currently ranked fourth in the world on the Top500* list3 announced at the Supercomputing '07 Conference, is named EKA after the Sanskrit term for "One". Using an octagon architecture built from volume cluster rather than proprietary components, EKA has a peak performance of 180 Teraflops and sustained performance of nearly 120 Teraflops, also making it the fastest system in Asia.

When designing and building EKA, CRL faced several large challenges including a limited space, sustained performance, and reliability.  To resolve these issues, CRL chose Intel® Connects Cables, high-performance optical-fiber InfiniBand* cables that deliver double data rate (DDR) performance of 20 Gbps along with an extremely low bit error rate (BER) of 10-15 4 With lengths up to 100m, Intel Connects Cables helped CRL fit EKA into the allocated space.  With 20Gbps DDR rates, the cables helped EKA achieve sustained performance of nearly 120 teraflops.  With a bit error rate of 10-15, Intel Connects Cables helped the CRL cluster achieve reliability and begin running successful LINPAC tests within just 20 days.5

 

Challenge

  • Space: Fit maximum computing power into the allocated space.
  • Sustained Performance: Design a robust HPC cluster that can deliver Top 10 performance in excess of 100 Teraflops.
  • Reliability: Build the system that can perform reliably and complete LINPAC runs.

Solutions

  • Space: Use 20m Intel® Connects Cables to allow for a near-circular architecture to save space as well as improve performance.
  • Performance: Use Intel Connects Cables to deliver double data rate (DDR) performance of 20 Gbps at up to 100m lengths.
  • Reliability: Lower the bit error rate and eliminate dead links or links with marginal performance with Intel Connects Cables (10-15 BER).

Assessing the situation

Put a big system, in a small room in a very short time

The CRL architects needed to meet several key requirements in designing and building EKA.  The goals for EKA were to reach a sustained performance of at least 100 teraflops and be one of the top 10 supercomputers in the world.  They also needed to fit a system with 1822 notes and 3644 Intel® Xeon® 5365 Quad Core 3 GHz processors into a 21m by 16m room and get it up and running LINPAC tests successfully within one month.

To address these formidable challenges, the EKA team worked closely with HP and the Intel India team, choosing a near circular architecture for both space and performance reasons.  To reduce both cost and speed of assembly, the system was built from standardized HP Proliant* BL 460c blade components and centrally placed Voltaire switches, initially wired with active copper InfiniBand* cables.

Make EKA perform reliably to complete LINPAC runs

The problem CRL now found was that the 15m active copper cables they previously used to connect the 288-port Voltaire switches to the servers did not have the needed reliability to complete the LINPAC runs.  Massive problems with BER and symbol errors resulted in underperforming and dead links, eventually aborting the LINPAC runs. To hit the Top 10, CRL needed EKA to perform reliably for 8 hours or more to complete the LINPAC benchmarks.

"Basically, we were getting only 50-70% network performance due to erroneous links," explains Ashrut Ambastha, EKA Architect, Computational Research Laboratories. "The degraded performance is one of the problems we continue to have with the remaining active copper cables."

Time was running out for CRL as the deadline for Top500 submissions loomed closer.  Having tried active copper cables, Ashrut Ambastha now looked to optical fiber Intel Connects Cables for a solution.

 

 

Delivering the solution

Impressed by the performance of optical cables in some of their smaller clusters, EKA architects swapped out 340 of the active copper cables, replacing them with Intel® Connects Cables as the connections to the 288-port Voltaire switches.  Using 20m cable runs, CRL architects were able to create a non-blocking node-toto-node architecture that took maximum advantage of their available space.  According to Ashrut Ambastha about 37% of the system is wired with Intel® Connects Cables (see Figure 1) and the remainder with copper.

Although the design and concept of EKA took 9 months, the supercomputer itself and LINPAC runs were completed in a record 1 month. Only 20 days after the hardware arrived on the site, the system was successfully running benchmarks at full scale.  EKA is the first time a near-circular architecture built from standard, volume cluster components has been tried out on this scale and it is one of the first implementations of optical InfiniBand cables in a supercomputer. For CRL, the proof was in the reliability of EKA when optical ICC cables replaced the active copper cables.

Dramatic improvements in reliability

Prior to the use of Intel Connects Cables, CRL was unable to complete LINPAC runs because of reliability problems with the copper cables. For example, when LINPAC tests were run with the 15m active copper interconnects, EKA's up time was only 4-1/2 to 7-1/2 hours before failure due to the poor reliability of the copper interconnects.6 The high symbol errors seen with copper simply would not allow bandwidth tests with long data packets.

"Sometimes we'd be almost finished, and at the last minute, we'd have a link drop. It was quite frustrating," says Ashrut. "The optical cables really helped solve this problem. With optical, we ran full LINPAC on one switch for more than 8 hours, and there were no symbol errors in the entire system. That's the reliability we needed to get us to 119.6 Teraflops."

"Better than 99.9% of the problems went away when we went with optical cables instead of copper." -Ashrut Ambastha

tata+crl+case+study+pix1.JPG

Figure 1. EKA supercomputer with copper and optical cables. Intel® Connects Cables (the yellow/orange cables in photo) take up less space, are more flexible, and allowed CRL to eliminate 1/3 of their cable trays and bend cables more easily into available spaces.

 

Inside the EKA System

EKA is currently the 4th fastest supercomputer in the world, with a peak performance of 180 Teraflops and a sustained performance of 120 (119.6 actual) Teraflops.

EKA was built using a dense data-center layout, novel network routing, and CLOS architecture. Its parallel processing library technologies were developed by CRL scientists. In addition to various lengths of copper cables, the system uses 340 Intel® Connects Cables to connect 1822 dual-socket compute nodes and the supporting 288-port InfiniBand switches at double data rates (DDR). The system has two topologies, and can be switched to use the topology appropriate for the application being run.

 

Component

EKA used:

Servers

HP ProLiant* BLA460c* server blades with Intel® X5365 Xeon quad-core processors, for a total of 14,240 processor cores

Switch Module

InfiniBand 4X DDR Mellanox switch module for the HP c-Class BladeSystem*

Switches

Voltaire Grid Director* ISR2012 IB 288-port 4X DDR switches

HCAs

Mellanox ConnectX*

Cabinets

HP BladeSystems* c_Class c7000 enclosures

Storage

HP SFS20* 80 TeraByte parallel file system storage

Interconnects

340 InfiniBand DDR Intel® Connects Cables plus additional copper interconnects

Software

HP XC* system software

Intel compilers and math libraries

 

Success for EKA

Once the system was up and running reliably, EKA achieved an initial sustained performance of 93 Teraflops. CRL then enlisted the expertise of Intel India & Russia software engineers to recompile, tweak, and optimize the machine. Performance steadily improved from 93 to 105, 108, and ultimately 119.6 Teraflops for the Top500 submission7.

"We used 340 of the Intel® Connects Cables on the cluster and the DDR performance at up to 20m with an extremely clean signal and no dropped packets was a key component in CRL achieving the performance we did," says CRL's Seetha Rama Krishna, EKA Program Manager.  "We experimented with other plain and active copper cables but it was the performance, longer length, and very low BER of the Intel Connects Cables that in the end helped ensure our placement on the TOP500 list."

 

Next steps for EKA

CRL intends to use the EKA system in government scientific research and product development for Tata, as well as to provide services to US customers.  Applications that will run on EKA include automotive simulations, animations, computational aerodynamics, pharmaceutical research, and nanophotonics modeling, as well as earthquake and Tsunami modeling.

However, the 120 teraflop system is only the first step -- a "demonstrator" machine -- towards petascale performance for CRL.  After seeing the EKA cluster performance with Intel Connects Cables, Ashrut is looking forward to building CRL's next, much larger supercomputer using primarily optical cables.

"What we have learned from this machine is that optics are the way to go," explains Ashrut.  "Once we start building the bigger machine, we'd like to eliminate copper altogether because of the improvements we're seeing with optical -- better BER, better airflow, and lower volume."

Summary

CRL is excited about the possibilities for building reliable supercomputers that take advantage of optical cables to explore new topologies. With challenging goals ahead, CRL expects the longer, reliable optical cables to make it easier to scale cluster size and applications, and improve overall performance further, ensuring a place among the top supercomputers in the world.

For more information about Intel Connects Cables, visit www.intelconnects.com.

For more information about EKA, CRL and its services, visit  www.crlindia.com.

 

Computational Research Laboratories

Computational Research Laboratories (CRL), in Pune, India, was incorporated as a fully-owned subsidiary of Tata Sons Limited, with a mandate to achieve global leadership in the area of high-performance computing systems.  With an elite team of 80 researchers and scientists covering application software, system architecture, system software and hardware design, CRL not only builds world-class and globally competitive supercomputer systems, but also delivers application-level scalability.

 

1 Performance as submitted for Top500 list was 119.6 Teraflops.  Source: Computational Research Labs (CRL).

2 Source: CRL

3 Source:  www.top500.org

4 Source: Intel Internal testing

5 Source: CRL

6 Source: CRL

7 Source: CRL

This document is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.

Copyright © 2008 Intel Corporation. All rights reserved.

Intel, the Intel logo, Intel. Leap ahead., Intel. Leap ahead. logo, and Intel Core are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

For more complete information about compiler optimizations, see our Optimization Notice.