Determining whether HPC Cluster Should Be 32-Bit or 64-Bit Processors


Challenge

Determine whether a High-Performance Computing (HPC) cluster should be based on 32-bit machines or 64-bit machines. HPC clusters are a form of parallel computing hardware. Parallel hardware comes in other forms. Multi-processor-based computers are a form of parallel hardware and include both shared-memory processing (also called symmetric multi-processor, or SMP) systems and massively parallel processing (MPP) systems, such as the Intel® Paragon ASCI Red TOPS computer.

Another implementation of parallel hardware is within the processor itself in the form of Instruction Level Parallelism (ILP). The Intel® Itanium® processor implements ILP in its Exclusive Parallel Instruction Computer (EPIC) architecture. A large multi-stage pipeline and out-of-order execution process with nine execution units enable up to eight instruction retirements per clock. The Itanium processor provides large floating-point calculations with two floating-point units that complete two calculations per clock – four calculations are completed per clock.

Parallel computers are classified in terms of the method used to organize instruction streams and data streams. Single Instruction Multiple Data (SIMD) parallel computers process a single stream of instructions acting on multiple streams of data. Multiple Instruction Multiple Data (MIMD) parallel computers process multiple streams of instructions acting on multiple streams of data. While SIMD was the method used in early supercomputing, it has not been successful as a means to organize large-scale computers with many microprocessors.

However, SIMD has been very successful as a method to increase Instruction Level Parallelism in Intel® microprocessor architectures. Both the Intel® NetBurst™ microarchitecture of the Intel® Xeon® processor, and the Intel Itanium processor’s EPIC architecture use SIMD to improve processor performance and throughput through instruction-level parallelism.


Solution

Base the decision on what kinds of applications you are going to run on the cluster. Will the applications running on the cluster include larger-than-32-bit floating-point calculations, depend on massive data sets, or require the fastest turnaround times possible? Will the applications and data run from 32-bit addressable memory space? How constrained is the budget?

Answering these questions helps you determine whether nodes should be based on 64-bit Intel® Itanium® processors, 32-bit Intel® Xeon® processors, or Intel Xeon processors MP. An Intel Xeon processor or Intel Xeon processor MP can be used for high-performance clusters when:

  • You need high-performance, 32-bit computing.
  • 32 bits of addressable memory is adequate.
  • Price-performance is critical.
  • Applications contain mostly integer (versus floating-point) calculations.
  • The application code can benefit from streaming SIMD (i.e., it contains small, repetitive loops that operate on sequential arrays of integers).

Intel Xeon processor-based platforms offer cost-effective, high-performance computing for HPC applications. With the Intel Xeon processor’s Intel NetBurst microarchitecture, applications can take advantage of a large, 20-stage pipeline that provides increased throughput from instruction-level parallelism (ILP) by retiring up to three instructions per clock. For applications that depend on very large data sets and to shorten turnaround times, Intel Xeon processors MP in multiprocessor platforms provide an integrated three-level (iL3) on-chip cache for fast access to data. Most high-performance processors utilize only two levels of cache. With the Intel Xeon processor MP iL3 cache, more data can be stored closer to the execution units in the processor for faster access to needed data, resulting in higher system throughput and shorter turnaround times.

An Intel Itanium processor-based cluster is recommended when:

  • The problem requires enormous floating-point performance.
  • The problem is very large and relies on very large data sets.
  • You need massive amounts of addressable memory (64-bit addressability).

 

The Intel Itanium processor provides two separate floating-point execution units for fast processing of large, complex numeric calculations. Each floating-point unit is capable of executing two calculations per clock. The processor is built around a 6.4 GB/s system bus, and it offers up to 3 MB on-die cache configurations that allow large sets of critical data to remain close to the execution units for a very fast path to memory. An Intel Itanium processor is based on EPIC architecture, which speeds throughput from instruction-level parallelism and allows up to eight instructions to be retired per clock. Very sophisticated branch prediction and predication algorithms increase application efficiency and performance.

To gain the highest performance from Intel® architecture-based nodes, Intel provides compilers developed specifically for its processors and designed to enable applications to take advantage of the processor microarchitecture features. Intel also offers performance-tuning tools, such as the Intel® VTune™ Performance Analyzer, and optimized libraries, including the Intel® Math Kernel Library.


Source

Building High-Performance Computing Clusters with Intel® Architecture, Part 1

 


For more complete information about compiler optimizations, see our Optimization Notice.
Categories: