performance

Optimizing Software Applications for NUMA: Part 5 (of 7)

3.2. Data Placement Using Implicit Memory Allocation Policies

In the simple case, many operating systems transparently provide support for NUMA-friendly data placement. When a single-threaded application allocates memory, the processor will simply assign memory pages to the physical memory associated with the requesting thread’s node (CPU package), thus insuring that it is local to the thread and access performance is optimal.

Optimizing Software Applications for NUMA: Part 1 (of 7)

1. The Basics of NUMA

NUMA, or Non-Uniform Memory Access, is a shared memory architecture that describes the placement of main memory modules with respect to processors in a multiprocessor system. Perhaps the best way to understand NUMA is to compare it with its cousin UMA, or Uniform Memory Access.

In the UMA memory architecture, all processors access shared memory through a bus (or another type of interconnect) as seen in the following diagram:

Improve Performance on 64-Bit Architecture of Applications with Many Small Functions


Challenge

Improve application performance in programs that contain many frequently used small to medium-sized functions. This characteristic is very common in object-oriented C++ programs that implement accessor methods.

  • itanium
  • performance
  • How-To
  • Intel® Itanium® Prozessoren
  • Threading Fortran applications for parallel performance on multi-core systems

    Advice and background information is given on typical issues that may arise when threading an application using the Intel Fortran Compiler and other sofftware tools, whether using OpenMP, automatic parallelization or threaded libraries.
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Apple Mac OS X*
  • Fortran
  • Intel® Fortran Compiler
  • performance
  • mult-core
  • OpenMP*
  • Optimierung
  • Threading
  • Quantify the Penalty of Branch Misprediction on 64-Bit Architecture


    Challenge

    Determine the performance penalty associated with the misprediction of a conditional branch on a processor based on 64-bit Intel® architecture. A separate item, How to Identify Branch Misprediction on 64-Bit Intel® Architecture shows how to identify stalls due to branch misprediction.


    Solution

    Use a simple loop as shown in the following code:

  • itanium
  • performance
  • How-To
  • Intel® Itanium® Prozessoren
  • Instruction Latencies in Assembly Code for 64-Bit Intel® Architecture


    Challenge

    Optimize assembly-language code for the Itanium® processor family in terms of instruction latencies. The latency of an instruction is the length of time that has elapsed from when the instruction is issued until the time that its results can be used. For most simple integer math operations, like "add r32=r33,r34", the latency is a single cycle, so it is possible to use the results of many operations in the very next set of parallel instructions. This is generally not true for floating-point operations or loads from memory.

  • itanium
  • performance
  • How-To
  • Seiten

    performance abonnieren