Intel® Itanium® Prozessoren

Code Timing and Profiling for Linux on 64-Bit Intel® Architecture


Challenge

Measure the time a program and its functions take to execute as part of the diagnosis phase of performance optimization. Such measurements are extremely valuable as a simple means to become familiar with how an application behaves during execution.


Solution

Use either the Linux time command or the clock function in the C library, and profile the application during compilation. The time command is used as follows:

prompt> time 

It gives the following information:

  • performance
  • Intel® Itanium® Prozessoren
  • Analyze Memory Accesses on 64-Bit Intel Architecture


    Challenge

    Determine what memory accesses are causing EXE pipeline stalls accumulated by the BE_EXE_Bubble counter. Most memory-access stall cycles are accumulated by the BE_EXE_Bubble counter. This counter accumulates stall cycles in the EXE stage of the pipeline. These stall cycles occur mostly because the data loaded into registers is not ready for consumption by the functional units.

    These dependency stalls break down to two sources:

  • performance
  • Intel® Itanium® Prozessoren
  • Register-Stack Engine Stalls on 64-Bit Architecture


    Challenge

    Identify the source of stall cycles due to invocation of the Register Stack Engine (RSE). There are 96 general registers used for the register stacks. A deep call stack or a call stack through functions with heavy register needs can exceed this resource. Such situations require the RSE to spill the values stored in these registers for higher levels of a call chain to a backing store. The RSE then recovers the values as the call stack is unwound. This occurs automatically as the need arises.

  • Stall Analysis
  • Intel® Itanium® Prozessoren
  • Resolve Address Conflicts on 64-Bit Architecture


    Challenge

    Resolve address conflicts that cause a significant number of stall cycles. Cache misses occur when data is not in the desired cache and data retrieval requires access to a slower cache, memory, or even disk.

    Address conflicts are more common than most people realize and can be as costly as cache misses. Address conflicts are caused by technical details of the cache and cache access hardware; they are therefore more difficult to understand, though sometimes much easier to avoid.

  • Memory Access
  • Intel® Itanium® Prozessoren
  • Tests of Efficient Implementation of Madd Algorithms on an Itanium®-based System

     

    Introduction

    By Joe Bissell, University of Delaware
    Gary Zoppetti, University of Delaware, and
    Walter Triebel, Fairleigh Dickinson University

    Integer matrix multiplication is a common procedure that is used generically to explore options for optimizing nested loops and specifically for computations with multimedia algorithms, such as JPEG image decoding, all the way to uses as diverse as oceanographic analysis. The essence of this algorithm is the assignment statement:

  • Intel® Itanium® Prozessoren
  • Intel® Itanium® Prozessoren abonnieren