Debugging HPC Applications

Debugging HPC Applications

The Intel Debugger is capable of working with serial, thread parallel (via PThreads and OpenMP) and cluster parallel (via MPI-1) applications.

I'd like to get a feel for how the community at large approaches debugging HPC applications. I'm curious to know:

  • What types of bugs do you encounter most often?
  • What types of bugs do you find most difficult to locate and fix?
  • What methods do you use to accomplish this?
  • Which debuggers do you tend to use (and why)?
  • Which debuggers do you prefer to use (and why)?
  • Which debugger features do you find yourself using most often?
  • Which new debugger features would you consider to be essential?


7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I work with cluster/MPI code. Most of my debugging has been done with a set of scripts I built on top of gdb. 90% or more of what I do involves using this as a simple segfault locator: the program being debugged just runs until it crashes, then I trace back & examine variables to try to find out why.

So one thing that's important to me is to be able to examine any variable, even if the compiler has somehow optimized it away. It's really annoying to try to look at something you suspect might be the cause of a problem, only to be told that there's no such variable name.

I don't find either gdb or the Intel debugger at alluseful in locating problems that don't cause segfaults or otherwise crash the program, because ofthe line-oriented interface. (And the GUI displays I've seen buit on top of gdb etc are worse, if anything. Toomuch cute, too little work.) What I'd like is something similar to the DOS CodeView debugger of the late '80s: full-screen, text-oriented, with well thought out displays.

I've not used the Intel debugger much yet, basically because of a lack of usable documentation. All there is is HTML (which is difficult to sit down and read), and many areas seem to be missing any coverage.

One particular item, though, would be a way to display the contents of the XMM registers as floating point, not just hex.

Thanks James, that's great feedback.

I'll forward the information regarding the GUI, the documentation and the xmm registers to the engineering team.

I'll also take a closer look at the CodeView debugger to learn something more about its interface design.

In the mean time, IDB is being enhanced to do more optimized code debugging (including some initial support for register variables, split lifetime variables and in-lined functions). You're likely to see this in the 8.1 final release. You're also likely to see these areas being enhanced (and some probable additions along the lines of semantic stepping through optimized code) in future releases.

In addition, the MPI support in IDB lets you control application instances spread across nodes in a cluster from a single debugger session. It uses an aggregation network to concentrate the program output in a way that allows for a nice clean command line experience (while giving you access to possible variations in that output). It does so with near linear performance on large cluster configurations (tested up to 2500 nodes).

If you have an opportunity, please give it a try and let me know what you think.

Thanks again.

-- Gordon

I recommand totalview as the debug tools, it is really good tools for debugging mpi program. it is GUI and has many other advancecd characters.

For debugging mpi program, it is most important to locate the problem, that 's a key.




I too believe Etnus TotalView to be the Gold Standard in debugger technology in general, and for debugging cluster parallel applications in particular.

There are actually several good tools available for debugging clustered codes. For instance, theIntel Cluster Tools (i.e. Intel Trace Collector and Intel Trace Analyzer) can be used together as an effective debugger on clustered codes. Both IDB and Streamline DDT are also viable debugger options. IDB is provided without cost on the Intel Compiler kits and provides a nice clean command line interface (with an aggregation network for concentrating program and debugger output) for clustered codes. Streamline DDT is a premium debugger that provides more graphical features while coming in at a lower cost per seat than Etnus TotalView.

I recommend trying each todetermine which ones best meet your needs (and your budget).

Has anyone here seen anything along the lines of static analysis tools for finding potential MPI related problems earlier in the application development cycle?


-- Gordon

gasaladi wrote:

There are actually several good tools available for debugging clustered codes. For instance, theIntel Cluster Tools (i.e. Intel Trace Collector and Intel Trace Analyzer) can be used together as an effective debugger on clustered codes.

Gordon -

Actually, the Intel tools that you mention here are performance analyzers. They can be used, to some small degree, as debuggers if you are unsure where messages might have originated from or gone to. However, that activity would require such painstaking steps that I would recommend using 'printf' to trace message traffic, instead. TotalView has the ability to track message queues that is really quite useful for this type of debugging.

I've tried the IDB to track down a problem I was having with an MPI code (though the error was not related to MPI, as far as I could tell). I started by launching four separate instances of the debugger and then attaching to the MPI processes at an artificial pause inserted into the code for just this purpose. I did not have much luck. Your message seemed to imply that their could be some easier method for using IDB in such a situation. If there is, could you briefly share what that usage would be? Do all the processes need to be on the same node or can they be spread out across a cluster?


Clay, see Chapter 19 of the Intel Debugger manual for how do this very simply. Essentially, you just do your own equivelent of:

mpirun -dbg=idb -np N [other mpich options] application [application arguments] [--idbopt idb options]

James, it appears that later versions of MPICH now include an alternate launcher called mpigdb that provides an MPI framework to make debugging cluster parallel applications possible with gdb. See therelated Argonne MPICH pages for specific details.

Its not as extensive as the support we offer with IDB, but its close enough to be interesting.

Leave a Comment

Please sign in to add a comment. Not a member? Join today