A Quick Peek Under the Hood of Intel® Parallel Inspector (Memory Checking)

Finding the cause of errors in multi-threaded applications can be quite a challenging task.  The Intel® Parallel Inspector, an Intel® Parallel Studio tool, is a proactive bug finder that helps you detect and perform root-cause analysis on threading and memory errors in multithreaded applications.

Intel Parallel® Inspector enables C and C++ application developers to:

  • Locate a large variety of memory and resource problems including leaks, buffer overrun errors and pointer problems;
  • Detect and predict thread-related deadlocks, data races and other synchronization problems;
  • Detect potential security issues in parallel applications
  • Rapidly sort errors by size, frequency and type to identify and prioritize critical problems;

In this article I'll focus on the technologies used for checking memory issues and will show what is happening under the hood. Intel® Parallel Inspector tracks all memory allocations and accesses using a binary instrumentation tool called Pin.  Pin is a dynamic instrumentation system provided by Intel (http://www.pintool.org), which allows C/C++ code to be injected into the areas of interest in a running executable. The injected code is then used to observe the behavior of the program. Intel® Parallel Inspector injects appropriate code into the application to check memory and threading errors.

 

insp-1.JPG

 

Intel® Parallel Inspector uses Pin in different settings to provide four levels of analysis, each having different settings and different overhead. The first three analysis levels are targeted for memory problems occurring on the heap while the fourth level can be additionally used to analyze the memory problems on the stack.


Intel® Parallel Inspector Memory Analysis Levels

insp-4.JPG


1. The first level analysis helps to find out if the application has any memory leaks. The memory leaks (please see "Leak Detection" section for more information) occur when a block of memory is allocated and never released. The call stack depth is set to 1 which means that that the Parallel Inspector keeps track of the running function and its caller. The allocators and deallocators of interest are:

 

Platform

Memory Allocator

Matching Deallocator

C++ language

new operator

delete operator

new[] operator

delete[] operator

C language

malloc(), calloc(), or realloc()functions

free() function

Windows* API

Windows* dynamic memory functions such as VirtualAlloc, GlobalAlloc() or LocalAlloc()

Appropriate functions

Example:

void func()
{
  char *pStr = (char*) malloc(512);
  return;
}

 

2. The second analysis level detects if the application has invalid memory accesses, invalid deallocations and mismatched allocations/deallocations.

Invalid memory accesses occur when a read or write instruction references memory that is logically or physically invalid.

Example:

void func()

{
  char *pStr = (char *) malloc(512);
  return;
}


At this level, invalid partial memory accesses can also be identified. Invalid partial accesses occur when a read instruction references a block (2-bytes or more) of memory where part of the block is logically invalid.

Example:

void func()

{
  int *pArray1 = (int*)malloc(10*sizeof(int));
  int *pArray2 = (int*)malloc(9*sizeof(int));
  memset(pArray2, 1, 9*sizeof(int));
  memcpy(pArray1, pArray2, 10*sizeof(int));  // Can result in invalid partial read
}


3. The third analysis level is similar to the second level except that the call stack depth is increased to 12 from 1 and enhanced dangling pointer check is also enabled.  The dangling pointers are those which access/point to data that no longer exist. Intel Parallel Inspector when a deallocation occurs, delays the deallocation so that the memory is not available for reallocation (it can't be returned by another allocation request). Thus any references that follow the deallocation can be guaranteed as invalid references from dangling pointers.  This technique does consume resources, so Intel Parallel Inspector must eventually return the address space for future allocation, thus limiting the range within which such dangling references can be detected.

4. The fourth level analysis tries to find all memory problems by increasing the call stack depth to 32, enabling enhanced dangling pointer check, including the system libraries in the analysis and by analyzing the memory problems on the stack.  The stack analysis is only enabled at this level.

Examples:

void func()

{
  int a;
  int b = a * 4; // uninitialized read of stack variable a
}

 

void stackUnderrun()
{
  char array[10];
  strcpy(array, "my string");
  int len = strlen(array) - 1;
  while (array[len] != ‘Z') // Will read from below the stack pointer
      len--;
}

Please note for analysis levels 2-4, if an application overloads or changes the semantics of the standard runtime allocation routines, then the behavior of Intel® Parallel Inspector is unknown and could possibility lead to a crash.

As we already know while the application is running, Intel Parallel Inspector tracks all memory allocations and accesses. The technologies employed by Intel Parallel Inspector to support all the analysis levels mentioned above are the leak detection and memory checking technologies which use Pin in various ways. The figure below illustrates how the instrumentation technology used by Pin works.

insp-2.JPG
Note:
Please check http://en.wikipedia.org/wiki/Trampoline_(computers) for information on trampoline definition.


Leak Detection

During the first level analysis, the Leak Detection Tool (LDT) stores the parameters of allocated objects until they are explicitly deallocated or until the application terminates. LDT using Pin (more specifically PinProbes) technology instruments the general C/C++/OS memory management functions and receives notifications when these instrumented functions are called by the application. In the analyzed functions LDT unwinds the current call stack in the thread to record the call sequence that led to the application state at the time of call. After the application is finished, results are written to the file.

1.       When the application starts, Pin injects its module into the application's process before the first instruction of the application has been executed.  It also redirects the libraries loading to Pin.

2.       After the application starts executing, binaries are loaded to the process space and LDT instruments the memory management functions mentioned earlier.

3.       When the application calls an instrumented function, the application is redirected to the Pin trampoline where the relevant information such as parameters, size of the object and its address are captured and stored in an internal allocated objects buffer.

4.       When the application finishes, all objects which were not deallocated explicitly are considered as memory leaks.

Additionally, the concept of reachable blocks is used to limit what is reported as a leak. If an allocated block is reachable by pointers starting from the global memory, Parallel Inspector doesn't report it as a leak. These tend to be the blocks that would live for the lifetime of the application, and deallocation at program exit is treated as "a waste of time".


Memory Checking

The memory checking technology like the leak detection technology uses Pin to instrument memory management functions and to collect information on their invocations and usage during the analysis levels 2 to 4. The memory checker tool in addition to leak detection can identify uninitialized and invalid accesses, dangling pointer issues and the memory problems on the stack.

The memory checker tool creates a bitmap representation of the application's memory using page table entries to detect memory problems. Basically it uses one bit per byte of memory in the process's address space to indicate whether that byte is valid or not and a second bit per byte to indicate whether that byte is initialized or not.


Stack Analysis

Stack memory can be used for local variables, local dynamic allocations (e.g alloca), passing parameters, saving and restoring registers, and storing addresses for call/return handling. Initially, the memory that the stack will occupy is marked as invalid and uninitialized. When each thread starts, the initial stack pointer value is used to modify the bitmap representing the initial stack region to be valid and initialized. Subsequent changes to the stack pointer are monitored and the bitmaps representing the changing regions are updated.

For the stack regions set up for local variables, the bitmap representing the memory is marked as valid and uninitialized when reserved. This occurs when the stack is expanded.  When the stack region is restored, it is marked as invalid and uninitialized. The region used for parameters needs to be marked as valid and initialized. For the stack regions used for everything else, the bitmap remains unchanged. This allows reads and writes to these regions to be flagged as errors. This might occur if a function under-flows an array.

When a call instruction is seen, the region between the lowest previously tracked stack pointer value and the current stack pointer value is marked as valid and initialized. This represents the parameters passed to the function called.

When a return instruction is seen, the memory region between the lowest tracked stack pointer value for the current frame and the lowest tracked stack pointer value for the prior frame is marked as invalid and uninitialized.

The bitmap representation is also used for uninitialized memory accesses and invalid memory accesses.



I tried to explain briefly how Intel Parallel Inspector detects memory errors and what kind of technologies it employs.  Intel Parallel Studio can help you to identify hard to find memory errors (and of course threading errors which were not mentioned in this article though) and help you increase your code quality.


Some of the Relevant Microsoft* Windows APIs

AllocateUserPhysicalPages, FreeUserPhysicalPages, MapUserPhysicalPages, MapUserPhysicalPagesScatter, HeapAlloc, HeapCompact, HeapCreate, HeapDestroy, HeapFree, HeapLockHeap, QueryInformation, HeapReAlloc ,VirtualAlloc, VirtualAllocEx, VirtualAllocEx, NumaVirtualFree, VirtualFreeEx, GlobalAlloc, LocalAlloc, GlobalFree, LocalFree, GlobalLock, LocalLock, GlobalReAlloc, LocalReAlloc, GlobalUnlock, LocalUnlock, _calloc_dbg  ,_expand, _expand_dbg  ,_free_dbg , _malloc_dbg , _realloc_dbg , calloc , free, malloc, Realloc, new, new[], delete, delete[]


by Levent Akyil
Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.