• 2020
  • 06/18/2020
  • Public Content

False Sharing

This recipe explores profiling a memory-bound
application using the General Exploration and Memory Access analyses of the Intel® VTune™ Amplifier.
Content expert
: Dmitry Ryabtsev
General Exploration analysis is renamed to Microarchitecture Exploration analysis starting with Intel VTune Amplifier 2019.


This section lists the hardware and software tools used for the performance analysis scenario.
  • Application:
    . The
    sample package is available with the product in the
    directory and at
  • Performance analysis tools:
    • Intel VTune Amplifier 2018: General Exploration, Memory Access analysis
    • For
      downloads and product support, visit .
    • All the Cookbook recipes are scalable and can be applied to Intel VTune Amplifier 2018 and higher. Slight version-specific configuration changes are possible.
    • Intel® VTune™ Amplifier has been renamed to Intel® VTune™ Profiler starting with its version for Intel® oneAPI Base Toolkit (Beta). You can still use a standalone version of the VTune Profiler, or its versions integrated into Intel Parallel Studio XE or Intel System Studio.
  • Operating system:
    Linux*, Ubuntu* 16.04 64-bit
  • CPU:
    Intel® Core™ i7-6700K processor

Run General Exploration Analysis

To have a high-level understanding of potential performance bottlenecks for the sample, start with the General Exploration analysis provided by the VTune Amplifier:
  1. Click the
    New Project
    button on the toolbar and specify a name for the new project, for example:
  2. In the
    Analysis Target
    window, select the
    local host
    target system type for the host-based analysis.
  3. Select the
    Launch Application
    target type and specify an application for analysis on the right.
  4. Click the
    Choose Analysis
    button on the right, select
    Microarchitecture Analysis
    General Exploration
    and click
    VTune Amplifier launches the application, collects data, finalizes the data collection result resolving symbol information, which is required for successful source analysis.

Identify a Bottleneck

Start with the
view that provides application-level statistics per hardware metrics.
Typically, for performance analysis you are recommended to create a
to measure your future optimizations. In this case, consider the application Elapsed Time as your baseline:
A brief analysis of the summary metrics shows that the application is mostly bound by contested memory accesses.

Find a Contended Data Structure

High value for the
Contested Accesses
metric prompts you to dig deeper and run the Memory Access analysis with the
Analyze dynamic memory objects
option enabled. This analysis helps you find out an access to what data structure caused contention issues:
From the Summary view, you see that a memory allocation data object in file
at line 52 introduced the highest latency to the application execution. The size of the allocation is quite small - only 512 bytes, so it should fit fully into the L1 cache. For more details, click this object in the table to switch to the Bottom-up view:
The average access latency to this object is 59 cycles, which is a very high value for the memory size that is expected to reside in the L1 cache. This can be the source for the contested accesses performance problem.
Expand the
stddefines.h:52 (512B)
memory object in the grid to view the allocation stack. Double-click the allocation stack to go deeper to the Source view that highlights the line where the object is allocated:
typedef struct { pthread_t tid; POINT_T *points; int num_elems; long long SX; long long SY; long long SXX; long long SYY; long long SXY; } lreg_args;
Threads code accessing the
array looks like this:
// ADD Up RESULTS for (i = 0; i < args->num_elems; i++) { //Compute SX, SY, SYY, SXX, SXY args->SX += args->points[i].x; args->SXX += args->points[i].x*args->points[i].x; args->SY += args->points[i].y; args->SYY += args->points[i].y*args->points[i].y; args->SXY += args->points[i].x*args->points[i].y; }
Each thread is independently accessing its element in the array, which looks like false sharing.
The size of the
structure in the sample is 64 bytes, which matches the cacheline size. But when you allocate an array of these structures, there is no guarantee that this array will be aligned with 64 bytes. As a result, array elements may cross cacheline boundaries, which triggers an unintended contention issue - false sharing.

Fix False Sharing Issue

To fix this false sharing problem, switch to an
function, which is used to allocate memory with 64 bytes alignment:
Re-compiling and re-running the application analysis with
provides the following result:
The Elapsed time is now 0.5 seconds, which is a significant improvement from original 3 seconds. The Memory Bound bottleneck went away. The false sharing performance issue is successfully fixed.
To discuss this recipe, visit the developer forum .

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804