<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Sun, 12 Feb 2012 02:44:33 -0800 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/intel-vtune-performance-analyzer-for-windows-kb/type/tips-and-techniques/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/intel-vtune-performance-analyzer-for-windows-kb/type/tips-and-techniques/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Identify Long Latency Instruction Impacts</title>
      <description><![CDATA[ Long latency instructions such as division and square-root can introduce stalls during the execution of an application. The Intel® VTune<sup>TM</sup> Amplifier XE performance profiler can help software developers analyze their application to identify algorithmic and microarchitectural performance issues. The VTune<sup>TM</sup> Amplifier XE uses the processor's Performance Monitoring Unit (PMU) to sample processor events and can be used to statistically sample the number of computational operations.<br /><br />The VTune<sup>TM</sup> Amplifier XE can help identify where such operations are taking place and if these operations are contributing to stall cycles during the execution.<br /><br /> <br /> 
<table border="1" cellpadding="0" cellspacing="0">
<tbody >
<tr >
<td  rowspan="2" width="158">Intel® Core<sup>TM</sup> 2 processor family   (Intel®  Core<sup>TM</sup> 2 Duo/Quad, etc)<br /><br /></td>
<td  width="161"><b>DIV</b><br /><br /></td>
<td  valign="top" width="415">Counts the number of divide operations   executed. This includes integer divides, floating point divides and   square-root operations executed.<br /></td>
</tr>
<tr >
<td  width="161"><b>CYCLES_DIV_BUSY</b><br /><br /></td>
<td  valign="top" width="415">Counts the number of cycles the   divider is busy executing divide or square root operations. The divide can be   integer, X87 or Streaming SIMD Extensions (SSE). The square root operation   can be either X87 or SSE.<br /></td>
</tr>
<tr >
<td  rowspan="2" width="158">Intel® Core<sup>TM</sup> architecture (Intel®   Core<sup>TM</sup> i7, i5, i3; a.k.a Nehalem)<br /><br /></td>
<td  width="161"><b>ARITH.DIV</b><br /><br /></td>
<td  valign="top" width="415">Counts number of divide or square root   operations. The divide can be integer, X87 or Streaming SIMD Extensions   (SSE). The square root operation can be either X87 or SSE.<br /></td>
</tr>
<tr >
<td  width="161"><b>ARITH. </b><b>CYCLES_DIV_BUSY</b><br /><br /></td>
<td  valign="top" width="415">Counts the number of   cycles the divider is busy executing divide or square root operations. The   divide can be integer, X87 or Streaming SIMD Extensions (SSE). The square   root operation can be either X87 or SSE.<br /></td>
</tr>
<tr >
<td  rowspan="2" width="158">2<sup>nd</sup> Generation Intel® Core<sup>TM</sup> architecture (a.k.a SandyBridge)<br /><br /></td>
<td  width="161"><b>ARITH.FP_DIV</b><br /><br /></td>
<td  valign="top" width="415">Counts the number of the divide operations   executed.</td>
</tr>
<tr >
<td  width="161"><b>ARITH.FPU_DIV_ACTIVE</b><br /><br /></td>
<td  valign="top" width="415">Counts the cycles when the divider is   busy executing divide or square-root operations. An additional constant   number of (6) cycles to be added per operation.<br /></td>
</tr>
</tbody>
</table>
<br />
<div ><b><span >Table 1:</span></b> PMU events used to count the specific events.<br /></div>
<br /> <b><span ><br />Example:</span></b> Let's consider the N Body problem for this exercise. The <i>N-body</i> problem predicts the motion of a group of celestial objects that interact with each other gravitationally. The sample application proceeds over time steps and in each step computes the net force on every body and updates its position, acceleration and velocity accordingly. This implementation requires <i>O(N<sup>2</sup>)</i> operations in each iteration.<br /><br />
<pre name="code" class="cpp">runSerialBodies()
{
...
// Run the simulation over a fixed range of time steps
for (double s = 0.; s &lt; STEPLIMIT; s += TIMESTEP)
{

  // Compute the accelerations of the bodies
  for (i = 0; i &lt; n - 1; ++i)
  {
   for (j = i + 1; j &lt; n; ++j)
   {

     // compute the distance between them
     double dx = body[i].pos[0]-body[j].pos[0];
     double dy = body[i].pos[1]-body[j].pos[1];
     double dz = body[i].pos[2]-body[j].pos[2];

     double distsq = dx*dx + dy*dy + dz*dz;

     if (distsq &lt; MINDIST)distsq = MINDIST;

     double dist = sqrt(distsq);

     // compute the unit vector from j to i
     double ud[3];
     ud[0] = dx / dist;
     ud[1] = dy / dist;
     ud[2] = dz / dist;

     // F = G*mi*mj/distsq, but F = ma, so ai = G*mj/distsq
     double Gdivd = GFORCE/distsq;
     double ai = Gdivd*body[j].mass;
     double aj = Gdivd*body[i].mass;
     // apply acceleration components using unit vectors
     for (int k = 0; k &lt; 3; ++k)
     {
        body[j].acc[k] += aj*ud[k];
        body[i].acc[k] -= ai*ud[k];
      }
    }
  }

  // Apply acceleration and advance bodies
  for (i = 0; i &lt; n; ++i)
  {
    for (j = 0; j &lt; 3; ++j)
    {     
       body[i].vel[j] += body[i].acc[j] * TIMESTEP;
       body[i].pos[j] += body[i].vel[j] * TIMESTEP;
       body[i].acc[j] = 0.;
     }
  }

}
...
}</pre>
<br /><br /> <br /><br /> Analyzing the sample code on Intel® Core<sup>TM</sup> i7 (x980) based system (3.33GHz, 6 core + Hyper Threading enabled) with VTune<sup>TM</sup> Amplifier XE reveals the following:<br /><br /> <br /> <img src="http://software.intel.com/file/35268" title="fig1.png" alt="fig1.png" height="458" width="775" /><br /><br /> <b><span >Figure 1:</span></b> Shows the analysis of the sample code, 85% of the clockticks, 89.3% of the uops dispatched and 83.4% of the dispatch stalls are occuring in this code segment. Please check the VTune(TM) Amplifier XE help file for more information on the events such as UOPS_DISPATCHED.<br /><br /> <br /> One way to optimize the code is to replace the division with reciprocal multiplication as shown below.<br /><br />
<pre name="code" class="cpp">// compute the unit vector from j to i
double ud[3];
ud[0] = dx / dist;
ud[1] = dy / dist;
ud[2] = dz / dist;
</pre>
<br />to<br />
<pre name="code" class="cpp">// compute the unit vector from j to i
double ud[3];
double dd = 1.0 / dist;
ud[0] = dx * dd;
ud[1] = dy * dd;
ud[2] = dz * dd; </pre>
<br /><br /> The optimized code consumes 4,668 million less clockticks and reduces the dispatch stalls from 7,448 million cycles to 3,020 million cycles.<br /><br /><br /> <img src="http://software.intel.com/file/35269" title="fig2.png" alt="fig2.png" height="335" width="646" /><br /><br /> <b><span >Figure 2:</span></b> Comparison of the original and optimized versions.<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/identify-long-latency-instruction-impacts/</link>
      <pubDate>Sat, 02 Apr 2011 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/identify-long-latency-instruction-impacts/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/identify-long-latency-instruction-impacts/</guid>
      <category>Tools</category>
      <category>Intel® VTune™ Performance Analyzer for Linux* Knowledge Base</category>
      <category>Intel® VTune™ Performance Analyzer for Windows* Knowledge Base</category>
      <category>Intel® VTune™ Amplifier XE Knowledge Base</category>
    </item>
    <item>
      <title>Estimating FLOPS using Event Based Sampling (EBS)</title>
      <description><![CDATA[ <p>The FLOPS (or flops or flop/s) is an acronym for <b>fl</b>oating point <b>op</b>erations per <b>s</b>econd and is a measure heavily used in high performance computing. The FLOPS is a common way of measuring the performance and computational capabilities of a given microprocessor.</p>
<p>In this article, you will find out how hardware based Event Based Sampling (EBS) technology can help developers estimate the floating point operations per second executed by their applications. FLOPS will refer to 32 bit and 64 bit floating point operations and the operations will be either addition or multiplication (computational).</p>
<p>The Intel® VTune<sup>TM</sup> Amplifier XE is a performance analysis tool, which can help the software developers analyze their application to identify algorithmic and microarchitectural performance issues. The VTune<sup>TM</sup> Amplifier XE uses the processor's Performance Monitoring Unit (PMU) to sample processor events and some of these processor events can be used to statically sample the number of computational floating point operations at execution.</p>
<p ><b><span ></span></b><b><span ></span></b></p>
<p ><img title="fig1.png" alt="fig1.png" src="http://origin-software.intel.com/file/34526" /><br />Figure 1: Scalar processing vs. SIMD (<span >S</span>ingle <span >I</span>nstruction <span >M</span>ultiple <span >D</span>ata) processing</p>
<p > </p>
<p ><b><span ><img title="fig2.png" alt="fig2.png" src="http://origin-software.intel.com/file/34527" /> </span></b></p>
<p ><b><span ></span></b><b><span ></span></b></p>
<p >Figure 2: Intel® Architecture integer, floating point, MMX and SSE (Streaming SIMD Extensions) registers.</p>
<p >Note: The figure doesn't show the latest AVX extension and registers.</p>
<p><b><span ></span></b></p>
<p>As Figure 1 and 2 demonstrate, floating point operations can be performed on legacy x87 registers or on SSE registers, depending on how the compiler generates the code. If the floating point instructions are executed on SSE registers, then they can be either scalar or packed operations. Table 1 (below) gives the PMU event names which can be used to statistically estimate the computational floating point operations executed by the hardware. It is a good idea to keep in mind that not all the executed instructions, hence counted by these events, are retired due to speculative nature of the architecture. Therefore, it is possible to experience overcounting of these events.</p>
<p> </p>
<table border="1" cellpadding="0" cellspacing="0">
<tbody >
<tr >
<td rowspan="2"  valign="top" width="150">
<p ><b><span >Processor Generation</span></b></p>
</td>
<td colspan="3"  valign="top" width="549">
<p ><b><span >Processor Event Names</span></b></p>
</td>
</tr>
<tr >
<td  valign="top" width="153">
<p ><b>FP  operations using legacy x87 </b></p>
</td>
<td colspan="2"  valign="top" width="396">
<p ><b>FP operations using SIMD</b></p>
</td>
</tr>
<tr >
<td rowspan="4"  valign="top" width="150">
<p>Intel® Core<sup>TM</sup> 2 processor family (Intel®  Core<sup>TM</sup> 2 Duo/Quad, etc)</p>
<p> </p>
</td>
<td rowspan="4"  valign="top" width="153">
<p >X87_OPS_RETIRED.ANY<b><span ></span></b></p>
</td>
<td  valign="top" width="123">
<p>Packed 64bit</p>
</td>
<td  valign="top" width="273">
<p>SIMD_COMP_INST_RETIRED.PACKED_DOUBLE</p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Packed 32bit</p>
</td>
<td  valign="top" width="273">
<p>SIMD_COMP_INST_RETIRED.PACKED_SINGLE</p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Scalar 64bit</p>
</td>
<td  valign="top" width="273">
<p>SIMD_COMP_INST_RETIRED.SCALAR_DOUBLE</p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Scalar 32bit</p>
</td>
<td  valign="top" width="273">
<p>SIMD_COMP_INST_RETIRED.SCALAR_SINGLE</p>
</td>
</tr>
<tr >
<td rowspan="4"  valign="top" width="150">
<p>Intel® Core<sup>TM</sup> architecture (Intel® Core<sup>TM</sup> i7, i5, i3; a.k.a Nehalem)</p>
</td>
<td rowspan="4"  valign="top" width="153">
<p >FP_COMP_OPS_EXE.x87</p>
</td>
<td  valign="top" width="123">
<p>Packed 64bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION</p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Packed 32bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_SINGLE_PRECISION<b></b></p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Scalar 64bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_FP_SCALAR<b></b></p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Scalar 32bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_FP_SCALAR<b></b></p>
</td>
</tr>
<tr >
<td rowspan="4"  valign="top" width="150">
<p>2<sup>nd</sup> Generation Intel® Core<sup>TM</sup> architecture (a.k.a SandyBridge)</p>
</td>
<td rowspan="4"  valign="top" width="153">
<p><b><span ></span></b></p>
<div >FP_COMP_OPS_EXE.X87<br /></div>
</td>
<td  valign="top" width="123">
<p>Packed 64bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE</p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Packed 32bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_PACKED_SINGLE<b></b></p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Scalar 64bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE<b></b></p>
</td>
</tr>
<tr >
<td  valign="top" width="123">
<p>Scalar 32bit</p>
</td>
<td  valign="top" width="273">
<p>FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE<b></b></p>
</td>
</tr>
</tbody>
</table>
<p ><b><span >Table 1:</span></b> PMU events are used to count the computational floating point operations at execution.<b><span ></span></b></p>
<p ><b><span >Note:</span></b> This table does not include the event names used to sample AVX FP operations</p>
<p>The VTune<sup>TM</sup> Amplifier XE can use any of the events or all of them at the same time to estimate the flops achieved by an application. In order to measure the elapsed time, the CPU_CLK_UNHALTED (a.k.a clockticks) event can be used. If the processor frequency is constant during the measuring period, you can use the clockticks event to calculate the elapsed wall clock time. Please keep in mind that the CPU_CLK_UNHALTED event name might vary by the processors architecture.</p>
<p>Alternatively, CPU_CLK_UNHALTED.REF, which counts the number of reference cycles and is not affected by thread frequency changes, can be used. The difference between the reference clocktick event and clocktick event is that even if a thread enters the halt state (by running the HLT instruction), the reference clocktick event continues to count as if the thread is continuously running at the maximum frequency.</p>
<p><b><span >Estimating FLOPS </span></b></p>
<p>The FLOPS formula can be given as follows:</p>
<blockquote >
<p><b>FLOPS </b>= ((number of FP ops / clock) * number of total computational FP ops) / Elapsed Time</p>
<p><b>Elapsed Time = </b>CPU_CLK_UNHALTED / Processor-Frequency / Number-of-Cores<b>. <br /></b>Note: The cores with non zero CPU_CLK_UNHALTED event count needs to be considered for this formula.<b></b></p>
</blockquote>
<p>To demonstrate how EBS technology can be used to estimate the FLOPS, a simple multi-threaded matrix multiplication will be used. Each thread in the thread pool executes the following code.</p>
<p> </p>
<blockquote >
<p>double a[NUM][NUM];</p>
<p>double b[NUM][NUM];</p>
<p>double c[NUM][NUM];</p>
<p>...</p>
<p>slice = (unsigned int) tid;</p>
<p>from  = (slice * NUM) / NUM_THREADS;</p>
<p>to    = ((slice + 1) * NUM) / NUM_THREADS;</p>
<p> </p>
<p>for(i = from; i &lt; to; i++) {</p>
<p >for(j = 0; j&lt; NUM; j++) {</p>
<p >for(k = 0; k &lt; NUM; k++) {</p>
<p >// 2 fp ops / iteration: 1 add, 1 multiply<br />c[i][j] += a[i][k] * b[k][j];</p>
<p >}</p>
<p >}</p>
<p>}</p>
<p>...</p>
</blockquote>
<p> </p>
<p>The application also reports the flops measured by dividing the total FP operations ( 2 / iteration * NUM * NUM * NUM) with the elapsed time. The elapsed time only includes matrix multiplication part and doesn't include the initialization and thread creation overhead.</p>
<p>In order to collect samples for the relevant code section <i>__itt_pause()</i> (pauses the collection) and <i>__itt_resume()</i> (resumes the collection) APIs are used. Please refer to VTune<sup>TM</sup> Amplifier XE documentation on how to use the user APIs.</p>
<p>VTune<sup>TM</sup> Amplifier XE can be configured as follows on Intel® Core<sup>TM</sup> i7 (x980) based system (3.33GHz, 6 core + Hyper Threading enabled):</p>
<p> </p>
<p ><img title="fig3.png" alt="fig3.png" src="http://origin-software.intel.com/file/34636" /></p>
<p > </p>
<p><br /><b><span >Using x87 Registers</span></b></p>
<p>The sample application is compiled in released mode (optimization level set to 0x) on a Windows* system using Visual Studio</p>
<p>The application reports the following when analyzed under VTune<sup>TM</sup> Amplifier XE.</p>
<p><img title="fig4.png" alt="fig4.png" src="http://origin-software.intel.com/file/34529" /></p>
<p><br />The results below give us insight on how the compiler generated the code.  In this run, we can clearly see that we only collected samples on FP operations using x87.</p>
<p ><img title="fig5.png" alt="fig5.png" src="http://origin-software.intel.com/file/34530" /></p>
<p>If we plug the numbers into the formula:</p>
<blockquote>
<p><b>MFLOPS Formula</b> = FP_COMP_OPS_EXE.FP<b> </b>/ 1x10<sup>6</sup> / Elapsed Time</p>
<p><b>Elapsed time</b> = CPU_CLK_UNHALTED.THREAD / Processor-Frequency / Number-of-Cores</p>
</blockquote>
<p> </p>
<blockquote>
<p>Elapsed Time = 607,652,000,000.00 / 3.33 x 10<sup>9 </sup>/ 12 = 15.206 secs</p>
<p>MFLOP = 18,470,000,000.00 / 1x10<sup>6</sup>/ 15.206 secs = <b><span >1,214.652 MFLOPS</span> </b><sup></sup></p>
</blockquote>
<p> </p>
<p><b><span >Using SSE registers</span></b></p>
<p>Now, let's look at the same application when SSE registers are used.  If we compile the application using Intel® compiler version 12.0, we see the following results under the VTune<sup>TM</sup> Amplifier XE.</p>
<p><img title="fig6.png" alt="fig6.png" src="http://origin-software.intel.com/file/34531" /></p>
<p ><img title="fig7.png" alt="fig7.png" src="http://origin-software.intel.com/file/34532" /></p>
<p><br /><br />One thing you will notice right away in the new result displayed is the difference in the function names where the samples are happening.  In the earlier example, we were getting the samples in matrixMultiply function, but now we see the samples in threadPool function.  This is due to inlining (for more information: <a href="http://en.wikipedia.org/wiki/Inline_expansion">http://en.wikipedia.org/wiki/Inline_expansion</a>). Drilling down into the threadPool makes this clear.</p>
<p ><img title="fig8.png" alt="fig8.png" src="http://origin-software.intel.com/file/34533" /></p>
<p> </p>
<p>We multiply FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION event by <b>2</b> because <b>two packed double precision floating operations can be performed</b> on 128 bit XMM registers in every clock. For single precision floating point operations, the total count for packed single precision floating operations needs to be multiplied by 4.</p>
<blockquote>
<p><b>MFLOPS Formula</b> = 2 * FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION / 1x10<sup>6 </sup>/ Elapsed Time</p>
<p><b>Elapsed time</b> = CPU_CLK_UNHALTED.THREAD / Processor-Frequency / Number-of-Cores</p>
</blockquote>
<blockquote>
<p>Elapsed time = (66,178,000,000 / 3.33 x10<sup>9</sup> / 12 ) =  1.656 secs</p>
<p>MFLOPS = 2 * FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION / 1 x 10<sup>6 </sup>/ 1.656 secs =  <b><span >11,053.140 MFLOPS</span></b></p>
</blockquote>
<p> </p> ]]></description>
      <link>http://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs/</link>
      <pubDate>Fri, 04 Feb 2011 15:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs/</guid>
      <category>Parallel Programming</category>
      <category>Intel Software Network communities</category>
      <category>Intel® VTune™ Performance Analyzer for Linux* Knowledge Base</category>
      <category>Intel® VTune™ Performance Analyzer for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>Custom Performance Counters, Visual Studio*, and the VTune Performance Analyzer</title>
      <description><![CDATA[ <br />
<div id="art_pre_template"><b>Problem : </b><br />If you have a Visual Studio solution where one of the projects is a performance counter DLL, after installing the VTune analyzer and integrating it into Visual Studio, you may find that any attempt to build the DLL fails.<br /><br /><b>Environment : </b><br />Windows*<br /><br /><b>Root Cause : </b><br />The VTune Performance Analyzer, upon start up, launches a helper process, 'vtunecca.exe'.  This helper process collects the list of performance objects and counters available in the system for configuration in preparation for Counter Monitor data collection.  This procedure of loading the performance objects and counters causes all custom performance counter DLLs to be loaded and locked into that process.  Consequently, when an attempt is made to build a custom performance counter DLL, the build fails because the DLL is open for access and cannot be overwritten.<br /><br /><b>Resolution : </b><br />There are a couple of workarounds for this known issue:<br /><ol>
<li>Un-integrate the VTune analyzer from Visual Studio.  Since the VTune analyzer has a standalone graphic interface, you can un-integrate from Visual Studio and use the standalone GUI to collect data.<br />To un-integrate: in the Control Panel, go to <b>Add/Remove Programs</b> and select the <b>VTune Performance Analyzer</b> and then press the <b>Change</b> button.  When the dialog opens, select <b>Modify</b> and step through the dialogs pressing <b>Next</b> until you get to the dialog that allows you to uncheck the <b>integrate with Visual Studio</b> option.</li>
<li>Rename the vtunecca.exe file prior to starting Visual Studio.  You can find the vtunecca.exe file in the C:\Program Files\Intel\VTune\Analyzer\Bin directory.  Of course, you cannot use the Counter Monitor while the name of this file is changed, but you will be able to build the DLL and test it using other means.<br />Simply rename the file back to the original vtunecca.exe and the Counter Monitor feature will begin working again, after restarting the VTune analyzer.</li>
</ol></div> ]]></description>
      <link>http://software.intel.com/en-us/articles/custom-performance-counters-visual-studio-and-the-vtune-performance-analyzer/</link>
      <pubDate>Sun, 15 Aug 2010 21:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/custom-performance-counters-visual-studio-and-the-vtune-performance-analyzer/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/custom-performance-counters-visual-studio-and-the-vtune-performance-analyzer/</guid>
      <category>Intel® VTune™ Performance Analyzer for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>Intel® FLEXlm* License Borrowing Capability</title>
      <description><![CDATA[ <strong>Overview<br /></strong><br />This feature allows users to ‘borrow” a license seat from the license host server for a limited time, disconnect from the network and use the borrowed license even with no connection to the license server. This is very useful in case you want to use the software offline.<br /><br /><strong>Required Information</strong><br /><br />To use the license borrow functionality for Intel floating product licenses, customers need to ensure they have the following items:<br /><br />1) A build of the Intel® License Manager for FLEXlm* (for the desired OS) that supports the Borrow capability: <br /><br />a) Users need to make sure that they are using a build of Intel FLEXlm* License Manager which supports borrowing <br />and early return of borrowed licenses. <br /><br />b) We recommend the customer download and install one of the free license manager servers available at the <br />following website link:<br /><a href="http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-flexlm-license-servers/">http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-flexlm-license-servers/</a><br /><br />2) A license file which has the keyword BORROW in it: <br /><br />a) Licenses for Intel® Software Development Products with version 2011 have the BORROW feature with a BOROW period of 7 days enabled by default.  <br /><br />b) To obtain a borrow-enabled license for a multi-seat floating product license with a version older than 2011, please   submit an issue to Intel® Premier Support at https://premier.intel.com/ under product 'Download, Licensing and Registartion'.<br />     <br /><br /><strong>Steps for starting the FLEXlm* License Manager (Server) for using the Borrow feature<br /></strong><br />(1) Start the license server using the new borrow-enabled license file(s).<br /><br />(2) Check the log for the Intel® FLEXlm* License Server and make sure that it does not complain about BORROW keyword in the license file. <br /><br />     • By default, the log file location on Windows* is: %ProgramFiles%\Common Files\Intel\FLEXlm <br />     • By default, the log file location on Linux* and Mac OS* X is the same where FLEXlm server has been installed<br /><br />     A typical log file when FLEXlm Server has started successfully looks like the following:<br /><br />     14:00:31 (lmgrd) US Patents 5,390,297 and 5,671,412.<br />     14:00:31 (lmgrd) World Wide Web: http://www.macrovision.com<br />     14:00:31 (lmgrd) License file(s): server.lic<br />     14:00:31 (lmgrd) lmgrd tcp-port 28518<br />     14:00:31 (lmgrd) Starting vendor daemons ...<br />     14:00:31 (lmgrd) Started INTEL (internet tcp_port 35860 pid 9309)<br />     14:00:31 (INTEL) FLEXlm version 9.23<br />     14:00:31 (INTEL) Server started on LicenseServer for:<br />     14:00:31 (INTEL) I3F97C15E (consisting of:   ArBBL<br />     14:00:31 (INTEL) CCompL    DbgL    FCompL<br />     14:00:31 (INTEL) MKernL    PerfAnl    PerfPrimL<br />     14:00:31 (INTEL) StaticAnlL    ThreadAnlGui    ThreadBB)<br /><br /><br /><strong>Client System Setup for the Borrow feature (Application Setup)<br /></strong><br />NOTE: The term “Client” refers to the application that uses Intel FLEXlm floating license seat check-out and check-in. <br /><br />1) Download the lmutil for your operating system and architecture from http://www.globes.com/support/fnp_utilities_download.htm. <br /><br />2) If you are not able to download the lmutil from the website above, work with your Intel Support team contact (or submit an issue to Premier Support at https://premier.intel.com/) for access to lmutil, and have the information about your Operating System, OS Version and Architecture (IA-32, Intel® 64 and/or IA-64[Intel® Itanium®]).<br /><br />3) Verify that no Intel product components’ license seats (i.e., Compiler Professional Edition, Vtune, etc.) can be borrowed by running the following command. If you see any information that indicates one or more features/components were borrowed, then borrowing has already been enabled for those licensed features/components. Here is an example of output when no borrowing is enabled.<br /><br />lmutil lmborrow -status<br /><br />Example:<br />lmutil lmborrow -status<br />lmutil - Copyright (c) 1989-2009 by Macrovision Corporation. All rights reserved.<br /><br /><br />4) Configure the borrow duration and FLEXlm feature to be borrowed:<br /><br />lmutil lmborrow INTEL dd-mmm-yyyy:[time] &lt;featurename&gt; -c &lt;serverlicense file&gt;<br /><br />Example:<br />lmutil lmborrow INTEL 06-Oct-2011 CCompL -c server.lic<br /><br />where, <br />server.lic is the license file which was used to start the server. It should be noted that license borrowing will fail if the license file on the client side is different than the one which was used to start the server.<br /><br />The command above borrows a “featurename” called CCompL (Intel® C++ Compiler for Linux*) from the vendor INTEL until 6th Oct 2011 using the license file server.lic<br /><br />NOTE:  The time specified on command line of lmborrow is the end date/time the user planed to borrow, which must be &lt;= 168 hours, which is the maximum borrow period. If the user wants to borrow the license seat for only 1 or 2 days, the corresponding date/time for that period should be set. <br />Users cannot borrow a license seat for more than the 168 hour barrier that is set in the license file and in the license server logic. If an extended borrow time is required, please submit an issue to Intel® Premier Support at https://premier.intel.com/ under product 'Download, Licensing and Registartion' providing a jsutification of why you need to extend the borrow time beyond 7 days.<br /><br />After running this command, the customer should see the following:<br /><br />lmutil - Copyright (c) 1989-2004 by Macrovision Corporation. All rights reserved.<br />Setting LM_BORROW=3-oct-2011:INTEL:06-oct-2011:CCompL<br /><br /><br />5) If the above steps are successful, you are now ready to borrow a seat by running the client application (e.g. – Composer XE, Vtune Amplifier XE, Inspector XE, etc). The FLEXlm feature will be borrowed when you run the client/application and a successful check-out happens. As soon as the first license is checked out, the server log file will confirm the borrowed feature with the following message in the log file:<br /><br />14:35:14 (INTEL) OUT: "I3F97C15E" User1@Host1<br />14:35:14 (INTEL) OUT: "CCompL" User1@Host1<br /><br />Note that there are no corresponding IN entries in the server log. This is different behavior than a normal check-out where corresponding to every two OUT entries in the server log file, you will also see two IN entries after the OUT entries after the application exits.<br /><br /><br />6) Verify that the FLEXlm product feature was really borrowed by running the following command on the client system: <br /><br />lmutil lmborrow –status<br /><br />Example for borrowing a seat for the Intel C++ Compiler for Linux*:<br /><br />lmutil lmborrow –status<br />lmutil - Copyright (c) 1989-2004 by Macrovision Corporation. All rights reserved.<br /><br />Vendor Feature Expiration<br />______ ________ __________<br /><br />INTEL CCompL 6-Oct-11 23:59<br /><br />NOTE: Before the Borrow period expires, the product will always get the license from the local storage for the Borrowed license seat when it tries to check-out license. The Client system on which the product is used does not need to be attached to FLEXlm* license host server.<br /><br /><br />7) Disconnect the client system from the server network. Now with the borrowed license, you can use the software application with the borrowed license.<br /><br />NOTE: After the Borrow period expires, the product license seat will no longer to be able to check-out the license from local storage. Instead, the client system must be “attached” to the FLEXlm* license host server to check-out a product license seat.<br /><br /><br />8) Run the following command to return a borrowed license: <br /><br />lmutil lmborrow -return -c server.lic featurename<br /><br />Example:<br />lmutil lmborrow -return -c server.lic CCompL<br />lmutil - Copyright (c) 1989-2009 by Macrovision Corporation. All rights reserved.<br /><br />On the FLEXlm server side, you will see the following message in the log file for the borrowed feature which was returned. This message is different compared to a normal check-in.<br /><br />14:40:17 (INTEL) REMOVING User1@Host1:/dev/pts/0 from CCompL by administrator request.<br /><br />14:40:17 (INTEL) IN: "CCompL" User1@Host1 (USER_REMOVED)<br />14:40:17 (INTEL) IN: "I3F97C15E" <a href="http://software.intel.commailto:User1@Host1">User1@Host1</a> (USER_REMOVED)<br /><br /><br />9) Run the following command to verify that the license was returned successfully back to the server: <br /><br />lmutil lmborrow -status<br /><br />Example:<br />./lmutil lmborrow -status<br />lmutil - Copyright (c) 1989-2009 by Macrovision Corporation. All rights reserved.<br /><br />Note: If you try to return a license which has not been borrowed, you will see a message like this:<br /><br />./lmutil lmborrow -return -c server.lic CCompL<br />lmutil - Copyright (c) 1989-2009 by Macrovision Corporation. All rights reserved.<br />Error: CCompL not currently borrowed.<br /><br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/intel-flexlm-license-borrowing-capability/</link>
      <pubDate>Fri, 06 Aug 2010 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/intel-flexlm-license-borrowing-capability/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/intel-flexlm-license-borrowing-capability/</guid>
      <category>Intel® TBB</category>
      <category>Intel® C++ Compiler for Linux* Knowledge Base</category>
      <category>Intel® C++ Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® C++ Compiler for Windows* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Linux* Knowledge Base</category>
      <category>Intel® Fortran Compiler for Mac OS X* Knowledge Base</category>
      <category>Intel® Integrated Performance Primitives Knowledge Base</category>
      <category>Intel® License Manager for FLEXlm* Knowledge Base</category>
      <category>Intel® Math Kernel Library Knowledge Base</category>
      <category>Intel® Threading Building Blocks Knowledge Base</category>
      <category>Intel® Visual Fortran Compiler for Windows* Knowledge Base</category>
      <category>Intel® VTune™ Performance Analyzer for Linux* Knowledge Base</category>
      <category>Intel® VTune™ Performance Analyzer for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>How do you generate a core dump file to help diagnosing software crash?</title>
      <description><![CDATA[ <p><span class="sectionBodyText">Sometime the user met a software crash, but the developer can't reproduce such problem on their side. So if the user can generate (provide) core dump file then submit a new issue to </span><a href="https://premier.intel.com/" class="sectionBodyText">https://premier.intel.com</a><span class="sectionBodyText"> with this file, it will beneficial on bug fix.</span></p>
<p class="sectionBodyText"><strong>Windows*</strong></p>
<ol type="1">
<li class="sectionBodyText">Right-click on "My Computer", then click "Properties"</li>
<li class="sectionBodyText">Click on "Advance" tab</li>
<li class="sectionBodyText">Under "Startup and Recovery", click "Settings"</li>
<li class="sectionBodyText">Under "Write debugging information", select "Small memory dump (64KB)"</li>
<li class="sectionBodyText">Default directory "C\Windows\Minidump" for "Small dump directory:"</li>
<li class="sectionBodyText">Click "OK" button.</li>
<li class="sectionBodyText">Use regedit.exe to verify item named "CrashDumpEnabled" value is 0x3 under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl</li>
<li class="sectionBodyText">Restart the machine</li>
<li class="sectionBodyText">If software crash happens, <strong>Minixxxx-01.dmp</strong> will be generated</li>
</ol>
<p class="sectionBodyText">Note: usually developer uses WinDbg to know what and why caused this failure.</p>
<p class="sectionBodyText"> </p>
<p class="sectionBodyText"><strong>Linux*</strong></p>
<p>1.   Use "ulimit -c" to check the switch of generating core file. If result is zero, the switch is disabled, will not generate core file</p>
<ol start="2" type="1">
<li class="sectionBodyText">Use "ulimit -c filesize" to enable this switch (filesize is KB), or use "ulimit -c unlimited". Using filesize might cause file truncated, gdb will report error.</li>
<li class="sectionBodyText">If the user met software crash, core dump file (like as <strong>core.xxxx</strong>) will be generated</li>
<li class="sectionBodyText">Disable "core dump" function by using "ulimit -c 0"</li>
</ol>
<p class="sectionBodyText">Note:  the developer will use "gdb -c corefile problematical-executable", "bt" command in gdb to know what happened, with call stack info.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/how-do-you-generate-a-core-dump-file-to-help-diagnosing-software-crash/</link>
      <pubDate>Wed, 14 Jul 2010 06:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/how-do-you-generate-a-core-dump-file-to-help-diagnosing-software-crash/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/how-do-you-generate-a-core-dump-file-to-help-diagnosing-software-crash/</guid>
      <category>Intel® VTune™ Performance Analyzer for Linux* Knowledge Base</category>
      <category>Intel® VTune™ Performance Analyzer for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>Get error message “Module is an instrumented module” when using Call graph data collector</title>
      <description><![CDATA[ <p><strong>Symptom</strong>:</p>
<p>When the user ran Call graph to statically instrument EXE or some DLLs, met error message such as "<strong>Error in module ###.dll - Module is an instrumented module</strong>"</p>
<p> </p>
<p><strong>Cause</strong>:</p>
<p>Call graph copies (saves) original user's EXE/DLLs to Call graph's cache directory first, then put instrumented EXE/ DLLs in their original directories respectively. After doing data collection, copies original modules from cache directory to original directories, and keeps instrumented modules to Call graph's cache directory. Next time to run Call graph again, user's modules will not be instrumented (assume that source code are not changed, and don't rebuild modules), get them from Call graph's cache to save time.</p>
<p>Usually this problem occurred if Call graph was aborted abnormally by user, original modules might be corrupted, or can't be restored into original directories. This interferes next time Call graph's running.<br /> </p>
<p><strong>Solution</strong>:</p>
<ol type="1">
<li>Empty Call graph cache directories first.</li>
<li>Secondary rebuild all influenced modules</li>
</ol>
<p><em>[DISCLAIMER: The information on this web site is intended for hardware system manufacturers and software developers. Intel does not warrant the accuracy, completeness or utility of any information on this site. Intel may make changes to the information or the site at any time without notice. Intel makes no commitment to update the information at this site. ALL INFORMATION PROVIDED ON THIS WEBSITE IS PROVIDED "as is" without any express, implied, or statutory warranty of any kind including but not limited to warranties of merchantability, non-infringement of intellectual property, or fitness for any particular purpose. Independent companies manufacture the third-party products that are mentioned on this site. Intel is not responsible for the quality or performance of third-party products and makes no representation or warranty regarding such products. The third-party supplier remains solely responsible for the design, manufacture, sale and functionality of its products. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others.]</em></p> ]]></description>
      <link>http://software.intel.com/en-us/articles/get-error-message-module-is-an-instrumented-module-when-using-call-graph-data-collector/</link>
      <pubDate>Tue, 09 Mar 2010 08:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/get-error-message-module-is-an-instrumented-module-when-using-call-graph-data-collector/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/get-error-message-module-is-an-instrumented-module-when-using-call-graph-data-collector/</guid>
      <category>Intel® VTune™ Performance Analyzer for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>Using ActivityController to selectively profile your big program</title>
      <description><![CDATA[ <p>The user may run sampling data collection with BIG application, but only have interest of limited code area, so collecting data for whole run is unnecessary. VTune(TM) Performance Analyzer's user interface provides "Start with data collection paused" function, so the user can run data collection session with pause mode, then use "Pause/Resume Activity" button to resume data collection when running activity.</p>
<p>However there is other way to do this better - that is, ActivityController can provide more powerful functions, such as Stop/Cancel/Pause/Resume Activity. In this way, the user doesn't need to click buttons on VTune(TM)Analyzer's graphic user interface! </p>
<p><img src="http://software.intel.com/file/25025" alt="ActivityController1.bmp" title="ActivityController1.bmp" /><br /><strong>Please note if the user has no activity running, ActivityController will report "You are not currently running any Activities!"</strong></p>
<p>ActivityController supports any situation of data collection, only if the user has started activity running.</p>
<p>Here are two examples of running VTune(TM) Analyzer under graphic user interface and command line. Please see their outputs after using ActivityController.</p>
<p>1) VTune(TM) Analyzer starts with paused mode</p>
<p><img src="http://software.intel.com/file/25026" alt="ActivityController2.bmp" title="ActivityController2.bmp" /><br /><br /><img src="http://software.intel.com/file/25027" alt="ActivityController3.bmp" title="ActivityController3.bmp" /><br /><br />2) VTL command works as long run sampling session<br /><img src="http://software.intel.com/file/25028" alt="ActivityController4.bmp" title="ActivityController4.bmp" /></p>
<p><em>[DISCLAIMER: The information on this web site is intended for hardware system manufacturers and software developers. Intel does not warrant the accuracy, completeness or utility of any information on this site. Intel may make changes to the information or the site at any time without notice. Intel makes no commitment to update the information at this site. ALL INFORMATION PROVIDED ON THIS WEBSITE IS PROVIDED "as is" without any express, implied, or statutory warranty of any kind including but not limited to warranties of merchantability, non-infringement of intellectual property, or fitness for any particular purpose. Independent companies manufacture the third-party products that are mentioned on this site. Intel is not responsible for the quality or performance of third-party products and makes no representation or warranty regarding such products. The third-party supplier remains solely responsible for the design, manufacture, sale and functionality of its products. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others.]</em></p> ]]></description>
      <link>http://software.intel.com/en-us/articles/using-activitycontroller-to-selectively-profile-your-big-program/</link>
      <pubDate>Wed, 03 Feb 2010 08:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-activitycontroller-to-selectively-profile-your-big-program/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-activitycontroller-to-selectively-profile-your-big-program/</guid>
      <category>Intel® VTune™ Performance Analyzer for Linux* Knowledge Base</category>
      <category>Intel® VTune™ Performance Analyzer for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>Application Crashes When Attempting Call Graph Profiling</title>
      <description><![CDATA[ <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="ProgId" content="Word.Document" />
<meta name="Generator" content="Microsoft Word 11" />
<meta name="Originator" content="Microsoft Word 11" />
<link rel="File-List" href="file:///C:%5CDOCUME%7E1%5Cdlanders%5CLOCALS%7E1%5CTemp%5Cmsohtml1%5C01%5Cclip_filelist.xml" />
<link rel="Edit-Time-Data" href="file:///C:%5CDOCUME%7E1%5Cdlanders%5CLOCALS%7E1%5CTemp%5Cmsohtml1%5C01%5Cclip_editdata.mso" />

<p><b>Application Crashes or No Data Collected When Attempting Call Graph Profiling</b></p>
We have seen cases where a user's application will crash when the VTune analyzer launches the application for call graph profiling.  Another symptom of this problem is the <b>No results were generated </b>message from the VTune analyzer, after the application under test completes.  Because of our binary instrumentation technology, and the injecting of code into a user application, sometimes various Microsoft* runtime libraries will cause the application to crash.  Symptoms vary, but you may see the "unhandled exception" message and the "Just-In-Time Debugger" prompt from Visual Studio*.  <br /><br /> <img src="http://software.intel.com/file/23442" title="JIT+debugger.JPG" alt="JIT+debugger.JPG" /><br /><br /> Followed by the VTune analyzer "No results were generated for this run" message:<br /><br /> <img src="http://software.intel.com/file/23443" title="no_results.JPG" alt="no_results.JPG" /><br /><br /> Reducing the instrumentation level of the Microsoft runtime libraries will often enable you to workaround this problem.  Follow these steps to apply the workaround:<br /><br /> 1)      Right-click on the Activity in the Tuning Browser and select <b>Modify Collectors...</b>:<br /><br /> <img src="http://software.intel.com/file/23444" title="modifycollector.JPG" alt="modifycollector.JPG" /><br /><br /> 2)      After the <b>Configure Call Graph</b> dialog appears, scroll the list of modules until you locate the msvcr<i>nn</i>[d].dll and msvcrt.dll modules.  The <i>nn</i> represents the version of Visual Studio that you are using.  In our example, Visual Studio 2005 was used and, therefore, <i>nn</i> = "80", as in 8.0.  If you are using Visual Studio 2008, the module would be msvcr90.dll or msvcr90d.dll, where the '<i>d</i>' representing the debug version of the DLL.<br /><br /> <img src="http://software.intel.com/file/23445" title="CollectorConfigDlg.JPG" alt="CollectorConfigDlg.JPG" /><br /><br /> Note that color-coding is used in the module list.  An explanation of the colors follows, and is available in the online help, as well.<br /><br /> 
<ul>
<li><span >Gray</span>: Modules you added to the project from the Application/Module Profile Configuration dialog box, or from the Call Graph Configuration Wizard.</li>
<li><span >Blue</span>: Modules added during call graph instrumentation as dependencies of the selected modules.</li>
<li>White: Modules you added via Add button, or added during run-time instrumentation.</li>
</ul>
<br /> 3)      Click on the <b>Instrumentation Level</b> cell for the <b>msvcr80.dll</b>, for example, and select <b>Minimal</b>:<br /><br /> <img src="http://software.intel.com/file/23446" title="CollectorConfigDlg2.JPG" alt="CollectorConfigDlg2.JPG" /><br /><br /> 4)      Press <b>Apply</b> and then <b>Instrument Now</b>:<br /><br /> <img src="http://software.intel.com/file/23447" title="CollectorConfigDlg3b.JPG" alt="CollectorConfigDlg3b.JPG" /><br /><br /> 5)      Now <b>OK</b> out of all dialogs and re-run the activity.<br /><br /> In this example, simply reducing the instrumentation level of msvcr80.dll resolved the problem and results were successfully collected.  The effect of reducing the instrumentation level on the runtime libraries is that no information regarding calls into those modules will be available in the call graph data.  If this information is critical for your tuning activity, you can try to use the Custom Instrumentation (see <b>Functions...</b> button) to deselect functions until the call graph activity succeeds.  Using a binary search technique, you would narrow down which function or functions fail when they are instrumented and select to not instrument them.  Follow these steps to accomplish this task:<br /><br /> <br /><br /> 1)      Press <b>Functions...</b> for the module of concern.  You will see a dialog box such as this:<br /> <br /><img src="http://software.intel.com/file/23451" title="functions1.JPG" alt="functions1.JPG" /><br /><br /> 2)      Now, using the scroll bar move to the middle of the list and select that line:<br /> <br /><img src="http://software.intel.com/file/23452" title="functions2.JPG" alt="functions2.JPG" /><br /><br /> 3)      Press Shift-Ctrl-End to select from this line to the bottom of the list.  The horizontal scroll bar will move all the way to the right.  Simply grab it and move it back to the left:<br /> <br /><img src="http://software.intel.com/file/23453" title="functions3.JPG" alt="functions3.JPG" /><br /><br /> 4)      Press the <b>Uncheck</b> button:<br /> <br /><img src="http://software.intel.com/file/23454" title="functions4.JPG" alt="functions4.JPG" /><br /><br /> 5)      Press <b>OK</b> and then <b>Apply</b> and <b>Instrument Now</b>:<br /> <br /><img src="http://software.intel.com/file/23455" title="functions5.JPG" alt="functions5.JPG" /><br /> <br /> <br /><br /> 6)      Notice that <b>Instrumentation Level</b> now shows <b>Custom</b> for the selected module.<br /><br /> 7)      <b>OK</b> out of all dialogs and re-run the Activity.<br /><br /> If the application fails, again, modify the function selection so that the top half of the functions is selected and re-run the activity.  In this way, you can narrow down which function(s) do not like to be instrumented (or "fail" when instrumented).<br /><br /> An alternative to the VTune analyzer's Call Graph feature is the <a href="http://software.intel.com../../../../en-us/intel-parallel-amplifier/">Intel® Parallel Amplifier</a>'s Hotspot analysis.  Parallel Amplifier uses a new technology to collect periodic samples, similar to "Clockticks" in the VTune analyzer, with complete call stacks, so that the call path to the hotspot is visible without instrumentation.  If you have a valid license for the VTune analyzer that includes support through April 1, 2010, you can download, install, and use Parallel Amplifier without any additional cost.<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/articles/application-crashes-when-attempting-call-graph-profiling/</link>
      <pubDate>Fri, 30 Oct 2009 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/application-crashes-when-attempting-call-graph-profiling/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/application-crashes-when-attempting-call-graph-profiling/</guid>
      <category>Intel® VTune™ Performance Analyzer for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>Using Intel® VTune™ Performance Analyzer to Optimize Software for the Intel(R) Core(TM) i7 Processor Family</title>
      <description><![CDATA[ <p>This presentation describes a process for software optimization on the Intel(R) Core(TM) i7 Processor Family, including Core i7 processors and Xeon(R) 5500 series processors.<br />The presentation lists typical issues and how to diagnose them using Intel(R) VTune Performance Analyzer.<br />Users of Intel(R) Performance Tuning Utility may also find this presentation useful.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-to-optimize-software-for-the-intelr-coretm-i7-processor-family/</link>
      <pubDate>Wed, 30 Sep 2009 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-to-optimize-software-for-the-intelr-coretm-i7-processor-family/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-to-optimize-software-for-the-intelr-coretm-i7-processor-family/</guid>
      <category>Intel® VTune™ Performance Analyzer for Linux* Knowledge Base</category>
      <category>Intel® VTune™ Performance Analyzer for Windows* Knowledge Base</category>
    </item>
    <item>
      <title>How Do I Measure Memory Bandwidth on an Intel® Core™ i7 or Xeon® 5500 Series Platform Using Intel® VTune™ Performance Analyzer?</title>
      <description><![CDATA[ <p><strong>Updated!  2/3/2011<br /><br /></strong>The new Intel® Core™ i7 and Xeon® 5500 series processors have a different architecture than previous processors, particularly when it comes to the uncore. The “uncore” is the part of the processor that is external to all the individual cores- for example, on the Core™ i7, there are 4 cores, and these share an L3 cache and a memory interface. The L3 and memory interface are considered uncore. VTune™ Performance Analyzer does not support the sampling of events that are triggered in the uncore of the processor.<br /><br />However, due to popular demand, we have created and documented a way for VTune analyzer users who have Core i7 or Xeon 5500 series processor-based platforms to measure memory bandwidth. This is not measurable by default since the events needed are in the uncore. Here is the process to enable bandwidth measurement using the program sep. Sep is a utility that provides the sampling functionality used by VTune analyzer and Intel® Performance Tuning Utility (PTU). <br /><br />Note that with this method the bandwidth events are <i>counted</i> using <i>time-based sampling</i>, not the <i>event-based sampling</i> that VTune analyzer normally uses. This means that you can determine a bandwidth for your whole system over a designated time range, but you won’t be able to see how much of the bandwidth used came from various functions/processes/modules. You can only see the total bandwidth for the system. Please adjust your application testing accordingly by running only the target application while measuring bandwidth.<br /><br />1. Download the <a href="http://software.intel.com/en-us/articles/download-intel-performance-tuning-utility-32-update-1/">Intel® Performance Tuning Utility 3.2 update 1</a>. The version of sep needed for this method is only available in this release of PTU. If you have a 32-bit operating system, get the IA-32 version, and if you have a 64-bit operating system, get the Intel® 64 version. PTU is available for both Windows* and Linux*.<br />2. Un-compress the package and follow the instructions in INSTALL.txt to install PTU. Make sure to install the sampling driver!<br />3. Download the appropriate Uncore Measurement package and uncompress it into a directory of your choice. To download the package, to go <a href="http://premier.intel.com">http://premier.intel.com</a>, log in, and select File Downloads from the menu on the left. Select either VTune™ Performance Analyzer for Linux* or VTune™ Performance Analyzer for Windows* and click Display File List. The package will be named lin_measurebw.tar.gz for Linux* or win_measurebw.zip for Windows*.<br />4. Run the bandwidth measurement script (<b>uncore.bat</b> for Windows*, <b>uncore.sh</b> for Linux*) from the uncore directory. This script sets up the environment needed to measure bandwidth, and then uses sep to measure it. It is important that you measure bandwidth using this script to avoid unstable configuration changes to your VTune™ analyzer or PTU installations! If you run this script from a command prompt (instead of double-clicking, close the command window afterwards.<br />5. Once the bandwidth measurement script has finished executing, open the <b>bandwidth.txt</b> file in the same directory. This file contains the results of bandwidth measurement, and will be overwritten each time you run the bandwidth measurement script. See the <i>Interpreting Bandwidth</i> section below to analyze the data.<br />6. Now that PTU is installed, you may use it for your sampling needs, or you can use VTune analyzer. PTU will be the current active sampling technology on your system after executing these instructions. You will need to follow <a href="http://software.intel.com/en-us/articles/how-to-configure-vtune-and-ptu-on-the-same-system/">these instructions</a> for switching between using PTU and VTune analyzer for sampling. <br /><br /><b>Interpreting Bandwidth<br /></b><br />This method measures bandwidth from each processor’s uncore memory controller to memory. It will include memory reads, memory writes, I/O, and writebacks from L3 to memory. It does not include traffic from cache-to-cache transfers between sockets. <br /><br />Using this method, your <b>bandwidth.txt</b> results file will contain results in this format:<br /><br /><span class="sectionBodyText"><i>Version Info: Sampling Enabling Product version: 2.9.devbuild (private) built on Mar 18 2009 02:53:25 P:Intel(R) Xeon(R) Processor 5500 series M:10 S:4<br /><br />UNC_IMC_WRITES.FULL.ANY 14,650,441,461 50,459 50,458 50,481 50,458 15,737 15,741 15,741 15,740<br />UNC_IMC_NORMAL_READS.ANY 14,650,441,476 196,626 196,618 196,679 196,515 36,071 36,071 36,072 36,072<br />----------<br /><br />5.00s real 0.468s user 39.531s system 38.796s idle<br /></i><br /></span>Bandwidth from reads and writes is measured separately, and each processor socket is measured separately. In the above output file, the first line of values measures writes to memory and the second line of values measures reads to memory. Each line of output will show a series of event values separated by spaces. The first value after each event is a timestamp (14,650,441,461 &amp; 14,650,441,476 in this example). The following values will be the counts of 64-byte transfers on the memory bus for each core. <br /><br />It is important to realize that for current Core i7 or Xeon 5500 series processors, there are 4 cores on each socket, all sharing the same uncore. So, you will see 4 values for each socket, but really these are all measuring the same uncore bandwidth. For example, for UNC_IMC_WRITES.FULL.ANY in the example above, the first 4 values after the timestamp are all close to 50,460. <b>They are really all measuring the same bandwidth from socket 0 to its memory, and so should be averaged, not summed!</b> The output above was measured on a dual-socket Xeon 5500 series platform with Intel® Hyper-Threading Technology disabled. There are 8 values for each event – 4 for one socket, and 4 for the other. If Hyper-Threading Technology had been enabled, there would be 8 values per socket, and those 8 should be averaged to get one bandwidth number for each socket. <br /><br />The number of values you see will correspond to the number of hardware threads on your system. The order in which the values appear may be different on Windows* and on Linux*. For Windows*, usually all the values for one processor socket will appear together. For example, on a dual-socket Windows* platform with Intel® Hyper-Threading Technology enabled, the values may be in the order &lt;Socket 0, Core 0, Hyperthread 0&gt;, &lt;S0, C1, H0&gt;, &lt;S0, C2, H0&gt;, &lt;S0, C3, H0&gt;, &lt;S0, C0, H1&gt;, &lt;S0, C1, H1&gt;, &lt;S0, C2, H1&gt;, &lt;S0, C3, H1&gt;, &lt;S1, C0, H0&gt;, etc, giving you 8 values total for each physical socket. On Linux* the way in which the threads and cores are enumerated varies according to the distribution. You can refer to the /proc/cpuinfo file for your platform to see the way the physical sockets are mapped – for each processor in /proc/cpuinfo, look at the “physical id”. The physical id indicates the socket number. This can help you identify how the values in bandwidth.txt correspond to physical sockets (the values in bandwidth.txt will be in the same order as the processors in the /proc/cpuinfo file). In all cases – just remember that for a particular bandwidth event, you should be seeing roughly the same quantities from cores and hardware threads on the same socket. If you have a dual-socket platform with Hyper-Threading Technology enabled, then half of the values for each bandwidth event will be for each socket. Approximately half should be around the same quantity, and the other half should be a different quantity. On a single-socket platform simply average all the values.<br /><br />Finally, near the bottom of each result file you will see the time spent sampling – 5 seconds in the example above.<br /><br />To compute total system bandwidth, use this formula:<br /><br /><i>Bandwidth (GB/s) = ((average of UNC_IMC_WRITES.FULL.ANY for each socket + average of UNC_IMC_NORMAL_READS.ANY for each socket) * 64 * 1.0e-9) / seconds measured<br /></i><br />For the example above, bandwidth is ((50,464 (writes on socket 1) + 15,740 (writes on socket 2) + 196,610 (reads on socket 1) + 36,072 (reads on socket 2) * 64 * 1.0e-9) / 5 = .004 GB/s. This bandwidth was measured on an idle system.<br /><br /><b>Final Notes<br /></b><br />This method can be used to measure total system bandwidth on Core i7 and Xeon 5500 series processor-based platforms. It will not work with any other processors. We also do not recommend using sep for any other sampling – VTune analyzer and PTU have much more friendly user interfaces for collecting and interpreting data. At this time, these events (needed for bandwidth measurement) are the only uncore events we are making available.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/how-do-i-measure-memory-bandwidth-on-an-intel-core-i7-or-xeon-5500-series-platform-using-intel-vtune-performance-analyzer/</link>
      <pubDate>Tue, 22 Sep 2009 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/how-do-i-measure-memory-bandwidth-on-an-intel-core-i7-or-xeon-5500-series-platform-using-intel-vtune-performance-analyzer/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/how-do-i-measure-memory-bandwidth-on-an-intel-core-i7-or-xeon-5500-series-platform-using-intel-vtune-performance-analyzer/</guid>
      <category>Intel® VTune™ Performance Analyzer for Linux* Knowledge Base</category>
      <category>Intel® VTune™ Performance Analyzer for Windows* Knowledge Base</category>
    </item>
  </channel></rss>
