<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Tue, 24 Nov 2009 18:01:07 -0800 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/forums/intel-threading-building-blocks/feed" rel="self" type="application/rss+xml" />
    <title>Intel Software Network - <![CDATA[ Intel® Threading Building Blocks ]]> feed</title>
    <link>http://software.intel.com/en-us/forums/intel-threading-building-blocks</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>idea for improveming the performance</title>
      <description><![CDATA[ the SW architecture like this:<br /><br />TASK1(Gbps packets input)---TASK2(pre-processed)-TASK3A,TASK3B,TASK3C....(all of them are processing engine)--TASK4(output)<br /><br />my questions:<br /><br />1.the packet are coming in continually,so I don't thind it is necessary to keep the cache HOT?(and I don't know how to do that,all packets are alloced in TASK1,and each TASK runs on different CPU,it is not like the TBB:pipeline mode)<br /><br />2.input packet will be copied to a buf(alloced from a memory pool with malloc),necessary to use cache_aligned_llocator&lt;&gt; to replace the malloc?(again,the content of buf changed quickly and frequently) ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/70093/</link>
      <pubDate>Mon, 23 Nov 2009 21:56:09 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/70093/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>Tuning Advice given by Intel VTune for tbb.dll and tbbmalloc.dll</title>
      <description><![CDATA[ <br />I sampled my application with Intel VTune profiler and the following is about the results it gave. <br /><br />My application comprises of an application.dll and a Test.exe both of which extensively use tbb.dll and tbbmalloc.dll. I use task interface, concurrent_queue, concurrent_hash_map from tbb.dll and cache_aligned_allocator from tbbmalloc.dll.Iam using the tbb22_20090809oss version.<br /><br />Here are the results which the Intel Tuning assistant gave<br /><i><b>Process/Module Summary (Process: test.exe, Module: msvcr80.dll, RVA: 0x23ed-0x504a7) </b><br />CPU_CLK_UNHALTED.CORE: 18,501,600,000 <br /><br />Time Statistics <br /> CPU_CLK_UNHALTED.CORE: 18,501,600,000 events<br /> Processor Time: 7.73 sec  <br /> Accounts for 44.48% (workload) <br /><br /><b>Process/Module Summary (Process: test.exe, Module: test.exe, RVA: 0x10b0-0x7f06) </b><br />CPU_CLK_UNHALTED.CORE: 18,400,800,000 <br /><br />Time Statistics <br /> CPU_CLK_UNHALTED.CORE: 18,400,800,000 events<br /> Processor Time: 7.69 sec<br /> Accounts for 44.24% (workload) <br /><br /><b>Process/Module Summary (Process: test.exe, Module: application.dll, RVA: 0x1017-0x488a) </b><br />CPU_CLK_UNHALTED.CORE: 225,600,000 <br /><br />Time Statistics <br /> CPU_CLK_UNHALTED.CORE: 225,600,000 events<br /> Processor Time: 0.094 sec<br /> Processor Time: 0.094 sec  <br /> Accounts for 0.54% (workload) <br /><br />Other Possible Problems <br /> CPI (Cycles Per retired Instruction) is poor: 1.92 clockticks per instructions retired<br /><b><br />Process/Module Summary (Process: test.exe, Module: tbb.dll, RVA: 0x91d0-0x1de61) </b><br />CPU_CLK_UNHALTED.CORE: 1,156,800,000 <br /><br />Time Statistics <br /> CPU_CLK_UNHALTED.CORE: 1,156,800,000 events<br /> Processor Time: 0.48 sec<br /> Processor Time: 0.48 sec  <br /> Accounts for 2.78% (workload) <br /><br />Other Possible Problems <br />CPI (Cycles Per retired Instruction) is poor: 2.8 clockticks per instructions retired<br /><br /><b>Process/Module Summary (Process: test.exe, Module: tbbmalloc.dll, RVA: 0x1904-0x4139) </b><br />CPU_CLK_UNHALTED.CORE: 206,400,000 <br /><br />Time Statistics <br /> CPU_CLK_UNHALTED.CORE: 206,400,000 events<br /><br /> Processor Time: 0.086 sec <br /> Accounts for 0.5% (workload) <br /><br />Other Possible Problems<br />Branch mispredictions impact performance: 15.29 % cycles spent in branch misprediction recovery<br /><br /> Advice: <br /> Use the precise events to focus on instructions of interest. <br /> Eliminate branches <br /> Use constants rather than variables or parameters <br /> Improve branch predictability. <br /> Compile with the Interprocedural Optimizations (IPO) switch <br /> Compile with the Profile-guided Basic-block Optimization. <br /> Consider assembly-level branch-prediction tuning. <br /> Measure events required to compute advanced event ratios. <br /><br /> CPI (Cycles Per retired Instruction) is poor: 3.07 clockticks per instructions retired<br /> Advice: <br /> Measure events required to compute advanced event ratios. <br /><br /> Many L2 cache demand misses: 0.0081 L2 cache demand misses per instruction retired<br /> Advice: <br /> Use the precise events to focus on instructions of interest. <br /> Improve data locality, if possible. <br /> Consume data in chunks that fit in the L2 cache. <br /> Better exploit the hardware prefetchers. <br /> Use software prefetching. <br /><br /> Many L2 data cache misses: 0.022 L2 cache misses per instruction retired<br /> Advice: <br /> Use the precise events to focus on instructions of interest. <br /> Improve data locality, if possible. <br /> Consume data in chunks that fit in the L2 cache. <br /> Better exploit the hardware prefetchers. <br /><br /> Many TLB misses: 7.31 % cycles spent on TLB misses<br /> Advice:  Measure events required to understand the type of TLB misses. <br /></i><br />As seen in the results the Intel Vtune profiler doesnt have much to advice on my application.dll and Test.exe. But it gives lot of advices on tbbmalloc.dll and tbb.dll. Does this have anything to do with the way of usage of  tbb?<br /><br />What is CPI(cycles per retired instruction)? what value of it is not poor?<br /><br />why does it show a huge Branch mispredictions impact performance in tbbmalloc.dll? And are the values of L2 cache demand misses, L2 data cache misses, TLB misses acceptable? Also is L2 cache demand miss and L2 data cache miss related to each other?<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/70082/</link>
      <pubDate>Mon, 23 Nov 2009 05:44:32 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/70082/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>Race Condition in ParallelMerge?</title>
      <description><![CDATA[ <p>Attached is a parallel merge sort that is substantially faster than TBB's parallel_sort - it uses parallel_reduce.  The speed up comes from using Intel's IPP to do the sort.  ParallelMerge (provided with TBB) merges the sorted data from each thread using parallel_reduce.  <br /><br />The sort works all the time with range size = height / (# of processors) and this is probably the most efficient setting for the tile size.  However, in testing the smaller range sizes, the stack is corrupted in ParallelMerge.  Parallel Studio reports that there is a race condition at the location of the crash.  (Parallel Studio has been patched with Update 2 - the latest.)<br /><br />To reproduce the problem, set TileSize = 1 in the attached code.  It is failing on my 8-way Core i7 so TileSize = Height / 8 will work fine.  This problem will not fail if TBB is initialized with only 1 thread (single-threaded mode.)  The problem may also be circumvented by setting ParallelMerge's Is_Divisible method to always return false (but this leaves a lot of the speed up on the table.)<br /><br />If anyone could help resolve this problem, it will probably benefit TBB or ParallelMerge.  I don't believe the problem is in my code but I am willing to be test any suggestions. </p> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/70047/</link>
      <pubDate>Fri, 20 Nov 2009 10:19:17 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/70047/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>parallel_reduce problem</title>
      <description><![CDATA[ I know I must be doing something really dumb, but I have not been able to figure out what.  I'm basically just doing the simple reduction example in the TBB book, but using the IPP sum routine (sum of all pixels in a Ipp16u plane).  Sounds simple enough, but it looks like join() is not being called enough times.  Do I ever need to explicitly call join() or does the system always call it?<br /><br />
<pre name="code" class="cpp">class Sum
{
public:  // Methods
    Sum(Img *img) : m_img(img), m_sum(0) {}
    void operator () (const tbb::blocked_range&lt;int&gt; &amp;range)
    {
        IppiSize sz;
        Ipp16u *pSrc = (Ipp16u*)m_img-&gt;getPixel(0, range.begin(), 0);
        I32 step = m_img-&gt;getStep(0);
        sz.width = m_img-&gt;getWidth(0);
        sz.height = range.size();

        if (ippStsNoErr != ippiSum_16u_C1R(pSrc, step, sz, &amp;m_sum))
            throw std::runtime_error("ippiSum_16u_C1R failed!\n");
        printf("Sum for %d %d = %0.1f\n", range.begin(), range.end(), m_sum);
    }
    Sum(Sum &amp;x, tbb::split) : m_img(x.m_img), m_sum(0) {}
    void join(const Sum &amp;y) { printf("%0.1f = %0.1f + %0.1f\n", m_sum + y.m_sum, m_sum, y.m_sum); m_sum += y.m_sum; }
    F64 getSum() { return m_sum; }

private: // Attributes
    Img *m_img;
    F64 m_sum;
};
</pre>
<br /><br />If I call operator() directly it works fine.  As soon as I put it inside a parallel_reduce() I get the wrong (smaller) answer.  Looking at the diagnostic prints in my code it looks like all sub-regions are computed correctly, but not all of them end up in join() calls.<br /><br />Peter<br /><br /><br /><br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69996/</link>
      <pubDate>Thu, 19 Nov 2009 05:21:00 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69996/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>Help with writing a parallel application for solving a three dimensional parabolic heat equation</title>
      <description><![CDATA[ I am working on my final year research..I am supposed to write a parallel system for finding solution to to parabolic partial differential equations. the three dimensional heat equations is my main focus. i need to do this using OpenMP....i want to do everything in Intel Parallel Studio   can anyone give me the serial version of this problem   at least by providing me with the necessary loops and then advice on where to parallelize
<div><span style="font-family: Verdana, Arial, Helvetica, sans-serif;"><br /></span></div>
<div><br /></div>
<div>Thanks (Mathew)</div> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69988/</link>
      <pubDate>Wed, 18 Nov 2009 19:41:03 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69988/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>performance of tbb for data parallel applications</title>
      <description><![CDATA[ hi everyone,<br />                    iam a novice programer to tbb an trying to evaluate the tbb on an intel 64 architecture.  i have few questions on tbb.<br />1. actually i would like to know whether tbb only achieves better performance(linear speedup)  for  divide and conquer type problems or  is it also possible for data parallel  applications as well ?. iam asking this is because i achieved poor performance for data parallel applications and i used only parallel_for() to parallelize the application.<br />2. one more observation is that the one node execution times of tbb applications very less than others but atlast the speedup achieved on 8-nodes not linear.<br />3. Also i would like to know whether there is an implicit synchronization barriers in the parallel_for() tasks for the threads or is it the responsibility of the programmer to synchronize.  ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69986/</link>
      <pubDate>Wed, 18 Nov 2009 18:03:05 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69986/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>TBB on linux segfaulting</title>
      <description><![CDATA[ I have a cross platform program I've written that uses either TBB or OpenMP (not at the same time!). The code is in every respect cross platform, but I initially developed it in Visual Studio.<br /><br />The program is a framework for developing n-body models.<br /><br />On Windows it works perfectly, compiling with no errors or warnings, and runs as expected. On Linux (Ubuntu 64 bit) it compiles well, just one warning that wouldn't effect the TBB code, but when I run it, as soon as it tries to begin the first parallel_for class, it segfaults.<br />I tried using openMP, and the code ran perfectly. Note that the OpenMP and TBB code do not interfere with each other, using them at the same time isn't possible.<br /><br />Since the parallel code is in my Runge Kutta 4th Order Integrator, I decided to comment out the first Parallel_for block and see if that had a bug in it (replacing it with equivilent serial code), but instead it just segfaults on running the next one it encounters. I deduce from this, and the fact that the same code runs perfectly on Windows, that it is the act of using TBB which is causing the segfault, not the code itself.<br /><br />On Linux I've compiled it using GCC, with the project managed by Code::Blocks. I've linked tbb, and initialised tbb with a call to<br />tbb::task_scheduler_init init;<br /><br />Have I missed something about using TBB on Linux?<br /><br />I haven't posted lots of code because I rather think the issue is one of setting up TBB, rather than a problem in my code. I can do though, but the project is rather large, so it would be a fair bit to go through.<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69972/</link>
      <pubDate>Wed, 18 Nov 2009 09:10:54 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69972/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>TBB on Visual studio 2008</title>
      <description><![CDATA[ I get the following error message upon trying to build the examples attached with TBB
<div><span style="font-family: Verdana, Arial, Helvetica, sans-serif;">"1&gt;Copying DLLs and PDBs<br />1&gt;$C:\Program Files\tbb22\ia32\vc9\bin\tbb.dll <br />1&gt;The filename, directory name, or volume label syntax is incorrect.",</span></div>
<div><span style="font-family: Verdana, Arial, Helvetica, sans-serif;">C:\Program Files\tbb22 is my installation directory,<br /></span></div>
<div><span style="font-family: Verdana, Arial, Helvetica, sans-serif;">I have included the path C:\Program Files\tbb22\ia32\vc9\lib in library files, and tbb.lib in liker&gt;&gt;additional properties.</span></div>
<div>Could someone please point out the problem to me or include a link on how to install TBB on VS 2008 ( I have followed the guide in the release notes (or as instructed in this link:</div>
<div>http://software.intel.com/en-us/blogs/2008/07/07/get-tbb-going-by-a-single-click/</div>
<div>, but as pointed out by other users there's no "use Intel TBB option in the project sub menu)</div>
<div><br /></div>
<div>thanks</div>
<div>thanks</div> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69945/</link>
      <pubDate>Tue, 17 Nov 2009 08:47:26 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69945/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>TBB and Arrandale</title>
      <description><![CDATA[ Intel has announced the roadmap for the Arrandale chip combining CPU and GPU. Will TBB enable us to utilize this new chip in the future ? ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69910/</link>
      <pubDate>Mon, 16 Nov 2009 03:03:13 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69910/</guid>
      <category>ISN General</category>
    </item>
    <item>
      <title>pipeline scheduling tasks</title>
      <description><![CDATA[ Hi,<br /><br />a have a problem with pipeline:<br />say i have a 4 threads, 4 tasks, and 3 pipeline stages:<br /> emit(serial), worker(not serial), and join(serial)<br /><br />but if i run a code - a got following output:<br /><br />processing with 4 files (4 threads)<br />emit __cdb_tbench-0<br />emit __cdb_tbench-1<br />worker: __cdb_tbench-0<br />emit __cdb_tbench-2<br />emit __cdb_tbench-3<br />worker: __cdb_tbench-0 done<br />joining : __cdb_tbench-0<br />worker: __cdb_tbench-1<br />joining : __cdb_tbench-0 done: 10000000<br />worker: __cdb_tbench-1 done<br />joining : __cdb_tbench-1<br />worker: __cdb_tbench-2<br />joining : __cdb_tbench-1 done: 10000000<br />worker: __cdb_tbench-2 done<br />joining : __cdb_tbench-2<br />worker: __cdb_tbench-3<br />joining : __cdb_tbench-2 done: 10000000<br />worker: __cdb_tbench-3 done<br />joining : __cdb_tbench-3<br />joining : __cdb_tbench-3 done: 10000000<br />Done.<br /><br />worker stage is nearly to 1 sec, join is faster.<br /><br />looks like stages are interleaved, but worker() never called in parallel,<br />can you advise why ?<br /><br />code looks like:<br /><br /> template&lt;class T, bool serial, void* (T::*callback)(void*)&gt;<br /> struct Stage : public tbb::filter {<br /> T&amp; ticket;<br /> Stage(T&amp; ticketIn) : tbb::filter(serial), ticket(ticketIn) {}<br /> virtual ~Stage() throw() {}<br /> virtual void* operator()(void* item)<br /> { <br /> return ((&amp;ticket)-&gt;*callback)(item);<br /> }<br />....<br /> tbb::pipeline pl;<br /> internal::Stage&lt;Ticket, true, &amp;Ticket::emit&gt; s1(ticket);<br /> internal::Stage&lt;Ticket, false, &amp;Ticket::work&gt; s2(ticket);<br /> internal::Stage&lt;Ticket, true, &amp;Ticket::join&gt; s3(ticket);<br /><br /> pl.add_filter(s1);<br /> pl.add_filter(s2);<br /> pl.add_filter(s3);<br /> pl.run(data.size());<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69904/</link>
      <pubDate>Mon, 16 Nov 2009 00:01:20 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/69904/</guid>
      <category>ISN General</category>
    </item>
  </channel></rss>