<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blogs &#187; Dmitriy Vyukov</title>
	<atom:link href="http://software.intel.com/en-us/blogs/author/dmitriy-vyukov/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs</link>
	<description></description>
	<lastBuildDate>Fri, 25 May 2012 22:49:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>1024cores: All about lock-free, concurrency, multicore and parallelism</title>
		<link>http://software.intel.com/en-us/blogs/2011/01/05/1024cores-all-about-lock-free-concurrency-multicore-and-parallelism/</link>
		<comments>http://software.intel.com/en-us/blogs/2011/01/05/1024cores-all-about-lock-free-concurrency-multicore-and-parallelism/#comments</comments>
		<pubDate>Wed, 05 Jan 2011 22:06:58 +0000</pubDate>
		<dc:creator>Dmitriy Vyukov</dc:creator>
				<category><![CDATA[Academic]]></category>
		<category><![CDATA[Game Development]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Performance and Optimization]]></category>
		<category><![CDATA[concurrency]]></category>
		<category><![CDATA[lock-free]]></category>
		<category><![CDATA[multi-core]]></category>
		<category><![CDATA[multithreading]]></category>
		<category><![CDATA[Parallel Computing]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2011/01/05/1024cores-all-about-lock-free-concurrency-multicore-and-parallelism/</guid>
		<description><![CDATA[http://www.1024cores.net - all about lock-free algorithms, concurrency, multithreading, multicore, parallel computations and related topics]]></description>
			<content:encoded><![CDATA[<p>It finally happened! I've launched a new web-site devoted to lock-free, wait-free and just scalable synchronization algorithms, multicore, concurrency, parallel computations, scalability-oriented architecture, patterns and anti-patterns, threading technologies and libraries and related topics.</p>
<p>Welcome to <a href="http://www.1024cores.net">1024cores</a>!</p>
<p>Currently there are some materials on<a href="http://www.1024cores.net/home/lock-free-algorithms/introduction"> fundamentals of synchronization algorithms</a>, articles on some practical problems (<a href="http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem">reader-writer problem</a>, <a href="http://www.1024cores.net/home/lock-free-algorithms/queues">producer-consumer queues</a>, <a href="http://www.1024cores.net/home/lock-free-algorithms/lazy-concurrent-initialization">lazy concurrent initialization</a>), some materials on <a href="http://www.1024cores.net/home/scalable-architecture/introduction">scalable architecture, </a>collection of my <a href="http://www.1024cores.net/home/parallel-computing">write-ups for Intel Threading Challenge 2009/2010</a> and some other not so developed sections. However the site is basically 10-days old and much more is coming, so I encourage you to subscribe to the <a href="http://feeds.feedburner.com/1024cores">RSS</a> and/or follow the <a href="http://blog.1024cores.net/">blog</a>.</p>
<p>Stay tuned and keep threading!</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2011/01/05/1024cores-all-about-lock-free-concurrency-multicore-and-parallelism/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Don&#039;t Hinder Concurrency!</title>
		<link>http://software.intel.com/en-us/blogs/2010/03/05/dont-hinder-concurrency/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/03/05/dont-hinder-concurrency/#comments</comments>
		<pubDate>Fri, 05 Mar 2010 21:29:40 +0000</pubDate>
		<dc:creator>Dmitriy Vyukov</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[synchronization]]></category>
		<category><![CDATA[thread-local storage]]></category>
		<category><![CDATA[threading]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/03/05/dont-hinder-concurrency/</guid>
		<description><![CDATA[I've just read the article "Use Thread-local Storage to Reduce Synchronization" (from the "Intel Guide for Developing Multithreaded Applications") and here is my take on that.]]></description>
			<content:encoded><![CDATA[<p><!-- 		@page { margin: 0.79in } 		P { margin-bottom: 0.08in } 		A:link { so-language: zxx } --></p>
<p>I've just read the article <span style="color: #000080;"><span style="text-decoration: underline;"><a href="http://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization/" target="_blank">Use Thread-local Storage to Reduce Synchronization</a></span></span> from the <span style="color: #000080;"><span style="text-decoration: underline;"><a href="http://software.intel.com/en-us/articles/intel-guide-for-developing-multithreaded-applications/" target="_blank">Intel Guide for Developing Multithreaded Applications</a></span></span> and here is my take on it.</p>
<p>Indeed. If threads are a tool <strong>for</strong> <strong>concurrency</strong>, synchronization is a tool for <strong>suppressing</strong> <strong>concurrency</strong>. Any form of synchronization (no matter locks, semaphores or atomic operations) hinders concurrency. It requires a distributed system to achieve strong global consensus, and consensus in a distributed system can't be cheap. Period.</p>
<p>So the best design of a concurrent system tries to reduce synchronization to a bare minimum (total elimination of synchronization is impossible, otherwise the system wi<span style="color: #000000;">ll break up int</span>o several independent systems). There are various techniques for reducing the need for synchronization - partitioning, privatization, replication, amortization.</p>
<p>The idea of <strong>partitioning</strong> is to split whole data-set into several mostly independent partitions, a worker thread is bound to each of the partitions, plus there must be some partitioning function that maps an external key to a partition where the data resides. Then, all requests are routed directly to a thread that bound to the required partition. As a result, a worker thread works with the partition's data without any synchronization.</p>
<p><strong>Privatization</strong> is a private case of the partitioning with a single partition, i.e. whole data-set is handed over to a single thread which can work with it without any synchronization. The negative side of this technique is that the single thread can become a bottleneck, other threads concurrently running on other cores can overwhelm it with requests.</p>
<p>The idea of <strong>replication </strong>is to have several independent replicas of a data-set, and propagate updates between replicas explicitly via messages. Data in replicas can be temporary inconsistent, however a lot of systems can tolerate some inconsistency.</p>
<p><strong>Amortization</strong> is usually based on some form of <strong>thread-local data</strong> (placed either on a thread's stack or in a compiler/OS-provided storage). The idea is simple – we collect some updates in thread-local storage and then apply them later in batches. That's what we saw in the article. The main advantage of amortization based on thread-local storage is it's simplicity. Indeed, you do not need to reorganize your data, to route requests to particular threads based on data placement, cope with inconsistencies, etc. So, if it's applicable it's the first thing you must consider.</p>
<p>Well, there are too many things I can say on these things... a way too many to fit into this blog. But what I wan<span style="color: #000000;">t to communicate is that you</span> must consider these things as a starting point rather than a final destination, they are a primitive tools for reducing synchronization in your concurrency toolbox. Choose the best tool for a particular situation, combine them, adopt them.</p>
<p lang="en-US">Now a few comments directly on the article.</p>
<p lang="en-US">
<p style="padding-left: 30px;"><span style="color: #333333;"><span style="font-family: Tahoma;"><span style="font-size: x-small;"><em>This solution trades synchronization per event for synchronization per thread. Performance will improve if the number of events is much larger than the number of threads.</em></span></span></span><em> </em></p>
<p lang="en-US">
<p lang="en-US">I would not agree here, there is no such a tradeoff involved. If a thread had not collected any events in his thread-local storage, then he just does not access centralized data at all. The additional overhead is a single 'if' statement per thread, which is negligible in a context of inter-thread work distribution. This technique does not increase the total number of events.</p>
<p lang="en-US">
<p style="padding-left: 30px;"><span style="font-family: Tahoma;"><span style="font-size: x-small;"><em>An additional advantage of using thread-local storage during time-</em></span></span><span style="color: #333333;"><span style="font-family: Tahoma;"><span style="font-size: x-small;"><em>critical portions of the program is that the data may stay live in a processor’s cache longer than shared data, if the processors do not share a data cache. When the same address exists in the data cache of several processors and is written by one of them, it must be invalidated in the caches of all other processors, causing it to be re-fetched from memory when the other processors access it. But thread-local data will never be written by any other processors than the one it is local to and will therefore be more likely to remain in the cache of its processor.</em></span></span></span></p>
<p lang="en-US">In general this is very true. Indeed, thread-local data reduces amount of inter-core communication, thus reducing amount of costly cache-coherence traffic.</p>
<p lang="en-US">But, this has little to do with shared cashes, even if cores share L3 cache, data still will be transferred between their L1 caches (L1 caches are not shared between cores on most current processors). So I would recommend to just ignore the part on shared caches. Prefer thread-local data and you are on the safe side with any current or future architecture.</p>
<p>There is another important consideration with regard to shared L2/L3 caches (which are featured on many current processors), and this consideration is <strong>against</strong> thread-local data. Consider the following situation. Moderate size shared object is frequently accessed for reading, but infrequently for writing. If it is split into thread-local parts (which usually implies increase in size), it will not fit into shared L2/L3 cache, thus threads will constantly evict each others data from the cache. However, if the object is implemented as a single centralized entity, it fits into the cache, thus threads will work with cached data without evictions.</p>
<p lang="en-US">So, the tradeoff frequently involved is reduction of synchronization versus increase of total working set. Which to prefer is highly dependent on the situation.</p>
<p lang="en-US">
<p style="padding-left: 30px;"><span style="font-family: Tahoma;"><span style="font-size: x-small;"><em>One must be careful about the trade-offs involved in this technique. The technique</em></span></span><span style="color: #333333;"><em> </em></span><span style="color: #333333;"><span style="font-family: Tahoma;"><span style="font-size: x-small;"><em>does not remove the need for synchronization, but only moves the synchronization from a time-critical section of the code to a non-time-critical section of the code. </em></span></span></span></p>
<p lang="en-US">Well, I would say that the main point of the technique is reduction of synchronization rather than move of the synchronization from one part of the code to another. Amortization via thread-local storage can not involve any movement of synchronization at all. The technique can be applied in two forms: single final aggregation or periodic aggregations. The latter does not involve any movement of synchronization while still reduces synchronization overheads. And it has additional benefit that separate monitoring thread can periodically fetch and output intermediate results.</p>
<p lang="en-US">Consider, for example, the following program:</p>
<p lang="en-US">
<pre><span style="color: #0000ff;"><span style="font-size: x-small;">long</span></span><span style="color: #000000;"><span style="font-size: x-small;"> total_event_count;</span></span></pre>
<pre><span style="color: #0000ff;"><span style="font-size: x-small;">__declspec</span></span><span style="color: #000000;"><span style="font-size: x-small;">(</span></span><span style="color: #0000ff;"><span style="font-size: x-small;">thread</span></span><span style="color: #000000;"><span style="font-size: x-small;">) </span></span><span style="color: #0000ff;"><span style="font-size: x-small;">long</span></span><span style="color: #000000;"><span style="font-size: x-small;"> thread_event_count; </span></span><span style="color: #008000;"><span style="font-size: x-small;">// thread-local cache</span></span></pre>
<pre><span style="color: #0000ff;"><span style="font-size: x-small;">void</span></span><span style="color: #000000;"><span style="font-size: x-small;"> thread_function(size_t begin, size_t end)</span></span></pre>
<pre><span style="color: #000000;"><span style="font-size: x-small;">{</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #0000ff;"><span style="font-size: x-small;">for</span></span><span style="color: #000000;"><span style="font-size: x-small;"> (size_t i = begin; i != end; i += 1)</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #000000;"><span style="font-size: x-small;">{</span></span></pre>
<pre>
<pre> <span style="color: #000000;"> </span><span style="color: #0000ff;"><span style="font-size: x-small;">if</span></span><span style="color: #000000;"><span style="font-size: x-small;"> (predicate(i))</span></span></pre>
</pre>
<pre><span style="color: #000000;"> </span><span style="color: #000000;"><span style="font-size: x-small;"> {</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #000000;"><span style="font-size: x-small;">  thread_event_count += 1;</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #008000;"><span style="font-size: x-small;">  // if we have cached enough events,</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #008000;"><span style="font-size: x-small;">  // transfer them to global shared variable</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #0000ff;"><span style="font-size: x-small;">  if</span></span><span style="color: #000000;"><span style="font-size: x-small;"> (thread_event_count == THRESHOLD)</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #000000;"><span style="font-size: x-small;">  {</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #000000;"><span style="font-size: x-small;">   _InterlockedExchangeAdd(&amp;total_event_count, thread_event_count);</span></span></pre>
<pre><span style="color: #000000;">    </span><span style="color: #000000;"><span style="font-size: x-small;">thread_event_count = 0;</span></span></pre>
<pre><span style="color: #000000;">   </span><span style="color: #000000;"><span style="font-size: x-small;">}</span></span></pre>
<pre><span style="color: #000000;">  </span><span style="color: #000000;"><span style="font-size: x-small;">}</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #000000;"><span style="font-size: x-small;">}</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #008000;"><span style="font-size: x-small;"> // transfer the remainder of locally cached events</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #0000ff;"><span style="font-size: x-small;">if</span></span><span style="color: #000000;"><span style="font-size: x-small;"> (thread_event_count)</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #000000;"><span style="font-size: x-small;">_InterlockedExchangeAdd(&amp; total_event_count, thread_event_count);</span></span></pre>
<pre><span style="color: #000000;"><span style="font-size: x-small;">}</span></span></pre>
<p lang="en-US">
<p>In the above example the synchronization is not moved to another point, but it's still reduced by a factor  of  THRESHOLD. Separate monitoring thread can periodically read and output total_event_count variable, and there is a guarantee that  total_event_count does n<span style="color: #000000;">ot lag b</span>ehind real value of discovered events by more than NUMBER_OF_THREAD *  THRESHOLD.</p>
<p lang="en-US">
<p lang="en-US">Note that thread-local data may be actually shared between threads, there is nothing preventing this. A method of declaration of a variable is orthogonal to it's “shared-ness”. Address of a variable declared as __declspec(thread)/__thread/omp threadprivate/pthread_key_create()/TlsAlloc() can be passed to another thread, and thus the variable become shared. Just as plain global variable can be ever accessed by a single thread, and so it's local to the thread.</p>
<p lang="en-US">
<p lang="en-US">Also note that you can get a flavor of thread-local data with plain global array indexed by a unique thread index. This technique is less dependent on a particular compiler/OS, and makes sharing of thread-local data much easier (infrequent sharing is not dangerous and anyway necessary in any real-world program). Here is a simple example:</p>
<p lang="en-US">
<pre><span style="color: #008000;"><span style="font-size: x-small;">// array of "thread-local" data</span></span></pre>
<pre><span style="color: #0000ff;"><span style="font-size: x-small;">long</span></span><span style="color: #000000;"><span style="font-size: x-small;"> </span></span><span style="color: #0000ff;"><span style="font-size: x-small;">volatile</span></span><span style="color: #000000;"><span style="font-size: x-small;"> event_counts [MAX_THREAD_COUNT] = {};</span></span></pre>
<pre><span style="color: #008000;"><span style="font-size: x-small;">// sequence used to generate unique thread indexes</span></span></pre>
<pre><span style="color: #0000ff;"><span style="font-size: x-small;">long</span></span><span style="color: #000000;"><span style="font-size: x-small;"> </span></span><span style="color: #0000ff;"><span style="font-size: x-small;">volatile</span></span><span style="color: #000000;"><span style="font-size: x-small;"> thread_sequence = 0;</span></span></pre>
<pre><span style="color: #008000;"><span style="font-size: x-small;">// worker thread routine</span></span></pre>
<pre><span style="color: #0000ff;"><span style="font-size: x-small;">void</span></span><span style="color: #000000;"><span style="font-size: x-small;"> worker_thread(size_t begin, size_t end)</span></span></pre>
<pre><span style="color: #000000;"><span style="font-size: x-small;">{</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #008000;"><span style="font-size: x-small;">// obtain unique thread index</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #0000ff;"><span style="font-size: x-small;">long</span></span><span style="color: #000000;"><span style="font-size: x-small;"> my_idx = _InterlockedIncrement(&amp;thread_sequence) - 1;</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #0000ff;"><span style="font-size: x-small;">for</span></span><span style="color: #000000;"><span style="font-size: x-small;"> (size_t i = begin; i != end; i += 1)</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #000000;"><span style="font-size: x-small;">{</span></span></pre>
<pre><span style="color: #000000;">  </span><span style="color: #0000ff;"><span style="font-size: x-small;">if</span></span><span style="color: #000000;"><span style="font-size: x-small;"> (predicate(i))</span></span></pre>
<pre><span style="color: #000000;">  </span><span style="color: #000000;"><span style="font-size: x-small;">event_counts[my_idx] += 1;</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #000000;"><span style="font-size: x-small;">} </span></span></pre>
<pre><span style="color: #000000;"><span style="font-size: x-small;">}</span></span></pre>
<pre><span style="color: #008000;"><span style="font-size: x-small;">// monitoring thread routine</span></span></pre>
<pre><span style="color: #0000ff;"><span style="font-size: x-small;">void</span></span><span style="color: #000000;"><span style="font-size: x-small;"> monitor_thread()</span></span></pre>
<pre><span style="color: #000000;"><span style="font-size: x-small;">{</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #0000ff;"><span style="font-size: x-small;">while</span></span><span style="color: #000000;"><span style="font-size: x-small;"> (termination_condition == </span></span><span style="color: #0000ff;"><span style="font-size: x-small;">false</span></span><span style="color: #000000;"><span style="font-size: x-small;">)</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #000000;"><span style="font-size: x-small;">{</span></span></pre>
<pre><span style="color: #000000;">  </span><span style="color: #008000;"><span style="font-size: x-small;">// obtain current thread count</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #0000ff;"><span style="font-size: x-small;"> long</span></span><span style="color: #000000;"><span style="font-size: x-small;"> thread_count = thread_sequence;</span></span></pre>
<pre><span style="color: #000000;">  </span><span style="color: #0000ff;"><span style="font-size: x-small;">long</span></span><span style="color: #000000;"><span style="font-size: x-small;"> sum = 0;</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #0000ff;"><span style="font-size: x-small;"> for</span></span><span style="color: #000000;"><span style="font-size: x-small;"> (</span></span><span style="color: #0000ff;"><span style="font-size: x-small;">long</span></span><span style="color: #000000;"><span style="font-size: x-small;"> i = 0; i != thread_count; i += 1)</span></span></pre>
<pre><span style="color: #000000;">  </span><span style="color: #000000;"><span style="font-size: x-small;">sum += event_counts[i];</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #000000;"><span style="font-size: x-small;"> printf(</span></span><span style="color: #a31515;"><span style="font-size: x-small;">"event count: %u\n"</span></span><span style="color: #000000;"><span style="font-size: x-small;">, (</span></span><span style="color: #0000ff;"><span style="font-size: x-small;">unsigned</span></span><span style="color: #000000;"><span style="font-size: x-small;">)sum);</span></span></pre>
<pre><span style="color: #000000;">  </span><span style="color: #000000;"><span style="font-size: x-small;">Sleep(1000);</span></span></pre>
<pre><span style="color: #000000;"> </span><span style="color: #000000;"><span style="font-size: x-small;">}</span></span></pre>
<pre><span style="color: #000000;"><span style="font-size: x-small;">}</span></span></pre>
<p lang="en-US">
<p><span style="color: #000000;">However, be</span> aware that the above example contains a nasty instance of <strong>false-sharing</strong> which kills performance. You can read about how to cope with it in the article <span style="color: #000080;"><span style="text-decoration: underline;"><a href="http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/" target="_blank">Avoiding and Identifying False Sharing Among Threads</a></span></span>.</p>
<p lang="en-US">
<p lang="en-US">Keep threading!</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/03/05/dont-hinder-concurrency/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parallelization And Optimization of The Line Segment Intersection Problem</title>
		<link>http://software.intel.com/en-us/blogs/2009/08/12/parallelization-and-optimization-of-the-line-segment-intersection-problem/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/08/12/parallelization-and-optimization-of-the-line-segment-intersection-problem/#comments</comments>
		<pubDate>Wed, 12 Aug 2009 16:19:32 +0000</pubDate>
		<dc:creator>Dmitriy Vyukov</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[fork-join]]></category>
		<category><![CDATA[line segment intersection]]></category>
		<category><![CDATA[optimization]]></category>
		<category><![CDATA[parallelization]]></category>
		<category><![CDATA[SSE]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/08/12/parallelization-and-optimization-of-the-line-segment-intersection-problem/</guid>
		<description><![CDATA[Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers representing the two (x,y,z) endpoints. ]]></description>
			<content:encoded><![CDATA[<p><!--[if !mso]&gt; &lt;!  v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} -->&lt;!--[endif]--&gt;&lt;!--[if gte mso 9]&gt; Normal 0 false false false MicrosoftInternetExplorer4 &lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt; &lt;![endif]--&gt;<!--  --><!--[if gte mso 10]&gt; &lt;!   /* Style Definitions */  table.MsoNormalTable 	{mso-style-name:"Table Normal"; 	mso-tstyle-rowband-size:0; 	mso-tstyle-colband-size:0; 	mso-style-noshow:yes; 	mso-style-parent:""; 	mso-padding-alt:0cm 5.4pt 0cm 5.4pt; 	mso-para-margin:0cm; 	mso-para-margin-bottom:.0001pt; 	mso-pagination:widow-orphan; 	font-size:10.0pt; 	font-family:"Times New Roman"; 	mso-ansi-language:#0400; 	mso-fareast-language:#0400; 	mso-bidi-language:#0400;} --> &lt;!--[endif]--&gt;&lt;!--[if gte mso 9]&gt; &lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt; &lt;![endif]--&gt;</p>
<p><strong>Line Segment Intersection Problem<br />
</strong></p>
<p><strong>1. Problem Statement</strong></p>
<p>Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers representing the two (x,y,z) endpoints.</p>
<p><strong>2. Single-threaded Implementation</strong></p>
<p><strong>2.1. Algorithm outline</strong></p>
<p>As a base algorithm I choose simple exhaustion of segment pairs:</p>
<pre>void solve_segment_intersection(std::vector&lt;segment_t&gt; const&amp; segments, std::vector&lt;intersection_t&gt;&amp; results)</pre>
<pre>{</pre>
<pre>    for (size_t i = 0; i != segments.size(); i += 1)</pre>
<pre>    {</pre>
<pre>        for (size_t j = i + 1; j != segments.size(); j += 1)</pre>
<pre>        {</pre>
<pre>            if (is_intersect(segments[i], segments[j]))</pre>
<pre>                results.push_back(intersection_t(segments[i], segments[j]));</pre>
<pre>        }</pre>
<pre>    }</pre>
<pre>}</pre>
<p>Computational complexity of such algorithm is O(N^2 / 2). It's quite high complexity, and there are known algorithms with complexity O((N+K)*logN) [1] and even O(N*logN + K) [2] (K - number of intersections). However I decide to not implement these theoretically more efficient algorithms because of the significantly higher implementation complexity and the fact that they are much harder to parallelize efficiently and less amenable to low-level optimizations such as vectorization, branch elimination, cache-conscious memory access patterns, etc.</p>
<p>In order to reduce computational complexity I apply sorting of segments by X coordinate with following bounding of exhaustion. If segments are sorted by X1 coordinate (it's assumed that X1 &lt;= X2), then once we encounter segment with X1 greater then X2 coordinate of original segment we may not consider all subsequent segments (they are not overlapping with original segment by X coordinate):</p>
<pre>void solve_segment_intersection(std::vector&lt;segment_t&gt;&amp; segments, std::vector&lt;intersection_t&gt;&amp; results)
{
<strong>    std::sort(segments.begin(), segments.end(), sort_by_x1);
</strong>
    for (size_t i = 0; i != segments.size(); i += 1)
    {
        for (size_t j = i + 1; j != segments.size(); j += 1)
        {
<strong>            if (segments[j].x1 &gt; segments[i].x2) break;
</strong>
            if (is_intersect(segments[i], segments[j]))
                results.push_back(intersection_t(segments[i], segments[j]));
        }
    }
}</pre>
<p>Now computational complexity is reduced to O(N*M), where M is some constant dependent on input data ("how many segments are overlapping by X coordinate"). So in best case complexity is O(N) now (if no segments are overlapping). In worst case complexity is still O(N^2 / 2).</p>
<p><strong>2.2. Bounding Box</strong></p>
<p>Precise calculation of is_intersect predicate is computationally hard (contains considerable amount of multiplication operations, conditional branching, etc). Bounding box is a simple optimization which determines evidently non intersecting segments. Bounding box optimization is based on the following observation - if 2 segments are intersecting in 3D space then their projections to coordinate axises are intersecting by pairs too. Thus, if projections to at least one axis are not intersecting then segments itself are not intersecting in 3D space:</p>
<pre>If ((s1.x2 &lt; s2.x1) or (s2.x2 &lt; s1.x1) or</pre>
<pre>    (s1.y2 &lt; s2.y1) or (s2.y2 &lt; s1.y1) or</pre>
<pre>     (s1.z2 &lt; s2.z1) or (s2.z2 &lt; s1.z1))

then segments s1 and s2 are not intersecting.
(it's assumed that x1 &lt;= x2, y1 &lt;= y2, z1 &lt;= z2)</pre>
<p>Experimentation shows that for randomly generated data bounding box technique detects quite significant amount of evidently non intersecting segments.</p>
<p>Here is the code:</p>
<pre>void solve_segment_intersection(std::vector&lt;segment_t&gt;&amp; segments, std::vector&lt;intersection_t&gt;&amp; results)</pre>
<pre>{</pre>
<pre>    std::sort(segments.begin(), segments.end(), sort_by_x1);</pre>
<pre>    for (size_t i = 0; i != segments.size(); i += 1)</pre>
<pre>    {</pre>
<pre>        for (size_t j = i + 1; j != segments.size(); j += 4) // note that increment is 4</pre>
<pre>        {</pre>
<pre>            if (segments[j].x1 &gt; segments[i].x2)</pre>
<pre>                break;</pre>
<pre><strong>            int max_of_mins_y = std::max(segments[i].y1, segments[j].y1);</strong></pre>
<pre><strong>            int min_of_maxs_y = std::min(segments[i].y2, segments[j].y2);</strong></pre>
<pre><strong>            if (max_of_mins_y &gt; min_of_maxs_y)</strong></pre>
<pre><strong>                // evidently no intersection</strong></pre>
<pre><strong>                continue;</strong></pre>
<pre><strong>
            int max_of_mins_z = std::max(segments[i].z1, segments[j].z1);</strong></pre>
<pre><strong>            int min_of_maxs_z = std::min(segments[i].z2, segments[j].z2);</strong></pre>
<pre><strong>            if (max_of_mins_z &gt; min_of_maxs_z)</strong></pre>
<pre><strong>                // evidently no intersection</strong></pre>
<pre><strong>                continue;</strong></pre>
<pre>            // only not calculate precise predicate</pre>
<pre>            if (is_intersect(segments[i], segments[j]))</pre>
<pre>                results.push_back(intersection_t(segments[i], segments[j]));</pre>
<pre>        }</pre>
<pre>    }</pre>
<pre>}</pre>
<p><strong>2.3. SSE To The Rescue</strong></p>
<p>Bounding box verification can be further optimized with SSE vector operations [3]. First gain comes from the fact that SSE vector operations can process up to 4 pairs of 32-bit integers at a time. Second gain comes from then fact that SSE extensions contain powerful operations that can find minimum/maximum of 2 values in streamlined fashion (w/o conditional branching, just single machine instruction).</p>
<p>Note that in order to apply vector operations data structures have to be converted from AoS (array of structures) representation to SoA (structure of arrays) representation, i.e. following straightforward representation of array of segments:</p>
<pre>struct segment_t</pre>
<pre>{</pre>
<pre>    int x1, x2, y1, y2, z1, z2;</pre>
<pre>};</pre>
<pre>typedef std::vector&lt;segment_t&gt; segments_t;</pre>
<p>have to be converted to following SoA representation:</p>
<pre>struct segments_t;
{
    std::vector&lt;int&gt; x1, x2, y1, y2, z1, z2;
};</pre>
<pre><!--[if gte mso 9]&gt;-->

  Normal
  0

  false
  false
  false

  MicrosoftInternetExplorer4

<!--[if gte mso 9]&gt;-->

<!--[if gte mso 10]&gt;--></pre>
<p>After such conversion we may use SSE vector operations. Here is a bit simplified code (it uses Intel C++ compiler intrinsics):</p>
<pre>void solve_segment_intersection(std::vector&lt;segment_t&gt;&amp; segments, std::vector&lt;intersection_t&gt;&amp; results)
{
    std::sort(segments.begin(), segments.end(), sort_by_x1);
    segments_t soa_segments; // population of soa_segments is omitted

    for (size_t i = 0; i != segments.size(); i += 1)
    {
        // load y and z coords of first segment
        __m128i s1_min_y_v = _mm_set1_epi32(soa_segments.y1[i]);
        __m128i s1_max_y_v = _mm_set1_epi32(soa_segments.y2[i]);
        __m128i s1_min_z_v = _mm_set1_epi32(soa_segments.z1[i]);
        __m128i s1_max_z_v = _mm_set1_epi32(soa_segments.z2[i]);

        for (size_t j = i + 1; j != segments.size(); <strong>j += 4) // note that increment is 4</strong>
        {
            if (segments[j].x1 &gt; segments[i].x2)
                break;

            // load y and z coords of second segment
            __m128i s2_min_y_v = _mm_load_si128((__m128i*)&amp;soa_segments.y1[j]);
            __m128i s2_max_y_v = _mm_load_si128((__m128i*)&amp;soa_segments.y2[j]);
            __m128i s2_min_z_v = _mm_load_si128((__m128i*)&amp;soa_segments.z1[j]);
            __m128i s2_max_z_v = _mm_load_si128((__m128i*)&amp;soa_segments.z2[j]);

            // find bounding box projection to y axis
            __m128i max_of_mins_y = _mm_max_epi32(s1_min_y_v, s2_min_y_v);
            __m128i min_of_maxs_y = _mm_max_epi32(s1_max_y_v, s2_max_y_v);

            // find bounding box projection to z axis
            __m128i max_of_mins_z = _mm_max_epi32(s1_min_z_v, s2_min_z_v);
            __m128i min_of_maxs_z = _mm_max_epi32(s1_max_z_v, s2_max_z_v);

            // check whether segments overlap by y and z coords
            __m128i cmp_y = _mm_cmpgt_epi32(max_of_mins_y, min_of_maxs_y);
            __m128i cmp_z = _mm_cmpgt_epi32(max_of_mins_z, min_of_maxs_z);

            // aggregate results for y and z axises
            __m128i cmp_yz = _mm_or_si128(cmp_y, cmp_z);

            if (_mm_test_all_ones(cmp_yz))
                // neither of these segments are intersecting
                continue;

            if (0 == _mm_extract_epi32(cmp_yz, 0))
                // bounding box says that these segments are possibly intersecting
                // so make precise verification
                if (is_intersect(segments[i], segments[j+0]))
                    results.push_back(intersection_t(segments[i], segments[j+0]));

            // analogously for other segments
            if (0 == _mm_extract_epi32(cmp_yz, 1) &amp;&amp; is_intersect(segments[i], segments[j+1]))
                results.push_back(intersection_t(segments[i], segments[j+1]));
            if (0 == _mm_extract_epi32(cmp_yz, 2) &amp;&amp; is_intersect(segments[i], segments[j+2]))
                results.push_back(intersection_t(segments[i], segments[j+2]));
            if (0 == _mm_extract_epi32(cmp_yz, 3) &amp;&amp; is_intersect(segments[i], segments[j+3]))
                results.push_back(intersection_t(segments[i], segments[j+3]));
        }
    }
}</pre>
<p>Note that there is only a handful of lightweight instructions (no multiplications, divisions, branching) before final _mm_test_all_ones() test which throws away 4 segments at a time (assumed to be common case for randomly generated data). And only if the test fails we consider each pair of segments individually. Is_intersecting predicate will be computed only for segment pairs for which bounding box says that segments are possibly intersecting.</p>
<p><strong>3. Parallelization</strong></p>
<p><strong>3.1. Parallelization Outline</strong></p>
<p>As a tool for parallelization I choose Intel Threading Building Blocks (TBB) library which provides handy and flexible abstraction of lightweight tasks. Here is the high-level scheme of parallelization (each rectangle represents a task, rectangles situated on the same horizontal level may be executed in parallel):</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/08/line.jpg"><img class="alignnone size-full wp-image-8647" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/08/line.jpg" alt="" width="462" height="627" /></a></p>
<p><img src="/DOCUME%7E1/ROOT%7E1.SWI/LOCALS%7E1/Temp/moz-screenshot-2.jpg" alt="" /><img src="/DOCUME%7E1/ROOT%7E1.SWI/LOCALS%7E1/Temp/moz-screenshot-3.jpg" alt="" /></p>
<p><strong>3.2. Start Phase</strong></p>
<p>On the start phase I analyze command line parameters, read number of segments, decide on number of worker threads and start input tasks.</p>
<p>Number of worker threads is crucial aspect for performance. Too high number of threads will introduce unjustified latency for small inputs, because of the overheads related to worker thread starting/stopping, work distribution, aggregation of results and synchronization. I've made a number of tests and set up minimum number of segments per worker thread to 1000. I.e. if there is less than 1000 segments all work will be performed by single thread, if there is 1000-2000 segments all work will be performed by 2 worker threads, and so on. I choose number 1000 based on the following equation. Execution time for 1000 segments for 1 thread is roughly equal to that for 2 threads, i.e. at this point parallelization related speedup starts outweighing parallelization related overheads.</p>
<p>Maximum number of worker threads is bounded by number of available execution units (processors, cores, HT threads). Higher number of threads is senseless for CPU-bound tasks (as opposed to IO-bound tasks), because will only cause additional overheads related to context switching.</p>
<p><strong>3.3. Input Phase</strong></p>
<p>Each input task is supplied with it's own piece of input file (input file is equally partitioned between tasks). First of all input task finds nearest to the piece begin boundary of segment description (using '\r' and '\n' symbols as markers). Then it parses input file until piece end, stores segment descriptions to array and collects some statistics.</p>
<p><strong>3.4. Sort Phase</strong></p>
<p>On sorting phase I just use TBB's standard tbb::parallel_sort() algorithm. As will be seen in the Performance section it achieves linear scalability and provides single-threaded performance similar to that of std::sort() algorithm. So no need to re-invent the wheel.</p>
<p><strong>3.5. Intersection Phase</strong></p>
<p>In order to parallelize intersection testing segment pairs must be somehow partitioned to independent groups. Since we have 2 nested loops - outer "i" loop and inner "j" loop - and all iterations of both loops are independent (no data dependencies between them) we may choose either of them as a source of partitioning. However golden rule of parallelization says:</p>
<p><strong><em>Choose highest possible level for parallelization.</em></strong></p>
<p>Parallelization on highest possible level tends to "distribute" threads from each other and give each thread bigger piece of independent work, thus reducing work distribution and synchronization overheads.</p>
<p>Parallelization of outer loop may be infeasible/impossible for some reasons. For example, if there is just too small number of iterations (less than number of threads), or if there are data dependencies between iterations (calculations on i+1 iteration depend on results of i-th iteration). However it's not the case for our problem, so I choose partitioning of outer "i" loop.</p>
<p>Initially I created one task per worker thread and equally divided iteration space among them, however it turns out that this way amount of work per task (thread) may be quite unbalanced. If segments are sorted by X coordinate ascending then first segment must be verified for intersection potentially with all other segments, while last segment must not be verified at all.</p>
<p>In order to overcome this I create larger number of tasks which is determined by the following formula:</p>
<p><strong>number_of_tasks = min(number_of_segments / min_number_of_segments_per_task, number_of_worker_threads * surplus_factor);</strong></p>
<p><strong> </strong></p>
<p>min_number_of_segments_per_task is set to 1000 (see section 3.2. above).</p>
<p>surplus_factor is set to 64 and determines number of tasks per worker thread required to achieve sufficiently good load balancing.</p>
<p><strong>3.5. Output Phase</strong></p>
<p>I decide to not parallelize output phase because number of intersections is expected to be small.</p>
<p>However if number of intersections is expected to be very large following parallelization strategy can be applied. Memory mapping of sufficient size is opened for output file. Private buffer able to hold, let's say, 1000 intersections is allocated for every worker thread. Worker threads collect results in these private buffers. When a thread's buffer becomes full, thread reserves a range of output file with single atomic increment operation on file_end_marker variable, and then flushes private buffer to file. When thread finishes work it has to flush it's partially full private buffer to file too. This way each thread will output it's own hot-in-cache results, and output will be evenly distributed and parallelized.</p>
<p><strong>4. Performance</strong></p>
<p>I used 2 different data generation algorithms for performance testing. First algorithm generates large segments with large coordinates domain. Here is the results I got on Intel Core2 Duo P9500 (2.5GHz, 6MB L2 cache, 1GHz memory bus) for 30'000 segments with coordinates randomly generated in the range 0 ... 10^6 (number of intersections is 0):</p>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="277" valign="top"><strong>Phase</strong></td>
<td width="277" valign="top"><strong>Single-threaded (ms)</strong></td>
<td width="277" valign="top"><strong>Multi-threaded (ms)</strong></td>
<td width="0" height="23"> </td>
</tr>
<tr>
<td width="277" valign="top">Input</td>
<td width="277" valign="top">2.54</td>
<td width="277" valign="top">1.38</td>
<td width="0" height="23"> </td>
</tr>
<tr>
<td width="277" valign="top">Sort</td>
<td width="277" valign="top">5.4</td>
<td width="277" valign="top">2.73</td>
<td width="0" height="23"> </td>
</tr>
<tr>
<td width="277" valign="top">Intersect</td>
<td width="277" valign="top">5873</td>
<td width="277" valign="top">3085</td>
<td width="0" height="23"> </td>
</tr>
<tr>
<td width="277" valign="top">Output</td>
<td width="277" valign="top">0.3</td>
<td width="277" valign="top">0.3</td>
<td width="0" height="23"> </td>
</tr>
<tr>
<td width="277" valign="top">Total</td>
<td width="277" valign="top">5920</td>
<td width="277" valign="top">3100</td>
<td width="0" height="23"> </td>
</tr>
</tbody>
</table>
<p>Second algorithm generates small segments with small coordinates domain. Here is the results I got for 400'000 segments of maximum length 40 (in each direction), all coordinates are in range 0...400 (~13K intersections):</p>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="277" valign="top"><strong>Phase</strong></td>
<td width="277" valign="top"><strong>Single-threaded (ms)</strong></td>
<td width="277" valign="top"><strong>Multi-threaded (ms)</strong></td>
<td width="0" height="23"> </td>
</tr>
<tr>
<td width="277" valign="top">Input</td>
<td width="277" valign="top">30</td>
<td width="277" valign="top">16</td>
<td width="0" height="23"> </td>
</tr>
<tr>
<td width="277" valign="top">Sort</td>
<td width="277" valign="top">89</td>
<td width="277" valign="top">47</td>
<td width="0" height="23"> </td>
</tr>
<tr>
<td width="277" valign="top">Intersect</td>
<td width="277" valign="top">8611</td>
<td width="277" valign="top">4450</td>
<td width="0" height="23"> </td>
</tr>
<tr>
<td width="277" valign="top">Output</td>
<td width="277" valign="top">6.2</td>
<td width="277" valign="top">6.2</td>
<td width="0" height="23"> </td>
</tr>
<tr>
<td width="277" valign="top">Total</td>
<td width="277" valign="top">8750</td>
<td width="277" valign="top">4530</td>
<td width="0" height="23"> </td>
</tr>
</tbody>
</table>
<p>As on can see, parallelized phases achieve nearly linear speedup. There is some fixed deviation from the linear scaling probably caused by thread management, work distribution and synchronization overheads. Execution time of non-parallelized output phase is negligible. So I consider parallelization as successful in whole.</p>
<p><strong>References</strong></p>
<p>[1] Bentley-Ottmann algorithm. <a href="http://en.wikipedia.org/wiki/Bentley%E2%80%93Ottmann_algorithm">http://en.wikipedia.org/wiki/Bentley%E2%80%93Ottmann_algorithm</a></p>
<p>[2] Linear-Time Algorithms for Geometric Graphs with Sublinearly Many Crossings. <a href="http://www.siam.org/proceedings/soda/2009/SODA09_018_eppsteind.pdf">http://www.siam.org/proceedings/soda/2009/SODA09_018_eppsteind.pdf</a></p>
<p>[3] <a href="http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions">http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/08/12/parallelization-and-optimization-of-the-line-segment-intersection-problem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multi-producer/multi-consumer SEH-based queue</title>
		<link>http://software.intel.com/en-us/blogs/2009/08/11/multi-producermulti-consumer-seh-based-queue/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/08/11/multi-producermulti-consumer-seh-based-queue/#comments</comments>
		<pubDate>Tue, 11 Aug 2009 16:51:24 +0000</pubDate>
		<dc:creator>Dmitriy Vyukov</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[lock-free]]></category>
		<category><![CDATA[producer-consumer]]></category>
		<category><![CDATA[queue]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/08/11/multi-producermulti-consumer-seh-based-queue/</guid>
		<description><![CDATA[Novel mostly lock-free algorithm for multi-producer/multi-consumer queue]]></description>
			<content:encoded><![CDATA[<p>Here is my multi-producer/single-consumer queue:<br />
<a href="http://groups.google.ru/group/lock-free/browse_frm/thread/55df71b87acb8201">http://groups.google.ru/group/lock-free/browse_frm/thread/55df71b87acb8201</a><br />
The interesting part of the algorithm is an XCHG-based producer part.</p>
<p>As Chris Thomasson correctly noted, the XCHG-based producer part can be combined with the well-known CAS-based consumer part in order to get multi-producer/multi-consumer (MPMC) queue:<br />
<a href="http://groups.google.ru/group/comp.programming.threads/browse_frm/thread/053e322ea90e4ad5">http://groups.google.ru/group/comp.programming.threads/browse_frm/thread/053e322ea90e4ad5</a></p>
<p>In general, one may combine different producer and consumer parts into single queue provided that queue structure stays the same. For example, it's possible to combine the XCHG-based producer part with consumer part from 2-LOCK queue algorithm. The resulting 1-LOCK/1-XCHG queue will have quite appealing characteristics (wait-free producers, and 1 spin-lock acquisition for consumers, no need for ABA prevention nor safe memory reclamation).</p>
<p>But what I'm going to show is a bit more interesting, it's a novel consumer part for MPMC queue.<br />
The problem with classical CAS-based MPMC queue is that both producers and consumers may touch potentially freed/reused memory, moreover producers may write to that memory. That's why it requires safe memory reclamation (SMR), and in general SMR is quite problematic in a non-managed non-kernel environment (C/C++).<br />
XCHG-based producer part gracefully avoids touching freed/reused memory. So now the problem is with consumer part only, but note that consumers may only read from freed/reused (no writes to that memory). The key point of the proposed algorithm is handling of reads from reused memory with failing CAS, and handling of reads from freed memory with SEH/signal handler.<br />
Main characteristics of the algorithm:<br />
- intrusive<br />
- producers: 1 XCHG, wait-free<br />
- consumers: 1 CAS on common path, mostly lock-free (***)<br />
- producers and consumers do not content with each other (until queue is empty)<br />
- no need for safe memory reclamation</p>
<p>(***) requires additional comments. There is a small (1 machine instruction in length) window of inconsistency for producers. If producer will be preempted there he may (or may not) cause blocking of consumers (other producers are still wait-free). If producer will be terminated there he will cause system-wide stall. Taking into account length of the window, probability of these things may be considered negligible in most situations.</p>
<p>The algorithm requires double-word CAS (for pointer + ABA counter). On 64-bit systems it may be reduced to single-word (64-bit) CAS with pointer packing technique. For example, on Intel64/Windows any aligned pointer may be packed to 39 bits, this allows for 25-bit ABA counter.</p>
<p>OK, here we go:</p>
<pre>/*  Multi-producer/multi-consumer queue</pre>
<pre> *  2009, Dmitriy V'yukov</pre>
<pre> *  Distributed under the terms of the GNU General Public License</pre>
<pre> *  as published by the Free Software Foundation,</pre>
<pre> *  either version 3 of the License,</pre>
<pre> *  or (at your option) any later version.</pre>
<pre> *  See: http://www.gnu.org/licenses</pre>
<pre> */</pre>
<pre>// 32-bit, Windows, MSVC</pre>
<pre>#include &lt;windows.h&gt;</pre>
<pre>#include &lt;intrin.h&gt;</pre>
<pre>class mpmc_queue</pre>
<pre>{</pre>
<pre>public:</pre>
<pre>    struct node_t</pre>
<pre>    {</pre>
<pre>        node_t* volatile        next_;</pre>
<pre>    };</pre>
<pre>    mpmc_queue()</pre>
<pre>    {</pre>
<pre>        head_.ptr_ = 0;</pre>
<pre>        head_.cnt_ = 0;</pre>
<pre>        tail_ = &amp;head_.ptr_;</pre>
<pre>    }</pre>
<pre>    ~mpmc_queue()</pre>
<pre>    {</pre>
<pre>        ASSERT(head_.ptr_ == 0);</pre>
<pre>        ASSERT(tail_ == &amp;head_.ptr_);</pre>
<pre>    }</pre>
<pre>    void enqueue(node_t* node)</pre>
<pre>    {</pre>
<pre>        ASSERT(node);</pre>
<pre>        node-&gt;next_ = 0;</pre>
<pre>        node_t** prev = (node_t**)</pre>
<pre>            _InterlockedExchange((long*)&amp;tail_, (long)node);</pre>
<pre>        ASSERT(prev);</pre>
<pre>        // &lt;--- the window of inconsistency is HERE (***)</pre>
<pre>        prev[0] = node;</pre>
<pre>    }</pre>
<pre>    node_t* dequeue()</pre>
<pre>    {</pre>
<pre>        unsigned retry_count = 0;</pre>
<pre>        retry:</pre>
<pre>        __try</pre>
<pre>        {</pre>
<pre>            head_t h;</pre>
<pre>            h.ptr_= head_.ptr_;</pre>
<pre>            h.cnt_ = head_.cnt_;</pre>
<pre>            for (;;)</pre>
<pre>            {</pre>
<pre>                node_t* n = h.ptr_;</pre>
<pre>                if (n == 0)</pre>
<pre>                    return 0;</pre>
<pre>                if (n-&gt;next_)</pre>
<pre>                {</pre>
<pre>                    head_t xchg = {n-&gt;next_, h.cnt_ + 1};</pre>
<pre>                    __int64 prev_raw =</pre>
<pre>                        _InterlockedCompareExchange64</pre>
<pre>                            (&amp;head_.whole_, xchg.whole_, h.whole_);</pre>
<pre>                    head_t prev = *(head_t*)&amp;prev_raw;</pre>
<pre>                    if (*(__int64*)&amp;prev == *(__int64*)&amp;h)</pre>
<pre>                        return n;</pre>
<pre>                    h.ptr_ = prev.ptr_;</pre>
<pre>                    h.cnt_ = prev.cnt_;</pre>
<pre>                }</pre>
<pre>                else</pre>
<pre>                {</pre>
<pre>                    node_t* t = (node_t*)tail_;</pre>
<pre>                    if (n != t)</pre>
<pre>                    {</pre>
<pre>                        // spinning here may only be caused</pre>
<pre>                        // by producer preempted in (***)</pre>
<pre>                        SwitchToThread();</pre>
<pre>                        h.ptr_= head_.ptr_;</pre>
<pre>                        h.cnt_ = head_.cnt_;</pre>
<pre>                        continue;</pre>
<pre>                    }</pre>
<pre>                    head_t xchg = {0, h.cnt_ + 1};</pre>
<pre>                    head_t prev;</pre>
<pre>                    prev.whole_ = _InterlockedCompareExchange64</pre>
<pre>                        (&amp;head_.whole_, xchg.whole_, h.whole_);</pre>
<pre>                    if (prev.whole_ == h.whole_)</pre>
<pre>                    {</pre>
<pre>                        node_t* prev_tail = (node_t*)</pre>
<pre>                         _InterlockedCompareExchange</pre>
<pre>                         ((long*)&amp;tail_, (long)&amp;head_.ptr_, (long)n);</pre>
<pre>                        if (prev_tail == n)</pre>
<pre>                            return n;</pre>
<pre>                        // spinning here may only be caused</pre>
<pre>                        // by producer preempted in (***)</pre>
<pre>                        while (n-&gt;next_ == 0)</pre>
<pre>                            SwitchToThread();</pre>
<pre>                        head_.ptr_ = n-&gt;next_;</pre>
<pre>                        return n;</pre>
<pre>                    }</pre>
<pre>                    h.ptr_ = prev.ptr_;</pre>
<pre>                    h.cnt_ = prev.cnt_;</pre>
<pre>                }</pre>
<pre>            }</pre>
<pre>        }</pre>
<pre>        __except ((GetExceptionCode() == EXCEPTION_ACCESS_VIOLATION</pre>
<pre>                &amp;&amp; ++retry_count &lt; 64*1024) ?</pre>
<pre>                EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH)</pre>
<pre>        {</pre>
<pre>            goto retry;</pre>
<pre>        }</pre>
<pre>    }</pre>
<pre>private:</pre>
<pre>    union head_t</pre>
<pre>    {</pre>
<pre>        struct</pre>
<pre>        {</pre>
<pre>            node_t*             ptr_;</pre>
<pre>            unsigned            cnt_;</pre>
<pre>        };</pre>
<pre>        __int64                 whole_;</pre>
<pre>    };</pre>
<pre>    head_t volatile             head_;</pre>
<pre>    char                        pad_ [64];</pre>
<pre>    node_t* volatile* volatile  tail_;</pre>
<pre>    mpmc_queue(mpmc_queue const&amp;);</pre>
<pre>    mpmc_queue&amp; operator = (mpmc_queue const&amp;);</pre>
<pre>};</pre>
<p>And here is a small test:</p>
<pre>/*  Multi-producer/multi-consumer queue</pre>
<pre> *  2009, Dmitriy V'yukov</pre>
<pre> *  Distributed under the terms of the GNU General Public License</pre>
<pre> *  as published by the Free Software Foundation,</pre>
<pre> *  either version 3 of the License,</pre>
<pre> *  or (at your option) any later version.</pre>
<pre> *  See: http://www.gnu.org/licenses</pre>
<pre> */</pre>
<pre>size_t const thread_count = 8;</pre>
<pre>size_t const batch_size = 32;</pre>
<pre>size_t const iter_count = 400000;</pre>
<pre>bool volatile g_start = 0;</pre>
<pre>struct my_node : mpmc_queue::node_t</pre>
<pre>{</pre>
<pre>    int data;</pre>
<pre>    char pad [64];</pre>
<pre>};</pre>
<pre>unsigned __stdcall thread_func(void* ctx)</pre>
<pre>{</pre>
<pre>    mpmc_queue&amp; queue = *(mpmc_queue*)ctx;</pre>
<pre>    srand((unsigned)time(0) + GetCurrentThreadId());</pre>
<pre>    size_t pause = rand() % 1000;</pre>
<pre>    my_node* node_cache [batch_size];</pre>
<pre>    for (size_t i = 0; i != batch_size; i += 1)</pre>
<pre>    {</pre>
<pre>        node_cache[i] = new my_node;</pre>
<pre>        node_cache[i]-&gt;data = i;</pre>
<pre>    }</pre>
<pre>    while (g_start == 0)</pre>
<pre>        SwitchToThread();</pre>
<pre>    for (size_t i = 0; i != pause; i += 1)</pre>
<pre>        _mm_pause();</pre>
<pre>    for (int iter = 0; iter != iter_count; ++iter)</pre>
<pre>    {</pre>
<pre>        for (size_t i = 0; i != batch_size; i += 1)</pre>
<pre>        {</pre>
<pre>            queue.enqueue(node_cache[i]);</pre>
<pre>        }</pre>
<pre>        for (size_t i = 0; i != batch_size; i += 1)</pre>
<pre>        {</pre>
<pre>            for (;;)</pre>
<pre>            {</pre>
<pre>                my_node* node = (my_node*)queue.dequeue();</pre>
<pre>                if (node)</pre>
<pre>                {</pre>
<pre>                    node_cache[i] = node;</pre>
<pre>                    break;</pre>
<pre>                }</pre>
<pre>                SwitchToThread();</pre>
<pre>            }</pre>
<pre>        }</pre>
<pre>    }</pre>
<pre>    return 0;</pre>
<pre>}</pre>
<pre>int main()</pre>
<pre>{</pre>
<pre>    mpmc_queue queue;</pre>
<pre>    HANDLE threads [thread_count];</pre>
<pre>    for (int i = 0; i != thread_count; ++i)</pre>
<pre>    {</pre>
<pre>        threads[i] = (HANDLE)_beginthreadex</pre>
<pre>              (0, 0, thread_func, &amp;queue, 0, 0);</pre>
<pre>    }</pre>
<pre>    Sleep(1);</pre>
<pre>    unsigned __int64 start = __rdtsc();</pre>
<pre>    g_start = 1;</pre>
<pre>    WaitForMultipleObjects(thread_count, threads, 1, INFINITE);</pre>
<pre>    unsigned __int64 end = __rdtsc();</pre>
<pre>    unsigned __int64 time = end - start;</pre>
<pre>    std::cout &lt;&lt; "cycles/op=" &lt;&lt; time /</pre>
<pre>        batch_size * iter_count * 2 * thread_count)</pre>
<pre>        &lt;&lt; std::endl;</pre>
<pre>}</pre>
<p>Here you may download complete Microsoft Visual Studio solution:<br />
<a href="http://software.intel.com/file/21465">mpmc_seh_queue.zip</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/08/11/multi-producermulti-consumer-seh-based-queue/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Another Sorts of Sorts</title>
		<link>http://software.intel.com/en-us/blogs/2009/05/06/another-sorts-of-sorts/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/05/06/another-sorts-of-sorts/#comments</comments>
		<pubDate>Wed, 06 May 2009 15:46:06 +0000</pubDate>
		<dc:creator>Dmitriy Vyukov</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[multi-threading]]></category>
		<category><![CDATA[parallelization]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/05/06/another-sorts-of-sorts/</guid>
		<description><![CDATA[Asaf Shelly posted interesting blog regarding first problem (radix sort) of the Intel Threading Contest 2009: All Sorts of Sorts There is also active discussion going in the comments. Since I had mentioned some aspects of my submission, I decided to post my write-up here (I've checked up with Contest Rules, luckily Intel leave me [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://software.intel.com/en-us/profile/326676/" target="_blank">Asaf Shelly</a> posted interesting blog regarding first problem (radix sort) of the Intel Threading Contest 2009:<br />
<a href="http://software.intel.com/en-us/blogs/2009/04/27/all-sorts-of-sorts/" target="_blank">All Sorts of Sorts</a><br />
There is also active discussion going in the comments. Since I had mentioned some aspects of my submission, I decided to post my write-up here (I've checked up with Contest Rules, luckily Intel leave me enough rights for this :) ). So here it goes:</p>
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US"><strong>Radix Sort</strong></p>
<p style="0in;"><span lang="en-US">Radix sort is a sorting algorithm that sorts integers by processing individual digits. Because integers can represent strings of characters and specially formatted floating point numbers, radix sort is not limited to integers. </span><span lang="en-US">Most digital computers internally represent all of their data as electronic representations of binary numbers, so processing the digits of integer representations by groups of binary digit representations is most convenient. Two classifications of radix sorts are least significant digit (LSD) radix sorts and most significant digit (MSD) radix sorts. LSD radix sorts process the integer representations starting from the least significant digit and move towards the most significant digit. MSD radix sorts work the other way around. MSD sorting algorithm has particular application to parallel computing, as each of the subdivisions can be sorted independently of the rest.</span></p>
<p style="0in;" lang="en-US">Radix sort is not a comparison-based sort, so theoretical limit of O(NlgN) is not applicable. Computational complexity of radix sort is O(NK), where N is the number of values and K is the number of subdivisions. This complexity holds for worst, best and mean cases. Space complexity is O(NK).</p>
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US"><strong>Single-Threaded Implementation</strong></p>
<p style="0in;" lang="en-US">Naïve single-threaded implementation of MSD radix sort is quite straightforward:</p>
<p style="0in;" lang="en-US"><!-- 	 	 --></p>
<pre><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>typedef </span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>std::vector&lt;std::vector&lt;byte&gt; &gt; data_t;</span></span></span>
<span style="Courier New,monospace;"><span style="x-small;"><span>size_t </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>const</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> byte_values = 256;</span></span></span>
<span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>typedef </span></span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>unsigned</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>char</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> byte;</span></span></span>
<span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>void</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> radix_sort(data_t&amp; data, size_t position = 0)</span></span></span>
<span style="Courier New,monospace;"><span style="x-small;">{</span></span>
    <span style="#008000;"><span style="Courier New,monospace;"><span style="x-small;"><span>// recursion stop conditions</span></span></span></span>
    <span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>if </span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>(data.size() &lt;= 1 || position == data[0].size())</span></span></span>
        <span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>return</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>;</span></span></span>
    std::vector&lt;data_t&gt; <span style="Courier New,monospace;"><span style="x-small;">radix (byte_values);</span></span>
    <span style="#008000;"><span style="Courier New,monospace;"><span style="x-small;"><span>// radix split</span></span></span></span>
    <span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>for </span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>(size_t i = 0; i != data.size(); ++i)</span></span></span>
    <span style="Courier New,monospace;"><span style="x-small;">{</span></span>
        <span style="Courier New,monospace;"><span style="x-small;">size_t idx = data[i][position];</span></span>
        <span style="Courier New,monospace;"><span style="x-small;">radix[idx].push_back(data[i]);</span></span>
    <span style="Courier New,monospace;"><span style="x-small;">}</span></span>
    <span style="Courier New,monospace;"><span style="x-small;">size_t out_pos = 0;</span></span>
    <span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>for</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> (size_t i = 0; i != byte_values; ++i)</span></span></span>
    <span style="Courier New,monospace;"><span style="x-small;">{</span></span>
        <span style="#008000;"><span style="Courier New,monospace;"><span style="x-small;"><span>// recursive sort of lesser significant digits</span></span></span></span>
        <span style="Courier New,monospace;"><span style="x-small;">radix_sort(radix[i], position + 1);</span></span>
        <span style="#008000;"><span style="Courier New,monospace;"><span style="x-small;"><span>// copyback</span></span></span></span>
        <span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>for</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> (size_t j = 0; j != radix[i].size(); ++j, ++out_pos)</span></span></span>
        <span style="Courier New,monospace;"><span style="x-small;">{</span></span>
            <span style="Courier New,monospace;"><span style="x-small;">data[out_pos] = radix[i][j];</span></span>
        <span style="Courier New,monospace;"><span style="x-small;">}</span></span>
    <span style="Courier New,monospace;"><span style="x-small;">}</span></span>
<span style="Courier New,monospace;"><span style="x-small;">}</span></span></pre>
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US"><strong>Parallelization</strong></p>
<p style="0in;" lang="en-US">I use 2 types of parallelization. First type is the parallelization of the radix split (intra-radix parallelization), this parallelization is especially useful for initial radix split (most significant digit). Input data is split into several parts (fork), each processor picks up a part and makes radix split (parallel processing). When all parts have split partial radix arrays are aggregated (join) and directed to the next level of recursion. This parallelization may help also with sorting of not-so-randomly distributed data.</p>
<p style="0in;" lang="en-US">Second type is the parallelization on inter-radix level. Processor completely sorts whole array on lower levels of recursion. This parallelization helps mitigate overheads of thread synchronization.</p>
<p style="0in;" lang="en-US">Parallelization is guided at run-time. I.e. threads prefer to do inter-radix parallelization, however if some threads are out of work they help other threads on intra-radix level.</p>
<p style="0in;" lang="en-US">When size of the input array reaches some threshold, thread switches to single-threaded mode, i.e. no further sub-tasks are split (this also helps mitigate synchronization overheads).</p>
<p style="0in;" lang="en-US">Here is pseudo-code of the parallel algorithm:</p>
<p style="0in;" lang="en-US">
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>struct</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> radix_desc</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">{</span></span></pre>
<pre style="0in;"><span style="#008000;"><span style="Courier New,monospace;"><span style="x-small;"><span>  // partial results</span></span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  std::vector&lt;data_t&gt;     radix [thread_count];</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  size_t                  radix_pending_count;</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  size_t                  position;</span></span></pre>
<pre style="0in;"><span style="#008000;"><span style="Courier New,monospace;"><span style="x-small;"><span>  //...</span></span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">};</span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>struct</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> radix_task</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">{</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  data_t                  input;</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  radix_desc&amp;             desc;</span></span></pre>
<pre style="0in;"><span style="#008000;"><span style="Courier New,monospace;"><span style="x-small;"><span>  //...</span></span></span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>  void</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> execute()</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  {</span></span></pre>
<pre style="0in;"><span style="#008000;"><span style="Courier New,monospace;"><span style="x-small;"><span>    // partial radix split</span></span></span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>    for</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> (size_t i = 0; i != input.size(); ++i)</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">    {</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">      size_t idx = input[i][desc.position];</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">      desc.radix[thread_id][idx].push_back(input[i]);</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">    }</span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>    if</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> (0 == atomic_decrement(desc.radix_pending_count))</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">    {</span></span></pre>
<pre style="0in;"><span style="#008000;"><span style="Courier New,monospace;"><span style="x-small;"><span>      // spawn sub-tasks</span></span></span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>      for</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> (size_t i = 0; i != byte_values; ++i)</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">      {</span></span></pre>
<pre style="0in;"><span style="#008000;"><span style="Courier New,monospace;"><span style="x-small;"><span>        // aggregate partial results</span></span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">        data_t result;</span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>        for</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> (size_t j = 0; j != thread_count; ++j)</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">        {</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">          result.insert(result.end(), desc.radix[j][i]);</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">        }</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;"><span>        radix_desc desc = </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>new</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> radix_desc (...);</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">        spawn_some_subtasks(desc, result);</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">      }</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">    }</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  }</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">};</span></span></pre>
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US"><strong>Scheduling</strong></p>
<p style="0in;" lang="en-US">I've implemented custom task-based scheduler on top of the Win32 threading API. In main part it's similar to classical Cilk-style work-stealing scheduler, though I've made some improvements on it. In particular I've added system-topology awareness, hyper-threading awareness, affinity-awareness, batch-spawn capability and manual task-depth control. All worker threads are strictly binded to EUs (execution units), stealing conducted based on the “distance” between EUs, i.e. worker thread tries to steal from neighbor threads first, then from threads running on different NUMA node (system-topology awareness). This allows to efficiently reuse data in shared L3 cache of the processors.</p>
<p style="0in;" lang="en-US">Sibling HT threads share single work-stealing deque (HT awareness), this allows them to keep as close to each other as possible in terms of working sets. Resources of single core (L1D cache, L1 DTLB, etc) are not capable to accommodate 2 distinct radix sorts, HT awareness allows HT sibling threads to work on single radix sort, so to say. Assume first HT thread completes radix split and spawns a bunch of sub-tasks. Then it picks up some sub-task to process, while second HT thread picks up another sub-task, data for that another sub task is already in L1D cache (as well as in L1 DTLB) of the core.</p>
<p style="0in;" lang="en-US">The scheduler is able to support affinity of tasks. Though I didn't have enough time to exploit the feature.</p>
<p style="0in;" lang="en-US">When thread completes radix split it submits up to 96 (number of printable characters in US-ASCII) sub-tasks, scheduler allows to submit all the tasks in single enqueue operation. This reduces synchronization overheads to some degree.</p>
<p style="0in;" lang="en-US">When thread submits new tasks to the scheduler it explicitly passes so called tasks depth as a parameter. Task depth relates to the task level in the work DAG. When thread pops task from own work-stealing deque it picks up task with the highest available level (the smallest piece of work), when thread steals task from remote work-stealing deque it picks up task the lowest available level (the biggest piece of work). This reduces number of steal operations.</p>
<p style="0in;" lang="en-US">Regarding Threading Building Blocks. Another possibility would be to use TBB's task scheduler. Usage of the TBB would not affect main logic of the program in any way, because it supports exactly the same task concept. On one hand TBB would allow to reduce amount of written code (no need to implement scheduler manually). On the other hand TBB's scheduler is not system-topology aware, not HT aware, does not provide batch spawn capability, and does not provide manual control over task depths (not relevant w/o HT awareness) (TBB's scheduler is affinity aware to some degree, i.e. it supports task affinities however does not supports thread affinities). Also TBB's scheduler has somehow bigger task spawn/consume overheads: some 600 cycles, while my scheduler some 200 cycles (on my hardware). Since the contest is about raw performance I've decided to implement own scheduler.</p>
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US"><strong>Single-threaded Optimizations</strong></p>
<p style="0in;" lang="en-US">Avoiding copyback. Naïve radix sort implementation makes K (number of digits) copies of the whole data set in the copyback phase. In order to eliminate those copies I use following optimization. On start I allocate array for the sorted data:</p>
<p style="0in;" lang="en-US">
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>struct</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> output_cell</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">{</span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>  int</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> count_;</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  uint32_t*   data_;</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">};</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;"><span>size_t </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>const</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> output_size = 96*128*128; </span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;"><span>output_cell* g_output = </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>new</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> output_cell [output_size];</span></span></span></pre>
<p style="0in;">3 most significant digits of the value determine index in that array:</p>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">size_t output_index(uint64_t val)</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">{</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  byte* v = (byte*)&amp;val;</span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>  return</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> ((size_t)v[3]) | ((size_t)v[2] &lt;&lt; 7) | (((size_t)v[1] - 32) &lt;&lt; 14);</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">}</span></span></pre>
<p style="0in;">4 least significant digits of the value are stored in the inner array:</p>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>void</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> store_result(uint64_t val, size_t position)</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">{</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  size_t idx = output_index(val);</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  uint32_t v = (uint32_t)(val &gt;&gt; 32);</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  g_output[idx].data_[position] = v;</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">}</span></span></pre>
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US">This way all copies of the data in the copyback phase are eliminated, sorted data are placed directly to the final destination.</p>
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US">Counting sort. When values reduced to 2 bytes (by 5 previous radix splits) I use counting sort (which is a special case of the radix sort with special intermediate representation of the values). Counting sort has the same computational complexity as the radix sort, however has lower space complexity and can be implemented more efficiently. Since I expect very few values will be sorted with counting sort at a time (i.e. counter array will be very sparse), I add bitmask to optimize search over counter array.</p>
<p style="0in;" lang="en-US">Pseudo-code of the counting sort:</p>
<p style="0in;" lang="en-US">
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>void</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> counting_sort(uint16_t* begin, uint16_t* end, uint32_t* output, uint32_t prefix)</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">{</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  uint32_t counter [256*256] = {};</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  bitmask_t bitmask;</span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>  for</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> (uint16_t* pos = begin; pos != end; pos += 1)</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  {</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">    uint16_t v = pos[0];</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">    counter[v] += 1;</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">    bitmask.set_bit(v);</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  }</span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>  for</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> (uint16_t v; bitmask.get_and_reset_bit(v);)</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  {</span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>    do</span></span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">    {</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">      uint32_t val = prefix;</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">      val |= v;</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">      output[0] = val;</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">      output += 1;</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">    }</span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>    while</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> (--counter[v]);</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  }</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">}</span></span></pre>
<p style="0in;" lang="en-US">
<p style="0in;"><span style="Courier New,monospace;"><span style="x-small;"><span lang="en-US">bitmask_t::get_and_reset_bit() </span></span></span><span lang="en-US">operation is implemented with the BSF instruction (_BitScanForward64() intrinsic). Bitmask optimization reduces computational complexity of the counting sort from 65536*N to 2*N.</span></p>
<p style="0in;" lang="en-US">Counting sort is not parallelized in my implementation. Since input data is uniformly distributed, I expect this to not affect performance. Though this is a possible further optimization which will allow better handling of not-so-randomly distributed data.</p>
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US">Template code generation. I heavily use C++ template programming in order to allow efficient code generation. Value is represented by the following class:</p>
<p style="0in;" lang="en-US">
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>template</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>&lt;size_t digits_t&gt; </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>struct</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> data_layout;</span></span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>template</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>&lt;&gt; </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>struct</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> data_layout&lt;7&gt; {</span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>typedef</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> uint64_t value_t;};</span></span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>template</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>&lt;&gt; </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>struct</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> data_layout&lt;6&gt; {</span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>typedef</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> uint64_t value_t;};</span></span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>template</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>&lt;&gt; </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>struct</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> data_layout&lt;5&gt; {</span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>typedef</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> uint64_t value_t;};</span></span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>template</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>&lt;&gt; </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>struct</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> data_layout&lt;4&gt; {</span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>typedef</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> uint32_t value_t;};</span></span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>template</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>&lt;&gt; </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>struct</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> data_layout&lt;3&gt; {</span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>typedef</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> uint32_t value_t;};</span></span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>template</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>&lt;&gt; </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>struct</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> data_layout&lt;2&gt; {</span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>typedef</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> uint16_t value_t;};</span></span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>template</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>&lt;size_t digits_t&gt;</span></span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>struct</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> value</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">{</span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>  typedef</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>typename</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> data_layout&lt;digits_t&gt;::value_t value_t;</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  value_t val;</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;"><span>  value&amp; </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>operator</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> = (value&lt;digits_t + 1&gt; </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>const</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>&amp; r)</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  {</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;"><span>    val = (value_t)(r.val &gt;&gt; (8 * (</span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>sizeof</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>(r) - </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>sizeof</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>(*</span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>this</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>))));</span></span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>    return</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> *</span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>this</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>;</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  }</span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>  char</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> prefix() </span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>const</span></span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  {</span></span></pre>
<pre style="0in;"><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>    return</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span> ((</span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>char</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>*)&amp;val)[</span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>sizeof</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>(*</span></span></span><span style="#0000ff;"><span style="Courier New,monospace;"><span style="x-small;"><span>this</span></span></span></span><span style="Courier New,monospace;"><span style="x-small;"><span>) - digits_t];</span></span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">  }</span></span></pre>
<pre style="0in;"><span style="Courier New,monospace;"><span style="x-small;">};</span></span></pre>
<p style="0in;" lang="en-US">All functions and classes related to radix sorting are also template parametrized by number of digits, and act accordingly to particular value layout, location of the radix prefix in the value, etc.</p>
<p style="0in;" lang="en-US">Also radix task is template parametrized by parameters is_single_threaded and is_parent_single_threaded. When  is_single_threaded==true, task allocates subtasks on the stack and executes them directly. When is_parent_single_threaded==true, task avoids atomic counting of pending siblings, since parent allocates sub-tasks on the stack they all will complete when parent completes.</p>
<p style="0in;" lang="en-US">
<p style="0in;"><span lang="en-US">Memory allocation. Efficient memory allocation is crucial for single-threaded as well as multi-threaded (standard Windows allocator uses single mutex which significantly reduces scalability) performance of the implementation. I implement distributed region memory allocator, there is a pool of 2 MB pages per NUMA node, a thread privatizes a page from that pool and then uses region allocation on the page. When page exhausted thread privatizes another page, and so on. No memory is freed to the OS during radix sort, though some memory is reused internally. Also I implement simple caching memory allocator for objects of a particular size; the allocator is based on a per-thread lifo freelist. When object is freed it’s pushed onto the freelist, when object must be allocated it’s popped from the freelist.</span></p>
<p style="0in;" lang="en-US">
<p style="0in;" lang="en-US"><strong>Tools</strong></p>
<p style="0in;" lang="en-US">I was considering Microsoft Visual C++ (MSVC) and Intel C++ (ICC) compilers. In 32-bit mode ICC showed impressive 30% speedup over MSVC (even more with profile-guided optimizations). However in 64-bit mode ICC showed wicked 20% slowdown (with maximum possible optimizations turned on, including /QxHost, /Qunroll, etc), profile-guided optimizations improve situation somehow but ICC still was behind MSVC. I didn't have time to investigate the problem, so I've decided to use MSVC for final submission.</p>
<p style="0in;" lang="en-US">As a profiler I used AMD CodeAnalyst, it's a simple profiler which allows to easily capture and analyze profile of the program. Profiling was crucial for single-threaded optimizations. Also it allowed me to verify that profile of the multi-threaded version is mainly identical to that of the single-threaded version, and that overheads for synchronization and scheduling are not greater than several percents – all this is a good sign of successful parallelization. Another option would be to use Intel PTU, it's somehow more complicated however would allow to capture processor performance events which is crucial for single-threaded optimization (for example it would answer what causes excessive pipeline stalls – L1D cache misses or L1 DTLB misses).</p>
<p style="0in;" lang="en-US">Another great tool I used is Windows Task Manager. I allowed me to track virtual memory consumption, CPU utilization, working set and number of page faults. The goal was to keep virtual memory consumption in expected bounds (~1.5 * input data size in my case), 100% utilization of the CPUs in parallel phase and 0 page faults (i.e. working set == virtual memory).</p>
<p style="0in;" lang="en-US">
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/05/06/another-sorts-of-sorts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Don&#039;t rely on memory barriers for synchronization... Only if you don&#039;t aware of Relacy Race Detector!</title>
		<link>http://software.intel.com/en-us/blogs/2009/03/03/dont-rely-on-memory-barriers-for-synchronization-only-if-you-dont-aware-of-relacy-race-detector/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/03/03/dont-rely-on-memory-barriers-for-synchronization-only-if-you-dont-aware-of-relacy-race-detector/#comments</comments>
		<pubDate>Tue, 03 Mar 2009 18:23:26 +0000</pubDate>
		<dc:creator>Dmitriy Vyukov</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[data races]]></category>
		<category><![CDATA[Deadlock]]></category>
		<category><![CDATA[livelock]]></category>
		<category><![CDATA[lock-free]]></category>
		<category><![CDATA[memory model]]></category>
		<category><![CDATA[multi-threading]]></category>
		<category><![CDATA[synchronization]]></category>
		<category><![CDATA[verification]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/03/03/dont-rely-on-memory-barriers-for-synchronization-only-if-you-dont-aware-of-relacy-race-detector/</guid>
		<description><![CDATA[Multithreading is hard. Implementation of synchronization primitives is even harder. And most advanced synchronization primitives which exploit relaxed memory models are brain damaging.  So don't rely on memory barriers for synchronization. Now you will think "Ah, you are one of those folks, who constantly saying to us - don't do this, don't do that, it's [...]]]></description>
			<content:encoded><![CDATA[<p style="justify;">Multithreading is hard. Implementation of synchronization primitives is even harder. And most advanced synchronization primitives which exploit relaxed memory models are brain damaging.  So don't rely on memory barriers for synchronization. Now you will think "Ah, you are one of those folks, who constantly saying to us - don't do this, don't do that, it's too difficult, it's too dangerous, anyway you will fail". Nope, I am not. I am saying exactly the opposite - do rely on memory barriers for synchronization... sometimes; because smart fine-grained synchronization is a matter of orders of magnitude performance difference and principal possibly of scaling (again sometimes, of course). But it's true that low-level synchronization primitives and memory ordering issues are hard. So some time ago I've developed a tool called <a href="http://groups.google.com/group/relacy">Relacy Race Detector</a> in order to help developers (actually in order to help me, but it doesn't matter) with these issues.</p>
<p style="justify;">Relacy Race Detector (RRD) is a synchronization algorithm verifier for relaxed memory models (formally - stateless dynamic verifier, if you are interested). Physically it's a header-only C++ library, but it's able to test not only C++ algorithms, but also Java, .NET/CLI, x86, PowerPC, SPARC, etc - more on this later. Now let's consider how work with RRD looks for end user.</p>
<p style="justify;">First of all you have to implement synchronization algorithm (mutex, concurrent data structure, etc) which is subject to verification. Then you express one or several unit-tests of the algorithm. Then you write several lines of code to start execution of the unit-test. Ok, now you may compile and start the program, and RRD will take care of the efficient execution of the unit-test, i.e. it will execute zillions of different thread interleavings. During execution of each interleaving it will be constantly conducting a number of built-in checks - checks for data races, dead-locks, live-locks, accesses to freed memory, double memory free, memory leaks, accesses to uninitialized variables, incorrect API usage (recursive lock on non-recursive mutex), as well as verification of user specified asserts and invariants. When (if) RRD will detect some error it will output detailed execution history that leads to the error (history includes such things as instances of the ABA problem, reorderings of memory accesses, thread blocking/unblocking). With such history in hands localization of the error is less than a problem.</p>
<p style="justify;">Great number of various built-in checks allows to specify no user asserts/invariants in many cases and still get exhaustive verification; for example, most mutual exclusion algorithms may be verified w/o user asserts. However, if you want to verify FIFO order of messages provided by producer-consumer queue, you will have to code this check manually.</p>
<p style="justify;">Regarding verification of Java/.NET programs. Initially I was targeted only at C++0x memory model and atomics API. But it turns out that C++0x memory model is so relaxed and atomics API is so general that I was easily able to map other memory models to C++0x memory model. So now RRD includes a set of thin wrappers for Java atomics and volatiles, .NET interlocked and volatiles, as well as POSIX and Win32 synchronization primitives. Thus, yes, you still have to code in C++, but you can verify your algorithm implementation as is would Java or .NET algorithm implementation.</p>
<p style="justify;">RRD contains 3 different schedulers: random scheduler, full search scheduler and context bound scheduler. Each represents some compromise between verification speed and completeness of the verification. All schedulers are fair, i.e. support verification of formally non-terminating programs. I don't want to go into more details regarding schedulers here, but you can find more info on Relacy Race Detector <a href="http://groups.google.com/group/relacy">web site</a>.</p>
<p style="justify;">Now let's look at the simple example which mimics basic spin-mutex:</p>
<pre>#include &lt;relacy/relacy.hpp&gt;
// main RRD namespace is 'rl', also note required instrumentation in the form of '($)'

struct mutex {
  rl::atomic&lt;int&gt; lock_;
   mutex() {
     lock_($) = 0;
   }
   void lock() {
     while (lock_($).exchange(1, rl::memory_order_acquire) == 1)
       rl::yield($, 1);
  }
  void unlock() {
     lock_($).store(0, rl::memory_order_release);
   }
};

// unit-test ('2' means number of threads)
struct mutex_test : rl::test_suite&lt;mutex_test, 2&gt; {
  mutex mtx;
  rl::var&lt;int&gt; data;
  // executed in single thread before main thread function
  void before() {
    data($) = 0;
  }
  // main thread function
  void thread(unsigned /*thread_index*/) {
    mtx.lock();
    data($) += 1;
    mtx.unlock();
  }
  // executed in single thread after main thread function
  void after() {
    RL_ASSERT(data($) == 2);
  }
};

int main() {
  rl::simulate&lt;mutex_test&gt;();
}</pre>
<p>If we will run it we will see successful completion of the test. Now let's introduce some simple bugs into the code and see how RRD will react on them. I've replaced rl::memory_order_release with rl::memory_order_relaxed in mutex::unlock() and run the program, the output is (I've omitted per-thread execution history and some non relevant details for brevity; [first number] in history is global operation index, second number is thread index):</p>
<pre>struct mutex_test
 <strong>DATA RACE</strong> (data race detected)
 iteration: 1</pre>
<pre>execution history:
 [0] 1: [CTOR BEGIN]
 [1] 1: &lt;0034BEE0&gt; atomic store, value=0, (prev value=0), order=seq_cst, in mutex::mutex, mutex.cpp(7)
 [2] 1: &lt;0034BF04&gt; store, value=0, in mutex_test::mutex_test, mutex.cpp(22)
 [3] 1: [CTOR END]
 [4] 1: &lt;0034BEE0&gt; exchange , prev=0, op=1, new=1, order=acquire, in mutex::lock, mutex.cpp(10)
 [5] 1: &lt;0034BF04&gt; load, value=0, in mutex_test::thread, mutex.cpp(26)
 [6] 1: <strong>&lt;0034BF04&gt; store, value=1, in mutex_test::thread, mutex.cpp(26)
</strong> [7] 1: &lt;0034BEE0&gt; atomic store, value=0, (prev value=1), order=relaxed, in mutex::unlock, mutex.cpp(14)
 [8] 0: &lt;0034BEE0&gt; exchange , prev=0, op=1, new=1, order=acquire, in mutex::lock, mutex.cpp(10)
 [9] 0: <strong>&lt;0034BF04&gt; load, value=0, in mutex_test::thread, mutex.cpp(26)</strong>
 [10] 0: <strong>DATA RACE</strong> (data race detected), in mutex_test::thread, mutex.cpp(26)</pre>
<p>We see that RRD easily detected data race on variable mutex_test::data (more precisely this is data race type 2 which means that conflicting data accesses are not adjacent in the execution history although still conflicting).<br />
Now let's make following change in the test:</p>
<pre>struct mutex_test : rl::test_suite&lt;mutex_test, 2&gt; {
  mutex mtx;
  <strong>rl::atomic&lt;int&gt;</strong> data;
  void before() {
    data($) = 0;
  }
  void thread(unsigned /*thread_index*/) {
    mtx.lock();
<strong>    int tmp = data($).load(rl::memory_order_relaxed);
    data($).store(tmp + 1, rl::memory_order_relaxed);
</strong>    mtx.unlock();
  }
  void after() {
    RL_ASSERT(data($) == 2);
  }
};</pre>
<p>Now mutex_test::data is declared as 'rl::atomic&lt;int&gt;' (so no more data races are possible on it), and I've slightly changed the way I increment the variable. Here is the output (remember, unlock operation is still done with memory_order_relaxed):</p>
<pre>struct mutex_test
<strong>USER ASSERT FAILED (assertion: data($) == 2)
</strong>iteration: 5</pre>
<pre>execution history:
[0] 1: [CTOR BEGIN]
[1] 1: &lt;0034AD10&gt; atomic store, value=0, (prev value=0), order=seq_cst, in mutex::mutex, mutex.cpp(7)
[2] 1: &lt;0034AD34&gt; atomic store, value=0, (prev value=0), order=seq_cst, in mutex_test::mutex_test, mutex.cpp(22)
[3] 1: [CTOR END]
[4] 1: &lt;0034AD10&gt; exchange , prev=0, op=1, new=1, order=acquire, in mutex::lock, mutex.cpp(10)
[5] 0: &lt;0034AD10&gt; exchange , prev=1, op=1, new=1, order=acquire, in mutex::lock, mutex.cpp(10)
[6] 0: yield(1), in mutex::lock, mutex.cpp(11)
[7] 1: &lt;0034AD34&gt; atomic load, value=0, order=relaxed, in mutex_test::thread, mutex.cpp(26)
[8] 1: &lt;0034AD34&gt; atomic store, value=1, (prev value=0), order=relaxed, in mutex_test::thread, mutex.cpp(27)
[9] 0: &lt;0034AD10&gt; exchange , prev=1, op=1, new=1, order=acquire, in mutex::lock, mutex.cpp(10)
[10] 0: yield(1), in mutex::lock, mutex.cpp(11)
[11] 1: &lt;0034AD10&gt; atomic store, value=0, (prev value=1), order=relaxed, in mutex::unlock, mutex.cpp(14)
[12] 0: &lt;0034AD10&gt; exchange , prev=0, op=1, new=1, order=acquire, in mutex::lock, mutex.cpp(10)
[13] 0: <strong>&lt;0034AD34&gt; atomic load, value=0 [NOT CURRENT], order=relaxed, in mutex_test::thread, mutex.cpp(26)</strong>
[14] 0: &lt;0034AD34&gt; atomic store, value=1, (prev value=1), order=relaxed, in mutex_test::thread, mutex.cpp(27)
[15] 0: &lt;0034AD10&gt; atomic store, value=0, (prev value=1), order=relaxed, in mutex::unlock, mutex.cpp(14)
[16] 0: [AFTER BEGIN]
[17] 0: <strong>&lt;0034AD34&gt; atomic load, value=1, order=seq_cst, in mutex_test::after, mutex.cpp(31)</strong>
[18] 0: <strong>USER ASSERT FAILED (assertion: data($) == 2)</strong>, in mutex_test::after, mutex.cpp(31)</pre>
<p>We see that RRD have detected on the fifth iteration that thread #0 loaded not current value from the 'data' variable, so that final value of the variable is 1 instead of 2.<br />
Ok, now I hope that you have got brief idea of the Relacy Race Detector. In the conclusion I want to describe some of the features that I am thinking to incorporate into future releases of RRD - thread local storage (with POSIX and Win32 wrappers), UNIX signals and hardware interrupts, partial-order reductions and parallelizatioin of the run-time, persistent checkpointing of the simulation, performance simulation, detection of dead-code. Although current development of RDD is a kind of demand-driven, so I will be happy to hear your comments, suggestions and feedback.</p>
<p>Main Relacy Race Detector web-site is:<br />
<a href="http://groups.google.com/group/relacy">http://groups.google.com/group/relacy</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/03/03/dont-rely-on-memory-barriers-for-synchronization-only-if-you-dont-aware-of-relacy-race-detector/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New interesting application of Transactional Memory for single-threading</title>
		<link>http://software.intel.com/en-us/blogs/2008/10/28/new-interesting-application-of-transactional-memory-for-single-threading/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/10/28/new-interesting-application-of-transactional-memory-for-single-threading/#comments</comments>
		<pubDate>Tue, 28 Oct 2008 17:18:41 +0000</pubDate>
		<dc:creator>Dmitriy Vyukov</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[atomicity]]></category>
		<category><![CDATA[transactional memory]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/10/28/new-interesting-application-of-transactional-memory-for-single-threading/</guid>
		<description><![CDATA[While dealing with Transactional Memory I realize new interesting application of Transactional Memory for SINGLE threaded applications. Atomicity guarantees provided by TM can be useful not only for multi-threaded environment, but also for single-threaded environment. Well, it's actually not astonishing, nevertheless I didn't hear anything similar in all that hype around TM. Assume we have [...]]]></description>
			<content:encoded><![CDATA[<p>While dealing with Transactional Memory I realize new interesting application of Transactional Memory for SINGLE threaded applications. Atomicity guarantees provided by TM can be useful not only for multi-threaded environment, but also for single-threaded environment. Well, it's actually not astonishing, nevertheless I didn't hear anything similar in all that hype around TM.</p>
<p>Assume we have complicated operation which involves non-trivial modifications of several objects/containers. Assume that exception can be thrown either by memory allocator, or by copy constructor of some object, or just by application logic. In order to provide strong exception safety in such situation we have to manually write code for cancellation of all those modifications. This can be non-trivial error-prone task.</p>
<p>TM already has all necessary machinery for cancellation of arbitrary operations. TM doesn't care about complexity of operations, number of involved objects/containers, it can just instantly cancel anything which happens inside atomic block. Why don't use it?</p>
<p>So the general recipe for strong exception safety with the help of TM: wrap arbitrary operation in atomic block, if the case of error just abort transaction. Hocus-pocus! Your operation obtains strong exception safety at once w/o single line of code!</p>
<p>This can be equally applied to exceptions and errors codes/return values, no matter.</p>
<p>Here is something which I was actually able to model with Intel C++ STM Compiler:</p>
<pre>// object which "sometimes" throws exception in copy ctor</pre>
<pre>bool fail_bad_int = false;</pre>
<pre>int fail_bad_int_step = 8;</pre>
<pre>__declspec(tm_callable)</pre>
<pre>struct bad_int</pre>
<pre>{</pre>
<pre>    bad_int(int v = 0)</pre>
<pre>        : v(v)</pre>
<pre>    {}</pre>
<pre>    bad_int&amp; operator = (bad_int r)</pre>
<pre>    {</pre>
<pre>        if (fail_bad_int &amp;&amp; 0 == --fail_bad_int_step)</pre>
<pre>            throw 0;</pre>
<pre>        v = r.v;</pre>
<pre>        return *this;</pre>
<pre>    }</pre>
<pre>    operator int () const</pre>
<pre>    {</pre>
<pre>        return v;</pre>
<pre>    }</pre>
<pre>    bool operator &lt; (bad_int r) const</pre>
<pre>    {</pre>
<pre>        return v &lt; r.v;</pre>
<pre>    }</pre>
<pre>    int v;</pre>
<pre>};</pre>
<pre>// transactional sort function</pre>
<pre>template&lt;typename T&gt;</pre>
<pre>__declspec(tm_callable)</pre>
<pre>void sort(T* begin, T* end)</pre>
<pre>{</pre>
<pre>    T temp;</pre>
<pre>    size_t n = end - begin;</pre>
<pre>    if (n &lt; 2)</pre>
<pre>        return;</pre>
<pre>    bool swapped = false;</pre>
<pre>    do</pre>
<pre>    {</pre>
<pre>        swapped = false;</pre>
<pre>        n -= 1;</pre>
<pre>        for (size_t i = 0; i != n; ++i)</pre>
<pre>        {</pre>
<pre>            if (begin[i + 1] &lt; begin[i])</pre>
<pre>            {</pre>
<pre>                temp = begin[i];</pre>
<pre>                begin[i] = begin[i + 1];</pre>
<pre>                begin[i + 1] = temp;</pre>
<pre>                swapped = true;</pre>
<pre>            }</pre>
<pre>        }</pre>
<pre>    }</pre>
<pre>    while (swapped);</pre>
<pre>}</pre>
<pre>int main()</pre>
<pre>{</pre>
<pre>    std::vector&lt;bad_int&gt; x;</pre>
<pre>    std::generate_n(std::back_inserter(x), 10, rand);</pre>
<pre>    std::copy(x.begin(), x.end(),</pre>
<pre>        std::ostream_iterator&lt;int&gt;(std::cout, " \t"));</pre>
<pre>    std::cout &lt;&lt; std::endl;</pre>
<pre>    fail_bad_int = true;</pre>
<pre>    bad_int* begin = &amp;*x.begin();</pre>
<pre>    bad_int* end = begin + x.size();</pre>
<pre>    __tm_atomic</pre>
<pre>    {</pre>
<pre>        try</pre>
<pre>        {</pre>
<pre>            sort(begin, end);</pre>
<pre>        }</pre>
<pre>        catch (...)</pre>
<pre>        {</pre>
<pre>            __tm_abort;</pre>
<pre>        }</pre>
<pre>    }</pre>
<pre>    std::copy(x.begin(), x.end(),</pre>
<pre>        std::ostream_iterator&lt;int&gt;(std::cout, " \t"));</pre>
<pre>    std::cout &lt;&lt; std::endl;</pre>
<pre>}</pre>
<p>Output:</p>
<blockquote><p>41 18467 6334 26500 19169 15724 11478 29358 26962 24464<br />
41 18467 6334 26500 19169 15724 11478 29358 26962 24464</p></blockquote>
<p>And if I comment __tm_abort statement out, then output is:</p>
<blockquote><p>41 18467 6334 26500 19169 15724 11478 29358 26962 24464<br />
41 6334 18467 19169 26500 26500 11478 29358 26962 24464</p></blockquote>
<p>Notice that in latter case array is in some intermediate state. And in former case array is in initial state.</p>
<p>Although one has to understand that this method can incur substantial run-time (space and time) overheads even if exception is not thrown. On the other hand it's extremely simple and not error prone (i.e. you can not forget to cancel some things or cancel incorrectly). So this approach can be used on cold-paths or on initial development stage.</p>
<p>I see that Intel C++ STM Compiler has execution mode called 'serial atomic', which used for serialized transactions which have abort statements. Serial atomic mode can be used to hasten such single-threaded usage of TM. I.e. TM run-time only has to log writes, no need for synchronization, quiescence, verification etc.</p>
<p>What do you think? Is it useful?</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/10/28/new-interesting-application-of-transactional-memory-for-single-threading/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eliminate False Sharing? Wrong!</title>
		<link>http://software.intel.com/en-us/blogs/2008/10/09/eliminate-false-sharing-wrong/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/10/09/eliminate-false-sharing-wrong/#comments</comments>
		<pubDate>Fri, 10 Oct 2008 01:29:09 +0000</pubDate>
		<dc:creator>Dmitriy Vyukov</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[cache]]></category>
		<category><![CDATA[data layout]]></category>
		<category><![CDATA[false-sharing]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/10/09/eliminate-false-sharing-wrong/</guid>
		<description><![CDATA[Entry in Parallel Programming with .NET blog "Most Common Performance Issues in Parallel Programs" and recent article in MSDN ".NET Matters: False Sharing" have attracted my attention. Basically they both suggest to eliminate false sharing. Wrong! Wrong! Wrong! It's not the whole truth, so to say. So if authors were under oath in the virtual [...]]]></description>
			<content:encoded><![CDATA[<p>Entry in Parallel Programming with .NET blog "<a href="http://blogs.msdn.com/pfxteam/archive/2008/08/12/8849984.aspx">Most Common Performance Issues in Parallel Programs</a>" and recent article in MSDN "<a href="http://msdn.microsoft.com/en-us/magazine/cc872851.aspx">.NET Matters: False Sharing</a>" have attracted my attention. Basically they both suggest to eliminate false sharing. Wrong! Wrong! Wrong! It's not the whole truth, so to say. So if authors were under oath in the virtual IT court, they would have to be arrested. Fortunately they weren't under oath :)</p>
<p>The first thing one has to say in that context is:<br />
1. <strong>Eliminate sharing</strong>. Period. Not false sharing, just sharing. It's sharing that has huge performance penalties. It's sharing that changes linear scalability of your application to super-linear degradation. And believe me, hardware has no means to distinguish false sharing from true sharing. It can't penalize only false sharing, and handle true sharing without any performance penalties.</p>
<p>Second thing one has to say in that context is:<br />
2. Put things that must be close to each other... <strong>close to each other</strong>. Assume following situation. In order to complete some operation thread has to update variable X and variable Y. If variables are situated far from each other (on different cache lines), then thread has to load (from main memory, or from other processor's cache) 2 cache lines instead of 1 (if variables are situated close to each other). Effectively this situation can be considered the same as false-sharing, because thread places unnecessary work on interconnects, thus degrading performance and scalability.</p>
<p>Points 1 and 2 can be aggregated as:</p>
<p>1+2. <strong>Do pay attention to data layout</strong>. This was important in the 60's. This is even more important in the multicore era.</p>
<p>Only after that one can also add:</p>
<p>3. Sometimes sharing can show up when you are not expecting it, i.e <strong>false sharing</strong>. This is important to eliminate false sharing too... etceteras... [insert here contents of <a href="http://msdn.microsoft.com/en-us/magazine/cc872851.aspx">False Sharing</a> article]</p>
<p>If one says <strong>only </strong>point 3, well, it's basically senseless. And sometimes it can even hurt.</p>
<p>Let's consider simple example:</p>
<blockquote>
<pre>long volatile g_operation_count = 0;</pre>
<pre>void collect_statistics() {</pre>
<pre>  InterlockedIncrement(&amp;g_operation_count);
}</pre>
</blockquote>
<p>What does naive programmer think about it? <em>Hmmm... Let's see... I use "fast" non-blocking interlocked operations. Good!... Hmmm... False sharing. Let's see... Hmmm... Here is no false sharing. Good! So my program fully conforms to recommendations of experts.</em></p>
<p>Rubbish! It's a dead-slow, completely non-scalable program.</p>
<p>Now let's apply consistent rules to the example. First of all we have to do something like this:</p>
<blockquote>
<pre style="30px;">long volatile g_operation_count [MAX_THREAD_COUNT] = {};</pre>
<pre style="30px;">void collect_statistics() {</pre>
<pre style="30px;">  InterlockedIncrement(&amp;g_operation_count[get_current_thread_id()]);
}</pre>
</blockquote>
<p>It's good distributed design. When we need aggregate number of operations we just sum up all thread local counters.</p>
<p>Only at this point we can remember about false-sharing and put the final touches to the code:</p>
<blockquote>
<pre style="30px;">struct counter_t {
  long volatile count;
  char pad [CACHE_LINE_SIZE - sizeof(long)];
}
counter_t g_operation_count [MAX_THREAD_COUNT] = {};</pre>
<pre style="30px;">void collect_statistics() {</pre>
<pre>  InterlockedIncrement(&amp;g_operation_count[get_current_thread_id()].count);
}</pre>
</blockquote>
<p>Ok, this distributed version is also fast and scalable. It has linear scalability and can be faster up to 100x on modern multi-core hardware as compared with original version.</p>
<p>So, point 1+2 is a kind of general rule, while point 3 is just a refinement to them.</p>
<p>Why people don't say the whole truth? I don't know. I don't beleive that authors don't aware of the problem. Maybe they think that it's obvious. The practice shows that it's not...</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/10/09/eliminate-false-sharing-wrong/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Hello, Intel Software Network!</title>
		<link>http://software.intel.com/en-us/blogs/2008/10/09/hello-intel-software-network/</link>
		<comments>http://software.intel.com/en-us/blogs/2008/10/09/hello-intel-software-network/#comments</comments>
		<pubDate>Fri, 10 Oct 2008 01:27:35 +0000</pubDate>
		<dc:creator>Dmitriy Vyukov</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[atomic-free]]></category>
		<category><![CDATA[lock-free]]></category>
		<category><![CDATA[obstruction-free]]></category>
		<category><![CDATA[wait-free]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2008/10/09/hello-intel-software-network/</guid>
		<description><![CDATA[Hi, I'm Dmitriy V'jukov, and welcome to my blog. I'm going to devote this blog to multi-threading, multi-core, synchronization algorithms and all other concerned things. Development of lock-free, wait-free, obstruction-free, atomic-free synchronization algorithms is my hobby. Some of my designs you can see here. Also, I've developed a tool called Relacy Race Detector, which can [...]]]></description>
			<content:encoded><![CDATA[<p>Hi, I'm Dmitriy V'jukov, and welcome to my blog.</p>
<p>I'm going to devote this blog to multi-threading, multi-core, synchronization algorithms and all other concerned things. Development of lock-free, wait-free, obstruction-free, atomic-free synchronization algorithms is my hobby. Some of my designs you can see <a href="http://groups.google.com/group/lock-free" target="_blank">here</a>.</p>
<p>Also, I've developed a tool called <a href="http://groups.google.com/group/relacy" target="_blank">Relacy Race Detector</a>, which can be of help to developers of synchronization algorithms.</p>
<p>Stay tuned!</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2008/10/09/hello-intel-software-network/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

