English | 中文 | Русский | Français
2,590 Posts served
8,335 Conversations started
Entry in Parallel Programming with .NET blog "Most Common Performance Issues in Parallel Programs" and recent article in MSDN ".NET Matters: False Sharing" have attracted my attention. Basically they both suggest to eliminate false sharing. Wrong! Wrong! Wrong! It's not the whole truth, so to say. So if authors were under oath in the virtual IT court, they would have to be arrested. Fortunately they weren't under oath :)
The first thing one has to say in that context is:
1. Eliminate sharing. Period. Not false sharing, just sharing. It's sharing that has huge performance penalties. It's sharing that changes linear scalability of your application to super-linear degradation. And believe me, hardware has no means to distinguish false sharing from true sharing. It can't penalize only false sharing, and handle true sharing without any performance penalties.
Second thing one has to say in that context is:
2. Put things that must be close to each other... close to each other. Assume following situation. In order to complete some operation thread has to update variable X and variable Y. If variables are situated far from each other (on different cache lines), then thread has to load (from main memory, or from other processor's cache) 2 cache lines instead of 1 (if variables are situated close to each other). Effectively this situation can be considered the same as false-sharing, because thread places unnecessary work on interconnects, thus degrading performance and scalability.
Points 1 and 2 can be aggregated as:
1+2. Do pay attention to data layout. This was important in the 60's. This is even more important in the multicore era.
Only after that one can also add:
3. Sometimes sharing can show up when you are not expecting it, i.e false sharing. This is important to eliminate false sharing too... etceteras... [insert here contents of False Sharing article]
If one says only point 3, well, it's basically senseless. And sometimes it can even hurt.
Let's consider simple example:
long volatile g_operation_count = 0;void collect_statistics() {InterlockedIncrement(&g_operation_count); }
What does naive programmer think about it? Hmmm... Let's see... I use "fast" non-blocking interlocked operations. Good!... Hmmm... False sharing. Let's see... Hmmm... Here is no false sharing. Good! So my program fully conforms to recommendations of experts.
Rubbish! It's a dead-slow, completely non-scalable program.
Now let's apply consistent rules to the example. First of all we have to do something like this:
long volatile g_operation_count [MAX_THREAD_COUNT] = {};void collect_statistics() {InterlockedIncrement(&g_operation_count[get_current_thread_id()]); }
It's good distributed design. When we need aggregate number of operations we just sum up all thread local counters.
Only at this point we can remember about false-sharing and put the final touches to the code:
struct counter_t { long volatile count; char pad [CACHE_LINE_SIZE - sizeof(long)]; } counter_t g_operation_count [MAX_THREAD_COUNT] = {};void collect_statistics() {InterlockedIncrement(&g_operation_count[get_current_thread_id()].count); }
Ok, this distributed version is also fast and scalable. It has linear scalability and can be faster up to 100x on modern multi-core hardware as compared with original version.
So, point 1+2 is a kind of general rule, while point 3 is just a refinement to them.
Why people don't say the whole truth? I don't know. I don't beleive that authors don't aware of the problem. Maybe they think that it's obvious. The practice shows that it's not...
| October 10, 2008 5:33 AM PDT
Dmitriy Vyukov
|
Christoph, you are perfectly right. I was thinking about something else. About *per-processor* counters. Per-processor counters are still linearly scalable, although have quite bigger overheads per operation. The main advantage of per-processors counters is that they can be used when threads are dynamically created and destroyed: counter_t g_operation_count [PROCESSOR_COUNT] = {}; // probably will be allocated dynamically void collect_statistics() { InterlockedIncrement(&g_operation_count[get_current_processor_id( )].count); } |
| October 10, 2008 5:43 AM PDT
Dmitriy Vyukov
|
2Christoph: wrt volatile. If one uses only Interlocked* functions to access variable, then, yes, volatile can be removed. Although if one reads variable with plain loads, which can be the case wrt 'statistics', it's better to declare variable as volatile to prevent 'compiler caching'. Also MSVC C/C++ compiler makes volatile variables much more powerful than it's required by ISO C/C++ standard. It actually makes them synchronization primitives. Much like volatile in C#. I.e. volatile store is store-release, and volatile load is load-acquire (both on compiler and hardware level). |
| October 14, 2008 8:46 AM PDT
Arch Robison (Intel) |
I liked the blog. I would have titled it "Eliminate True Sharing Too!" since from just the title, some readers are going to infer that false sharing is a good thing. (which it is sometimes - that would make an interesting blog.) Using load-acquire store-release semantics on platforms like Itanium and PowerPC will punish performance, because on that platform volatile implies acquire-load store-release, and those operations are not cheap. The ISO C++0x "memory_order_relaxed" atomics are really the best answer for the statistics example if the statistics sum has to be computed asynchronously with the updates. If the sum only needs to be collected after the threads are done running, then the memory fences implied by the thread synchronization should suffice. |
| October 17, 2008 2:34 PM PDT
Dmitriy Vyukov
|
Re: I would have titled it "Eliminate True Sharing Too!" since from just the title, some readers are going to infer that false sharing is a good thing. This is not intended for readers who read only titles :) On the other hand such title arouses some interest, so to say. I am not constrained with any obligations or marketing, so not going to be indulgent, I am going to push things to the extreme. No regret! No remorse! No hostages! I am not going to say things like "ok, you can start with eliminating only false-sharing" (implying: "at least it will not require you to redesign your slack application"), or "[if all other alternatives are too hard for you, then] you can try to use Transactional Memory, it has some scalability [not very good, but at least your application will stop buzzing]". There is quantum satis of such articles. |
| October 17, 2008 2:35 PM PDT
Dmitriy Vyukov
| I hope you will interpret above with due portion of sense of humor ;) |
| October 18, 2008 11:55 AM PDT
Dmitriy Vyukov
|
2Arch wrt memory fences. I totally agree. I just didn't want to dig that deep straight away :) Memory fences are indeed expensive. The interesting thing is that sometimes it's possible to eliminate even inherently required fences from algorithms with the help of some algorithmic tricks. For example I've developed rw-lock algorithm in which readers execute no fences at all, not only no acquire/release fences, but also no store-load fences. Provided moderate write workload this rw-lock slaughters traditional rw-locks. The performance difference I observed on quad-core was up to 300x. By eliminating all memory fences I was able to achieve overhead per read_lock/read_unlock less than 10 cycles. Memory fences are indeed expensive! |
| October 26, 2008 9:46 AM PDT
jimdempseyatthecove
|
The algorithm chosen has to meet the requirements. In the multi-threaded statistics counter example (first posted in this blog) the inference from interpreting the design is that at any moment during the collection of statistics by multiple threads (in addition to after all threads complete) that any thread can view the approximation of the then current count for each thread. Note, this is an approximation because one or more threads, in the process of accumulating the counts may have reached the point of determination that a count should be made but is also just prior to executing the interlocked increment. i.e. although the responsible thread knows there is an additional statistic, it has not yet published the updated count. If the purpose of the algorithm is the produce the tallies in parallel as fast as possible and without regard to progress monitoring, but only with regard to sum total value when done, then the statistic counters would be best made local (registered if possible) to thread. Then use InterlockedAdd to SumTotal or InterlockedExchange to ThreadTotal[n].count (e.g. initialized to -1 and summed when not -1 at end). I fully agree with Dmitriy's sentiments. If there is no requirement for sharing, then eliminate sharing. Now for eliminating sharing, if the algorithm is complex enough to use stack local pointers and counters, the algorithm could be written to make effective use of cache line sharing as well as avoid false sharing. i.e. pack the most frequently used variables into as few cache lines as possible and to assure that the cache lines used do not false share with cache lines used by the other threads. The simplest way to do this would be through the use of the aligned static array of structures as Dmitriy layed out above and use the non-Interlocked increments to thread private members during tally collection, then using the InterlockedExchange to global published tally (in seperate cache line) at the end (or periodically during run). Keeping the puplished partial tally seperate from the working tally can reduce the frequency of cache eviction/memory bus load. If you publish at end of tally then you might only experience thread number number of evictions. If you publish every n ticks of the tally counters then you experience ~TotalTicks/n number of evictions (TotalTicks/NumberOfThreads/n * NumberOfThreads). Progress monitoring can be expensive. Jim Dempsey |

Christoph Bartoschek
++g_operation_count[get_current_thread_id()].count
Another question is why should one make count volatile? Normally volatile does not add any thread-safety.