Multicore programming with different multi-threading approaches

By Michael J Huelskoetter (90 posts) on July 9, 2010 at 4:20 am

A few days ago I visited a multicore developers conference in Munich where Dr. Mario Deilmann from Intel held his keynote speech regarding the different multi-threading approaches like Winthreads, POSIX threads, OpenMP, Intel TBB and task oriented concepts. His first conclusion must really be taken into account when you talk about parallel programming: Don't think in threads but in tasks as thread-oriented multi-threading is hard to learn, doesn't scale properly and causes a lot of trouble!

But, in accordance to Mario, OpenMP is a good first approach when talking about parallelism as it offers some easy-to-learn methods which help to parallelize your serial code. An even better concept is task oriented multi-threading, though, as Microsoft is offering now with the .NET 4 framework and Visual Studio 2010. In .NET 4 there are functions, classes and methods which help to parallelize serial code.

But what we learnt by Mario at the Multicore Developers Day in Munich as well is the fact, that there are new approaches when you want to multi-thread your code. One new concept is Cilk which will be part of the next Intel C++ Compiler. Cilk is completely task oriented and you don't need to rewrite your serial code. There are only two main methods - namely spawn() and for() which you add to your code. The rest will be done in the background with the help of a very intelligent scheduler which distributes the threads to the CPU resources as best as possible.

So, if you want to learn more about Mario's keynote and the event itself you should visit this link where you find a summary of the developers conference. But please be aware : these two blog posts are in German (as it is a German event :-)

Categories: Events, Parallel Programming
Tags: ,

For more complete information about compiler optimizations, see our Optimization Notice.

Comments (11)

July 9, 2010 6:01 AM PDT

Dmitriy Vyukov
Dmitriy VyukovTotal Points:
43,894
Black Belt
> Don't think in threads but in tasks as thread-oriented multi-threading is hard to learn, doesn't scale properly and causes a lot of trouble!

And how does task-oriented multi-threading principally differs from thread-oriented multi-threading? Except somehow larger overheads.
Note that all modern task-oriented libraries still have overheads unacceptable for straightforward implementation of some problems:
http://software.intel.com/en-us/forums/showthread.php?t=75972&o=d&s=lr
So, as far as I see, they are basically equivalent (syntax differences aside).
July 9, 2010 6:03 AM PDT

Dmitriy Vyukov
Dmitriy VyukovTotal Points:
43,894
Black Belt
> doesn't scale properly

That's just not true. Raw threads are the most efficient and scalable model. Since your tasking library is built on top of threads, sorry, you can be at most as scalable as threads in the very best case.
July 9, 2010 6:04 AM PDT

Dmitriy Vyukov
Dmitriy VyukovTotal Points:
43,894
Black Belt
> One new concept is Cilk

Perhaps you meant Cilk++ which is a completely different thing
July 9, 2010 6:07 AM PDT

Dmitriy Vyukov
Dmitriy VyukovTotal Points:
43,894
Black Belt
> There are only two main methods - namely spawn() and for()

Perhaps you meant *three* main *keywords* - cilk_spawn, cilk_sync and cilk_for (w/o parentheses)
July 9, 2010 6:08 AM PDT

Dmitriy Vyukov
Dmitriy VyukovTotal Points:
43,894
Black Belt
> The rest will be done in the background with the help of a very intelligent scheduler which distributes the threads to the CPU resources as best as possible.

The scheduler is completely unintelligent random scheduler, and it distributes tasks, not threads.
July 10, 2010 2:46 AM PDT


Kevin Cameron
Most of the current techniques are based on what people started doing on servers in the 90's before the multicore chips arrived (which implement the same architecture on single chips). It just doesn't scale.

If you want techniques that scale to thousands of cores (or machines), taking a look at how hardware is designed - thousands of processing blocks running in parallel.
July 10, 2010 2:49 PM PDT

Patrick Viry
Patrick ViryTotal Points:
15
Registered User
It has been widely acknowledged for years that threads are a hardware-level concept that is ill-suited as a practical programming abstraction. See e.g. "The Problem with Threads", a 2006 Berkeley Technical Report by Edward A. Lee, http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf .

Mario couldn't mention Ateji PX since it has just been announced this week. Ateji PX is an extension to Java that provides a compositional approach to parallel programming, unlike threads, OpenMP or Cilk. Compositional means easier to scale (just compose), easier to understand and debug (just decompose), and easier to learn (it's closer to our intuition of what parallelism "is").

You'll find a gentle introduction to Ateji PX on HPCwire at http://www.hpcwire.com/features/French-Firm-Brews-Parallel-J..... 38769.html , and more technical whitepapers at http://www.ateji.com/px/ .
July 11, 2010 1:48 PM PDT

Dmitriy Vyukov
Dmitriy VyukovTotal Points:
43,894
Black Belt
> See e.g. "The Problem with Threads", a 2006 Berkeley Technical Report by Edward A. Lee, http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf .

Everything he said equally applies to TBB. Just #define tbb::task std::thread and you can transparently switch task and threads. They are the same. You spawn tasks, and you spawn threads; you join tasks, and you join threads.
July 11, 2010 1:49 PM PDT

Dmitriy Vyukov
Dmitriy VyukovTotal Points:
43,894
Black Belt
> Ateji PX is an extension to Java that provides a compositional approach to parallel programming, unlike threads, OpenMP or Cilk.

In what way is not Cilk++ compositional?
July 13, 2010 2:08 PM PDT

bjoernknafla
bjoernknaflaTotal Points:
185
Registered User
Well, there is a difference between threads and tasks: creating tasks is typically faster than creating threads. Intel's Carbon project even simulated task queues on the chip with nice results.

Also: as tasks are typically mapped onto a number of threads you control the number of threads and therefore don't oversubscribe the available hardware threads (which might make a performance difference based on the workload you put on the threads).

Because of the oversubscription problem scaling performance via threads doesn't scale - while tuning the size and number of tasks to put on a fixed number of threads often is scalable.

In the end the problem and the hardware at hand decide which technique to use for your solution. As most of the mentioned workshops are targeted at beginners in parallelism the message to go for tasks seems a better fit than telling them to use threads. Especially as most job pool implementations offer primitives that guide you in programming in continuations instead of blocking - which shields beginners and non-experts from having to cope with memory models and explicit locking.

Concluding: the big advantages of TBB and Cilk are, that they guide (or even enforce?) how to program in parallel - less rope to hang then going down to the thread-level (though it's still possible to block or to create data races...).
July 13, 2010 3:57 PM PDT

Patrick Viry
Patrick ViryTotal Points:
15
Registered User
When designing Ateji PX, I purposedly didn't want to enter into a discussion about the meaning of what is a thread, what is a task, what is a process. These terms are overused and not well defined. Instead, I introduced the notion of "branch", which is defined on a clear and formal foundation, so we know what we're talking about.

When talking about threads, the common understanding is that a thread consists in a handful of processor register values, that can be restored upon request. This is definitely a hardware concept.

From a programmer's point of view, there is a huge difference between launching a thread/task/process and composing parallel branches -- hint: the verbs are important. "lauching" or "spawning" (the wording used in Cilk and Erlang) means creating an entity that will live its own life, on which you have no control anymore once it is started. "composing" (as in Occam and Ateji PX) means that you compose statements or blocks of statements in parallel, exactly like you can already compose statements or blocks in sequence.

The difference ? You can reason about composition, so code is easier to understand and development tools are able to help the programmer. The behavior of the composition is the composition of both behaviors. When there's a problem, you debug the code by decomposing it. It just becomes so much easier to write parallel code that works, whether memory is shared or not.

Trackbacks (2)


Leave a comment  

To obtain technical support, please go to Software Support.
Name (required)*

Email (required; will not be displayed on this page)*

Your URL (optional)


Comment*