Parallel Implementation Methods with Intel® Parallel Composer Webinar Q&A

Q&A from Webcast: The webinar "Parallel Implementation Methods with Intel® Parallel Composer" was presented by Ganesh Rao, March 31st, 2009, as part of our technical webinar series about Multithreading tools and techniques. The following questions were selected from the list of questions and answers generated by this web cast, and may be useful to other developers as reference.

Q: SSE2 is a standard and it is supported by AMD. Why is it set as "Intel Processor Specific
A: The Intel® C++ compiler can generate code targeting any processor with SSE2 instruction set support,
using /Qax:SSE2 or can generate code that is optimized for Intel processors with SSE2 support using /QxSSE2.
The latter offers more optimizations and it does a cpu check when the application is executed.

Do you need to #include<omp.h> to be able to use #pragma omp?
A: No, but you need to include omp.h if you want to call the OpenMP APIs such as omp_get_num_threads(),
omp_thread_num(), etc.

If I am using omp dlls, will concurrent omp applications (processes) compete for CPU resources the
same way as it is when I use static omp
A: Testing shows there is practically no performance advantage to linking mulitple processes with the static OpenMP runtime, as opposed to linking with the dynamic runtime (uses DLLs), which is the default.  But in either linking scenario, you have to be careful not to oversubscribe the machine if using multiple OpenMP processes.  With multiple, independent OpenMP processes running on a host, the OpenMP runtime library execution mode should be 'throughput' (environment variable KMP_LIBRARY), which is the default.  On the other hand, if you have a dedicated host, the runtime library execution mode should be 'turnaround', which will minimize the execution time of a single OpenMP process.

How is the binary compatibility between the Intel compiler and the Visual Studio compiler maintained when VS only supports OpenMP 2.5?
A: If you want to use any new features in OpenMP 3.0, you have to use the Intel® Parallel Composer, but if you only
use the features from OpenMP 2.5, you can use Visual C++ 2005 or 2008 or the Intel Parallel Composer. Please
refer to this knowledge base article for more information at:

Q:  May be it's better to use 4 threads for 2 cores? (i.e. Number threads = Number Cores * 2?
A: Probably not unless the machine is hyperthreaded, and hyperthreading is enabled in the BIOS; otherwise you will
oversubscribe the machine.  In general shouldn't give any OpenMP process more than the number of machine logical threads (number-of-processors * number-of-cores/processor * number-of-threads/core), and in fact you might find it better to limit the total of number of OpenMP threads to something less.  Testing shows that the
performance penalty by oversubscribing the machine can be severe.  In general, it is OK to use all the machine's logical threads for one OpenMP process (and this is the default, unless you explicitly change the number by setting OMP_NUM_THREADS or by calling omp_set_num_threads()), but depending on your host usage scenario, you should set the environment for 'throughput' (multiple OpenMP processes running) or 'turnaround' (single, dedicated OpenMP process).

What is the difference between using omp parallel task versus using omp parallel sections calling a task?
A: Calling a task within a section just creates extra overhead and cannot control and synchronize the tasks since
each parallel section is independent of each other. OpenMP 3.0 tasking is more flexible and efficient compared to 
using parallel sections.  With parallel sections, there is no way to coordinate the task in each section, so it is not 
possible to determine whether one section will be executed before another, regardless of which section comes first 
in the program source.  On the other hand, the task directive can take an "if" clause to cause the task to be
executed immediately or be deferred; a thread can be "hard wired" to a task (called "tied"), or can be "untied",
which allows any available thread in the thread pool to start executing the task.  Tasking has much better 
performance and scalability for nested parallel and recursive algorithms, compared to parallel sections.  There is
much more overhead creating and destroying the nested parallel regions (parallel sections tasking), versus
executing tasks which are all created by a single parallel region containing a task directive.  You can control the total number of threads (OMP_NUM_THREADS) with tasking, whereas with nested parallel regions (sections), with each newly created region you get OMP_NUM_THREADS new threads, and that can easily oversubscribe the host.

Q:  Do you have a favorite textbook about OpenMP that you
A: Please look at the "Related Information" section in the product user guide where there is a mention of associated 
Intel Documents. A good recommended book to look at: "Using OpenMP: Portable Shared Memory Parallel
Programming" by Barbara Chapman, Gabriele Jost, and Ruud van der Pas



Optimization Notice in English

Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.