stall in parallel_for

stall in parallel_for

Hello,

I am using TBB 4.1 update 3 (Linux) on a workstation with 4 Intel Xeon sockets (X7560, total 32 cores, 64 threads, using Intel compiler 12.0.4) and testing with 4 MPI processes---each configured to use 8 threads (task_scheduler_init init(8)). Each processes has three task groups and each task group invokes parallel_for in multiple levels.

If I use the debug version of TBB libraries, I am seeing the following assertion failures (mostly the first one, and somewhat less frequently the second one)

Assertion my_global_top_priority >= a.my_top_priority failed on line 530 of file ../../src/tbb/market.cpp

Assertion prev_level.workers_requested >= 0 && new_level.workers_requested >= 0 failed on line 520 of file ../../src/tbb/market.cpp

If I use the production version of TBB libaries, I am seeing stall inside parallel_for (this happens roughly once every hour or two... though very irregular, the above assertion failures occur more frequently...)

In one case I see two processes out of the total four with two stalling threads (total four...) If I print few GDB outputs for one of the two problematic processes (both have 10 threads, the remaining two (without a stalling thread) has 8 and 10, respectively),

(gdb) info threads
  10 Thread 0x2afbd94d6940 (LWP 11682)  0x0000003810ad19a9 in syscall () from /lib64/libc.so.6
* 9 Thread 0x2afbd98d7940 (LWP 11684)  0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
  8 Thread 0x2afbd9cd8940 (LWP 11690)  0x00002afbb3e5280e in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb43dfe00, completion_ref_count=@0x0, return_if_no_work=true, $I7=<value optimized out>, $I8=<value optimized out>,
    $I9=<value optimized out>) at ../../src/tbb/custom_scheduler.h:253
  7 Thread 0x2afbda0d9940 (LWP 11694)  0x0000003810ad19a9 in syscall () from /lib64/libc.so.6
  6 Thread 0x2afbda4da940 (LWP 11695)  0x00002afbb3e52810 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb42cfe00, completion_ref_count=@0x0, return_if_no_work=5, $I7=<value optimized out>, $I8=<value optimized out>,
    $I9=<value optimized out>) at ../../src/tbb/custom_scheduler.h:253
  5 Thread 0x2afbda8db940 (LWP 11699)  0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6 <= stalling
  4 Thread 0x2afbdacdc940 (LWP 11705)  0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6 <= waiting for a MPI message
  3 Thread 0x2afbdd4e9940 (LWP 11714)  tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (
    this=0x2afbb45a3480, completion_ref_count=@0x0, return_if_no_work=true, $I7=<value optimized out>, $I8=<value optimized out>,
    $I9=<value optimized out>) at ../../src/tbb/custom_scheduler.h:173
  2 Thread 0x2afbdd8ea940 (LWP 11715)  tbb::internal::generic_scheduler::reload_tasks (this=0x2afbdcc57e00, $5=<value optimized out>)
    at ../../src/tbb/scheduler.cpp:854
  1 Thread 0x2afbb41f5f60 (LWP 11678)  0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6 <=stalling

(gdb) thread 1
[Switching to thread 1 (Thread 0x2afbb41f5f60 (LWP 11678))]#0  0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
(gdb) where
#0  0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
#1  0x00002afbb3e529b4 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb45a4a00,
    completion_ref_count=@0x0, return_if_no_work=6, $I7=<value optimized out>, $I8=<value optimized out>, $I9=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:261
#2  0x00002afbb3e516e9 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbb45a4a00,
    parent=..., child=0x6, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:547
#3  0x00002afbb3e4fbd0 in tbb::internal::generic_scheduler::local_spawn_root_and_wait (this=0x2afbb45a4a00, first=..., next=@0x6,
    $0=<value optimized out>, $1=<value optimized out>, $2=<value optimized out>) at ../../src/tbb/scheduler.cpp:664
#4  0x00002afbb3e4fafb in tbb::internal::generic_scheduler::spawn_root_and_wait (this=0x2afbb45a4a00, first=..., next=@0x6,
    $1=<value optimized out>, $2=<value optimized out>, $3=<value optimized out>) at ../../src/tbb/scheduler.cpp:672
#5  0x000000000048a272 in tbb::task::spawn_root_and_wait (root=..., $=1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task.h:693
#6  0x0000000000626e6c in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run(const tbb::blocked_range<int> &, const class {...} &, const tbb::auto_partitioner &) (range=..., body=..., partitioner=..., $1=<value optimized out>,
    $2=<value optimized out>, $3=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:94
#7  0x0000000000633066 in tbb::parallel_for(const tbb::blocked_range<int> &, const class {...} &) (range=..., body=...,
    $6=<value optimized out>, $7=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:165
#8  0x00000000005fb993 in operator() (this=0x2afbdbe1ac58, r=..., $F2=<value optimized out>, $F3=<value optimized out>) at mech_intrct.cpp:295
#9  0x0000000000627637 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run_body (
    this=0x2afbdbe1ac40, r=..., $7=<value optimized out>, $8=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:110
#10 0x000000000061d883 in tbb::interface6::internal::partition_type_base<tbb::interface6::internal::auto_partition_type>::execute (
    this=0x2afbdbe1ac68, start=..., range=..., $5=<value optimized out>, $0=<value optimized out>, $1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/partitioner.h:259
#11 0x0000000000626ff4 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::execute (
    this=0x2afbdbe1ac40, $7=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:116
#12 0x00002afbb3e515ba in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbb45a4a00,
    parent=..., child=0x6, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:440
#13 0x00002afbb3e4fbd0 in tbb::internal::generic_scheduler::local_spawn_root_and_wait (this=0x2afbb45a4a00, first=..., next=@0x6,
    $0=<value optimized out>, $1=<value optimized out>, $2=<value optimized out>) at ../../src/tbb/scheduler.cpp:664
#14 0x00002afbb3e4fafb in tbb::internal::generic_scheduler::spawn_root_and_wait (this=0x2afbb45a4a00, first=..., next=@0x6,
    $1=<value optimized out>, $2=<value optimized out>, $3=<value optimized out>) at ../../src/tbb/scheduler.cpp:672
#15 0x000000000048a272 in tbb::task::spawn_root_and_wait (root=..., $=1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task.h:693
#16 0x0000000000627540 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run(const tbb::blocked_range<int> &, const class {...} &, const tbb::auto_partitioner &) (range=..., body=..., partitioner=..., $4=<value optimized out>,
    $5=<value optimized out>, $6=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:94
#17 0x000000000063309e in tbb::parallel_for(const tbb::blocked_range<int> &, const class {...} &) (range=..., body=...,
    $8=<value optimized out>, $9=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:165
#18 0x0000000000604e69 in Sim::computeMechIntrct (this=0x2afbb459fc80, $1=<value optimized out>) at mech_intrct.cpp:293
#19 0x00000000004b041f in operator() (this=0x2afbdc0b4248, $7=<value optimized out>) at run.cpp:174
#20 0x00000000004e0a5c in tbb::internal::function_task<lambda []>::execute (this=0x2afbdc0b4240, $8=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task_group.h:79
#21 0x00002afbb3e51f50 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::wait_for_all (this=0x2afbb45a4a00, parent=...,
    child=0x6, $J3=<value optimized out>, $J4=<value optimized out>, $J5=<value optimized out>) at ../../src/tbb/custom_scheduler.h:81
#22 0x00000000004dfd8f in tbb::task::wait_for_all (this=0x2afbb45c7a40, $=0=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task.h:704
#23 0x00000000004e040f in tbb::internal::task_group_base::wait (this=0x7fff0f9681d8, $3=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task_group.h:157
#24 0x00000000004b6f6b in Sim::run (this=0x2afbb459fc80, $=<value optimized out>) at run.cpp:176
#25 0x000000000042b58f in biocellion (xmlFile="fhcrc.xml") at sim.cpp:181
#26 0x000000000042722a in main (argc=2, ap_args=0x7fff0f96a218) at biocellion.cpp:30

(gdb) thread 5
[Switching to thread 5 (Thread 0x2afbda8db940 (LWP 11699))]#0  0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
(gdb) where
#0  0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
#1  0x00002afbb3e529b4 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb42b7e00,
    completion_ref_count=@0x0, return_if_no_work=2, $I7=<value optimized out>, $I8=<value optimized out>, $I9=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:261
#2  0x00002afbb3e516e9 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbb42b7e00,
    parent=..., child=0x2, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:547
#3  0x00002afbb3e4fbd0 in tbb::internal::generic_scheduler::local_spawn_root_and_wait (this=0x2afbb42b7e00, first=..., next=@0x2,
    $0=<value optimized out>, $1=<value optimized out>, $2=<value optimized out>) at ../../src/tbb/scheduler.cpp:664
#4  0x00002afbb3e4fafb in tbb::internal::generic_scheduler::spawn_root_and_wait (this=0x2afbb42b7e00, first=..., next=@0x2,
    $1=<value optimized out>, $2=<value optimized out>, $3=<value optimized out>) at ../../src/tbb/scheduler.cpp:672
#5  0x000000000048a272 in tbb::task::spawn_root_and_wait (root=..., $=1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task.h:693
#6  0x0000000000626e6c in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run(const tbb::blocked_range<int> &, const class {...} &, const tbb::auto_partitioner &) (range=..., body=..., partitioner=..., $1=<value optimized out>,
    $2=<value optimized out>, $3=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:94
#7  0x0000000000633066 in tbb::parallel_for(const tbb::blocked_range<int> &, const class {...} &) (range=..., body=...,
    $6=<value optimized out>, $7=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:165
#8  0x00000000005fb993 in operator() (this=0x2afbdc8add58, r=..., $F2=<value optimized out>, $F3=<value optimized out>) at mech_intrct.cpp:295
#9  0x0000000000627637 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run_body (
    this=0x2afbdc8add40, r=..., $7=<value optimized out>, $8=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:110
#10 0x000000000061d883 in tbb::interface6::internal::partition_type_base<tbb::interface6::internal::auto_partition_type>::execute (
    this=0x2afbdc8add68, start=..., range=..., $5=<value optimized out>, $0=<value optimized out>, $1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/partitioner.h:259
#11 0x0000000000626ff4 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::execute (
    this=0x2afbdc8add40, $7=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:116
#12 0x00002afbb3e515ba in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbb42b7e00,
    parent=..., child=0x2, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:440
#13 0x00002afbb3e4f189 in tbb::internal::arena::process (this=0x2afbb42b7e00, s=..., $t4=<value optimized out>, $t5=<value optimized out>)
    at ../../src/tbb/arena.cpp:98
#14 0x00002afbb3e4cf15 in tbb::internal::market::process (this=0x2afbb42b7e00, j=..., $0=<value optimized out>, $1=<value optimized out>)
    at ../../src/tbb/market.cpp:465
#15 0x00002afbb3e4a764 in tbb::internal::rml::private_worker::run (this=0x2afbb42b7e00, $L7=<value optimized out>)
    at ../../src/tbb/private_server.cpp:274
#16 0x00002afbb3e4a696 in tbb::internal::rml::private_worker::thread_routine (arg=0x2afbb42b7e00, $L9=<value optimized out>)
    at ../../src/tbb/private_server.cpp:231
#17 0x0000003811a0683d in start_thread () from /lib64/libpthread.so.0
#18 0x0000003810ad4fad in clone () from /lib64/libc.so.6

For few seemingly normal threads

(gdb) thread 2
[Switching to thread 2 (Thread 0x2afbdd8ea940 (LWP 11715))]#0  tbb::internal::generic_scheduler::reload_tasks (this=0x2afbdcc57e00,
    $5=<value optimized out>) at ../../src/tbb/scheduler.cpp:854
854    ../../src/tbb/scheduler.cpp: No such file or directory.
    in ../../src/tbb/scheduler.cpp
(gdb) where
#0  tbb::internal::generic_scheduler::reload_tasks (this=0x2afbdcc57e00, $5=<value optimized out>) at ../../src/tbb/scheduler.cpp:854
#1  0x00002afbb3e52726 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbdcc57e00,
    completion_ref_count=@0x2afbb45a3a00, return_if_no_work=7, $I7=<value optimized out>, $I8=<value optimized out>, $I9=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:193
#2  0x00002afbb3e516e9 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbdcc57e00,
    parent=..., child=0x7, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:547
#3  0x00002afbb3e4f189 in tbb::internal::arena::process (this=0x2afbdcc57e00, s=..., $t4=<value optimized out>, $t5=<value optimized out>)
    at ../../src/tbb/arena.cpp:98
#4  0x00002afbb3e4cf15 in tbb::internal::market::process (this=0x2afbdcc57e00, j=..., $0=<value optimized out>, $1=<value optimized out>)
    at ../../src/tbb/market.cpp:465
#5  0x00002afbb3e4a764 in tbb::internal::rml::private_worker::run (this=0x2afbdcc57e00, $L7=<value optimized out>)
    at ../../src/tbb/private_server.cpp:274
#6  0x00002afbb3e4a696 in tbb::internal::rml::private_worker::thread_routine (arg=0x2afbdcc57e00, $L9=<value optimized out>)
    at ../../src/tbb/private_server.cpp:231
#7  0x0000003811a0683d in start_thread () from /lib64/libpthread.so.0
#8  0x0000003810ad4fad in clone () from /lib64/libc.so.6

(gdb) thread 3
[Switching to thread 3 (Thread 0x2afbdd4e9940 (LWP 11714))]#0  tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb45a3480, completion_ref_count=@0x0, return_if_no_work=true, $I7=<value optimized out>, $I8=<value optimized out>,
    $I9=<value optimized out>) at ../../src/tbb/custom_scheduler.h:173
173    ../../src/tbb/custom_scheduler.h: No such file or directory.
    in ../../src/tbb/custom_scheduler.h
(gdb) where
#0  tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb45a3480, completion_ref_count=@0x0,
    return_if_no_work=true, $I7=<value optimized out>, $I8=<value optimized out>, $I9=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:173
#1  0x00002afbb3e516e9 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbdcc6fe00,
    parent=..., child=0x1, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:547
#2  0x00002afbb3e4f189 in tbb::internal::arena::process (this=0x2afbb45a3480, s=..., $t4=<value optimized out>, $t5=<value optimized out>)
    at ../../src/tbb/arena.cpp:98
#3  0x00002afbb3e4cf15 in tbb::internal::market::process (this=0x2afbb45a3480, j=..., $0=<value optimized out>, $1=<value optimized out>)
    at ../../src/tbb/market.cpp:465
#4  0x00002afbb3e4a764 in tbb::internal::rml::private_worker::run (this=0x2afbb45a3480, $L7=<value optimized out>)
    at ../../src/tbb/private_server.cpp:274
#5  0x00002afbb3e4a696 in tbb::internal::rml::private_worker::thread_routine (arg=0x2afbb45a3480, $L9=<value optimized out>)
    at ../../src/tbb/private_server.cpp:231
#6  0x0000003811a0683d in start_thread () from /lib64/libpthread.so.0
#7  0x0000003810ad4fad in clone () from /lib64/libc.so.6

(gdb) thread 7
[Switching to thread 7 (Thread 0x2afbda0d9940 (LWP 11694))]#0  0x0000003810ad19a9 in syscall () from /lib64/libc.so.6
(gdb) where
#0  0x0000003810ad19a9 in syscall () from /lib64/libc.so.6
#1  0x00002afbb3e4a8e2 in tbb::internal::rml::private_worker::run (this=0x2afbb4557dac, $L7=<value optimized out>)
    at ../../src/tbb/private_server.cpp:281
#2  0x00002afbb3e4a696 in tbb::internal::rml::private_worker::thread_routine (arg=0x2afbb4557dac, $L9=<value optimized out>)
    at ../../src/tbb/private_server.cpp:231
#3  0x0000003811a0683d in start_thread () from /lib64/libpthread.so.0
#4  0x0000003810ad4fad in clone () from /lib64/libc.so.6

Any clues???

I also see several times, some threads end up inside the bool arena::is_out_of_work() function...

Thank you very much,

38 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If you don't use prorities yourself, try again with __TBB_TASK_PRIORITY=0 (somewhere in tbb_config.h).

Hello what MPI version do you use?

can't it be related to post http://software.intel.com/en-us/forums/topic/392226?

--Vladimir

I got the same problem..  Only way I got to fix the issue is not to use context priorities at all.

http://software.intel.com/en-us/forums/topic/278901

Quote:

Raf Schietekat wrote:

If you don't use prorities yourself, try again with __TBB_TASK_PRIORITY=0 (somewhere in tbb_config.h).

Thank you for the suggestion.

I will try but I am using priorities (task::self().set_group_priority()) and have a reason to do this. I assigned low, normal, high priorities to three different task groups. It also seems like this problem is happening only when threre are more than one process per shared memory node.

Quote:

Vladimir Polin (Intel) wrote:

Hello what MPI version do you use?

can't it be related to post http://software.intel.com/en-us/forums/topic/392226?

--Vladimir

I am using MPICH2 (1.4.1p1). The blocking thread (waiting for an MPI message) in the GDB output is normal... The problem I am seeing is stall inside parallel_for even though there are many idle threads.

Thanks!!!

Quote:

Michel Lemay wrote:

I got the same problem..  Only way I got to fix the issue is not to use context priorities at all.

http://software.intel.com/en-us/forums/topic/278901

Thank you and it seems like this is related to priorities... I am assigning low, normal, and high priorities to three different task groups and each group invokes parallel_for in multiple levels. But it seems like this problem occurs only when there are more than one process per node in my case (I am using 4.1 update 3). Have you ever seen this with just one process per node?

A more genereal question I have is... is there any sort of resource sharing or communication among different processes using TBB inisde the TBB library?

Thanks,

The suggestion was just to try and help pinpoint the problem: how about Vladimir's question?

I did do some research about a problem with priorities, but there's a lot more that I haven't figured out yet.

Quote:

Raf Schietekat wrote:

The suggestion was just to try and help pinpoint the problem: how about Vladimir's question?

I did do some research about a problem with priorities, but there's a lot more that I haven't figured out yet.

It seems like my replies to other comments are still sleeping in the queue :-(

I commented out all task::self().set_group_priority calls in my code and I haven't observed an assertion failure or stall yet. It seems like this is somewhat relevent to priorities.

I don't think MPI is an issue here. I am using MPICH2 and I am not seeing anything strange related to MPI communication in GDB outputs.

Thanks,

Quote:

Seunghwa Kang wrote:

Quote:

Michel Lemaywrote:

I got the same problem..  Only way I got to fix the issue is not to use context priorities at all.

http://software.intel.com/en-us/forums/topic/278901

Thank you and it seems like this is related to priorities... I am assigning low, normal, and high priorities to three different task groups and each group invokes parallel_for in multiple levels. But it seems like this problem occurs only when there are more than one process per node in my case (I am using 4.1 update 3). Have you ever seen this with just one process per node?

A more genereal question I have is... is there any sort of resource sharing or communication among different processes using TBB inisde the TBB library?

Thanks,

We run only one process with TBB scheduling.  This process contains tens of threads with varying priorities.  And such threads would schedule similar priorities to the context passed to parallel algorithms.  i.e. Low priority threads (background processing) would typically launch low priority parallel loops and tasks.  Inversely, high priority thread (from waiting user input) would launch high priority tasks.

I saw this issue on at least two machines with AMD and Intel processors with 32 cores and under heavy utilization.  However, I've never been able to create simple piece of code mimicking this problem and send it to TBB team for review.

Kang,

From your first description of your application, (in summary) you partitioned your MPI applications into seperate sockets, each running 8 threads in TBB... without oversubscription for the system. System has 64 hw threads, 8 MPI slices, each TBB'd to 8 threads.

From reading other forum posts, I seem to recall that the newest TBB library has added a "feature" whereby if a system is running multiple TBB applications, each assuming they have the full complement of hardware threads, that the developers have schemed a way for each TBB application to throttle-down on the number of worker threads. This code is new (may have bugs) and may cause issues with MPI especially across sync. If possible, try to disable this feature.

Jim Dempsey

www.quickthreadprogramming.com

I am still having this issue with TBB 4.1 update 4.

Stall is quite difficult to reproduce but the following assertion fails pretty frequently even with the simple code I pasted below. My wild guess is this assertiion failure is a necessary condition for stall and hope fixing this will also fix the stall issue.

Assertion my_global_top_priority >= a.my_top_priority failed on line 506 of file ../../src/tbb/market.cpp

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <string>
#include <omp.h>
#include <tbb/tbb.h>
using namespace tbb;
#define ENABLE_PRIORITY 1
#define NUM_TIME_STEPS 100000
void myFunc( void );
int main( int argc, char* ap_args[] ) {
    /* initialize TBB */
    task_scheduler_init init( 64 );
    /* start time-stepping */
    /* this loop cannot be parallelized */
    for( int i = 0 ; i < NUM_TIME_STEPS ; ) {
#if ENABLE_PRIORITY
        task::self().set_group_priority( priority_normal );
#endif
        {
            task_group myGroup;
            myGroup.run( []{ myFunc(); } );
            myGroup.wait();
        }
        i++;
        std::cout << i << "/" << NUM_TIME_STEPS << std::endl;
    }
    std::cout << "Simulation finished." << std::endl;
    return 0;
}
void myFunc( void ) {
#if ENABLE_PRIORITY
    task::self().set_group_priority( priority_high );
#endif
    for( int i = 0 ; i < 10 ; i++ ) {
        /* make a copy of the agent and grid data at the beginning of this sub-step */
        parallel_for( blocked_range<int> ( 0, 38 ), [&]( const blocked_range<int>& r ) {
        for( int ii = r.begin() ; ii < r.end() ; ii++ ) {
            parallel_for( blocked_range<int> ( 0, 8 ), [&]( const blocked_range<int>& r2 ) {
            for( int jj = r2.begin() ; jj < r2.end() ; jj++ ) {
            }
            } );
        }
        } );
    }
    return;
}

I ran the code on a Linux system with 32 cores and 64 hardware threads.

I build the executable by typing

icpc -std=c++0x -g -Wall -I /home/install/tbb41u4/include -I ../include -c test.cpp -o test.o
icpc -std=c++0x -static-intel  -o test test.o -openmp-link static -openmp -L /home/install/tbb41u4/lib/intel64/gcc4.1 -ltbb_debug -ltbbmalloc_proxy_debug -ltbbmalloc_debug

hello, great reproducer!

I was able to reproduce the assert failure on windows machine.
Could you submit it via our contribution page? Then we can add it to our unit testing.

--Vladimir

Wow!  This bug has been hiding and creeping for so long!  I'm glad someone finaly reproduced it with a simple piece of code!

Good job M Kang!

Just want to inform that this assertion is still failing with TBB 4.2 and the stalling issue is persisting.

Hope this to be fixed sometime soon.

I am unable to directly test the code. Can you try this:

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <string>
#include <omp.h>
#include <tbb/tbb.h>
using namespace tbb;
#define ENABLE_PRIORITY 1
#define NUM_TIME_STEPS 100000
void myFunc( void );
int main( int argc, char* ap_args[] ) {
    /* initialize TBB */
    task_scheduler_init init( 64 );
    /* start time-stepping */
    /* this loop cannot be parallelized */
    for( int i = 0 ; i < NUM_TIME_STEPS ; ) {
#if ENABLE_PRIORITY
        task::self().set_group_priority( priority_normal );
#endif
        {
            task_group myGroup;
            myGroup.run( []{ myFunc(); } );
#if ENABLE_PRIORITY
            task::self().set_group_priority( priority_high );
#endif
            myGroup.wait();
#if ENABLE_PRIORITY
            task::self().set_group_priority( priority_normal );
#endif
        }
        i++;
        std::cout << i << "/" << NUM_TIME_STEPS << std::endl;
    }
    std::cout << "Simulation finished." << std::endl;
    return 0;
}
void myFunc( void ) {
#if ENABLE_PRIORITY
    task::self().set_group_priority( priority_high );
#endif
    for( int i = 0 ; i < 10 ; i++ ) {
        /* make a copy of the agent and grid data at the beginning of this sub-step */
        parallel_for( blocked_range<int> ( 0, 38 ), [&]( const blocked_range<int>& r ) {
        for( int ii = r.begin() ; ii < r.end() ; ii++ ) {
            parallel_for( blocked_range<int> ( 0, 8 ), [&]( const blocked_range<int>& r2 ) {
            for( int jj = r2.begin() ; jj < r2.end() ; jj++ ) {
            }
            } );
        }
        } );
    }
    return;
}

Jim Dempsey

www.quickthreadprogramming.com

One thing that's immediately striking, and I've been here before but I may have abandoned it at the time (I just don't remember at this moment), is how "volatile intptr_t* tbb::internal::generic_scheduler::my_ref_top_priority" can reference either "volatile intptr_t tbb::internal::arena_base::my_top_priority" (master) or "intptr_t tbb::internal::market::my_global_top_priority" (worker): perhaps the non-volatile market member variable will work, perhaps not, but I think that it should all be tbb::atomic<intptr_t>, at least because a volatile is only a poor man's atomic anyway (silently relies on cooperation from both compiler and hardware for elementary atomic behaviour, nonportably implied memory semantics on some compilers but none by the Standard, no RMW operations), but also because there's no help from the compiler to avoid such confusion between volatile and non-volatile (which may temporarily reside in registers).

I have no idea at this time whether this might be the cause of the problem or just a red herring, but I would get rid of the volatile even if only as a matter of principle. Actually all uses of volatile in TBB, or at least those that are not burdened with backward compatibility.

(Added) It would seem that the market's my_global_top_priority is protected by my_arenas_list_mutex for writing, but, strictly speaking, although the new value should now not be hidden indefinitely, it's not quite the same as a read-write mutex, especially if there's a breach caused by the alias. Luckily my_ref_top_priority's referent can be made const, so that's not an issue.

(Added) Does the problem occur, e.g., on both Linux and Windows, or only on Linux? If so, it would mean that the code relies on nonportable memory semantics associated with the use of "volatile". Otherwise it's something else, or that and something else.

(Added) TODO: check that update_global_top_priority() is only called by a thread that holds a lock on my_arenas_list_mutex. These things should be documented...

Looking further (and I may add to this posting)...

Hello Jim,

The code you attached also fails.

I am getting

Assertion my_global_top_priority >= a.my_top_priority failed on line 536 of file ../../src/tbb/market.cpp
Abort

Thanks,

Quote:

jimdempseyatthecove wrote:

I am unable to directly test the code. Can you try this:

#include <stdio.h> #include <stdlib.h> #include <iostream> #include <string> #include <omp.h> #include <tbb/tbb.h> using namespace tbb; #define ENABLE_PRIORITY 1 #define NUM_TIME_STEPS 100000 void myFunc( void ); int main( int argc, char* ap_args[] ) {     /* initialize TBB */     task_scheduler_init init( 64 );     /* start time-stepping */     /* this loop cannot be parallelized */     for( int i = 0 ; i < NUM_TIME_STEPS ; ) { #if ENABLE_PRIORITY         task::self().set_group_priority( priority_normal ); #endif         {             task_group myGroup;             myGroup.run( []{ myFunc(); } ); #if ENABLE_PRIORITY             task::self().set_group_priority( priority_high ); #endif             myGroup.wait(); #if ENABLE_PRIORITY             task::self().set_group_priority( priority_normal ); #endif         }         i++;         std::cout << i << "/" << NUM_TIME_STEPS << std::endl;     }     std::cout << "Simulation finished." << std::endl;     return 0; } void myFunc( void ) { #if ENABLE_PRIORITY     task::self().set_group_priority( priority_high ); #endif     for( int i = 0 ; i < 10 ; i++ ) {         /* make a copy of the agent and grid data at the beginning of this sub-step */         parallel_for( blocked_range<int> ( 0, 38 ), [&]( const blocked_range<int>& r ) {         for( int ii = r.begin() ; ii < r.end() ; ii++ ) {             parallel_for( blocked_range<int> ( 0, 8 ), [&]( const blocked_range<int>& r2 ) {             for( int jj = r2.begin() ; jj < r2.end() ; jj++ ) {             }             } );         }         } );     }     return; }

Jim Dempsey

I have tried this only on Linux, so one possibility is, this problem might be solved for Windows only.

Thanks,

Quote:

Raf Schietekat wrote:

One thing that's immediately striking, and I've been here before but I may have abandoned it at the time (I just don't remember at this moment), is how "volatile intptr_t* tbb::internal::generic_scheduler::my_ref_top_priority" can reference either "volatile intptr_t tbb::internal::arena_base::my_top_priority" (master) or "intptr_t tbb::internal::market::my_global_top_priority" (worker): perhaps the non-volatile market member variable will work, perhaps not, but I think that it should all be tbb::atomic<intptr_t>, at least because a volatile is only a poor man's atomic anyway (silently relies on cooperation from both compiler and hardware for elementary atomic behaviour, nonportably implied memory semantics on some compilers but none by the Standard, no RMW operations), but also because there's no help from the compiler to avoid such confusion between volatile and non-volatile (which may temporarily reside in registers).

I have no idea at this time whether this might be the cause of the problem or just a red herring, but I would get rid of the volatile even if only as a matter of principle. Actually all uses of volatile in TBB, or at least those that are not burdened with backward compatibility.

(Added) It would seem that the market's my_global_top_priority is protected by my_arenas_list_mutex for writing, but, strictly speaking, although the new value should now not be hidden indefinitely, it's not quite the same as a read-write mutex, especially if there's a breach caused by the alias. Luckily my_ref_top_priority's referent can be made const, so that's not an issue.

(Added) Does the problem occur, e.g., on both Linux and Windows, or only on Linux? If so, it would mean that the code relies on nonportable memory semantics associated with the use of "volatile". Otherwise it's something else, or that and something else.

(Added) TODO: check that update_global_top_priority() is only called by a thread that holds a lock on my_arenas_list_mutex. These things should be documented...

Looking further (and I may add to this posting)...

Try this:

remove my edits (go back to your posted code)
remove the main thread's messing with priority (leave as-is from init time)
Add a chunksize argument to your two inner parallel_for's such that the product of the two partitionings is less than total number of threads in the thread pool. This is not a fix but simply a diagnostic.

(0, 38, 13)     - 3 tasks (threads)
(0, 8, 4)        - 2 tasks (threads) x 3 tasks == 6 of the 8 in the init.

 See if the code locks up.

------------------------

It is unclear as to what your expectations are with changing the priorities. These priorities are not system thread priorities. Instead these are task level priorities.

JIm Dempsey

www.quickthreadprogramming.com

This still fails.

"It is unclear as to what your expectations are with changing the priorities. These priorities are not system thread priorities. Instead these are task level priorities."

The code I attached is just a reproducer. What I have in the real code is three task groups doing different works.

Quote:

jimdempseyatthecove wrote:

Try this:

remove my edits (go back to your posted code)
remove the main thread's messing with priority (leave as-is from init time)
Add a chunksize argument to your two inner parallel_for's such that the product of the two partitionings is less than total number of threads in the thread pool. This is not a fix but simply a diagnostic.

(0, 38, 13)     - 3 tasks (threads)
(0, 8, 4)        - 2 tasks (threads) x 3 tasks == 6 of the 8 in the init.

 See if the code locks up.

------------------------

It is unclear as to what your expectations are with changing the priorities. These priorities are not system thread priorities. Instead these are task level priorities.

JIm Dempsey

Quote:

Raf Schietekat wrote:

"volatile intptr_t* tbb::internal::generic_scheduler::my_ref_top_priority" can reference either "volatile intptr_t tbb::internal::arena_base::my_top_priority" (master) or "intptr_t tbb::internal::market::my_global_top_priority" (worker)

I forgot one: "intptr_t tbb::task_group_context::my_priority". This may be concurrently updated, which by itself probably also requires this to be an atomic variable instead of an ordinary one.

Quote:

Raf Schietekat wrote:

(Added) Does the problem occur, e.g., on both Linux and Windows, or only on Linux? If so, it would mean that the code relies on nonportable memory semantics associated with the use of "volatile". Otherwise it's something else, or that and something else.

Apparently Vladimir Polin had already reproduced it on Windows.

(Added) There is also a potential issue that set_priority() is not serialised, i.e., it might happen that, with A ancestor of B ancestor of C, A has new priority a, B has new priority b, but C also has new priority a (instead of b). This is from inspecting the source code (not confirmed); I don't know whether it is relevant here.

(Added) This is seriously going to take longer than five minutes to figure out...

This is still failing with 4.2 update 1. The reproducer code is still producing the same assertion failure. I hope this to get fixed sometime soon.

Sorry, but not in the next release or two. It might be the case of bad assertion only. I.e. the logic behind task priorities probably works correctly. And thanks for reminding, it eventually will get into our focus

Quote:

Anton Malakhov (Intel) wrote:

Sorry, but not in the next release or two. It might be the case of bad assertion only. I.e. the logic behind task priorities probably works correctly. And thanks for reminding, it eventually will get into our focus

The stall issue is still there, so it's unlikely that this is just a bad assertion. And I also observed the __TBB_ASSERT failure inside assert_market_valid() in market.h and another assertion failure I don't remember. These happen significantly less frequently. It is pretty likely there is a data synchronization issue when priority is involved.

 

If you execute the code below with more than 1 MPI process, you will see the stall issue pretty frequently (approx. once a minute in my system) but I haven't observed the stall problem once I set ENABLE_PRIORITY to 0.

 

Another problem is that CPU usage goes to max (e.g. assuming there are 64 hardware threads and two MPI processes, the CPU utilization for each process reaches 3200% even though I set the number of threads to 8).

 

I hope this can help you to fix the problem.

 

 
#include <assert.h>
#include <iostream>
#include <vector>
#include <mpi.h>
#include <tbb/tbb.h>
using namespace std;
using namespace tbb;
#define ENABLE_PRIORITY 1
const int NUM_THREADS = 8;
const int MAIN_LOOP_CNT = 10000;
int g_rank;
MPI_Comm g_mpiCommWorldDefault;
MPI_Comm g_mpiCommWorldLow;
MPI_Comm g_mpiCommWorldHigh;
void computeLowPrior( void );
void updateHighPrior( void );
int main( int argc, char* ap_args[] ) {

	    task_group highPriorGroup;
    task_group lowPriorGroup;
    int level;
    int ret;
    /* initialize MPI */
    ret = MPI_Init_thread( NULL, NULL, MPI_THREAD_MULTIPLE, &level );
    assert( ret == MPI_SUCCESS );
    assert( level == MPI_THREAD_MULTIPLE );
    ret = MPI_Comm_dup( MPI_COMM_WORLD, &g_mpiCommWorldDefault );
    assert( ret == MPI_SUCCESS );
    ret = MPI_Comm_dup( MPI_COMM_WORLD, &g_mpiCommWorldLow );
    assert( ret == MPI_SUCCESS );
    ret = MPI_Comm_dup( MPI_COMM_WORLD, &g_mpiCommWorldHigh );
    assert( ret == MPI_SUCCESS );
    ret = MPI_Comm_rank( g_mpiCommWorldDefault, &g_rank );
    assert( ret == MPI_SUCCESS );
    /* initialize TBB */
    task_scheduler_init init( NUM_THREADS );
    /* main computation */
    for( int i = 0 ; i < MAIN_LOOP_CNT ; i++ ) {
#if ENABLE_PRIORITY
        task::self().set_group_priority( priority_normal );
#endif
        highPriorGroup.run( []{ updateHighPrior(); } );
        lowPriorGroup.run( []{ computeLowPrior(); } );
        lowPriorGroup.wait();
        highPriorGroup.wait();
        ret = MPI_Barrier( g_mpiCommWorldDefault );
        assert( ret == MPI_SUCCESS );
        if( g_rank == 0 ) {
            cout << "loop " << i << " finished." << endl;
        }
    }
    return 0;
}
void computeLowPrior( void ) {
#if ENABLE_PRIORITY
    task::self().set_group_priority( priority_low );
#endif
    int ret;
    /* compute */
    parallel_for( blocked_range<int> ( 0, 10000, 1 ), [&]( const blocked_range<int>& r ) {
    for( int ii = r.begin() ; ii < r.end() ; ii++ ) {
    }
    } );
    /* communicate */
    ret = MPI_Barrier( g_mpiCommWorldLow );
    assert( ret == MPI_SUCCESS );
    /* compute */
    parallel_for( blocked_range<int> ( 0, 10000, 1 ), [&]( const blocked_range<int>& r ) {
    for( int ii = r.begin() ; ii < r.end() ; ii++ ) {
    }
    } );
    return;
}
void updateHighPrior( void ) {
#if ENABLE_PRIORITY
    task::self().set_group_priority( priority_high );
#endif
    for( int i = 0 ; i < 1000 ; i++ ) {
        int ret;
        /* compute */
        parallel_for( blocked_range<int> ( 0, 10000, 1 ), [&]( const blocked_range<int>& r ) {
        for( int ii = r.begin() ; ii < r.end() ; ii++ ) {
        }
        } );
        /* communicate */
        ret = MPI_Barrier( g_mpiCommWorldHigh );
        assert( ret == MPI_SUCCESS );
    }
    return;
}

 

Thanks, it really helps to understand the issue.

First of all, I'd admit there is something non-optimal with task priorities implementation.

But let's keep it for a second because this reproducer clearly shows that task priority is not the culprit.. but rather a catalyst for the problem in your code.

Seems like nobody pay enough attention that your problem statement involves 4 MPI processes.. and you use MPI barriers from inside of TBB tasks. It should be the rule #1 written in capital letters in the Reference: make sure your TBB-based code does not involve any kind of barriers!!! (unless you are really sure what you are doing). Or similar to Cilk guys saying about TLS: friends, tell friends DO NOT USE BARRIERS WITH TBB.

Basically, you expect either a mandatory parallelism or that two different TBB invocations would behave the same way in respect to sequence of the code execution. Which is not the case and thus perfectly detected by these barriers.

So, I think your usage model should be fixed for better conformance with TBB application design rules. Other issue with the code: unfortunately, constructor of task_group provoke default initialization of TBB, so the task_scheduler_init concurrency was not respected (I've filled the bug report for this).

As for what's wrong with TBB, I've modified your reproducer to remove MPI and to set a barrier between "computeLowPrior" and "updateHighPrior". It halts because the former stuck waiting for low-priority parallel_for which is blocked because other threads hunt after higher priority tasks. It seems like not contradicting to what TBB promises though look non-optimal.

Quote:

Anton Malakhov (Intel) wrote:

Thanks, it really helps to understand the issue.

First of all, I'd admit there is something non-optimal with task priorities implementation.

But let's keep it for a second because this reproducer clearly shows that task priority is not the culprit.. but rather a catalyst for the problem in your code.

Seems like nobody pay enough attention that your problem statement involves 4 MPI processes.. and you use MPI barriers from inside of TBB tasks. It should be the rule #1 written in capital letters in the Reference: make sure your TBB-based code does not involve any kind of barriers!!! (unless you are really sure what you are doing). Or similar to Cilk guys saying about TLS: friends, tell friends DO NOT USE BARRIERS WITH TBB.

Basically, you expect either a mandatory parallelism or that two different TBB invocations would behave the same way in respect to sequence of the code execution. Which is not the case and thus perfectly detected by these barriers.

So, I think your usage model should be fixed for better conformance with TBB application design rules. Other issue with the code: unfortunately, constructor of task_group provoke default initialization of TBB, so the task_scheduler_init concurrency was not respected (I've filled the bug report for this).

As for what's wrong with TBB, I've modified your reproducer to remove MPI and to set a barrier between "computeLowPrior" and "updateHighPrior". It halts because the former stuck waiting for low-priority parallel_for which is blocked because other threads hunt after higher priority tasks. It seems like not contradicting to what TBB promises though look non-optimal.

Thank you very much for the comments, and I have several follow-up questions,

 

"make sure your TBB-based code does not involve any kind of barriers!!! (unless you are really sure what you are doing)"

Is this a correctness requirement or a performance requirement? I understand the performance impact of invoking a barrier inside a TBB thread, and at least if I disable priority, if there are two or more TBB threads, there is no deadlock issue. TBB has its own correctness requirement in this regard?

" It halts because the former stuck waiting for low-priority parallel_for which is blocked because other threads hunt after higher priority tasks. It seems like not contradicting to what TBB promises though look non-optimal."

I don't fully understand this. One TBB worker thread may block on a barrier and become unavailable, but other TBB threads should do work if there are available work, but it seems like this is not what happens. Are there additional caveats regarding this? My understanding is if one thread blocks, only that thread should become unavailable for work, instead of making the whole thread pool unavailable, but it seems like this is not a way TBB behaves.

Thank you very much,

 

A drop in performance because of blocking can go all the way to zero, so it's generally too risky to mess with that even if a mere drop in performance is still acceptable to your application (mistakes like wrong location of task_scheduler_init, changes in runtime environment with default task_scheduler_init, composition with other code, maintenance issues, ...).

I'm surprised too that the scheduler literally "postpones execution of tasks with lower priority until all higher priority task are executed" (and I think that should be "higher-priority tasks"). What's the reason for that?

But I thought there already was a reproducer without MPI barriers, in #12?

 

Hello,

you can use synchronization API.

the shared queue does not guarantee precise first-in first-out behavior. Tasks that uses 3rd party barrier might be executed by the same thread in different order so it will cause a deadlock. look at Scheduling Algorithm​ chapter of the reference manual.

--Vladimir

Quote:

Seunghwa Kang wrote:

Quote:

"make sure your TBB-based code does not involve any kind of barriers!!! (unless you are really sure what you are doing)"

Is this a correctness requirement or a performance requirement? I understand the performance impact of invoking a barrier inside a TBB thread, and at least if I disable priority, if there are two or more TBB threads, there is no deadlock issue. TBB has its own correctness requirement in this regard?

It is correctness requirement, please see caution at Task Scheduler section of the Reference manual (http://software.intel.com/en-us/node/468188):

CAUTION

There is no guarantee that potentially parallel tasks actually execute in parallel, because the scheduler adjusts actual parallelism to fit available worker threads. For example, given a single worker thread, the scheduler creates no actual parallelism. For example, it is generally unsafe to use tasks in a producer consumer relationship, because there is no guarantee that the consumer runs at all while the producer is running.

So, any relations between TBB tasks must be expressed to the scheduler in TBB terms.

Quote:

Seunghwa Kang wrote:

Quote:

" It halts because the former stuck waiting for low-priority parallel_for which is blocked because other threads hunt after higher priority tasks. It seems like not contradicting to what TBB promises though look non-optimal."

I don't fully understand this. One TBB worker thread may block on a barrier and become unavailable, but other TBB threads should do work if there are available work, but it seems like this is not what happens. Are there additional caveats regarding this? My understanding is if one thread blocks, only that thread should become unavailable for work, instead of making the whole thread pool unavailable, but it seems like this is not a way TBB behaves.

The current task priority implementation is done in the way that until any higher-priority work is not finished, lower priority will no be executed. TBB tasks are not supposed for long blocking waits.

More specifically, it looks like the thread executing high-priority parallel_for steals a task of low-priority parallel_for and puts it into local 'offloaded' tasks list thus preventing other threads from stealing/executing it. For the given use-case it could be workarounded by e.g. coping the last '#if __TBB_TASK_PRIORITY' section from arena::process() method of arena.cpp into the end of local_wait_for_all() method of custom_scheduler.h (with corresponding inter-classes corrections like removing 's.' from one members and adding 'my_arena->' to other members). But it does not solve the root-cause of the issue and probably not likely to happen in production release.

Quote:

Vladimir Polin (Intel) wrote:

Hello,

you can use synchronization API.

the shared queue does not guarantee precise first-in first-out behavior. Tasks that uses 3rd party barrier might be executed by the same thread in different order so it will cause a deadlock. look at Scheduling Algorithm​ chapter of the reference manual.

--Vladimir

Thank you very much for the answer but I don't understand this.

The MPI_Barrier is a barrier between two (or more) different MPI processes, and I am not sure how TBB synchronization functions can achieve synchronization across two different MPI processes.

And I am not sure what's the difference between invoking MPI_Barrier and putting some sleep function or intensive loops. As long as I understand, TBB does not know whether I call MPI_barrier or not, and MPI_Barrier invocation will just to cause the thread to block for some time, and I don't understand "Tasks that uses 3rd party barrier might be executed by the same thread in different order so it will cause a deadlock." In my understanding, nothing gets queued by invoking MPI_barrier.

 

Quote:

Anton Malakhov (Intel) wrote:

More specifically, it looks like the thread executing high-priority parallel_for steals a task of low-priority parallel_for and puts it into local 'offloaded' tasks list thus preventing other threads from stealing/executing it. For the given use-case it could be workarounded by e.g. coping the last '#if __TBB_TASK_PRIORITY' section from arena::process() method of arena.cpp into the end of local_wait_for_all() method of custom_scheduler.h (with corresponding inter-classes corrections like removing 's.' from one members and adding 'my_arena->' to other members). But it does not solve the root-cause of the issue and probably not likely to happen in production release.

This aligns with my experience and the debugger output.

"But it does not solve the root-cause of the issue."

I assume the root cause is violating "There is no guarantee that potentially parallel tasks actually execute in parallel, because the scheduler adjusts actual parallelism to fit available worker threads. For example, given a single worker thread, the scheduler creates no actual parallelism. For example, it is generally unsafe to use tasks in a producer consumer relationship, because there is no guarantee that the consumer runs at all while the producer is running."

But I am curious about the rationale behind this. Clearly if there is only one worker thread, there can be a deadlock, and I am explicitly taking care of this. If the number of worker threads is smaller than the number of potentially blocking threads, I don't run those in parallel. And for most current and future microprocessors (e.g. Xeon Phi) with many cores, I am not sure this is something that really needs to be enforced. And as I remember, there is a TBB example for using priority saying, a high priority task waits for user inputs, and a low priority task performs background computation, but this also violates the requirement.

and from the same page, "they should generally avoid making calls that might block for long periods, because meanwhile that thread is precluded from servicing other tasks."

and this is something I am fully aware of.

To work around, I can spawn pthreads instead of task groups, and invoke parallel_for inside the spawned pthreads (so to block only within the spawned pthread, not inside TBB tasks), but not sure I really should add this additional step. And will this fix the issue you described ("More specifically, it looks like the thread executing high-priority parallel_for steals a task of low-priority parallel_for and puts it into local 'offloaded' tasks list thus preventing other threads from stealing/executing it.")?

"and probably not likely to happen in production release."

and what is the production release? Code compiled without the debug option? The program behavior does not change much whether I use the debug version or the release version in my experience. Or are you talking about the commercial version of TBB? As long as I am aware of, there is no difference between the free and commercial versions of TBB.

Thank you very much,

Quote:

Raf Schietekat wrote:

I'm surprised too that the scheduler literally "postpones execution of tasks with lower priority until all higher priority task are executed" (and I think that should be "higher-priority tasks"). What's the reason for that?

Quote:

Anton Malakhov (Intel) wrote:

But it does not solve the root-cause of the issue and probably not likely to happen in production release.

Still wondering? Why shouldn't this simply be the last scheduling option instead of just wasting time? If priorities are set at the task_group_context level, why should all the workers in the current arena suddenly be blinded to lower-priority ones even after verifying that there is nothing else to do? Would it cause some form of priority inversion perhaps?

Quote:

Raf Schietekat wrote:

But I thought there already was a reproducer without MPI barriers, in #12?

Still wondering? I see that Vladimir acknowledged it in #13, Michel chimed in in $14, so that's when I started my (idle?) speculation, and Anton threw cold water on any hope for a quick solution in #24. Is it a different issue or what?

Quote:

Seunghwa Kang wrote:

But I am curious about the rationale behind this. Clearly if there is only one worker thread, there can be a deadlock, and I am explicitly taking care of this. If the number of worker threads is smaller than the number of potentially blocking threads, I don't run those in parallel. And for most current and future microprocessors (e.g. Xeon Phi) with many cores, I am not sure this is something that really needs to be enforced. And as I remember, there is a TBB example for using priority saying, a high priority task waits for user inputs, and a low priority task performs background computation, but this also violates the requirement.

How are you "explicitly taking care of this"? Maybe just not in #25?

I don't think that not allowing a program to require concurrency is just a matter of availability of resources, but rather a valuable principle for debugging and probably compositing. (My only problem is with the wasted resources by ignoring lower-priority work.)

What example is that? Does it imply concurrent execution of background computation?

Quote:

Seunghwa Kang wrote:

To work around, I can spawn pthreads instead of task groups, and invoke parallel_for inside the spawned pthreads (so to block only within the spawned pthread, not inside TBB tasks), but not sure I really should add this additional step. And will this fix the issue you described ("More specifically, it looks like the thread executing high-priority parallel_for steals a task of low-priority parallel_for and puts it into local 'offloaded' tasks list thus preventing other threads from stealing/executing it.")?

Each application thread gets its own arena, and there is no catastrophic interference between arenas because only fully unemployed worker threads migrate between them (although becoming unemployed might take a while). So there's one example where available parallelism might well be lower than physical parallelism, but that shouldn't be a problem if the program is according to TBB design rules. It may or may not be a problem for you that this also means that priorities don't work across arenas except to help decide where unemployed workers should migrate to, if my understanding is still correct there ("Though masters with lower priority tasks may be left without workers, the master threads are never stalled themselves.").

Quote:

Seunghwa Kang wrote:

and what is the production release? Code compiled without the debug option? The program behavior does not change much whether I use the debug version or the release version in my experience. Or are you talking about the commercial version of TBB? As long as I am aware of, there is no difference between the free and commercial versions of TBB.

I think this was about private experiment vs. downloadable release (even non-stable), not debug vs. release.

Quote:

Seunghwa Kang wrote:

Quote:

"But it does not solve the root-cause of the issue."

I assume the root cause is violating "There is no guarantee that potentially parallel tasks actually execute in parallel..."

But I am curious about the rationale behind this.

Raf is right. The rationale for such optional parallelism is composablity.. when a worker thread at any moment can join parallel execution or leave it after completion of a task (at the outer level).

Quote:

Seunghwa Kang wrote:

To work around, I can spawn pthreads instead of task groups, and invoke parallel_for inside the spawned pthreads (so to block only within the spawned pthread, not inside TBB tasks), but not sure I really should add this additional step. And will this fix the issue you described ("More specifically...

It will work since the tasks of different parallel_for's will be isolated and no stealing will be possible between them. Please also consider using of tbb::task_arena [CPF] feature as it allows to create the same isolated arenas without creating additional thread.

Quote:

Seunghwa Kang wrote:

"and probably not likely to happen in production release."

and what is the production release? 

Raf is right, I suggested a private workaround for TBB library which is not likely to happen in official version (until we will be 100% sure how to do it)

Quote:

Raf Schietekat wrote:

Still wondering? Why shouldn't this simply be the last scheduling option instead of just wasting time? If priorities are set at the task_group_context level, why should all the workers in the current arena suddenly be blinded to lower-priority ones even after verifying that there is nothing else to do? Would it cause some form of priority inversion perhaps?

We admit this inefficiency as I said. It was done with medium-to-short task sizes in mind. my_offloaded_tasks​ is private member of dispatcher and cannot be made visible to others yet until the owner makes it itself. However, taking lower-priority tasks should be somehow limited by high-priority tasks anyway, otherwise, priorities will not work as intended. I'm thinking about implementation of is_out_of_work() method which will be able to grab offloaded tasks if they are blocked in workers busy with other work.

Quote:

Raf Schietekat wrote:

Still wondering? I see that Vladimir acknowledged it in #13, Michel chimed in in $14, so that's when I started my (idle?) speculation, and Anton threw cold water on any hope for a quick solution in #24. Is it a different issue or what?

It turned out to be two separate issues. The actual problem was the deadlock. Assertion failure was suspected as a culprit. But they are not connected.

Thanks for answering the rest of the questions.

[quote]

How are you "explicitly taking care of this"? Maybe just not in #25?

I don't think that not allowing a program to require concurrency is just a matter of availability of resources, but rather a valuable principle for debugging and probably compositing. (My only problem is with the wasted resources by ignoring lower-priority work.)

What example is that? Does it imply concurrent execution of background computation?

[\quote]

1. How are you "explicitly taking care of this"? Maybe just not in #25?

if( #_threads < 2 ) {

updateHighPrior();

computeLowPrior();

}

else {

highPriorGroup.run( []{ updateHighPrior(); } );
lowPriorGroup.run( []{ computeLowPrior(); } );
lowPriorGroup.wait();
highPriorGroup.wait();

}

2. Composability clearly matters, but I think the composability in a main program and the composability issue in a library program needs to be considered separately. For library programs to be used by arbitrary users, avoiding any potential composability issues could be highly desirable, but for a program with more limited use cases, forcing a program to block in only a single threaded mode can cause more problems. And for TBB to be used in a broader context, blocking should not cause unexpected side effects besides just making the blocking thread unavailable (e.g. worker threads are just idling and not doing any work even when there are available worker threads and available works). And it seems like the suggested update will fix this issue.

3. To explain a bit about the application, this is a simulation program, and each MPI process is responsible for sub-partitions of the entire simulation domain. The program runs on a cluster computer with multiple nodes. And there are data exchanges at the sub-partition boundaries. The barriers in the sample code is actually data exchanges in the real program. And forcing data exchanges to occur in only one thread can limit algorithm design in many high-performance computing applications.

First of all a correction: I shouldn't have used "compositing" instead of "composing".

I don't fully understand the previous posting (other than the formatting mishaps), or even the need for priorities in this program, but even if that lower-priority work were still available for second-hand stealing (or whatever that would be called) it's still stretching the main purpose of TBB (which is efficient execution of CPU-bound work while transparently adapting to available parallelism) if you're doing things that require a certain level of concurrency. Sometimes it works, sometimes it becomes... complicated.

It does not seem to be in the cards for TBB to also become a self-supported reactive-programming toolkit. If you want to use a synchronous API for MPI, you should probably do that with plain application threads anyway. Otherwise it depends on your needs whether you should combine TBB with something else or use another toolkit instead, I think.

(Added 2013-12-02) Note that the "something else" above could be either rolling your own solution with an application thread to handle all the asynchronous stuff or using another toolkit for that, together with TBB. You can then hook into TBB by using a continuation and a dummy child to simulate the blocking without actually blocking (spawn the child to execute the continuation). I know it's tempting to second-guess the prohibition against blocking by using platform-specific assumptions or cheating with more threads specified in task_scheduler_init, and maybe composability is not a good-enough reason against that in non-library code, and you may get away with it some or even a lot of the time, but what you also get is new and exciting opportunities to get into trouble.

Leave a Comment

Please sign in to add a comment. Not a member? Join today