Mavericks (OS X 10.9) performance issue with TBB?

Mavericks (OS X 10.9) performance issue with TBB?

In our app testing, we are seeing 25-50% more CPU utilization under OS X 10.9 than under OS X 10.8.5, which under heavy load is causing our app to fail due to missed time-based events.  Using the Instruments Time Profiler on the application under both OSes, we are seeing a large part of this is due to more time being spent down in the OS in the sched_yield call, which is being called from the tbb code.  Here is the OS X 10.8.5 trace:

Running Time Self # Self Self % Symbol Name
10982.0ms 45.2% 0.0 0 0 thread_start
10982.0ms 45.2% 0.0 0 0 _pthread_start
5191.0ms 21.4% 6.0 6 0 tbb::internal::rml::private_worker::thread_routine(void*)
4782.0ms 19.7% 153.0 153 0.6 tbb::internal::market::process(rml::job&)
4420.0ms 18.2% 44.0 44 0.1 tbb::internal::arena::process(tbb::internal::generic_scheduler&)
2689.0ms 11.0% 1927.0 1927 7.9 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task(long&, bool)
687.0ms 2.8% 43.0 43 0.1 sched_yield
644.0ms 2.6% 644.0 644 2.6 swtch_pri

Here is the OS X 10.9 trace:

Running Time Self # Self Self % Symbol Name
14104.0ms 54.2% 0.0 0 0 thread_start
14104.0ms 54.2% 0.0 0 0 _pthread_start
14104.0ms 54.2% 0.0 0 0 _pthread_body
8769.0ms 33.7% 0.0 0 0 tbb::internal::rml::private_worker::thread_routine(void*)
8751.0ms 33.6% 12.0 12 0 tbb::internal::market::process(rml::job&)
8730.0ms 33.5% 31.0 31 0.1 tbb::internal::arena::process(tbb::internal::generic_scheduler&)
7732.0ms 29.7% 959.0 959 3.6 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task(long&, bool)
6767.0ms 26.0% 70.0 70 0.2 sched_yield
6697.0ms 25.7% 6697.0 6697 25.7 swtch_pri

So in 10.9, we are spending 24% more time in switch_pri than in 10.8.5.  We've contacted Apple about this as well, but is this issue familar to anybody? 

Thanks,

Jeremy

19 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Vladimir Polin (Intel)'s picture

is there the same hardware?

it looks that in second case that there is less work to do (more threads/powerfull processors?) and workers are looking for a work more often than in the first case.

--Vladimir

Vladimir,

Thanks for responding.  Same hardware - Mid 2012 MacBook Pro (Retina) with 2.3GHz quad i7, and same software test being run, just switching back and forth between OSes on the same laptop.  So the amount of work should be equivalent with just the OS being changed.

Jeremy

Vladimir Polin (Intel)'s picture

Do you run it on batteries or connecting to AC adapter? It might be caused by Timer Coaliasing introduced in 10.9.

--Vladimir

Raf Schietekat's picture

Why would timer coalescing (sic!) behave differently based on the current power source (unless so instructed through the new API)?

I think that opting out by checking Prevent App Nap on the application is a better way to determine this.

Based on the information from various sources (including Apple: "Timer Coalescing minimizes the amount of system maintenance and background work that is performed while your Mac is running on battery power."), it looks like timer coalescing is only enabled when the machine is running on battery power. Also, it seems to me that timer coalescing is independent from App Nap (although they both work with the timers), so switching that off is a good experiment, but I am not sure it will solve the issue if it is related to timer coalescing.

Vladimir Polin (Intel)'s picture

Citação:

Raf Schietekat escreveu:

Why would timer coalescing (sic!) behave differently based on the current power source (unless so instructed through the new API)?

it looks i was not clear enough. I've asked exactly about batteries:)

App Nap might also be a case.

Raf Schietekat's picture

Hmm, I got my information from a very thorough review, which presents Timer Coalescing as part of App Nap, but now I see that's not how Apple has it. Sorry for not having gone straight to the source instead.

I am aware that coalescing is presented as especially relevant when the system is operating from a finite energy source, but I still see no clear indication that it is disabled when the system is being powered from an external power supply, nor do I see why it should.

(2013-11-02 Edited for clarity)

You are correct. Although Apple clearly state that coalescing is turned on when running on batteries, I have not found (and I tried looking) a clear declaration that it is turned off when running on AC.

One reason (I can think of) to turn it off could be that they are a bit worried, that their claims that messing around with the timers has no real effect on the application are not completely accurate under all circumstances :-).

jimdempseyatthecove's picture

Raf,

Is there a TBB setting equivilent to OpenMP KMP_BLOCK_TIME? Essentially extend the task steal hunt time before issuing sched_yield? Although this will not eliminate the O.P.'s issue of latency on sched_yield (apparently due to timer coalescing), it may reduce or eliminate the the frequency of sched_yield's (at the expense of unproductive CPU cycles).

Jim Dempsey

www.quickthreadprogramming.com
Raf Schietekat's picture

Well, we still don't know for sure whether and how timer coalescing might be involved in this, or at least I don't, and even less whether it would do any good to keep spinning longer (regardless of power consumption). Are these frequent short yields or less frequent long yields? Is another thread of the application less polite and hogging CPU time? If timer coalescing is causing burstiness, how would this affect the application? Is work spawned or enqueued, and could it be that enqueued tasks would be picked up quicker than spawned tasks are stolen?

But it could always be fun to try and increase the factor 2 in "const int failure_threshold = 2*int(n);" at src/tbb/custom_scheduler.h:269 (or thereabouts), perhaps just to linger until after the end of the current hustle and bustle of coalesced activity (if that is what's going on). If it is increased significantly, and CPU load increases too much, perhaps a slightly higher value for PauseTime might help.

Hey, why does the scheduler pause between stealing attempts? Shouldn't it only pause between stealing rounds, after it has tried a statistically significant fraction of possible victims? Should it also keep track of which possible victims were already probed as part of the same round and use that to increase the fraction to 100%? Or should the pause just gradually increase for a similar effect?

The MacBook Pro was on AC power.  We did try turning off App Nap by checking on the "Prevent App Nap" checkbox in the "Get Info" window for the application, and noticed no difference in performance.  That would have been too easy.  :)

Jeremy

Raf Schietekat's picture

Jim's suggestion was about the time until blocking, but I only addressed the time until yield. For completeness, the time until blocking is controlled by: "const int yield_threshold = 100;" (same file). But I'd suggest experimenting with the factor 2 first.

The suggestion about App Nap was when I thought that timer coalescing was part of it.

jimdempseyatthecove's picture

Not that this matters to Jeremy sched_yield on Windows has a peculiar characteristic which is counter intuitive. At least it was this way a few years ago. At that time sched_yield would yield only to threads (any/all apps) that were context switched out while computing (IOW preempted). You might say, that's what I want. Well, excluded from the list of candidates to yield to are threads that have completed events (e.g. I/O completion or WaitForSingleEvent) but have not yet begun to execute. This had the nasty habbit of starving I/O bound threads when you have a full complement, or over-subscription, of threads that are compute bound and issuing sched_yield. The solution in this case was to use Sleep(0) instead.

The point of mentioning this Windows peculiarity with respece to OS/X is sometimes well meaning programmers adopt same/similar strategies. And in this respect sched_yield on OS/X may have a feature with unintended consequences (to Jeremy).

Jim Dempsey 

www.quickthreadprogramming.com
Raf Schietekat's picture

I'm reading that timer coalescing is already established on Linux and Windows (true?), so, if it is causing these problems, it isn't likely to be a TBB issue, hence no reason to experiment with longer spinning after all.

Well, good night, and good luck.

(2013-11-09 Added) And by that (from the movie with the same name) I meant that I've tried to correct and answer some things (mixed with a mistake and some random thoughts), but now I'm out of ideas.

Raf Schietekat's picture

Before I do the upgrade myself (on the same hardware), I'm interested for my own sake as well to know whether this is something with the application or with TBB running on OS X (how do you switch back and forth between O.S. versions, BTW, and don't you then also lose any other data on the machine?). The evidence doesn't seem to point to timer coalescing at the level of TBB (otherwise this would probably have been reported before for Linux and Windows), and I find nothing about sched_yield and Mavericks (despite Jim's account about Windows, that seems rather unlikely to me). TBB is apparently using a lot of CPU time, but, except for a bug in OS X, that should only happen if there's nothing else going on (as already suggested by Vladimir): could it be that the application is polling for events with a timer instead of reacting?

While looking at the scheduler code, I was also wondering why TBB is not eating its own cooking for the task queue. Is that just because concurrent_queue is newer and has not yet been substituted for the simpler task_stream?

Raf,

We are in the process of building a version of the app that uses straight pthreads instead of tbb to help isolate the problem.  We have been playing with the failure_threshold, all they way up to 32x, and haven't noticed any discernable difference in behavior.

It's pretty easy to configure the system to be dual-boot, by creating a new partition using Disk Utility.  You can do this non-destructively by creating a new partition out of free space on your drive.  Then download the Mavericks installer to your boot volume.  By default it will want to upgrade your boot volume, but redirect it to install on your new empty partition.  We found that we needed about 40GB for the OS, our application, and Instruments traces.  Then you use the "Startup Disk" control panel to bounce back and forth.

Thanks,

Jeremy

Vladimir Polin (Intel)'s picture
Raf Schietekat's picture

I just saw that the "Unity guys" have boiled their problem down to an OpenGL issue, so it's unrelated.

I don't suppose that this has ceased to be an issue for TBB with 10.9.2?

 

 

Login to leave a comment.