CPU not used at 100%

CPU not used at 100%

Hi everyone,

I wonder if there is not a bug inside TTB library. I explain:

I have a long computation for 1 000 000 objects. For this, I use parralel_for paradigm like this:

size_t size = 1000000;
tbb::parallel_for(tbb::blocked_range<size_t>(0, size),
        [&](const tbb::blocked_range<size_t>& r) {
            for (size_t i=r.begin(); i != r.end(); ++i)
            {
                //long computation here
            }
}

When I launch the program, sometimes my 8 cores work at 100% => That's great !

But sometimes, I have a loss of "powerful" for my threads (i will provide a graphic chart of my CPU usage it will be clearer to understand).

Can you explain why ? Thanks :)

PS: I use Windows 10 with an Intel i7 (8 cores)

 

AnexoTamanho
Fazer downloadimage/png cpu_usage.png19.77 KB
27 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

add image

Anexos: 

AnexoTamanho
Fazer downloadimage/png cpu_usage.png19.77 KB

Hi,

What sort of "long computation" is performed? Is it pure math or does the algorithm perform some IO operations (read files, sockets and so on)? Do you use some synchronizations, third-party library calls and/or OS API inside the computations?

Regards,
Alex

Hi,

It is pure maths operations  no IO or sockets. No api call. And there is no synchronisation. It is strange because sometimes all CPU work and sometimes not...

Any ideas ?

EDIT: I try to change partitioner but same result. Maybe I did some mistakes because I don't have an expert knowledge about this.

 

 

Could you provide a complete reproducer and some related details about your environment and use case, please?

Regards,
Alex

Cita:

Alex (Intel) escribió:

Could you provide a complete reproducer and some related details about your environment and use case, please?

Yes, I wrote a small example that reproduce the problem :

   size_t nbCombiFast = 100000;
    tbb::parallel_for(tbb::blocked_range<size_t>(0, nbCombiFast),
        [&](const tbb::blocked_range<size_t>& r) {
            for (size_t i=r.begin(); i != r.end(); ++i)
            {
                std::vector<float> v;
                float r = 0;
                for (size_t o=0; o<10000000; ++o)
                {
                    r += o*sin(o) - cos(o);
                }
                v.push_back(r); // to avoid compiler optimizations...
            }
        }
    );

My environment is :

  • MinGW v5.1 (with GCC 5.1 (tdm-1) with Thread model = posix)
  • Compiler options: Release mode with -O3 for optimizations
  • Windows 10 with Intel i7-6820HQ @ 2.70Ghz
  • I don't remember the TBB version, but the name of the folder I had compiled was "tbb2017_20161128oss". Because there is no binaries for MinGW compiler, so I had compiled myself the TBB library for MinGW (maybe the problem is here ?)

When I launch this example, sometimes all my cores work at 100% and sometimes it makes a long time to reach 100% ...

Thanks

size_t nbCombiFast = 100000;
atomic<int64_t> hack = 0; // add to eliminate heap critical section
 tbb::parallel_for(tbb::blocked_range<size_t>(0, nbCombiFast),
     [&](const tbb::blocked_range<size_t>& r) {
         for (size_t i=r.begin(); i != r.end(); ++i)
         {
// remove    std::vector<float> v;
             float r = 0;
             for (size_t o=0; o<10000000; ++o)
             {
                 r += o*sin(o) - cos(o);
             }
// remove    v.push_back(r); // to avoid compiler optimizations...
             hack += (int64_t)r; // non-critical section
         }
     }
 );
if(hack) print "Won't print"; // avoid meaningless code elimination

 If the above eliminates the symptom, then this indicates adverse interaction with heap.

Note, if you have TBB scalable allocator activated, Your former code may have had experienced excessive amounts of "first touch" delays as slabs are allocated, touched for initialization, and touched again as each vector grows (iow when allocated node changes size).

To confirm this, place your original code into a function, then call this function twice, with a 10 second sleep function between calls. I expect that your first call will exhibit the existing chart, and the second call will produce the expected chart.

Jim Dempsey

Same problem with your hack atomic variable instead of my std::vector...  I put the code just to be sure there is no errors.

void func()
{
    tbb::atomic<int64_t> hack = 0;
    tbb::parallel_for(tbb::blocked_range<size_t>(0, 100),
    [&](const tbb::blocked_range<size_t>& r) {
        for (size_t i=r.begin(); i != r.end(); ++i)
        {
            float r = 0;
            for (size_t o=0; o<10000000; ++o)
            {
                r += o*sin(o) - cos(o);
            }
            hack += (int64_t)r;
        }
    });
    if (hack) std::cout << "Youpi !";
}

int main()
{
    //tbb::task_scheduler_init init(8);
    auto d = std::chrono::system_clock::now();
    func();
    auto e = std::chrono::system_clock::now();
    auto millis = std::chrono::duration_cast<std::chrono::milliseconds>(e - d).count();
    std::cout << "Milli = " << millis << std::endl;
    Sleep(10000);

    auto d2 = std::chrono::system_clock::now();
    func();
    auto e2 = std::chrono::system_clock::now();
    auto millis2 = std::chrono::duration_cast<std::chrono::milliseconds>(e2 - d2).count();
    std::cout << "Milli = " << millis2 << std::endl;

    return 0;
}

I also perform your 2nd test but not working. Sometimes the first call is better than the second and sometimes it is the 2nd that is better than the first. How can I do if I use TBB scalable allocator ?

I join a picture of my CPU when launching the program. At start, the CPU is at full speed (100%) and then I don't know why but the speed drops... and then increases again... The speed has not the time to reach 100% again because the computation is finished.

I really don't understand the problem. Have you got the same issue with your CPU ?

Thanks !

Is the picture shown for the provided example? Where is the "Sleep(10000)" time (low CPU utilization)? It should be about 2-2.5 rectangles on the X-axis. What is the typical running "Milli = " times for the first and the second runs in the example?

Regards,
Alex

Hi,

No the graph shown is only for the first call because there is no place for the full bench in the graph...
Typically, func() takes between 21 000ms (best case when CPU is always at full speed) and 37 000ms (in the worst case). The average is about 24 000 ms.

I join another graph with the full bench (1st call and 2nd call). We have 10s between each call. The first call takes 23363ms and the second 36261ms. A big difference in this case.

If I launch again the bench 3 times I have:

  • 1st call = 22512ms / 2nd call = 32042ms
  • 1st call = 37233ms / 2nd call = 36134ms (no luck for this bench !)
  • 1st call = 30162ms / 2nd call = 35332

Thanks

Anexos: 

AnexoTamanho
Fazer downloadimage/png bug_tbb2.png26.24 KB

Alex,

After looking at the chart in #10 I will make an educated guess at what might cause the symptom.

TBB, like virtually all well written multi-threading (w/ thread pool) system, tries to be nice to other processes on the system. To address this, when a (each) thread, after some period of time is unable to find work, it suspends itself. This suspension is typically performed on timed wait on a condition variable (pthread and std::thread/condition_variable, or Windows WaitForSingleEvent). The symptom for the second call is indicative of the TBB thread management code of .NOT. signaling the condition_variable or event for the other thread(s) when work becomes available. IOW the additional threads are not run immediately, but rather startup after the timer expires.

Note, this omission includes the situation whereby the main thread properly notifies one of the waiting threads, but that thread fails to properly notify the correct other threads. Potentially this could be the result of the second thread notifying itself (or other running thread) as opposed to notifying a waiting thread. IOW do not simply look at what the main thread does, but look deeper at what the woken-up threads do.

Jim Dempsey

Is there any solution to avoid that ? It is really bad for my application that needs full performance.

I do not want to rewrite an entire thread pool for my application...

This is something that needs to be fixed inside TBB. About all you can do now is to keep your threads busy. Until a fix comes in, as a hack (crude hack), is to schedule number of threads in the pool -1 number of low priority tasks that sleeps 1ms (or shorter), checks a program termination flag (you set this at end), if termination not indicated the task schedules itself on the low priority queue then exits.

Presumably, when your program is running "hot", these tasks will not get dequeued. Only when threads enter stealing mode with nothing else to do, will it take one of these tasks. You will need to post a reminder to remove this when the TBB library gets fixed.

Jim Dempsey

Cita:

jimdempseyatthecove escribió:

This is something that needs to be fixed inside TBB. About all you can do now is to keep your threads busy. Until a fix comes in, as a hack (crude hack), is to schedule number of threads in the pool -1 number of low priority tasks that sleeps 1ms (or shorter), checks a program termination flag (you set this at end), if termination not indicated the task schedules itself on the low priority queue then exits.

I am sorry but I don't understand your solution. Can you provide a short piece of code ?

Cita:

jimdempseyatthecove escribió:

You will need to post a reminder to remove this when the TBB library gets fixed.

I am looking forward seeing this fix :)

Thanks

Note, untested code, crude hack, you are welcome to improve on this


#if defined(USE_LowPrioritySpinTask)
class MyLowPrioritySpinTask : public tbb::task {
    static bool terminateFlag = false;
    static atomic<int> terminated = 0;
    /*override*/ tbb::task* execute() {
        if(terminateFlag)
        {
           ++terminated;
           return NULL;
        }
        tbb::task::enqueue(this, tbb::priority_t::low); // re-queue our task at low priority
        return NULL
    }
  public:
    void terminate() { terminateFlag = true; }
};
#endif

...
#if defined(USE_LowPrioritySpinTask)
int nSpinnerTasks = YourTBBworkerPoolSizeYouDetermineThis() - 1;
vector<MyLowPrioritySpinTask *> vMyLowPrioritySpinTasks;
for(int i=0; i<nSpinnerTasks; ++i)
{
    MyLowPrioritySpinTask * t = new (tbb::task::allocate_root()) MyLowPrioritySpinTask ();
    tbb::task::enqueue(*t, tbb::priority_t::low);
    vMyLowPrioritySpinTasks.push_back(t);
}
#endif

doYourProgramHere();

#if defined(USE_LowPrioritySpinTask)
MyLowPrioritySpinTasks::terminate();
while(MyLowPrioritySpinTasks::terminated < vMyLowPrioritySpinTasks.size())
  mm_pause();
for(int i=0; i<vMyLowPrioritySpinTasks.size(); ++i)
    delete vMyLowPrioritySpinTasks[i];
#endif

*** Caution, the above, as coded, will make your program 100% active all the time. It is up to you to expand upon this to meet your needs. As an example, you could place a timed sleep function into the top of the execute(). This will introduce ~ 1/2 this sleep time latency in getting your threads going again. The sleep could be conditioned upon how long it took between entries. i.e virtually no time == no other work needs to be done so perform sleep (remember to recapture time/ticks following sleep).

Jim Dempsey

I will try your solution later. Just a few questions :

1) mm_pause() seems to not exist in TBB do you mean _mm_pause() function ?

2) When you speak about doYourProgramHere(), I think I have to put this ?

 tbb::parallel_for(tbb::blocked_range<size_t>(0, nbCombiFast),
     [&](const tbb::blocked_range<size_t>& r) {
         for (size_t i=r.begin(); i != r.end(); ++i)
         {
             std::vector<float> v;
             float r = 0;
             for (size_t o=0; o<10000000; ++o)
             {
                 r += o*sin(o) - cos(o);
             }
             v.push_back(r); // to avoid compiler optimizations...
         }
     }
 );

3) And YourTBBworkerPoolSizeYouDetermineThis() means nbCombiFast (i.e 100 000 elements) for my example ?

4) Should not be faster if I re-enqueue all my tasks with a high priority instead of a low ?

5) you said " Caution, the above, as coded, will make your program 100% active all the time.".

At the first sight, after :

while(MyLowPrioritySpinTasks::terminated < vMyLowPrioritySpinTasks.size())
	mm_pause();

my CPU should not be active at 100% ? Because all tasks will be finished. I need to retrieve a low CPU usage after the big computation for my program.

Thank you very much

 

1) mm_pause    As I said, untested code (may contain typos, missing code, etc...)
2) yes. main() { init TBB, launch spinners, your code here, terminate spinners }
3) No, this means the number of hardware threads you establish for the TBB thread pool. Your i7-6820HQ has 4 cores 8 threads any you typically would use an 8-hardware thread TBB thread pool *** However there are circumstances where you may want to use less or more hardware threads. There may be a TBB function to return the number of hardware threads used by the TBB thread pool.
4) No,
5) Yes. _mm_pause() is an instruction that relieves instruction cycles (and power, L1 ICache activity) but does not suspend the thread from execution. The thread will be in the run state until it exits the loop (or the O/S preempts the software thread).

The 100% is for the duration of the "launch spinners" thru "terminate spinners". Please observe that while I showed launching at start of program and terminating at end of program, you can improve upon this by launching and terminating around specific sections of your code that exhibit the unnecessary startup delays.

Jim Dempsey

 

I think I did an error while implementing your solution because I have a crash :

  • In DEBUG mode : the application crash within TBB with the error "pure virtual method called. Terminate called without an active exception."
  • In RELEASE mode : the application crash after the "END" and before the "return 0;"

The whole code :

#include <iostream>
#include <tbb/tbb.h>
#include <chrono>

#define USE_LowPrioritySpinTask

#if defined(USE_LowPrioritySpinTask)
class MyLowPrioritySpinTask : public tbb::task {
    static bool terminateFlag;
    /*override*/ tbb::task* execute() {
        if(terminateFlag)
        {
           ++terminated;
           return NULL;
        }
        tbb::task::enqueue(*this, tbb::priority_t::priority_low); // re-queue our task at low priority
        return NULL;
    }
  public:
    static tbb::atomic<int> terminated;
    static void terminate() { terminateFlag = true; }
};

bool MyLowPrioritySpinTask::terminateFlag = false;
tbb::atomic<int> MyLowPrioritySpinTask::terminated(0);

#endif

void func()
{
    tbb::atomic<int64_t> hack = 0;
    tbb::parallel_for(tbb::blocked_range<size_t>(0, 100),
    [&](const tbb::blocked_range<size_t>& r) {
        for (size_t i=r.begin(); i != r.end(); ++i)
        {
            float r = 0;
            for (size_t o=0; o<10000000; ++o)
            {
                r += o*sin(o) - cos(o);
            }
            hack += (int64_t)r;
        }
    });
    if (hack) std::cout << "Youpi !";
}

int main()
{
    unsigned int nbThread = 8;

    tbb::task_scheduler_init init(nbThread);

#if defined(USE_LowPrioritySpinTask)
    int nSpinnerTasks = nbThread - 1;
    std::vector<MyLowPrioritySpinTask *> vMyLowPrioritySpinTasks;
    for(int i=0; i<nSpinnerTasks; ++i)
    {
        MyLowPrioritySpinTask * t = new (tbb::task::allocate_root()) MyLowPrioritySpinTask ();
        tbb::task::enqueue(*t, tbb::priority_t::priority_low);
        vMyLowPrioritySpinTasks.push_back(t);
    }
#endif

    // Big Computation here
    auto d = std::chrono::system_clock::now();
    func();
    auto e = std::chrono::system_clock::now();
    auto millis = std::chrono::duration_cast<std::chrono::milliseconds>(e - d).count();
    std::cout << "Milli = " << millis << std::endl;

#if defined(USE_LowPrioritySpinTask)
    MyLowPrioritySpinTask::terminate();
    while(MyLowPrioritySpinTask::terminated < vMyLowPrioritySpinTasks.size())
        _mm_pause();

    std::cout << "DELETING spinners..." << std::endl;
    for(int i=0; i<vMyLowPrioritySpinTasks.size(); ++i)
        delete vMyLowPrioritySpinTasks[i];
#endif

    std::cout << "END" << std::endl;

    return 0;
}

Maybe something is wrong ?

Thanks
 

I tried the code and made several derivations each resulting in an assert, then I tried using recycle_as_safe_continuation, this removed the assert, however, when the task recycled, it went onto the normal priority queue (as opposed to keeping the priority it had). This is much harder that what it first appears.

I haven't experimented with task_arena, but I suspect this will have the same symptom (the continued task having higher than low priority).

Does someone else on this forum have any suggestions? (other than wait for a fix)

Jim Dempsey

I will wait for a fix, please let me know where this fix will be available.

Maybe if I have enough time, I will write a custom thread pool for my application but I really don't want to do this...

Does anyone in this forum have a better idea ?

Thanks, specially for Jim Dempsey for helping me

Cita:

jimdempseyatthecove escribió:

After looking at the chart in #10 I will make an educated guess at what might cause the symptom.

TBB, like virtually all well written multi-threading (w/ thread pool) system, tries to be nice to other processes on the system. To address this, when a (each) thread, after some period of time is unable to find work, it suspends itself. This suspension is typically performed on timed wait on a condition variable (pthread and std::thread/condition_variable, or Windows WaitForSingleEvent). The symptom for the second call is indicative of the TBB thread management code of .NOT. signaling the condition_variable or event for the other thread(s) when work becomes available. IOW the additional threads are not run immediately, but rather startup after the timer expires.

Note, this omission includes the situation whereby the main thread properly notifies one of the waiting threads, but that thread fails to properly notify the correct other threads. Potentially this could be the result of the second thread notifying itself (or other running thread) as opposed to notifying a waiting thread. IOW do not simply look at what the main thread does, but look deeper at what the woken-up threads do.

TBB runtime uses a list of sleeping threads and notify only threads that are in this list. The logic is OS agnostic and pretty simple (get from the list: private_server.cpp#L384-L385, notify the threads: private_server.cpp#L393-L394). Therefore, if the guess is correct then the discussed issue should be reproducable on any OS and any machine. I failed to reproduce the issue on several Windows-based machines. Even if the problem is on TBB side, it is not so simple and caused by specific environment (surely, we have quite extensive coverage in our testing).

Regards,
Alex
 

Alex,

From little I know about TBB, TBB (I assume) recently switched to using C++11 std::thread, and more importantly std::condition_variable. The std::condition_variable has both notify_one() and notify_all() (whereas former pthread did not have notify_all). Using non-TBB threading, I have experienced similar problems of which I am suspecting that notify_all() is not doing its job, but may be waking up one, or less than all, waiting thread(s), leaving the others to wakeup on the timed wait timer expiration. Note, this is an unfounded suspicion on my part since the TBB scheduler is opaque.

"The logic is OS agnostic and pretty simple" would not be immune to this symptom should TBB be using std::condition_variable::notify_all().

Does TBB use notify_all?

Jim Dempsey

Just a message to go up this topic. Maybe put a pin on this topic ?

I am looking forward seeing a fix for this problem because I really need threads in my application. Otherwise, I will code my owm thread pool.

Thanks a lot

Hi Diedler,

We do not confirm any issue with CPU usage in TBB. As Alex mentioned above, we tried to reproduce it on several machines, and see no CPU usage problem. I personally tried the attached code on my Win10 laptop (2 cores, 4 threads) and did not see the behavior you described.

The code is derived from yours at the message #8 above. I also added a variant that does not use TBB but C++11 threads instead (see USE_TBB macro). I used VS2015 for compilation. Please try it in your environment and see if there is any difference in behavior for TBB & C++11 threads.

Updated: also tried with TDM MinGW-w64 5.1 - no issues either.

Anexos: 

Hi Alexey,

I test your program on my Laptop with the same issue. But I have an idea : I noticed the CPU Clock falls down randomly during the bench. That 's explain the CPU usage graph in the message #8. Maybe I have to disable an option in the BIOS but I don't know which option...

I also tried on another Laptop with a CPU Intel Core 17-7700HQ @2.80Ghz and no problem ! The CPU Clock does not fall and all the settings are by default.

So you are right, it is not an issue in TBB library. If you have any idea ?

Thanks

 

I'd suggest to look for some options related to Intel(R) Turbo Boost Technology. Also find some tool to look at CPU temperature - as far as I understand, modern power-saving technologies may automatically drop CPU frequency if it becomes too hot.

Alexey,

It would seem odd that the MS Performance Monitor would count clock ticks during thread runtime, and then use that against the estimated Turbo-Boost speed. But maybe that is what they do. To me, CPU performance is:

The ratio of all threads busy time to all threads idle time.

not

The ratio of all threads at maximum Turbo-Boost time to all threads at idle time.

Though I can see how the two can get goofed up. Note, there used to be a similar issue a while back when Turbo Boost (anti-Turbo down throttle) was introduced. The fix was to use a fixed timer (memory bus clock if I recall). Maybe this quirk came back for that platform/version of Windows.

Jim Dempsey

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!