TBB tasks -- donation work

TBB tasks -- donation work

Hello,

I have one of those embarrasingly parallel applications that never really is embarrasingly parallel and hit a bit of a weird scheduling issue I don't know how to solve.

 

The basic workflow is that there are many tasks (one per pixel) that run independently. However some of the data they work on is produced lazily. At that point, I want all tasks to hop in and lend their hand as much as possible, so everyone can carry on. This LazyTask is mostly serial, but has some parallel components to it (parallel_for, really). And my problem is that when I run it in isolation, that is, I just run the task from the master thread, I get the expected speedup (see WorksFast in the example). However when I run it such that inside the main parallel_for (over each pixel) I create a task_group which contains just this lazy task and then wait on it, it becomes really slow (see SlowerThanExpected). Actually almost 2x slower than if I don't try to parallelize the lazy task at all (see WantToSpeedUp).

Looking at htop, I can see that when I run the lazy task in isolation, it runs clearly single threaded expect where extra threads are needed. However, in the task_group inside parallel_for scenario, all threads are running all the time (probably hitting the scheduler for tasks that are mostly not there). Is there any good way out of this mess?

I understand that the "WantToSpeedUp" cannot possibly utilize the parallel_for that is inside the LazyTask, since I've blocked all the other tasks. I thought that the 2x slowdown could be at least partially fault of turboboost not kicking in when all the other threads are spinning madly, but I tried to check the frequency in both cases and there isn't that much of a different. I run on 2x 12 core (+ HT, so 48 threads altogether), LazyTask takes about 45s single threaded with 10s in the for loop. When in "WorksFast", it the loop runs in 1.5s and the overall time is 33s. When in "WantToSpeedUp", it runs the same as single threaded. When SlowerThanExpected, the LazyTask takes about 74s altogether, even though the parallel_for does its job and takes only 1.5s. So it seems that the constant pressure of the tasks on the scheduling system is the culprit here, but maybe my pattern is just really bad and can be done in a better way? (The LazyTask cannot be run before the main loop, as there are many different such lazy tasks, all are memory hungry and not all are needed each time).

Thanks,
Tomas

Sample code -- EDIT: I removed the sample code because it contained bugs. The code in the next post is an actual working example.

publicaciones de 11 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Just to make things easier and to isolate away possible weirdness in our codebase, I wrote an actual minimal example (as opposed to the pseudocode above).

 

#include <tbb/parallel_sort.h>
#include <tbb/task_group.h>
#include <chrono>
#include <cstdlib>
#include <iostream>


void LazyTask(const char* testName)
{
    size_t runsize = 100000000; // slow -- ~20s
//    size_t runsize = 10000000; // fast -- ~2s

    using time_point = std::chrono::high_resolution_clock::time_point;
    time_point total_start = std::chrono::high_resolution_clock::now();

    std::cout << "Starting " << testName << std::endl;
    srand(777);
    std::vector<float> data;
    for(size_t i = 0; i < runsize; ++i)
        data.push_back(float(rand()) / RAND_MAX);
    for(size_t i = 0; i < runsize; ++i)
        data[i] += (float(rand()) / RAND_MAX);
    for(size_t i = 0; i < runsize; ++i)
        data[i] += (float(rand()) / RAND_MAX);
    time_point sorting_start = std::chrono::high_resolution_clock::now();
    tbb::parallel_sort(data.begin(), data.end());
    time_point sorting_stop = std::chrono::high_resolution_clock::now();
    for(size_t i = 0; i < runsize; ++i)
        data[i] += (float(rand()) / RAND_MAX);
    for(size_t i = 0; i < runsize; ++i)
        data[i] += (float(rand()) / RAND_MAX);
    for(size_t i = 0; i < runsize; ++i)
        data[i] += (float(rand()) / RAND_MAX);
    time_point total_stop = std::chrono::high_resolution_clock::now();
    std::cout << testName <<
                 "; total: " << (std::chrono::duration_cast<std::chrono::milliseconds>(total_stop - total_start).count() / 1000.f) <<
                 "; sorting: " << (std::chrono::duration_cast<std::chrono::milliseconds>(sorting_stop - sorting_start).count() / 1000.f) <<
                 std::endl;
}

void preBuild()
{
    LazyTask("prebuild");
}

void singleThreaded()
{
    bool is_done   = false;
    bool is_acting = false;
    std::mutex mutex;
    std::condition_variable cond;

    tbb::parallel_for<int>(0, 8192,
                            [&](int i)
    {
        std::unique_lock<std::mutex> lock(mutex);
        if(is_done) return;
        if(!is_acting)
        {
            is_acting = true;
            lock.unlock();
            LazyTask("singleThreaded");
            lock.lock();
            is_done = true;
            cond.notify_all();
        }
        else
        {
            while(!is_done)
                cond.wait(lock);
        }
    });
}

void cooperating()
{
    bool is_done   = false;
    bool is_acting = false;
    std::mutex mutex;
    tbb::task_arena arena;
    tbb::task_group group;

    tbb::parallel_for<int>(0, 8192,
                            [&](int i)
    {
        std::unique_lock<std::mutex> lock(mutex);
        if(is_done) return;
        if(!is_acting)
        {
            is_acting = true;
            lock.unlock();
arena.execute([&](){group.run_and_wait([&](){LazyTask("cooperating");});});
            lock.lock();
            is_done = true;
        }
        else
        {
            lock.unlock();
            arena.execute([&](){group.wait();});
        }
    });
}


void test()
{
    using time_point = std::chrono::high_resolution_clock::time_point;
    time_point start, stop;

    start = std::chrono::high_resolution_clock::now();
    preBuild();
    stop = std::chrono::high_resolution_clock::now();
    std::cout << "preBuild -- total: " << (std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count() / 1000.f) << std::endl;

    start = std::chrono::high_resolution_clock::now();
    singleThreaded();
    stop = std::chrono::high_resolution_clock::now();
    std::cout << "preBuild -- total: " << (std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count() / 1000.f) << std::endl;

    start = std::chrono::high_resolution_clock::now();
    cooperating();
    stop = std::chrono::high_resolution_clock::now();
    std::cout << "preBuild -- total: " << (std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count() / 1000.f) << std::endl;
}

/*
One of the outputs from 48 thread machine

Starting prebuild
prebuild; total: 13.403; sorting: 1.422
preBuild -- total: 13.434
Starting singleThreaded
singleThreaded; total: 23.571; sorting: 11.95
preBuild -- total: 23.608
Starting cooperating
cooperating; total: 17.05; sorting: 1.478
preBuild -- total: 17.1
*/

 

I was able to reproduce the issue. Thank you for the reproducer. I'll come back when I have some results.

Regards,
Alex

It seems that std::rand performance depends on if you ever used threads or not. Let’s consider the following example:

#include <chrono>
#include <thread>
#include <iostream>

void test() {
    for (int trail = 0; trail < 3; ++trail) {
        std::chrono::high_resolution_clock::time_point t0 = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < 100 * 1000 * 1000; ++i) (void)std::rand();
        std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();

        std::cout << "Trail " << trail.
            << ": time = " << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() / 1000.f
            << std::endl;
    }
}

int main(int argc, char* argv[]) {
    std::cout << "Before thread creation: " << std::endl;
    test();

    std::thread([] {}).join();

    std::cout << "After thread creation: " << std::endl;
    test();

    return 0;
}

We run the same function twice: before thread usage and after. The performance differs in 2 times on my machine:

Before thread creation:
Trail 0: time = 0.631
Trail 1: time = 0.613
Trail 2: time = 0.615
After thread creation:
Trail 0: time = 1.411
Trail 1: time = 1.412
Trail 2: time = 1.41

Regards,
Alex

I cannot re-test this week, but from memory, that wasn't the case.
(and definitely isn't the case for the actual problem, not the toy example)

How the provided example relates to the real code? Because it depends on std::rand performance difference. Do you use std::rand in the production code?

Regards,
Alex

Sorry, must have somehow missed the question.

And it doesn't. The sequential code we run doesn't use rand() at all, but it is really hard to come up with a relevant simple example.
(What it does is a pretty complicated analysis of subdivision surfaces represented by half-edges, so a lot of pointer walking and general memory traffic).

I can confirm that running in isolation, I can see the performance difference. Beats me why rand() would be sensitive to this, it is just a simple LFSR (I am using gcc 6.4).

 

Either way, when the threads are started, the minimal code is fairly symptomatic of what's going on. The TBB threads are "busy waiting" for some work to appear which seems to lower the performance of the one thread that does have work.
I can try to put together an example that would use some other work (memcopy of large chunks of data comes to mind, or going through a large array of ints and doing sum of every 2nd, 3rd, and 4th number, which could mimic the pointer hopping a bit), if you think it would show something interesting, but I cannot give the actual code as a minimal example, as it is 20+ KLOC.

TBB never puts the waiting thread to sleep. Unfortunately, fixing this issue we will affect overall TBB performance, so it is a difficult trade-off either slowdown the whole TBB and put waiting thread to sleep or keep busy waiting but minimize TBB overheads for other cases. Can you estimate how much performance impact do you have in a real use case because of busy wait?

Sure.

The case where things behave well (i.e., when I run it outside another parallel_for), the whole process finishes in 33 seconds, 1.5s of that is on the parallel sort. This is on 2x 12 core (+ HT).
When I run it within the parallel for, the overall time is 74s, the parallel sort still takes only 1.5s
(When I run it just single threaded, it takes 45s and the sort takes 10 seconds).

So the default behavior would be greatly helped by a TBB algorithm, if there was no contention, but with the contention it is actually better to run single threaded than multithreaded.
If this cannot be solved within TBB, we will probably build a second thread pool for the more "coarse" scheduling and oversubscribe, but that's a fairly undesired behavior, as the application is supposed to mostly keep within a user-specificed number of threads.

The rand function is a serializing function. IOW has critical section. When run serially, the adverse interaction is less than when run in parallel, also the aggravation increases as you increase the number of threads. For test cases, you should construct your test case to have as similar code as  your actual code (or use test data together with your actual code). Your test observation in #4 seems to indicate that rand is capable of detecting if the process has ever created an additional thread. If not, then it runs the variant that does not have this critical section, on the other hand, if (when) an additional thread has been created, then this (apparently) sets a one time flag to cause rand to use the variant with the critical section.

Jim Dempsey

 

Unfortunately the real code is really complex and long. I didn't realize rand() had this implications. The slowdown can be shown on other scenarios, and the bottomline is that the threads are actively trying to steal work while there is none to be had, at least disabling turbo boost for the single-threaded portions. And since there is no easy way to turn that behavior around (without major implications all over the actual scheduling), the best course of action is to use a different scheduling for these large-and-mostly-single-threaded parts of the process. Would that be correct?

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya