TBB Task Scheduler: Integrating Data Tasks for Data Parallelism (need help)

TBB Task Scheduler: Integrating Data Tasks for Data Parallelism (need help)

AJ's picture

I have spent the past semester researching how to combine data parallelism with TBB, so that TBB can offset data tasks to a GPU when available. My work is nearly complete, and now I need to ask the TBB community for some help. My approach is to separate data parallelism from task parallism, which involves the creation of a new type of task: the data task. For the purposes of my work, a data task represents an operation which can utilize data parallelism. For example, multiplying a large vector by a scalar, or a matrix by a scalar.

There are two work pools in my design. There is the task pool which remains completely controlled by TBB. I also have added a single shared data task pool, from which the GPU will consume data tasks. If a CPU has no work to be done, that is normal TBB tasks to consume, it will steal data tasks from the data task pool. Ordinary TBB tasks can only be executed on the CPU, whereas data tasks may be executed on either the GPU or CPU. A data task is very simple, it only provides an virtual execute() function for now.

I would like to integrate data task with the TBB task scheduler in a simple way. Data tasks can be run from within TBB tasks. I would like to allocate a task that spawns that data tasks, putting them into the data task pool. Rather than blocking until the data task is complete, I would like for the worker thead (which ran the tbb::task that spawned data tasks) to continue to process other TBB tasks as normal, and steal tasks as required. When all data tasks are complete, the allocated parent task is ready to be re-executed.

So if there wasn't a mixture of data tasks and tbb::tasks, I would allocate a parent, spawn children, and wait for all to complete. This problem is slightly more complex, because some of these children are of a different type, and are placed into a data task pool instead of the task pool used by TBB. If a program utilizes a large amount of data parallelism which the GPU can help with, TBB worker threads should steal data tasks for themselves too.

Thanks for your input, and as always I'm on #tbb on irc.freenode.net for a chat.

AJ

42 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
jimdempseyatthecove's picture

AJ,

I wish to apologize for taking so long to get back with you on this.

Let me preface this by saying that it looks as if you have made good headway in converting TBB to your purposes (combined GPGPU and TBB) and this post is not intended to disuade you from taking that route.

I took the time to adapt a sample program in a programmer's tool I'm working on to do just what you asked to do :). Unfortunately it is not a commercial application just yet (looking for alpha testers).

The test platform is an Intel Q6600 Windows XP x64 with MS VS 2005 C++
I have the Intel C++ compiler too, but have not run the test using that just yet.

For the GPU I have one AMD FireStream 9170 running Brook+/CAL 1.2.1

The test program was taken from thier double_precision_matmult.br sample program. The original program illustrate and compared the performance difference between a double precision matrix multiply between the GPU and a single thread process.

The modifications I made were to add parallel programming to the CPU portion of the code. And to add (combine) the execution of the parallel CPU code with the GPU.
i.e. Parallel CPU paralleled with GPU.

To perform this in a manner to simulate an application the sample program was modified further whereby the number of iteration loops was written to be reinterpreted as the number of Objects to create. Each Object contains three matricies (two input and one output). So the simultion test run loops through performing a matrix multiplaction on a list of Objects.

The tuning parameters were set to use 100 objects, each with matrix size of 256 x 256. Varying the size of the matrix greatly affects the relative performance. On the GPU side, the smaller the matrix the higher the proportion of overhead to start and stop the GPU. On the CPU side, much larger matricies exceed the size of the L2 cache. The matrix sizes were chosen such that it would not overly diminish the performance of either GPU or CPU.

Here are the relative performance figures:

Calculate performance with GPU 4.862223 Gflops
Calculate performance with CPU 0.824801 Gflops
Calculate performance with parallel CPUs 2.655731 Gflops
Calculate performance with GPU and parallel CPUs 6.217411 Gflops

GPU 5.895x that of 1 core
GPU 1.831x that of 4 cores
4 cores 3.22x faster than 1 core
GPU + 4 cores 1.279x faster than GPU

The following file is the C++ side of the sample application showing the threading technique. The Brook+ side is not included since this is not pertinant to the "how to implement data parallelism into a multi-threaded application".

Jim Dempsey

// DoQtInit.cpp
#include 
#include 
#include 

#include "common.h"
#include "QuickThread.h"
#include "parallel_for.h"

using namespace qt;
extern struct infoRec cmd;

int do_main(void* context);

int ObjectAllocationError = 0;

// Setup Affinity bitmasks for qtControls
// Firestream Brook+/CAL has the quirk that
// the first thread to touch the GPU owns the GPU.
//
// qtControl for GPU
qtControl qtControlGPU;
// qtControl for CPUs
qtControl qtControlCPU;
// qtControl for CPU exclusive of GPU
qtControl qtControlCPUxGPU;

// Setup_qtControls() called by the main thread
// which will be an arbitrary thread from the thread pool.
// Figure out what thread we are and setup the qtControl's
void	Setup_qtControls()
{
	// Setup qtControlGPU to be accessed only by current thread
	qtControlGPU.Affinities.SelectAffinities(_L0_);
	// Setup qtControlCPUxGPU to be accessed by all threads
	// (but enqueued by my thread)
	// First set to all threads
	qtControlCPUxGPU.Affinities.SelectAffinities(_M3_);
	// Now exclude current thread on AffinityDeQueue
	qtControlCPUxGPU.AffinityDeQueue &= ~qtControlGPU.AffinityQueue;
	// And set the AffinityQueue to any but the current thread
	qtControlCPUxGPU.AffinityQueue = qtControlCPUxGPU.AffinityDeQueue;
}

struct Object
{
    double* inputA;
    double* inputB;
    double* output;
    unsigned int Width;
    unsigned int Height;
	Object()
	{
	  inputA = NULL;
	  inputB = NULL;
	  output = NULL;
	  if(!ObjectAllocationError)
	  {
        Height = cmd.Height;
        Width = cmd.Width;
	    /////////////////////////////////////////////////////////////////////////
        // Allocate memory 
	    /////////////////////////////////////////////////////////////////////////
        inputA = allocate_mat_d(Height, Width);
        inputB = allocate_mat_d(Width, Height);
        output = allocate_mat_d(Height, Height);
        if (!inputA || !inputB || !output)
		  ObjectAllocationError = -1;
	  }
	};
	~Object()
	{
	  if(inputA) free(inputA);
	  if(inputB) free(inputB);
	  if(output) free(output);
	};
};

Object* ObjectList = NULL;

double* inputA(int iObj)
{
	return ObjectList[iObj].inputA;
}

double* inputB(int iObj)
{
	return ObjectList[iObj].inputB;
}
double* output(int iObj)
{
	return ObjectList[iObj].output;
}
unsigned int Width(int iObj)
{
	return ObjectList[iObj].Width;
}
unsigned int Height(int iObj)
{
	return ObjectList[iObj].Height;
}

int AllocateObjects()
{
	ObjectList = new Object[cmd.Iterations];
	return ObjectAllocationError;
}
void DeleteObjects()
{
    if (ObjectList)
		delete[] ObjectList;
}

void FillObjects()
{
	for (unsigned int i = 0; i < cmd.Iterations; ++i)
	{
      fill_mat_d(
		  ObjectList[i].inputA,
		  ObjectList[i].Height,
		  ObjectList[i].Width,
		  1, RANDOM);
      fill_mat_d(
		  ObjectList[i].inputB,
		  ObjectList[i].Width,
		  ObjectList[i].Height,
		  1, RANDOM);
	}
}

// stub called from Brook+ to initialize
// and start up QuickThread (using defaults
int DoQtInit()
{
	qtInit	qtInit;
	return qtInit.QueueMain(do_main, NULL);
}

// 1 core matmult
void matmultCPU(double* a, double* b, double* c, int m, int k, int n)
{
    int y;
    for (y = 0; y < m; y++)
    {
        int x;
        for (x = 0; x < n; x++)
        {
            double temp = 0;
            int z;
            for (z = 0; z < k; z++)
            {
                temp += a[y * k + z] * b[z * n + x];
            }

            c[y * n + x] = temp;
        }
    }
}

// Partial matmult stripe
void  matmultCPUparallel_y(
	intptr_t yBegin, intptr_t yEnd,
	double  *a, double  *b, double  *c, int  k, int  n)
{
    intptr_t  x, y;
	for(y = yBegin; y < yEnd; ++ y)
	{
		for (x = 0; x < n; x++)
		{
		  double  temp = 0;
		  int  z;
		  for (z = 0; z < k; z++)
		  {
			temp += a[y * k + z] * b[z * n + x];
		  }
		  c[y * n + x] = temp;
		}
	}
}
#define USE_CPUparallel_L2
#ifdef USE_CPUparallel_L2
// perform blocking parallel_for across all cores sharing L2
void  matmultCPUparallel(
	double  *a, double  *b, double  *c, int  m, int  k, int  n)
{
	// schedule on our thread's L2
	parallel_for(_L2_, matmultCPUparallel_y, 0, m, a, b, c, k, n);
}
#else
// perform blocking parallel_for across all cores
void  matmultCPUparallel(
	double  *a, double  *b, double  *c, int  m, int  k, int  n)
{
	parallel_for(matmultCPUparallel_y, 0, m, a, b, c, k, n);
}
#endif
// perform non-blocking parallel_for
// on all cores except the core driving the GPU
void  matmultCPUxGPUparallel(
	double  *a, double  *b, double  *c, int  m, int  k, int  n)
{
	parallel_for(qtControlCPUxGPU, matmultCPUparallel_y, 0, m, a, b, c, k, n);
	qtControlCPUxGPU.WaitTillDone();
}

// 1 core pick an Object and then matmult
void  DoObjectCPU(intptr_t iObj)
{
	Object& Obj = ObjectList[iObj];
    matmultCPU(Obj.inputA, Obj.inputB, Obj.output, Obj.Height, Obj.Width, Obj.Height);
}

// all cores pick an Object and then matmult
void DoObjectCPUparallel(intptr_t iObj)
{
	Object& Obj = ObjectList[iObj];
    matmultCPUparallel(Obj.inputA, Obj.inputB, Obj.output, Obj.Height, Obj.Width, Obj.Height);
}

// all cores except core driving GPU pick an Object and then matmult
void DoObjectCPUxGPUparallel(intptr_t iObj)
{
	Object& Obj = ObjectList[iObj];
    matmultCPUxGPUparallel(Obj.inputA, Obj.inputB, Obj.output, Obj.Height, Obj.Width, Obj.Height);
}

// template for Brook+ routine to perform GPU matmult
void  double_precision_simple_matmult (const float  Width,
		::brook::stream A,
		::brook::stream B,
		::brook::stream result);

// pick and object and perform matmult on GPU
void DoObjectGPUstream(
	::brook::stream& A, ::brook::stream& B, ::brook::stream& C, intptr_t iObj)
{
	Object& Obj = ObjectList[iObj];
	/////////////////////////////////////////////////////////////////////////
    // Brook code block
	/////////////////////////////////////////////////////////////////////////
    {
		streamRead(A, Obj.inputA);
		streamRead(B, Obj.inputB);

		// Run the brook program 
		double_precision_simple_matmult(
			(float)Obj.Width,
			A,
			B,
			C);
		// Write data back from stream 
		streamWrite(C, Obj.output);
    }
}

// Hack, save copy of Brook+ stream pointers
::brook::stream* pStreamA;
::brook::stream* pStreamB;
::brook::stream* pStreamC;


void DoObjectGPUstreams(void* A, void* B, void* C)
{
	pStreamA = (::brook::stream*)A;
	pStreamB = (::brook::stream*)B;
	pStreamC = (::brook::stream*)C;
}

// pick an object and pass on to GPU for matmult
void DoObjectGPU(intptr_t i)
{
	DoObjectGPUstream(*pStreamA,*pStreamB,*pStreamC,i);
}

// Run through all objects and pass on to GPU
void	DoObjectsGPU()
{
    for (intptr_t i = 0; i < cmd.Iterations; ++i)
    {
		DoObjectGPU(i);
    }
}

// Run through all objects and pass on to CPU (1 core)
void	DoObjectsCPU()
{
    for (intptr_t i = 0; i < cmd.Iterations; ++i)
    {
		DoObjectCPU(i);
    }
}

#ifdef USE_CPUparallel_L2
// Run through all objects and pass on to CPU (n cores in parallel)
// This is run without the GPU so all cores are available
void	DoObjectsCPUparallel()
{
	// Create 1 or more instances of DoObjectCPUparallel
	// split up across (for each) L2 cache
	parallel_for_each(_OneEach_L2_, 0, cmd.Iterations, DoObjectCPUparallel);
}
#else
// Run through all objects and pass on to CPU (n cores in parallel)
// This is run without the GPU so all cores are available
void	DoObjectsCPUparallel()
{
    for (intptr_t i = 0; i < cmd.Iterations; ++i)
    {
		DoObjectCPUparallel(i);
    }
}
#endif
// asynchronous "task" performing GPU matmult on object list
// in competition with CPU in parallel.
void	DoObjectsGPU_each(qt::parallel_for_iterator* p_itr)
{
	// Next perform iterations for our thread
	// The parallel CPU tasks will take some of the iterations
	// Because the GPU is faster than CPU we want to make sure
	// that the GPU finishes up last. To do this we make sure
	// the GPU iteration space reserves some object numbers in
	// advance of calculation. Depending on array sizes the
	// optimal number will vary.
	// Get the GPU first object number
	intptr_t i=p_itr->next();
	// reserve next iteration
	intptr_t iNext1=p_itr->next();
	// reserve next iteration
	intptr_t iNext2=p_itr->next();
	// while not done
	while(i < p_itr->end())
	{
		DoObjectGPU(i);			// process an object on the GPU
		i = iNext1;				// percolate the reserved object numbers
		iNext1 = iNext2;
		iNext2=p_itr->next();	// reserve one more object number
	}
}

// Perform the parallel CPU matmults on the object list
// in competition with the GPU. Use only the CPUs that are
// not the one driving the GPU.
void	DoObjectsCPUxGPU_each(qt::parallel_for_iterator* p_itr)
{
	// Next perform iterations for our thread
	// reserve the last few for the GPU
	for(intptr_t i=p_itr->next(); i+4 < p_itr->end(); i=p_itr->next())
	{
		DoObjectCPUxGPUparallel(i);
	}
}

// Fire up the tests that perform the parallel CPU
// in parallel with GPU working on one iteration space
// of the object list. 
//
void	DoObjectsGPUCPUparallel()
{
	// create an iterator across the number of objects
	// Note, cmd.Iterations was used to specify number of objects.
	parallel_for_iterator	itr(0,cmd.Iterations);
	// make local qtControl
	qtControl	qtControl;
	// local qtControl to assume affinity of qtControlCPUxGPU
	// which will run on any of the CPUs that is not the one
	// designated to drive the GPU
	qtControl.Affinities = qtControlCPUxGPU.Affinities;
	// queue up the parallel CPU task twice
	qtControl.QueueWork(DoObjectsCPUxGPU_each, &itr);
	qtControl.QueueWork(DoObjectsCPUxGPU_each, &itr);
	// Instead of queuing DoObjectsGPU_each as a task
	// we can avoid the overhead and call it directly
	DoObjectsGPU_each(&itr);
}

Blog: The Parallel Void

www.quickthreadprogramming.com
robert.jay.gould's picture
Quoting - AJ

I have spent the past semester researching how to combine data parallelism with TBB, so that TBB can offset data tasks to a GPU when available. My work is nearly complete, and now I need to ask the TBB community for some help. My approach is to separate data parallelism from task parallism, which involves the creation of a new type of task: the data task. For the purposes of my work, a data task represents an operation which can utilize data parallelism. For example, multiplying a large vector by a scalar, or a matrix by a scalar.

There are two work pools in my design. There is the task pool which remains completely controlled by TBB. I also have added a single shared data task pool, from which the GPU will consume data tasks. If a CPU has no work to be done, that is normal TBB tasks to consume, it will steal data tasks from the data task pool. Ordinary TBB tasks can only be executed on the CPU, whereas data tasks may be executed on either the GPU or CPU. A data task is very simple, it only provides an virtual execute() function for now.

I would like to integrate data task with the TBB task scheduler in a simple way. Data tasks can be run from within TBB tasks. I would like to allocate a task that spawns that data tasks, putting them into the data task pool. Rather than blocking until the data task is complete, I would like for the worker thead (which ran the tbb::task that spawned data tasks) to continue to process other TBB tasks as normal, and steal tasks as required. When all data tasks are complete, the allocated parent task is ready to be re-executed.

So if there wasn't a mixture of data tasks and tbb::tasks, I would allocate a parent, spawn children, and wait for all to complete. This problem is slightly more complex, because some of these children are of a different type, and are placed into a data task pool instead of the task pool used by TBB. If a program utilizes a large amount of data parallelism which the GPU can help with, TBB worker threads should steal data tasks for themselves too.

Thanks for your input, and as always I'm on #tbb on irc.freenode.net for a chat.

AJ

How about having a tbb::task that has the "task" to get a steal a data-task and process it. This was you keep the actual tasks pools separate, so data-task pool implementation can be kept simple. Then what I'd do is have place this steal-task at the top of the tbb::task pool, so when all normal tasks run out this steal task get run. And for the sake of if I'd make this steal-task respawn itself (recycling as per the manual) so its always on the stack.

Sorry if my diagram below is confusing, but something like this:

tbb::tasks: (steal) <- child <-child <-child [CPU]

V

data tasks: task, task, task, task, task [GPU/CPU]

AJ's picture

I will likely change the internal's of TBB's scheduler for the actual theft of data tasks. So a worker thread will try to steal tasks from other threads, and if that doesn't work I'll hit the data task pool. My problem is more so that I want to allocate a task that represents the data parallelism, and once that task has been allocated the worker thread should enter into a work-stealing mode to try to steal TBB tasks from other threads. Otherwise it hits the data task pool for data tasks. That's what I'm thinking.

I kinda want to have a way of forcing a worker-thread to go into task-stealing-mode, and then back to normal once the parent of the data task is finished. Maybe I should just add something like that to the scheduler interface, at which point I'd need help with where to put the logic in the code :-)

robert.jay.gould's picture
Quoting - AJ

I will likely change the internal's of TBB's scheduler for the actual theft of data tasks. So a worker thread will try to steal tasks from other threads, and if that doesn't work I'll hit the data task pool. My problem is more so that I want to allocate a task that represents the data parallelism, and once that task has been allocated the worker thread should enter into a work-stealing mode to try to steal TBB tasks from other threads. Otherwise it hits the data task pool for data tasks. That's what I'm thinking.

I kinda want to have a way of forcing a worker-thread to go into task-stealing-mode, and then back to normal once the parent of the data task is finished. Maybe I should just add something like that to the scheduler interface, at which point I'd need help with where to put the logic in the code :-)

Although I haven't touched the corresponding API at all, this situation sounds similar to the mailboxes interface for tbb::tasks. Maybe mailboxing is enough to get the job done, or at least a good choice for an entry point to extending the scheduler. But like I said, I haven't looked into mailboxes API much so it's just a guess. Maybe someone more familiar with mailboxing can give some feedback on this.

Raf Schietekat's picture
I think that the road of least resistance may be to include some code somewhere related to label "fail" in task.cpp, i.e., let a worker execute a GPU-oriented task as a last resort before becoming idle, because this does not seem to be something that can be easily if at all achieved using the existing API or thread affinity (mailboxes) or "changing modes".
AJ's picture

Right, that's my intention. Before a worker thread yields because there is no work to be done, it will grab some GPU tasks.

The problem is this... it's possible for a TBB task to have data task children. Data tasks will always be leaf nodes, after all if a data task has a tbb::task as a child, it's a tbb::task by definition (violates what a data parallel task is). Suppose someone writes this code for a parallel_for:

struct ParallelWorker

{

void operator()(Range& x) const

{

// let's assume X, Y, U are defined

X = matrix_multiply(X, Y);

Vector Z = matrix_multiply(X, U);

}

};

So when this runs, each matrix_multiply() spawns a bunch of data parallel tasks, which go into the data task pool. However, ParallelWorker's body is now blocked... it can't proceed until each matrix_multiply() operation completes. This is no different than nested parallelism. Rather than having the worker thread block, I'd like to implement each matrix_multiply in such a way that it lets the worker thread steal tasks. It would steal from other task pools, and as a last resort as you said, get data tasks from the GPU. When the matrix_multiply operation is completed, the parent should be ready to be resumed. This corresponds in concept to having a parent task spawn a number of children, but these children have all been stolen so the worker thread now has to search for work.

Hopefully this all makes sense.

Dmitry Vyukov's picture
Quoting - AJ

So when this runs, each matrix_multiply() spawns a bunch of data parallel tasks, which go into the data task pool. However, ParallelWorker's body is now blocked... it can't proceed until each matrix_multiply() operation completes. This is no different than nested parallelism. Rather than having the worker thread block, I'd like to implement each matrix_multiply in such a way that it lets the worker thread steal tasks. It would steal from other task pools, and as a last resort as you said, get data tasks from the GPU. When the matrix_multiply operation is completed, the parent should be ready to be resumed. This corresponds in concept to having a parent task spawn a number of children, but these children have all been stolen so the worker thread now has to search for work.

If I am not missing something, you can just spawn a number of data tasks with spawn_root_and_wait() (there is a overload that accepts a list of tasks), all other will be provided by current TBB scheduler - current thread will be blocked until all child tasks will be completed, current thread will be stealing from other threads.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
AJ's picture

Right, that would mean that only CPUs are consuming data parallel tasks. In this case, the GPU is available so I would like for the GPU to consume these tasks. Thus the need for a separate data task pool, and handling of data tasks.

Dmitry Vyukov's picture
Quoting - AJ

Right, that would mean that only CPUs are consuming data parallel tasks. In this case, the GPU is available so I would like for the GPU to consume these tasks. Thus the need for a separate data task pool, and handling of data tasks.

Oh, I forgot to mention that you still have to incorporate 2 task pools, and incorporate some priority stealing for CPU threads.

I meant that only the following will be solved automatically:

"The problem is this... it's possible for a TBB task to have data task children. [...] I'd like to implement each matrix_multiply in such a way that it lets the worker thread steal tasks. It would steal from other task pools, and as a last resort as you said, get data tasks from the GPU. When the matrix_multiply operation is completed, the parent should be ready to be resumed."

So the complete solution is:

- create additional task pool for data tasks

- incorporate priority stealing for CPU threads (normal tasks first)

- spawn data tasks with spawn_root_and_wait() (while thread is waiting for these tasks to complete it will be stealing other tasks - preferably normal tasks)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
AJ's picture

This solution sounds like it adds a lot more complexity, although I'm not familiar enough with the TBB scheduler to know just how complex the implementation would actually be. What about the approach suggested by Raf? Have worker threads steal from the GPU as a last resort?

My thought is to create a data_task, which has within it a method get_tbb_task() or something, which will return a TBB compatible task representing that data task. This way the GPU handles data_task objects itself, and TBB will have to call a method to convert a data_task into something it can handle. So it will steal a data task, then create TBB tasks by calling data_task.get_tbb_task().

Adding priority sounds like a complex solution, rather than only stealing from the GPU as an absolute last resort to do useful work. I like your suggestion of doing a spawn and wait. How can I incorporate a spawn and wait with a different type of task? Rather than having actual TBB tasks enter the pool, I'd like for the thread to try to call its work theft functions... stealing tasks from other worker threads, or from the GPU as a last resort. This way both TBB tasks and GPU tasks can run concurrently on the same system.

I had thought that perhaps I could come up with a solution using empty tasks... put the worker thread into the position that it can't do any more useful work without theft, at which point it will try to steal work from other CPUs... and finally hit the GPU for work.

Raf Schietekat's picture

"The problem is this... it's possible for a TBB task to have data task children." Just an idea, and I don't know if this will work because a lot about TBB is still fuzzy to me, but if the parent allocates an equal number of empty_tasks in TBB as it creates "data tasks" outside of TBB, and accounts for them in the ref_count but doesn't actually spawn them before calling wait_for_all(), a watcher tbb_thread could perhaps follow up each "data task" completion by spawning an additional task of the parent, whose only action would be to destroy the associated empty_task. That way the only change to TBB itself would be the one described earlier.

AJ's picture
Quoting - Raf Schietekat

"The problem is this... it's possible for a TBB task to have data task children." Just an idea, and I don't know if this will work because a lot about TBB is still fuzzy to me, but if the parent allocates an equal number of empty_tasks in TBB as it creates "data tasks" outside of TBB, and accounts for them in the ref_count but doesn't actually spawn them before calling wait_for_all(), a watcher tbb_thread could perhaps follow up each "data task" completion by spawning an additional task of the parent, whose only action would be to destroy the associated empty_task. That way the only change to TBB itself would be the one described earlier.

I'm not sure what would happen if I allocate empty_tasks to count toward the ref_count but don't spwan them before I wait... what should happen?

jimdempseyatthecove's picture

AJ,

Permit me to rephrase your request then you can tell me if this is what you are thinking.

You want to determine if a GPU is present and if present you set a flag indicating it is available.

When GPU is available you want to seperate the various tasks into two classes:
CPU only and GPU prefered

The CPU only are straitforward tasks

The GPU preferred tasks are in a queue that can be executed by either GPU or CPU. The selection chriteria might be when the GPU task list has n pending requests redirect next tasks to CPU. Or if your task requests to GPU contain an estimated processing weight and as GPU tasks are enqueued the estimated weight is added to a tally and when a GPU task ends the estimated weight is removed from the tally then when the tally exceeds a threshold the CPU route steals GPU tasks.

Also, since you may have one or more GPUs you may want multiple GPU processing tasks.

Is this what you had in mind?

Also, I tried to figure out the irc.freenode.net and could not find the #tbb area.

Jim Dempsey

Blog: The Parallel Void

www.quickthreadprogramming.com
AJ's picture

I'm glad you took a shot at joining freenode. You have to join the #tbb room... I'm not sure what irc client you tried, but you can run "/join #tbb" to join the TBB chat room.

What I'm thinking is to have two separate task pools, one for the CPU, and one for the GPU as you mentioned. The CPU task pool is just ordinary TBB, but the GPU task pool holds a special type of task: the data_task. This type of task may be converted to a TBB task.

So you have touched on what I have in mind. Integrating this approach with TBB is my current task.

AJ's picture

Here's a guide for getitng onto irc.

http://freenode.net/using_the_network.shtml

Raf Schietekat's picture

"I'm not sure what would happen if I allocate empty_tasks to count toward the ref_count but don't spwan them before I wait... what should happen?" I'm not sure either, but it's what I would try first if I didn't have to be 100% sure in advance about whether it would work or about its reliability (please don't use this to control something like a nuclear reactor). I leave it to you to test it first, which you can do before changing TBB itself or doing anything with the GPU(s), and be prepared to have to revise it later on. But maybe I have to take the question literally: the idea is that TBB thinks that some other thread is working on those threads, and that it just goes out stealing like it normally does, until at some point it no longer thinks that and can continue with the parent, while the GPU-oriented tasks are always preferentially handled by the GPU. I see no reason why you couldn't combine this with Jim's idea of using a (tunable?) threshold for stealing work from the GPU.

AJ's picture
Quoting - Raf Schietekat

But maybe I have to take the question literally: the idea is that TBB thinks that some other thread is working on those threads, and that it just goes out stealing like it normally does, until at some point it no longer thinks that and can continue with the parent, while the GPU-oriented tasks are always preferentially handled by the GPU. I see no reason why you couldn't combine this with Jim's idea of using a (tunable?) threshold for stealing work from the GPU.

This is exactly what I want to do... the question is: how?

Raf Schietekat's picture

"This is exactly what I want to do... the question is: how?" Well, just as I said, you pretend that you've allocated and spawned N tasks, even though you've only allocated them in parallel with "spawning" an equal number of GPU tasks outside of TBB, so you set the ref_count to N+1, allocate N empty_task children, then you wait_for_all() and hope that TBB will take the bait. The GPU stuff has to be handled by whatever means available, and then the dedicated tbb_thread will at some point have to detect that the GPU task has completed, and decrement the ref_count of its TBB "parent" (some housekeeping will be required) in the way I described. You can try this first without any actual GPU work (just a mock-up to validate the TBB side of this approach), then add the GPU work, and later you can modify the scheduler to allow GPU-task stealing by TBB, optionally with a threshold.

AJ's picture
Quoting - Raf Schietekat

"This is exactly what I want to do... the question is: how?" Well, just as I said, you pretend that you've allocated and spawned N tasks, even though you've only allocated them in parallel with "spawning" an equal number of GPU tasks outside of TBB, so you set the ref_count to N+1, allocate N empty_task children, then you wait_for_all() and hope that TBB will take the bait. The GPU stuff has to be handled by whatever means available, and then the dedicated tbb_thread will at some point have to detect that the GPU task has completed, and decrement the ref_count of its TBB "parent" (some housekeeping will be required) in the way I described. You can try this first without any actual GPU work (just a mock-up to validate the TBB side of this approach), then add the GPU work, and later you can modify the scheduler to allow GPU-task stealing by TBB, optionally with a threshold.

Alright, thanks. I'll try this tonight.

Raf Schietekat's picture

"Alright, thanks. I'll try this tonight." Note that the additional-child indirection may very well not be required, but I got spooked somehow about calling destroy() directly from an unrelated thread, and it may play a role if the parent isn't already executing. Anyway, I hope it does the trick, and that I haven't forgotten anything actually relevant instead.

AJ's picture

WOHOO! It works... (the concept). Here's the code illustrating the concept.

#include
#include
#include

#include

class Terminator
{
public:
void operator()(tbb::task* t)
{
std::cout << "Sleeping..." << std::endl;
tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(10.0) );

std::cout << "Set ref_count = 0" << std::endl;
t->set_ref_count(t->ref_count() - 1);
};
};

class Waiter : public tbb::task
{
public:
tbb::task* execute()
{
// Create a task that won't return, because ref_count is too high
tbb::empty_task& child = *new (allocate_child()) tbb::empty_task();
set_ref_count(3);

// Spawn a thread to set ref_count = 0 later

Terminator terminator;
tbb::tbb_thread thread(terminator, this);
spawn_and_wait_for_all(child);

std::cout << "Done Waiting." << std::endl;
return 0;
}
};

int main()
{
tbb::task_scheduler_init init;

// Allocate a parent task
Waiter& parent = *new (tbb::task::allocate_root()) Waiter();
tbb::task::spawn_root_and_wait(parent);

return 0;
}

Raf Schietekat's picture

"It works..." Not quite what I suggested, though! I think you've been lucky, even though it seems quite predictable what's happening: empty_task is probably immediately executed on the same thread (an obvious implementation of spawn_and_wait_for_all()), letting the ref_count drop to 2. Then "t->set_ref_count(t->ref_count() - 1);" only works because it is the only thread manipulating the ref_count (set_ref_count() is not thread safe), letting it drop to 1, which is spawn_and_wait_for_all()'s cue to return. empty_task is only a diversion here (it would be equivalent to forget all about it and call wait_for_all() instead); in my suggestion it is a tool to decrement the ref_count atomically, as suggested in the introduction to "8.3.3 Explicit task Destruction" in "Reference Manual (Open Source).pdf", from an additional child (destroy() is a non-static method, so it's not even a choice anyway).

Raf Schietekat's picture

A bit more explicitly: "because it is the only thread manipulating" -> "because, thanks to the most likely synchronous execution by spawn_and_wait_for_all() vs. the long delay in Terminator, when it executes it is almost certainly the only thread still manipulating". I think that even the appearance of a race should be avoided.

AJ's picture

Okay, here it is again without the empty_task.

#include
#include
#include

#include

class Terminator
{
public:
void operator()(tbb::task* t)
{
std::cout << "Sleeping..." << std::endl;
tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(10.0) );

std::cout << "Set ref_count = 0" << std::endl;
t->set_ref_count(t->ref_count() - 1);
};
};

class Waiter : public tbb::task
{
public:
tbb::task* execute()
{
set_ref_count(2);

Terminator terminator;
tbb::tbb_thread thread(terminator, this);
wait_for_all();

std::cout << "Done Waiting." << std::endl;
return 0;
}
};

int main()
{
tbb::task_scheduler_init init;

// Allocate a parent task
Waiter& parent = *new (tbb::task::allocate_root()) Waiter();
tbb::task::spawn_root_and_wait(parent);

return 0;
}

Raf Schietekat's picture

Sure, but you should primarily have a more disciplined way of decrementing ref_count, as described. This code may happen to work (if you only want to have one GPU task per parent!), but set_ref_count() is not meant to be thread safe.

AJ's picture

If I wrap set_ref_count() in a spin_lock will you be happy?

Raf Schietekat's picture

I would be very disappointed...

AJ's picture

Actually I might just modify the TBB source to make set_ref_count atomic... do you think I'd get a performance hit vs. an external spin_lock in my code on ref_count access?

Raf Schietekat's picture

I'm curious: why this insistence not to use what is available?

AJ's picture

I'm not entirely sure what you mean, or are suggesting... if ref_count() and set_ref_count are not thread safe, I need to either 1) Make them thread safe or 2) Lock the task before calling them.

Raf Schietekat's picture
Quoting - AJ

I'm not entirely sure what you mean, or are suggesting... if ref_count() and set_ref_count are not thread safe, I need to either 1) Make them thread safe or 2) Lock the task before calling them.

No, you should 0) destroy a stand-in empty_task from within an additional task as described and motivated earlier. You had an empty_task, but it was immediately thrown away by spawn_and_wait_for_all(), then you removed all traces of it; instead, you should have kept the allocation but changed spawn_and_wait_for_all() to just wait_for_all(), resulting in an semi-excess ref_count (that was the speculative side of my proposal). Instead of directly manipulating ref_count when the GPU work has completed, you should allocate_additional_child_of() a task that will destroy() the stand-in empty_task that was waiting in the wings; in the structure describing the GPU work you should keep a reference to the "parent" for use in allocate_additional_child_of() and a reference to the empty_task to set the target for destroy().

In the "Additions to atomic" patch I have made ref_count atomic because I was suspicious of supposedly atomic operations on non-atomic (and even non-volatile!) variables (and with reason, as it turned out, even if it was not exactly this that failed), but even there I would not recommend directly manipulating the ref_count, not even with an atomic decrement operation, because it serves no purpose to second-guess the existing API if an equivalent official alternative is readily available. There is much that needs to be changed about TBB (especially now that its leadership has declared its intentions for retrograde progress), but otherwise it seems better to leave well enough alone.

Raf Schietekat's picture

For some more clarity (hopefully):

I wrote "I would be very disappointed..." That was of course an intentionally silly response to "AJ"'s "will you be happy", in case anyone missed that.

"AJ" wrote "if ref_count() and set_ref_count are not thread safe, I need to either 1) Make them thread safe or 2) Lock the task before calling them." That is invalid, because it is not officially clear that ref_count is merely a task-local property that may change asynchronously. It's not (maybe the Reference Manual should make this clear...), and thread safety is not a single indivisible concept. In this case, just see the documentation for destroy(task&), which insists on making clear how it behaves differently, with regard to ref_count, from, e.g., a last child returning from execute() (my counterexample).

My intention was to offer a necessarily hacked solution with absolutely minimal assumptions, for maximum chance of success, and, despite a moment of confusion on my side, all elements are required. There's no use in taking more risks than this.

Anyway, that's all from my side: I can only explain, but of course you're free to draw your own conclusions.

AJ's picture

Thank you for the detailed response. I have to go through the TBB scheduler code again. I'll simplify my question.

I have a single function with prototype: bool steal_data_task();

This function will try to steal a data task, and run it on the CPU. It will return true if a task was stolen and executed, hinting that there might be more work. False otherwise.

I want to insert this single call into the TBB scheduler code, so that the worker threads will attempt to steal work from the data task pool via this function call.

I also realized that I needed to write a function void wakeup_all_workers(); that will wake up any sleeping workers, so that when I add data tasks, the workers can start stealing tasks.

Any input on where the steal_data_task() call should go is appreciated. I have to trace through the scheduler code a few more times to actually understand it.

Raf Schietekat's picture

"I also realized that I needed to write a function void wakeup_all_workers(); that will wake up any sleeping workers, so that when I add data tasks, the workers can start stealing tasks."

See GenericScheduler::mark_pool_full(), but, while you may need it to run full throttle and/or to get any result if the GPU/CPU task was added from an independent thread while TBB was sound asleep, you still have to add steal_data_task() near "fail:" first.

Dmitry Vyukov's picture
Quoting - jimdempseyatthecove

Here are the relative performance figures:

Calculate performance with GPU 4.862223 Gflops
Calculate performance with CPU 0.824801 Gflops
Calculate performance with parallel CPUs 2.655731 Gflops
Calculate performance with GPU and parallel CPUs 6.217411 Gflops

Jim, this example indeed shows the power and flexibility of your library.

How many threads did you start in parallel CPU+GPU example? 4 or 5? Are calls to Brook+/CAL blocking? If so then I think you must start 5 threads. From provided numbers it's possible that only 3 CPU threads was running.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
jimdempseyatthecove's picture
Quoting - Dmitriy Vyukov

Jim, this example indeed shows the power and flexibility of your library.

How many threads did you start in parallel CPU+GPU example? 4 or 5? Are calls to Brook+/CAL blocking? If so then I think you must start 5 threads. From provided numbers it's possible that only 3 CPU threads was running.

Dmitriy,

I am still a newbie on Brook+ GPGPU programming. The CAL (underlaying Brook+) has completion routine capability. The Brook+ coding (from my limited working with it) appears to wait implicitly. I am assuming at this time it is either in a compute loop looking at a shared memory variable and as to if this loop contains mm_wait or SwitchToThread I do not know. Therefor the coding model I chose was (on Q6600 with 4 HW threads) to launch 4 affinitized compute threads with one of the threads designated as the GPU control thread. Therefore, the CPU portion of the parallel CPU in parallel with GPU was using only 3 threads.

The sample program has a #define variable that can be used to specify if the concurrent CPU portion of the parallel programmiing was to include or exclude the GPU designated thread.As Irecallthis #define did not extend the number of compute bound threads by 1 (to 5). I believe better results will come from making the GPU controlling thread an I/O class thread. Some experimentation needs to be done to determine best juxtapositioning of threads. Also tweaking the Brook+ may be needed to use SwitchToThread would be in order. I wanted to get sample code outfor AJ to review and comment on.

I haven't used CUDA as I do not have an nVidia adapter on my test systems.

Jim Dempsey

Blog: The Parallel Void

www.quickthreadprogramming.com
robert.jay.gould's picture
Quoting - jimdempseyatthecove

I haven't used CUDA as I do not have an nVidia adapter on my test systems.

Jim Dempsey

Although it won't really help measure real performance the CUDA SDK comes with an emulator, that simulates the parallel processing and expected gains, etc...

If you are really curious, you could try that out before commiting to buy an nVidia card or not.

jimdempseyatthecove's picture
Quoting - robert.jay.gould Quoting - jimdempseyatthecove

I haven't used CUDA as I do not have an nVidia adapter on my test systems.

Jim Dempsey

Although it won't really help measure real performance the CUDA SDK comes with an emulator, that simulates the parallel processing and expected gains, etc...

If you are really curious, you could try that out before commiting to buy an nVidia card or not.

Thanks for the tip. The CUDA emulator might help with initial testing. My main development system (Q6600) has only one PCIx16 slot and I'd rather not be popping cards in and out. Also the ATI and nVidia might not live well together even if I had dual PCIx16 slots.

Jim Dempsey

Blog: The Parallel Void

www.quickthreadprogramming.com
it.d.balajigmail.com's picture
Hi,

I am currently looking at this work. I am just curious about what happened to this work? How much of it got completed? Could you please direct me to publications, if any, based on this work?

Thanks,
Balaji

Quoting - AJ

I have spent the past semester researching how to combine data parallelism with TBB, so that TBB can offset data tasks to a GPU when available. My work is nearly complete, and now I need to ask the TBB community for some help. My approach is to separate data parallelism from task parallism, which involves the creation of a new type of task: the data task. For the purposes of my work, a data task represents an operation which can utilize data parallelism. For example, multiplying a large vector by a scalar, or a matrix by a scalar.

There are two work pools in my design. There is the task pool which remains completely controlled by TBB. I also have added a single shared data task pool, from which the GPU will consume data tasks. If a CPU has no work to be done, that is normal TBB tasks to consume, it will steal data tasks from the data task pool. Ordinary TBB tasks can only be executed on the CPU, whereas data tasks may be executed on either the GPU or CPU. A data task is very simple, it only provides an virtual execute() function for now.

I would like to integrate data task with the TBB task scheduler in a simple way. Data tasks can be run from within TBB tasks. I would like to allocate a task that spawns that data tasks, putting them into the data task pool. Rather than blocking until the data task is complete, I would like for the worker thead (which ran the tbb::task that spawned data tasks) to continue to process other TBB tasks as normal, and steal tasks as required. When all data tasks are complete, the allocated parent task is ready to be re-executed.

So if there wasn't a mixture of data tasks and tbb::tasks, I would allocate a parent, spawn children, and wait for all to complete. This problem is slightly more complex, because some of these children are of a different type, and are placed into a data task pool instead of the task pool used by TBB. If a program utilizes a large amount of data parallelism which the GPU can help with, TBB worker threads should steal data tasks for themselves too.

Thanks for your input, and as always I'm on #tbb on irc.freenode.net for a chat.

AJ

jimdempseyatthecove's picture

Blaji,

I do not know how far AJ has progressed on his work in integrating CUDA with TBB. AJ is quite competent and may have successfuly done this (i.e. is bug free or at least produces usable code while using documented coding practices).

I have successfully integrated the ATI Brook+ into test applications using my threading tool QuickThread. I have not attempted this with CUDA (haven't downloaded the SDK and do not have a recent nVidia card). I do not see why the CUDA integration could not be integrated as easily as I have done with Brook+

On the Brook+ and AMD FireStream card I have, and at least on the version of the SDK I have, the CPU <-> Kernal code worked fine as long as the CPU thread remained constant. In my case I also found it wanted to be the main thread of the application as well. This is not a problem in coding in QuickThread as the tasks can easily be directed at one or a subset of threads. The test application I ran was a matrix multiply operation performed row-by-row in the outer loop and where the rows are picked by availability of CPU thread and GPU. i.e. some rows handled by GPU and others by cores on CPU in first come first serve manner.

You might be able to ge AJ's email address and communicate with him directly.
My email is (squish out the spaces and change 'at' to @)

jim _ dempsey at ameritech . net

Jim

Blog: The Parallel Void

www.quickthreadprogramming.com
AJ's picture

I'm still around, but I check messages very infrequently nowadays. So far as "what happened" to my research, it's still ongoing. As research code, none of it is ready for production usage. I have begun to implement real, production code that draws on my experiences with past research versions. However this is slow going, and I doubt anything will be ready for another year. I have had some big problems with deciding what work should be offset to the GPU and what should be offset to the CPU. There are also a lot of technologies involved, and hence many possible approaches to integrating GPUs with TBB.

Feel free to drop by IRC and discuss my work, I'm there a lot of the time... on #tbb on irc.freenode.net.

Research publications will be coming soon, I'm under a lot of pressure to start to publish some of the preliminary results. When these publications are ready I'll post a link on the forum :-)

Login to leave a comment.