cpu performance w/a simple gather utility

cpu performance w/a simple gather utility

I have an i7 965 cpu running on win7 and using tbb 3.0.
The attached program, is designed to gather into a (clumn format) matrix elements that live in different buffers
(in this case a bigger matrix).
I am trying to see how many processing units (out of 8?) are engaged in the calculation. From the task mgr
I see at most 4.

using std::vector;
#include "tbbparallel_for.h"
#include "tbbparallel_reduce.h"
#include "tbbblocked_range.h"
#include "tbbblocked_range2d.h"

// function to gather data into a single matrix as columns
template< typename _M, typename _T>
void gather_data_into_matrix_columns( _M & m, vector<_T*> const & dataStart, vector const & dataStride ){

	struct column_gatherer{
		_M & m_;
		vector<_T*> const & dataStart_;
		vector const & dataStride_;
		column_gatherer( _M & m, vector<_T*> const & dataStart, vector const & dataStride )
			:m_(m), dataStart_(dataStart), dataStride_(dataStride) 
		void operator()( const tbb::blocked_range2d< size_t > & r ) const {
			size_t rowStart = r.rows().begin();
			size_t rowEnd = r.rows().end();
			size_t colStart = r.cols().begin();
			size_t colEnd = r.cols().end();

			for ( size_t colIdx = colStart; colIdx != colEnd; ++colIdx ){
				size_t stride = dataStride_[colIdx];
				_T const * ptr = dataStart_[colIdx] + rowStart * stride;
				for ( size_t rowIdx = rowStart; rowIdx != rowEnd; ++rowIdx ){
					m_( rowIdx, colIdx ) = *ptr;
					ptr += stride;

	static const size_t GRAIN_SIZE_ROW = 10;
	static const size_t GRAIN_SIZE_COLUMN = 10;

	column_gatherer cg( m, dataStart, dataStride );
				0, m.size1(), GRAIN_SIZE_ROW, 
				0, m.size2(), GRAIN_SIZE_COLUMN
			tbb::auto_partitioner() );

using std::cout;
using std::endl;

using boost::numeric::ublas::matrix;

bool test_GatherScatter(){
	const size_t nRows = 80;
	const size_t nCols = 20; 
	matrix< double, boost::numeric::ublas::column_major> m(nRows,nCols);

	const size_t nRowsBig = 800;
	const size_t nColsBig = 100; 
	matrix< double, boost::numeric::ublas::column_major> mBig(nRowsBig,nColsBig);
	for ( size_t i = 0; i != nRowsBig; ++i ){
		for ( size_t j = 0; j != nColsBig; ++j ){
			mBig(i,j) = double(rand())/double(RAND_MAX) - 0.5L;

	vector< double *> colStart( nCols );
	vector< size_t > strides( nCols );
	for ( size_t i = 0; i != nCols; ++i ) {
		colStart[i] = &(mBig(0,3*i));
		strides[i] = 5;

	gather_data_into_matrix_columns( m, colStart, strides );

	for( size_t i = 0; i != nRows; ++i ){
		for( size_t j = 0; j != nCols; ++j ){
			if ( m(i,j) != mBig( 5*i, 3*j) ){
				cout << "(i, j) = (" << i << ", " << j << ")" << endl;
				return false;
	return true;


int main(){
	bool b;
	for ( size_t i = 0; i != 1000; ++i )
	b = test_GatherScatter();

Is this a reliable test? Is there a better one than eyeballing it ;)) ?
The code is a concatenation of various files. Conceptually gives the right result. Please, let me know if there is something more you might need.
Thank you very much for your help,

ps: the motivation for this is to bring into one place (typically a matrix) data that will constitute the rhs of lapack linear equation solver.

23 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Informed but also still very wild guess: the matrix has only 16 chunks, which execute very quickly, so only 4 cores can get out of the starting blocks quickly enough to execute them. Look what happens if you increase matrix size and/or decrease grainsize and/or add bogus busy delays in the parallel code: what differences do you see, and what are the execution times? I don't suppose that the serial setup is long enough for the worker threads to go to sleep, though (how long would that take?), so it shouldn't make a difference if you do the 1000 iterations only around the call to gather_data_into_matrix_columns() so that the workers would certainly still be spinning and able to react faster than if they would have to be woken up first.

Hi Raf,
Thank you for the quick reply. The external loop was only there to fascilitate my inspecting on what shows in the task manager's performance tab.
I followed your advice and added bogus code inside the operator call of the nested class (repeated the assignment a good 1,000 times) and saw all 8 (logical) processors, fire up. The coarsening of the grain size had only partial results. Increasing the matrices past a certain limit would probably be unrealistic for the purposes of a sngle memory buffer living within the boost matrix.
Does this mean that this strategy cannot be improved for this size of problems? (typically nRows = 40 and nCols
~=1000 )
Fnally, coming back to my initial question, is there some utility within tbb to inspect such statistics? - if not, it might be a useful addition for tunn9ing purposes.
Again, thank you very much for your help,
ps: did not understand why was redirected. Was I in the wrong place?

I meant the opposite of coarsening the grainsize: refining it. You might
even try to rely entirely on auto_partitioner unless you see a real
need for using grainsize anyway, which is primarily needed with
simple_partitioner. Having no grainsize and using simple_partitioner
would be a way to diagnose the original problem by both increasing
overhead and providing more opportunity for stealing.

Scaling code apparently remains challenging even with a task-based
framework... Unless you have an opportunity to parallelise at a higher
level, this may be the best you can do, and if it's important to do
better you may have to try to be creative.

Intel offers developer products that incorporate both TBB and diagnostic tools, but the latter are not part of the former.

Hi Raf,

I spent a good chunk of time today going over these ideas.
I do have Intel's Parallel Studio and tried a few experiments with the Parallel Amplifier (PA) as well, but I am afraid all this was inconclusive.
In order to get something meaningful through the PA, I had to break up the code into smaller pieces (so, for example the filling of the big source matrix does not mask the performance - also tried an mkl mt random number generator.
Interestingly, PA even showed the vml/vsl functions as poorly threaded - I guess to make me feel a bit better about all this ;)).
Now, there is the element that I might be doing something less than clever w/the PA.
It is noteworthy that, if I do a loop of 100 reps on the gather function, the total tick_count is approximately twice (!!) the time for a single call. Now, I do realize that this is wall clock time, but is there any chance that most of the time is spent in setting up the thread pool rather than running it? If this is the case, is there a way to avoid this toll?
Finally, I ran this for very large grain sizes - assuming that this renders it single-threaded - correct?. It turns, out that my total gain is approx. 50%, which with 8 processors and nothing else happenning on my pc is rather suboptimal ;-)).
Any ideas you might have will be greatly appreciated.
Thank you very much for all your help,

The thread pool should persist across calls to TBB-related code, so its setup should be amortised away for significant TBB usage over the lifetime of the program (except that bursty use may incur a penalty of awaking sleeping workers, I suppose), which may not be the case here. It might also be interesting to consider some actual numbers related to serial setup vs. parallel gather, and whether the speedup is measured specifically on the parallel part (otherwise Amdahl necessarily limits the total impact). If the serial setup is significant, maybe it too can be parallelised. Also, your mileage will likely differ in a real program (making a program that's already finishing in a flash any faster does not seem very relevant), where different opportunities (for high-level parallelism) and/or distractions may occur compared to microbenchmarks.

Thank you for your response Raf.
I was going for some kind of generalization of the pack/unpack utilities of the vml library (part of mkl).
Hence the effort.
The reason for this is that, in PDEs you very often need to solve systems of equations at each time step.
In the class of methods I am considering, in the case of multidimensional problems, one further has to solve systems at every dimension (as many as the product of sizes of the rest of the dimensions). Hence this snippet that calculates in a flash as you say, is to be used thousand of times. Ideally the lapack routines would allow for this, allowing for some stride info of the vector arguments. Alas, LAPACK does notprovide for this. So copying among different structures becomes indispensable (sometimes there are other organizations of the problem that bypass this, but not always).
This, as a note that I did not mean to waste your time on an inane issue ;-)).
Thank you very much for all your help,
Have a great Weekend,

"Hence this snippet that calculates in a flash as you say, is to be used thousand of times."
I have found it is often more efficient to speculate than to have many rounds of Q&A before a fearful first attempt at a suggestion, but in light of this additional background information I admit it does seem rather difficult to parallelise successive time steps... (Note that I really meant "challenges" instead of just silly "distractions".)

"Ideally the lapack routines would allow for this, allowing for some
stride info of the vector arguments. Alas, LAPACK does notprovide for
Hmm, with LAPACK supposedly designed for more efficient memory access patterns and relying on BLAS, so that layout may be linked to performance, I wouldn't dare comment on that myself.

I don't really understand what you're doing, mixing LAPACK and boost uBLAS (I would suppose LAPACK uses its own BLAS?), but here's my ultimate attempt, in the form of a question anyway (consider it rhetorical): if you are doing a gathering operation to use only part of the big matrix, wouldn't other parts be gathered and processed too, and wouldn't that be the level at which to parallelise instead of the individual gathering operations? To avoid parallel overhead, it's typically best to aim as high as possible.

Hi Raf,
Thank you for your resonse. A few clarifying points:
You are right on LAPACK basing on BLAS and therefore the memory layout to be dicated thereoff- I, too. am not in a position to comment on other possible implementations, especially if all that is parallelized is the BLAS substrate and not the drivers themselves, as it now seems to me to be the case.
Hence the "ideally" in my comment. LAPACK was created with different hardware constraints and, therefore,
objectives than today's available options.
I do not mix uBLAS and LAPACK/BLAS. I only use the matrix structures from uBLAS- I consider the effort of rewritting BLAS in C++ largely misdirected and futile in the presence of vendor and freely provided mt implementations.
You are absolutely right that the result of this discussion seems to be that the organization of the parallelizqation has to happen at a higher level. However, this calls for a much more complicated design - especially if one wants to use many PDE problems based on the same PDE at once the difference among tem being the initial and boundary conditions).
Naturally, one tries the easier slution first. I was hoping there was one to my "problem", since admittedly, it seems like a rather simple one, as you also noted in a previouscomment. Alas..
Again thank you very much for all your help and your input.
Best Regards,

"LAPACK was created with different hardware constraints and, therefore, objectives than today's available options."
Could you briefly elaborate on that?

The open source BLAS and LAPACK don't directly address threading. Several major vendors have supported closed source code versions with OpenMP compatible threading, such as Intel MKL library. They expect to have a specific group of CPUs and cores assigned before starting the job, and they don't share their resources with the TBB threading model, for example. So, the story for now seems to be that MKL LAPACK is suitable for use along with TBB only in MKL single thread mode.
Many BLAS functions can be built with C or C++ after automated translation of the open Fortran source, but you can just as well compile them in Fortran and link together with C++. The Fortran source, of course, is far too old to come with ISO C binding, which you would need to add to get portable "extern C" linkage.
As you can see, I'm stabbing in the dark as to whether I can relate to the current question.

Hi Raf,
What I meant is for example choices of in-place substitution of results which was largely dictated by the fact that memory was way more "expensive" , forcing the user to copy around instead of using pre-allocated buffers.
Also, having been written in FORTRAN it cannot really provide for strided-based types (yes it could, but at the expense of a lot of code/argument -ists overhead. And yes, I understand that there are issues with efficiency, but this was not, at the time at least, the underlying concern).
As far as I am concerned, one of LAPACK's main clients are PDE solvers and with the possibility of many more cores these days, problems that would be very slow in shared memory configurations (like my pc, these days and hopefully next year) are now tractable.
So, for example, when I try to solve a 4-d PDE using ADI (maybe even higher when the knights show up) have the unknown function as a boost nulti-array and when solving for each dimension at every time step I do need to make unnecessay copies to solve the linear systems. When I was able to do efficiently 2 dimensions the overhead was less of a fraction than with more dimensions. As a coder it is just a constant headache ;-))
There is no intention here of knocking down LAPAC, of course ;-)).
For all design choices there are pros and cons. The value of LAPACK lies, as far as I am concerned, with the robustness of the mathematical algorithms rather than its design (please note I only program in C++ and am FORTRAN illiterate - and wish to stay this way ;-)) ).
Nice talking to you, Thank you again,

Hi TimP,
As far as threading goes, I thought GOTO BLAS does address it and it is free (no?).
Thank you for bringing up the constraint of using MKL and TBB at the same time. I was not clear on this and this is a really useful piece of information (although somewhat dissappointing..),
It so happens that in my code, I do not mix the two (i.e. my TBB calls do not include MKL ones, by choice, and , of course, the opposite is simply not possible) -had in mind to look into it in the future, though.
My response earlier had to do with af's observation that I was potentially mixing uBLAS and MKL for BLAS calls,
which I don't - have my own tbb-based version of vector expressions for pure BLAS-like (i.e. not LAPACK-feeding) calls - luckily I do not need to multiply matrices that much.
Thank you for the intervention, since I did learn something trully valuable.

"What I meant is for example choices of in-place substitution of results
which was largely dictated by the fact that memory was way more
"expensive" , forcing the user to copy around instead of using
pre-allocated buffers."
On http://www.netlib.org/lapack/faq.html I read: "The original goal of the LAPACK project was to make the widely used
EISPACK and LINPACK libraries run efficiently on shared-memory vector
and parallel processors. On these machines, LINPACK and EISPACK are
inefficient because their memory access patterns disregard the
multi-layered memory hierarchies of the machines, thereby spending too
much time moving data instead of doing useful floating-point operations." Would this be consistent with a design constrained by expensive memory? How expensive was memory in 1992, LAPACK's first public release? Also, it has been demonstrated elsewhere that it is sometimes more efficient to copy data, operate on it, and copy back the result, as mentioned here, e.g., by Arch Robison (original architect of TBB), if I remember well.

>>Does this mean that this strategy cannot be improved for this size of problems? (typically nRows = 40 and nCols
~=1000 )

This is likely too small of a data set to divide up 8-ways (number of threads) and amortize the task/thread start/stop overhead. Can you re-think your problem such that you can have 8 of these 40x1000 (sub-)arrays worked on at once?

Alternative suggestion. Once you rearrange the data I imagine you intend to do some work with it. If so then see if you can code such that some threads rearrange more pieces than "some threads" and as each piece is rearranged perform (enqueue) the additional work using the remaining threads. IOW pipeline the operation (not necessarily using parallel_pipeline).

Jim Dempsey

Hi Raf,
Thank you for your response.
Let me state from the start that I am not qualified to talk on "multi-layered memory architectures" - this is getting to low in what actually happens for me to have an educated opinion.
Naively speking and from my experience, LAPACK calls are only a part of bigger operations, i.e. solving a PDE and doing something with the results at every (time)step of the solution. The allocation of data in other big data structures is then no longer optional. Benchmarking just the LAPACK call then is probably not enough.
Again, I do not know enough to have an educated opinion.
The same document, for example says abit layer how LAPACK can solve very efficiently, as a result, tridiagonal systems with many rhs vectors. In practice though, unless solving for Dirichlet type problems where the one-dimensional operators are homogeneous wrt therest of the dimensions i.e there is no mixing(a situation very common in ADI type methods), I have never encountered a useful application for this ability/strength. It is very possible that I don't deal with the kind of problems for which this would be useful, however, andtherefore have a limited view.
The Robinson result sounds interesting and would like to know more about its rational (is by any chance a pointer to the paper easy to produce?)
Finally, memory in 1992 was much more expensive than today. At the time, at least on PCs, it was not even shipped in GBs ;-)) - yes I realize that LAPACK was most likely not designed for usage with the PCs ;-)).
At this point, de facto will have tocopy and maybe I am luckyand thisis the best choice.\
Thank you for your comments,

Hi Jim,
I am thinking now along these lines.
I have come to realize that, typically in a 4 dimensional equation with 50 points in each dimension,
I would have 50 * 50^3 points and then,I think I would be fine.
For a 2-dimensional problem would be suboptimal, but maybe who cares.
Your suggestion is definitely the right way to design this but would prefer to avoid the coding -and maintenance- overhead of it.
The inquiry sprang from a toy example I ran and was "struck" from the fact that I could not "take advantage" of all the computing power I had. I guess I was much more naive before this discussion.
Thank you very much,
ps: will also look into these blog and site for eduactional reasons ;-))

In TBB it is relatively easy to perform an n-way fork/join using parallel_invoke. Consider writing a shell function that potentially recursively calls itself based on (remaining)problem size. A simplified shell function might simply split the problem in two if it it larger than a threshold. You might want to consider a switch statement with 2-way, 3-way, 4-way, ?-way splits which may/may-not reduce scheduling overhead (please experiment). The Fibbonacci sample program does this. **** but you would not want to solve the series using this method *** (there is an orders of magnitude faster serial method). When parallel_for isn't suitable consider using parallel_invoke with your own split code.

Jim Dempsey

"The Robinson result sounds interesting and would like to know more about
its rational (is by any chance a pointer to the paper easy to produce?)"
I thought I remembered Arch Robison (sic) mentioning something like that on this forum, but I only found this, which to me seems to be a lot more because of applicable sort algorithms than because of memory access reasons. Perhaps James has something to say about appropriate strategies on NUMA systems?

"Finally, memory in 1992 was much more expensive than today."
I do recall choosing a computer configured with 8 MB rather than the maximum around that time (later I splurged for an upgrade to 20 MB), but it was still a serial system, so I don't know what the issues were on the hardware supposedly considered for LAPACK, and how big a typical problem really is compared to the available memory. But here is where I should probably leave the discussion to any others with the relevant knowledge and experience.

Hi Jim,
thank you very much for the suggestion - was not aware of parallel_invoke (probably have an older version of the book). Will look into it.
Thank you again,

Apologies for misspelling Robison's name..(from him too, actually)
Thank you for your help,

"The Fibbonacci sample program does this. **** but you would not want to
solve the series using this method *** (there is an orders of magnitude
faster serial method)."
You can choose anything from exponential down to logarithmic staying fully in the integral domain, and Binet's formula looks even faster (although I suspect that in practice it isn't).

But I mainly added this reply because I just remembered strings as an example of where it's often faster to copy than to use copy-on-write, although that's because of the cost of the atomic operations for maintaining the reference count on a certain very popular processor architecture (it eludes me why weaker but cheaper operations have not been added yet), not because of memory access characteristics (either).

Jim, I repeat my question: would at least NUMA sometimes mandate explicit copying? I expect there should be at least some overhead related to the chatter of keeping caches coherent when not using private copies even for read-only memory regions (which the processors wouldn't generally know without heavy manipulations of virtual-memory tables), but I have no idea how significant it would be and whether it would pay off to avoid it.

>>Jim, I repeat my question: would at least NUMA sometimes mandate explicit copying?

When your program performs write to memory the write alters cache and RAM (excepting for streaming store which alters only L1 and RAM as opposed to including last level cache). NUMA systems can be hierarchical and as such the memory your program accesses might be within the current NUMA node, one node distance, two nodes distant, ... The further the NUMA node is from the requesting processor the longer the request becomes.

When your data access is predominantly: repeated read-only in cache sized regions
Then copying is of little value (unless the copy is from scattered to contiguous)

When your data access is predominantly: repeated read/modify/write
Then copying may have value.

This does make the requirement that the other threads avoid using this data when it is not residing in its "usual" locations.

RE: which the processors wouldn't generally know without heavy manipulations of virtual-memory tables

This is usualy not material since the virtual memory tables are process related as opposed to processor related.

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today