Help required in parallel_for loop

Help required in parallel_for loop

Hi ,

I am new to TBB and working on parallelizing my existing code.

I could easilt paralleize with OpenMP but we need to check the performance of our code in Both TBB and OpenMP after parallelization hence i tried parallelizing the code but i am getting errors which i am not able to reslove please help kindly help me with these errors.My code is as below just using a parallel for loop and lambda function i ahve all serial , openmp and tbb changes i have made please do look at teh code and tell me what else i shud change for tbb to work.

        case openmp:
        {
            #pragma omp parallel for private (iter, currentDB, db)
            for (iter = 1; iter < numDB; iter++)
            {

                currentDB = this->associateDBs->GetAssociateDB(iter);
                db = this->dbGroup.getDatabase( currentDB );
                GeoRanking::GeoVerifierResultVector  resLocal;
                db->recog( fg, InternalName, resLocal );
                LOG(info,omp_get_thread_num()) << "Thread : "<<"currentDB :" <<currentDB<< "No of Res Matches: "<<resLocal.getNumberOfMatchesFound()<<"Match Names :"<<resLocal.getMatch();
                #pragma omp critical
                res.push_back(resLocal);
            }
        }
        break;
        case serial:
        {
                for (iter = 1; iter < numDB; iter++)
                    {
                        currentDB = this->associateDBs->GetAssociateDB(iter);
                        db = this->dbGroup.getDatabase( currentDB );
                        db->recog( fg, InternalName, res );
                    }
        }

break;
        case tbb:
#ifdef USING_TBB
        {
                    size_t GRAIN_SIZE= 10; //boost::thread::hardware_concurrency();
                 // tbb::blocked_range<size_t> range( 0, numDB, GRAIN_SIZE);
                //  tbb::parallel_for( range, fg, InternalName, res);
            parallel_for( tbb::blocked_range<size_t>(0, numDB,GRAIN_SIZE ),
        [&](const tbb::blocked_range<size_t>& r ) ->void {
                    for(size_t iter = r.begin(); iter != r.end(); iter++ )
                    {
                        std::string currentDB = this->associateDBs->GetAssociateDB(iter);
                        DatabaseAccessor_ptr db = this->dbGroup.getDatabase( currentDB );
                        GeoRanking::GeoVerifierResultVector  resLocal;
                        db->recog( fg, InternalName, resLocal );
                        res.push_back(resLocal);}

            });
            }
#endif
        break;

error i am getting is as below:-

/home/girijag/ripe/src/index/forward_db/ForwardDatabaseAccessor.cpp:410:5: error: no matching function for call to ‘parallel_for(tbb::blocked_range<long unsigned int>, indexing::forward::ForwardDatabaseAccessor::recog(cv::Mat, features::Camera::Type, const string&, GeoRanking::GeoVerifierResultVector&, boost::shared_ptr<features::FeatureGroup>&) const::<lambda(const tbb::blocked_range<long unsigned int>&)>)’
/home/girijag/ripe/src/index/forward_db/ForwardDatabaseAccessor.cpp:410:5: note: candidates are:
/usr/local/include/tbb/parallel_for.h:215:6: note: template<class Index, class Function> void tbb::strict_ppl::parallel_for(Index, Index, Index, const Function&)
/usr/local/include/tbb/parallel_for.h:228:6: note: template<class Index, class Function> void tbb::strict_ppl::parallel_for(Index, Index, const Function&)
/usr/local/include/tbb/parallel_for.h:235:6: note: template<class Index, class Function> void tbb::strict_ppl::parallel_for(Index, Index, Index, const Function&, tbb::task_group_context&)
/usr/local/include/tbb/parallel_for.h:248:6: note: template<class Index, class Function> void tbb::strict_ppl::parallel_for(Index, Index, const Function&, tbb::task_group_context&)
/usr/local/include/tbb/parallel_for.h:204:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&, tbb::affinity_partitioner&, tbb::task_group_context&)
/usr/local/include/tbb/parallel_for.h:197:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&, const tbb::auto_partitioner&, tbb::task_group_context&)
/usr/local/include/tbb/parallel_for.h:190:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&, const tbb::simple_partitioner&, tbb::task_group_context&)
/usr/local/include/tbb/parallel_for.h:182:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&, tbb::affinity_partitioner&)
/usr/local/include/tbb/parallel_for.h:175:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&, const tbb::auto_partitioner&)
/usr/local/include/tbb/parallel_for.h:168:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&, const tbb::simple_partitioner&)
/usr/local/include/tbb/parallel_for.h:161:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&)
/home/girijag/ripe/src/index/forward_db/ForwardDatabaseAccessor.cpp: At global scope:
/home/girijag/ripe/src/index/forward_db/ForwardDatabaseAccessor.cpp:464:6: warning: unused parameter ‘abortIndexing’ [-Wunused-parameter]
make[2]: *** [index/forward_db/CMakeFiles/forward_db.dir/ForwardDatabaseAccessor.cpp.o] Error 1
make[1]: *** [index/forward_db/CMakeFiles/forward_db.dir/all] Error 2

18 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello, a couple of questions:

1. what is numDB?
2. why did you decide to use  "[&](const tbb::blocked_range<size_t>& r )->void {" instead of "[=](const tbb::blocked_range<size_t>& r )->void {" for const variable?

--Vladimir

Hi Thanks for ur response :) ....

numDB is the number of DB's which we are iterating in for loop .... we need to parallize the search in those DB's......... and i have a function recog which actually uses variables which is defined outside for loop  hence i used that & . in lamdba expression.

Actually i am not sure if that code is parallelizing my for loop to iterate over those numDB's ..... i am very new to TBB ..... the chunk of code within that lambda function needs to be parallelized and i am not sure if my code is doing that ....

Few comments:

  • numDB and GRAIN_SIZE are important values.  Grain size must represent a small enough amount of work.  if numDB is 20, grain size would be better be something like 1 or 2..  One range should represent about 10,000 to 20,000 cpu cycles.
  • In your case, I can see another problem.  Accessing a Database is typically loaded with connections, mutexes and other operations that travels through the kernel and might wait.  In general, it's a bad practice to perform operations like that in a TBB thread.  Use a custom threadpool instead.

Can u please let me know how to implement a custom threda pool in TBB ..... or any article which explains with examples ...please

Unfortunatly, tbb is not designed to execute that kind of tasks.  tbb algorithms work with internal tbb scheduler which we have not much control over.

When I suggest a custom thread pool, I refer to use another library or to implement one yourself using native calls.  boost might help here.

Actually we also dunno how many DB's we would have in future and so if TBB scheduler can handle creation of threads internally its better for us right .... else we have to manage thread creations there are numerous threads created in our application and its difficlut for us to decide on how many threads to create for maximum untilization of cores ......... using OpenMP i have parallelized in the above code and it showed a significant improvement in speeding up the saerch in those DB's . Can we use TBB to further increase the speed. Does Paralle_for loops in TBB not help paralleizing the image search in various DB's parallely

I can see two aspects to this question. First, how to change your parallel OpenMP code to parallel TBB code. Second, is this really efficient way of doing this?

As for the first question, you can actually use the parallel_for, but you have to be careful about selecting patitioners and grainsize. Just like OpenMP, TBB tries to bundle iterations together to reduce overhead. If the number of databases you want to try is close to the number of threads in your machine or each database query takes considerable amout of work, you probably don't want the bundling to happen. In that case use simple partitioner (to prevent parallel_for from creating those bundles automatically) and grain size of 1. You could also consider parallel_for_each.

As for the efficiency, the main question is: whatdoes the db->recog really do? Does it connect to a database server or is it the "database" actually just some library that runs within your process. In the first case, you may want to start more queries than the number of threads in the TBB thread pool. This is tricky and not efficient. Basically, making such requests is the one situation where TBB is usually not a good idea. By the way, the same is true for the OpenMP-based solution. However, if the "database" is just a locally executed engine, then this solution may be a good one and work efficiently.

Hi , Thanks for your reply.

Databases in my app are not databases like SQL or Oracle but KD trees. Henc yes database is locally executed engine and its just like some library that runs within my process. db->recog function actually searches my database that is tree structure multiple times and gets the recognized image which is similar or same .... Overall my Recog function is taking in serial searches 7.045 secs while in OpenMp its taking 0.58 secs ..... I want to test it with TBB and check if it could take even more less time to recog an image .

Now since i am new to this TBB i want to know as per my error why does it say no matching function for parallel_for

I have included the following headed files

#include <tbb/parallel_for.h>
#include "tbb/task_scheduler_init.h"

and later have changed the code of serial for loop to parallel tbb for loop as below:-

#ifdef USING_TBB
        {
                    size_t GRAIN_SIZE= 1; //boost::thread::hardware_concurrency();
                  parallel_for( tbb::blocked_range<size_t>(0, numDB,GRAIN_SIZE ),
        [&](const tbb::blocked_range<size_t>& r ) ->void {
                    for(size_t iter = r.begin(); iter != r.end(); iter++ )
                    {
                        std::string currentDB = this->associateDBs->GetAssociateDB(iter);
                        DatabaseAccessor_ptr db = this->dbGroup.getDatabase( currentDB );
                        GeoRanking::GeoVerifierResultVector  resLocal;
                        db->recog( fg, InternalName, resLocal );
                        res.push_back(resLocal);}

            });
            }
#endif
        break;

as of now my machine is 32 core machine and i have 22 db's for test purpose so i can have grain size as 1 right.

and i need parallel 22 loops as i ahve 22db's i ahve used 0-numDB as range .

Now please tell me should i ahve to do anything else apart from the calls i am having in that case statement.

why do i get errors such as no matching function call

and do i have to do some other code changes ??? I would be greatful to you all if you could tell me what other code changes i should do to make my TBB code to work

the errors are as below:-

/home/girijag/ripe/src/index/forward_db/ForwardDatabaseAccessor.cpp:410:5: error: no matching function for call to ‘parallel_for(tbb::blocked_range<long unsigned int>, indexing::forward::ForwardDatabaseAccessor::recog(cv::Mat, features::Camera::Type, const string&, GeoRanking::GeoVerifierResultVector&, boost::shared_ptr<features::FeatureGroup>&) const::<lambda(const tbb::blocked_range<long unsigned int>&)>)’
/home/girijag/ripe/src/index/forward_db/ForwardDatabaseAccessor.cpp:410:5: note: candidates are:
/usr/local/include/tbb/parallel_for.h:215:6: note: template<class Index, class Function> void tbb::strict_ppl::parallel_for(Index, Index, Index, const Function&)
/usr/local/include/tbb/parallel_for.h:228:6: note: template<class Index, class Function> void tbb::strict_ppl::parallel_for(Index, Index, const Function&)
/usr/local/include/tbb/parallel_for.h:235:6: note: template<class Index, class Function> void tbb::strict_ppl::parallel_for(Index, Index, Index, const Function&, tbb::task_group_context&)
/usr/local/include/tbb/parallel_for.h:248:6: note: template<class Index, class Function> void tbb::strict_ppl::parallel_for(Index, Index, const Function&, tbb::task_group_context&)
/usr/local/include/tbb/parallel_for.h:204:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&, tbb::affinity_partitioner&, tbb::task_group_context&)
/usr/local/include/tbb/parallel_for.h:197:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&, const tbb::auto_partitioner&, tbb::task_group_context&)
/usr/local/include/tbb/parallel_for.h:190:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&, const tbb::simple_partitioner&, tbb::task_group_context&)
/usr/local/include/tbb/parallel_for.h:182:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&, tbb::affinity_partitioner&)
/usr/local/include/tbb/parallel_for.h:175:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&, const tbb::auto_partitioner&)
/usr/local/include/tbb/parallel_for.h:168:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&, const tbb::simple_partitioner&)
/usr/local/include/tbb/parallel_for.h:161:6: note: template<class Range, class Body> void tbb::parallel_for(const Range&, const Body&)
/home/girijag/ripe/src/index/forward_db/ForwardDatabaseAccessor.cpp: At global scope:
/home/girijag/ripe/src/index/forward_db/ForwardDatabaseAccessor.cpp:464:6: warning: unused parameter ‘abortIndexing’ [-Wunused-parameter]
make[2]: *** [index/forward_db/CMakeFiles/forward_db.dir/ForwardDatabaseAccessor.cpp.o] Error 1
make[1]: *** [index/forward_db/CMakeFiles/forward_db.dir/all] Error 2

Hey my prev code is compiling :)

only changes i made was i added C++11 support in my cmake file and upgraded my GCC version to 4.7 :)

Thanks all ....

now another question :)

i am using #pragma critical in OpenMp in TBB what shud i use to make that critical or semaphore or mutex protected ............

Hi All,

I tested my below code which is now parallelized from serial.

It seems TBB is faster than OpenMP while my openmp takes 0.955 secs for a database search my TBB code is taking 0.75 secs.

Below i ahve pasted all three cases , serial, openmp and tbb . Please tell me what more optimization in my TBB code i cud do to increase the speed .  I might be not using the proper function calls . Like spinlock mutex is it right or can i use something else . then parallel for is better or parallel for each ?

And can i use dynamic containers for vector res instead of protecting it with mutex and locks wud that increase the speed ?

As of now i am using Grain size as 1 since mine is a 32 core machine and i am having 28 DB for test setup but in real senario i can have n number of DB's and i would have 10,000 images in one DB (DB is a KD tree here) . Recog time for images would be 0.23 secs . How could i set my grain size if i am not sure how many number of db's i would be creating . What is the optimal value i can have as a grain size in 32 core machine .

Please do give ur suggestions regarding this .

    // get features for recog
    fg = this->featureFactory->createFeaturesRecog(image, cameraType);
    ForwardDatabaseAccessor::typeFlag tFlag= ForwardDatabaseAccessor::openmp;
    if (numDB > 0)
    {
        std::string currentDB;
        DatabaseAccessor_ptr db;
        size_t iter;
        switch (tFlag)
        {
        case openmp:
        {
         #pragma omp parallel for private (iter, currentDB, db)

            for (iter = 0; iter < numDB; iter++)
            {
                currentDB = this->associateDBs->GetAssociateDB(iter);
                db = this->dbGroup.getDatabase( currentDB );
                GeoRanking::GeoVerifierResultVector  resLocal;
                db->recog( fg, InternalName, resLocal );
                 LOG(info,omp_get_thread_num()) << "Thread : "<<"currentDB :" <<currentDB<< "No of Res Matches: "<<resLocal.getNumberOfMatchesFound()<<"Match Names :"<<resLocal.getMatch();
                 #pragma omp critical
                 res.push_back(resLocal);
            }

        }
            break;
     case serial:
     {
         for (size_t iter = 0; iter < numDB; iter++)
                     {
                         currentDB = this->associateDBs->GetAssociateDB(iter);
                         db = this->dbGroup.getDatabase( currentDB );
                         db->recog( fg, InternalName, res );
                     }
     }//serial case loop closed
break;
     case tbb:
#ifdef USING_TBB
        {

                    size_t GRAIN_SIZE= 1; //boost::thread::hardware_concurrency();
                    parallel_for( tbb::blocked_range<size_t>(0, numDB,GRAIN_SIZE ),
                            [&](const tbb::blocked_range<size_t>& r ) ->void {
                    for(size_t iter = r.begin(); iter != r.end(); iter++ )
                    {
                        std::string currentDB = this->associateDBs->GetAssociateDB(iter);
                        DatabaseAccessor_ptr db = this->dbGroup.getDatabase( currentDB );
                        GeoRanking::GeoVerifierResultVector  resLocal;
                        db->recog( fg, InternalName, resLocal );
                        ReductionMutex::scoped_lock lock(minMutex);
                        res.push_back(resLocal);

                     }

            });
            }
#endif
        break;

        }
    }

Since it seems that db->recog takes significant amount of CPU time, the rest is not so important, as long as you make sure that each parallel invocation runs just one db->recog. Note that this may not be the case in your last example, since you don't specify the partitioner in your parallel_for call. The default is auto_partitioner, which may provide multiple objects to your functor (lambda). I think it won't in this case, but to be sure, you may want to specify simple_partitioner. This way, you shouldn't have problems with small number of databases. On the other hand, since the database access is long enough, it should easily beat the recommended 10k instructions per task, so having larger grainsize probably won't improve the total performance much, even if you have a lot of databases.

I don't think swithcing from parallel_for to parallel_for_each will have measurable impact on the performance in this case.

You could get rid of the locking by switching from std::vector to tbb::concurrent_vector. But in this case, it may be even better to stick with the std::vector, expand it to numDB elements and then store the result using res[iter]=resLocal.

Hi Thanks for your reply.

How do i determine the value for auto_partitioner ? like if my number of DB is 30  and i am using 32 core machine what should be the value of partitioner ?

I think if you leave the grain size at 1 and add tbb::simple_partitioner() as the third parameter of parallel_for, you should be fine. You could spend some extra work to reduce your overhead, but I think the benefits in this case will be negligible. Getting rid of the lock (mutex) may be a better idea, since the lock would be bad for scalability with high degree of parallelism (high number of cores).

Can u please explain me what is this auto_partitioner and how do we decide on this value . and also grain_size  what is it and how we arrive at a value for that .Please explain me in simple terms like i did not getwhat u meant by the following line in your previous answer "The default is auto_partitioner, which may provide multiple objects to your functor (lambda)." ...... Its nice to understand what i am trying to do :) i am very very new to this TBB and mutithreading hence please help me understand this :) .....

As a new TBB user, you might find it beneficial to read some of the provided documentation, like http://software.intel.com/sites/products/documentation/doclib/tbb_sa/help/index.htm, the Intel TBB user's guide, which has a section explaining partitioners in Intel Threading Building Blocks.  In brief, the parallel_for construct is truly parallel, meaning it doesn't just iterate through the list, it breaks the list up, partitions it, in order to share the work among multiple workers.  How that partitioning is done is determined by the partitioner being used, which by default is the auto-partitioner.

Looking briefly at your code, it employs a scoped lock that has been named ReductionMutex, somewhat of a misnomer since there is no reduction in the usual sense being performed here.  Instead each iteration of the loop hits the mutex, serializing the code.  Since every database request seems to produce a result, you might actually want to try a parallel_reduce, though the end result is a list of query results rather than a scalar, so you may have to do some special handling to catenate sublists together in the reduction phase.  Or you may want to consider using a thread local storage solution to have each worker accumulate a set of results and then gather those together at the end. These generally are a little hairier to implement but could eliminate the per-loop-iteration mutex call.  

Finally, I would expect if the lookup costs in your KD trees are pretty uniform (which they should be as long the trees are fairly balanced), then the OpenMP code should probably win, having less setup overhead than Intel TBB; however if the individual lookups vary wildly in their times, the better load balancing for dynamic work inherent in Intel TBB should enable it to win.  I also note that the OpenMP example seems to have more serial code, both the critical section for the "reduction" and what looks like an I/O call to log progress.  Could it be the case that your OMP code is just spending more time waiting for locks?  What happens if you comment out the "LOG" line?

Leave a Comment

Please sign in to add a comment. Not a member? Join today