Problems with spin_mutex lock for matrix assembly

Problems with spin_mutex lock for matrix assembly

Hi, 

   I am using the latest release of TBB for parallelizong my c++ finite element analysis application. I am facing trouble with the matrix assembly part, where each thread calculates a component of the global matrix and adds it to the global matrix (i.e., each thread is writing to a global variable). To facilitate this, I am using a spin_mutex lock. Unfortunately, every once in a while I miss one of the thread's contributions to the matrix, and the numerical computations blow-up. 

Following is the structure of my code. I think I am using the spin_mutex correctly, but the the problem is random. Sometimes it happens after two iterations, and sometimes it takes 100 or so (and all possibilities in between). I would appreciate if someone could comment on whether there is an error in my use of the mutex, or if this is a known issue and there is a workaround. 

I am using this on Mac OS 10.8.2. The hardware is: MacBook Air with 1.7 GHz Intel Core i5 with 4GB 1333 MHz DDR3 RAM.

Thanks,

Manav

 

Update: Found this to be an error in my code. Fixed and working fine now!

 

tbb::spin_rw_mutex assembly_mutex;

class AssembleElementMatrices

{

public:

    AssembleElementMatrices(const std::vector<FESystem::Mesh::ElemBase*>& e, 

                            FESystem::Numerics::VectorBase<FESystemDouble>& r,

                            FESystem::Numerics::MatrixBase<FESystemDouble>& stiff):

    elems(e),

    residual(r),

    global_stiffness_mat(stiff)

    {    }

     

  void operator() (const tbb::blocked_range<FESystemUInt>& r) const

    {

        FESystem::Numerics::DenseMatrix<FESystemDouble> elem_mat;

        FESystem::Numerics::LocalVector<FESystemDouble> elem_vec;

        for (FESystemUInt i=r.begin(); i!=r.end(); i++)

        {

            // code to calculate elem_vec and elem_mat

            {

                tbb::spin_rw_mutex::scoped_lock my_lock(assembly_mutex, true);

                dof_map.addToGlobalVector(*(elems[i]), elem_vec, residual); // adds elem_vec to appropriate locations in residual vector

                dof_map.addToGlobalMatrix(*(elems[i]), elem_mat, global_stiffness_mat); // adds elem_mat to appropriate locations in the global_stiffness_mat matrix

            }

        }

    }

    }

protected:

      const std::vector<FESystem::Mesh::ElemBase*>& elems;

    FESystem::Numerics::VectorBase<FESystemDouble>& residual;

    FESystem::Numerics::MatrixBase<FESystemDouble>& global_stiffness_mat;

};

void calculateQuantities()

{

    const std::vector<FESystem::Mesh::ElemBase*>& elems = mesh.getElements();

    tbb::parallel_for(tbb::blocked_range<FESystemUInt>(0, elems.size()), AssembleElementMatrices(elems, residual, global_stiffness_mat));

  }

11 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

>>...I think I am using the spin_mutex correctly, but the the problem is random...

That thread is unanswered for more than 2 days and I wonder if you could provide more details? If you post some sources of a complete test-case that reproduces your random error somebody ( for example me ) will take a look at it. Does it make sence?

Best regards,
Sergey

This could have been positioned more prominently: "Update: Found this to be an error in my code. Fixed and working fine now!"

>>..."Update: Found this to be an error in my code. Fixed and working fine now!"

That is possible. Anyway, it would be nice to hear from 'Manav B.'.

Hi Sergey and Raf,

Thanks for your message. I agree that I could have made the update more prominent.

The problem in the code was that I was unintentionally using a global matrix as scratch in each of the threads. So each thread would modify it in the sequence of computations, which would get read by another thread that was expecting to see some other numbers in the matrix.

It took a few days to identify this, but once rectified, the code works beautifully.

Thanks,
Manav

Hi Manav,

>>...It took a few days to identify this, but once rectified, the code works beautifully...

I have a couple of questions:

- How big is your matrix?
- Could you provide some performance numbers? ( Please provide technical details like CPU, operating system, size of the matrix, etc )
- Did you have a chance to test matrix multiplication with TBB?

Thanks in advance.

>>...Please provide technical details like CPU, operating system, size of the matrix, etc...

Sorry, I see your data:

>>...I am using this on Mac OS 10.8.2. The hardware is: MacBook Air with 1.7 GHz Intel Core i5 with 4GB 1333 MHz DDR3 RAM...

What about a size of the matrix?

The matrix size in this case is 16x16. Each thread, however, has several of such matrices that are needed for the computations.
I have also used TBB to parallelize my LU decomposition solver, which I frequently use for sparse matrices with order of a few hundred thousand.

I have not done a parallelization efficiency benchmarking yet, but intend to finish that in the coming days.

Manav

Thank you for additional details.

>>...The matrix size in this case is 16x16. Each thread, however, has several of such matrices that are needed for the computations.

I think 16x16 matricies are too small for any multi-processing on them. What about overhead related to threads ( context switches, etc ) and how many threads do you create in total?

>>...I have not done a parallelization efficiency benchmarking yet, but intend to finish that in the coming days.

Would you be able to evaluate performance of serial and muli-threaded versions? I won't be surprised to see that a serial version ( if it exists ) outperforms the multi-threaded.

Best regards,
Sergey

Hi Sergey,

I just uploaded a pdf file that shows the speedup I was able to obtain. This was done on a Mac Pro with 2 x 3.06 GHz 6-core Intel Xeon and 32 GB of 1333 MHz DDR3 RAM. The OS is 10.8.2. So, including hyperthreading, the machine sees 24 processing cores, but the speedup saturates at 12 threads (expected, I think).

I have parallelized two separate blocks of my code. The first block does matrix assembly where each thread calculates one matrix for each element assigned to it and then adds it to a global matrix. Each thread in this block owns a few matrices and vectors of dimension 16.

The second block that I have parallelized is the LU decomposition solver where I make each thread operate on a set of rows independently.

Manav

附件: 

附件尺寸
下载 parallel-speedup.pdf27.93 KB

Consider using two mutex
{
tbb::spin_rw_mutex::scoped_lock my_lock(residual_mutex, true);

dof_map.addToGlobalVector(*(elems[i]), elem_vec, residual); // adds elem_vec to appropriate locations in residual vector
}
{
tbb::spin_rw_mutex::scoped_lock my_lock(stiffness_mutex, true);
dof_map.addToGlobalMatrix(*(elems[i]), elem_mat, global_stiffness_mat); // adds elem_mat to appropriate locations in the global_stiffness_mat matrix
}

Jim Dempsey

www.quickthreadprogramming.com

发表评论

登录添加评论。还不是成员?立即加入