Seeing slow performance with ALLOCATE and DEALLOCATE in OpenMP app

Seeing slow performance with ALLOCATE and DEALLOCATE in OpenMP app

andyb123's picture

I'm working on an Fortran application that is using OpenMP fairly extensive to utilize threads. One part of the code is doing quite a lot of ALLOCATE and DEALLOCATE statements in fairly quick successfion for lots of small blocks. All threads will likely be doing the same since it is inside an "!$omp parallel do" loop. I know this may not be ideal but it is currently unavoidable. However, when I profile with Vtune, I see a large time reported for "for_allocate" and "for_deallocate" inside "libifcoremt.a" and by comparison significantly less time in the actual libc allocation routines - order of magnitude difference pretty much - which is not what I would expect. I was wondering if anyone else has seen this sort of behaviour?

A little digging further with vtune suggests almost all of the time sits with a single memory read instruction and before that a write to an adjascent location. My suspicion is that if all threads are doing this, this constitutes classic false sharing and the cache line is bouncing between caches hence a delay. Checking the symbols, looks like it probably does something like:

for__protect_cm_ops = 0; if (for__protect_signal_ops == 1) { ...

Does this sound plausible? And if so, is there anyway I can work around this? I assume these are flags for something?

Thanks, Andy.

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Tim Prince's picture

If your allocate and deallocate have an effect like internal critical regions, and they can be all the same (sufficiently small) size, would it help to allocate outside the parallel region and set a private designation so each thread gets a copy?

jimdempseyatthecove's picture

Alternative:

type(Node), allocatable, target :: Nodes(:)
type(Node), pointer :: aNode
...
! Nodes not allocated here
! Use FIRSTPRIVATE here (issue with PRIVATE and unallocated array)
!$omp parallel FIRSTPRIVATE(Nodes), PRIVATE(i, iNode, nNodes, aNode, ...)
nNodes = WorstCaseNumberOfNodesForThisThread()
allocate(Nodes(nNodes))
iNode = 0
!$omp do
DO I=1,Whatever
! replace allocate(aNode)
iNode = iNode + 1
if(iNode .gt. nNodes) STOP ! fix code
aNode => Nodes(iNode)
! Now use aNode as you did before

Jim Dempsey

 

www.quickthreadprogramming.com

Login to leave a comment.