July 6, 2009 8:35 AM PDT
[OMP] Threadprivate common dramatically decrease performance
Dear Intel forumers,
I have recently introduced THREADPRIVATE statements in some of my Fortran commons in order to make the variables contained in the commons private for each thread. These commons contain quite large variables. Dynamic threading is set to OFF and i use same number of threads in all parallel regions.
I noticed that, when i set the number of thread to 1, the performances of the program with/without the threadprivate attribute are very different. (30-40 % slower with threadprivate attributes when running optimized (O2) version of the code).
I have carefully read the documentation and I don't really understand why. In these conditions, I thought the memory was allocated once when i first used the threadprivate common in the first parallel region and stay "alive" during all the program execution. So, the "cost" of the threadprivate should be only a one shot at the beginning. It seems not to be the case. Could you tell me more about it? Are there options to optimize the use of threadprivate statements?
When you compile with OpenMP enabled .AND. specify one thread, the code generated will still use the threadprivate access methods for dereferencing threadprivate variables (a few extra instructions per dereference).
Therefore any slowdown due to threadprivate will be visible in the one thread OpenMP produced application using threadprivates. When you observe a slowdown when going from one thread (in mt generated program) to multiple threads then suspect adverse memory interactions (cache evictions). Or, perhaps, making a parallel version of a short loop that cannot ammortize the thread start/stop overhead in a cost effective manner.
You would expect to pay the cost of copying data in and out of threadprivate at the beginning and end of each parallel region, which could be substantially more than the bare allocation cost. I don't know how the out-dated compiler would affect this, except that it doesn't include the current version of OpenMP library.
You would expect to pay the cost of copying data in and out of threadprivate at the beginning and end of each parallel region, which could be substantially more than the bare allocation cost. I don't know how the out-dated compiler would affect this, except that it doesn't include the current version of OpenMP library.
Dear tim, thank you for your response, I haven't set any "options" to activate the copy of threadprivate variables at the end of parallel region. In fact, the threadprivate variables are just a set of "temporary variables" shared across some subroutines which don't need to be used at the end of the parallel region. In this situation, are there default copies at the end of a parallel region ?
This way of coding could be discussed. In fact, i try to parrallelize some old big Fortran codes with a minimum of changes in the structure.
You would expect to pay the cost of copying data in and out of threadprivate at the beginning and end of each parallel region, which could be substantially more than the bare allocation cost. I don't know how the out-dated compiler would affect this, except that it doesn't include the current version of OpenMP library.
Tim,
The shared (local) data (scalars or descriptor) would be copied (as hidden dummy arguments) but the threadprivate data persists (no copying required).
When you compile with OpenMP enabled .AND. specify one thread, the code generated will still use the threadprivate access methods for dereferencing threadprivate variables (a few extra instructions per dereference).
Therefore any slowdown due to threadprivate will be visible in the one thread OpenMP produced application using threadprivates. When you observe a slowdown when going from one thread (in mt generated program) to multiple threads then suspect adverse memory interactions (cache evictions). Or, perhaps, making a parallel version of a short loop that cannot ammortize the thread start/stop overhead in a cost effective manner.
Can you supply a small code example?
Thank you for your response.
This is the idea of the global structure:
toto.h: COMMON /TEST/ TMPVAR(A_BIG_NUMBER)
!$OMP THREADPRIVATE (/TEST/)
first_routine.f
INCLUDE 'toto.h'
... work on TMPVAR in subroutines...
second_routine.f
!$OMP PARALLEL
CALL first_routine(...)
!$OMP END PARALLEL
third_routine.f
CALL first_routine(...)
In this structure, first_routine can be called inside or outside a parallel region. Since this first_routine use a common declared thread private, how will this common behave when used outside a parallel region? can i manage the code in an other way?
Results are good, but performance not. I noticed that (to be confirmed) when i suppress my parrallel regions but when i let the threadprivate commons, the code remains slow.
Since I use OPENMP2, maybe some improvements have been made in OPENMP3 (I will receive IFORT 11 soon)
If you can send a working (small) code section illustrating your problem we can better assist you in determining the problem (usualy a coding error on your part due to unfamiliarity of programming model).
If you can send a working (small) code section illustrating your problem we can better assist you in determining the problem (usualy a coding error on your part due to unfamiliarity of programming model).
Jim
Dear, i have created a simple sample code which resume my performance problem. You'll see that there is no parallel section, only Threadprivate statements. (PS: I know that using local variables, this example will be far much faster, but it's not the aim of the topic)
When i execute this code (release), i obtain:
NON-THREADPRIVATE LOOP TIME = 17.9481196632551 THREADPRIVATE LOOP TIME = 19.7407659353339 Press any key to continue . . .
I would like to know more about this performance gap. I am sorry for the presentation of the code, i am quite new using this interface.
Threadprivate.h
integer*4 i1, i2, i3, i4, i5, i6
COMMON /TEST / i1, i2, i3
COMMON /TEST2/ i4, i5, i6
!$OMP THREADPRIVATE (/TEST/)
My_main.f
program My_threadprivate_example
use ifport
implicit none
include 'omp_lib.h'
integer*4 i
double precision ytim1, ytim2
ytim1 = OMP_GET_WTIME()
do i=1,10000000000
call My_second_routine()
enddo
ytim2 = OMP_GET_WTIME()
write(*,*) 'NON-THREADPRIVATE LOOP TIME = ', ytim2 - ytim1
ytim1 = OMP_GET_WTIME()
do i=1,10000000000
call My_first_routine()
enddo
ytim2 = OMP_GET_WTIME()
write(*,*) 'THREADPRIVATE LOOP TIME = ', ytim2 - ytim1
i = SYSTEM ("PAUSE")
end program My_threadprivate_example
My_first_routine.f
subroutine My_first_routine
include 'Threadprivate.h'
i1 = 1
i2 = 2
i3 = 3
end subroutine
My_second_routine.f
subroutine My_second_routine
include 'Threadprivate.h'
i4 = 1
i5 = 2
i6 = 3
end subroutine
What you observe is correct. Access to thread private variable will induce some additional overhead. The overhead per access is small but noticable. In the event of a simple routine such as presented in your example the overhead is significant. However, the overhead may be insignificant when performed in practice in your real application.
I suggest you place a break on the routines and opening the dissassembly window. Examine both the TLS loads and the non-TLS loads. You will notice a few extra assembly instructions to accomplish the stores. (some of these instructions can be optimized out in Release mode).
Example:
Assume you have a small vector containing the X,Y and Z components called POS(3) and you wish to rotate this vector. When this vector is in thread local storage the CALL ROTATE(POS, ROT) would incure the small overhead only once in constructing the address of POS (or array descriptor for POS), but the routine ROTATE would contain no such overhead.
Your alternative to using Thread Local Storage is to pass a thread context pointer in all subroutine and function calls that require thread contexed information (big programming effort). The overhead in performing that generally far exceeds the compiler performing this for you. Essentially the thread local storage becomes something similar to THREADCONTEXT%ASSEMBLYCONTEXT%yourTLSVariableHere
Where THREADCONTEXT is a vendor method for obtaining a thread by thread context area, ASSEMBLYCONTEXT is a vendor method of obtaining an assembly (compile time object) specific thread context variable and yourTLSVariableHere is your thread local storage variable name. But the compiler does this automagicly thus hiding the THREADCONTEXT%ASSEMBLYCONTEXT% and providing a portable programming means.
In my case, I notice that the overhead becomes very important for the commons which contains lots of very small variables and used in functions called millions of times.
The compiler will be able to generate code that knows where i,j,k will be located at runtime.
The compiler will NOT be able to generate code that knows where tp_i,tp_j,tp_k will be located at runtime but CAN generate code to determine where tp_i,tp_j,tp_k will be located at runtime.
This will generate additional overhead which may or may not be significant in your program. In cases where it is significant use transitional variables.
Example: if tp_i, tp_j and tp_k are index bases within an array and each thread is to manipulate portions of an array relative to those bases
do i=0,count-1 do j=0,count-1 do k=0,count-1 array(tp_i+i, tp_j+j, tp_k+k) = expression end do end do end do
Then you are programming with unnecessary overhead. using transitional variables
i_base=tp_i j_base=tp_j k_base=tp_k do i=0,count-1 do j=0,count-1 do k=0,count-1 array(i_base+i, j_base+j, k_base+k) = expression end do end do end do
and this reduces the thread private access ovehead to one occurance *** the above assumes tp_i,tp_j,tp_k do not vary during execution of loop
I am sure you are aware that the loop I illustrated can be rewritten
! convert
i_base=tp_i
j_base=tp_j
k_base=tp_k
do i=0,count-1
do j=0,count-1
do k=0,count-1
array(i_base+i, j_base+j, k_base+k) = expression
end do
end do
end do
! rewritten as
do i=tp_i,tp_i+count-1
do j=tp_j,tp_j+count-1
do k=tp_k,tp_k+count-1
array(i, j, k) = expression
end do
end do
end do