Does Fortran have an equivalent of memcpy in C

Does Fortran have an equivalent of memcpy in C

Hi,

I am trying to optimize my code, which moves blocks of data quite often. For example:

          Do k = 1, nCv
            Do i = 1, nCells
              cvOld(i,k) = cv(i,k)
            End Do
          End Do

where the declarations of cvOld and cv are:

    real(kind=8), pointer :: cv(:,:), cvOld(:,:)

In C language, I think memcpy function may help to improve the performance. Do we have an equivalent one in Fortran?
And from your experience, what's the fastest way to do this data moving? I have many such loops (just moving blocks of data) 
in my big code. I am not sure whether the compiler is smart enough to use optimized function.

I will truly appreciate your time and help. Thanks!

 

12 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

ifort makes an __intel_fast_memcpy substitution automatically whenever it appears to be possibly a good strategy. Your question "what is the fastest way" can't be answered out of context.  You might start by examining the compiler reports forclues on what the compiler chooses to do.  Shortly you will get into performance testing to see whether you can show an advantage in second-guessing the compiler.

It's important to use parallelization consistently, including first initialization and affinity settings, to take advantage of first touch data placement, at least if you use a multiple CPU platform.  This could happen semi-automatically if you apply the same parallelization to this nested copy loop as elsewhere in your application.

The memcpy substitution strategy may work well if your individual vectors are long enough for the library function to make the right choice about streaming stores.  A typical disadvantage of memcpy is that you give up control over temporal vs. non-temporal stores in cases where you could be making the choice yourself or giving the compiler sufficient information to make the decision.

MKL dcopy (blas95 copy) is another possibility, particularly if you want automatic application of threading when calling in a serial code region.  You would need to check on whether you can use a single copy call to handle the nested loops by treating it as a single vector of length nCv*size(cv,dim=1).  I haven't found documentation about the temporal vs. non-temporal aspect of MKL dcopy.

Several possible aspects of your question were discussed already on these forums

http://software.intel.com/en-us/forums/topic/475372

引文:

Tim Prince 写道:

ifort makes an __intel_fast_memcpy substitution automatically whenever it appears to be possibly a good strategy. Your question "what is the fastest way" can't be answered out of context.  You might start by examining the compiler reports forclues on what the compiler chooses to do.  Shortly you will get into performance testing to see whether you can show an advantage in second-guessing the compiler.

It's important to use parallelization consistently, including first initialization and affinity settings, to take advantage of first touch data placement, at least if you use a multiple CPU platform.  This could happen semi-automatically if you apply the same parallelization to this nested copy loop as elsewhere in your application.

The memcpy substitution strategy may work well if your individual vectors are long enough for the library function to make the right choice about streaming stores.  A typical disadvantage of memcpy is that you give up control over temporal vs. non-temporal stores in cases where you could be making the choice yourself or giving the compiler sufficient information to make the decision.

MKL dcopy (blas95 copy) is another possibility, particularly if you want automatic application of threading when calling in a serial code region.  You would need to check on whether you can use a single copy call to handle the nested loops by treating it as a single vector of length nCv*size(cv,dim=1).  I haven't found documentation about the temporal vs. non-temporal aspect of MKL dcopy.

Several possible aspects of your question were discussed already on these forums

http://software.intel.com/en-us/forums/topic/475372

Hi Tim,

Thank you so much for your detailed information. I checked the assembly language file and found only several memory movement loops are using  __intel_fast_memcpy. This big CFD code is fully paralleled with MPI. We want to improve its performance and I am now running and profiling it with TAU in a serial way (only one MPI task).

The profiling results from TAU tells me the CPU time consumed in memory movement loops is comparable to that of computation loops. That's why
I am considering memcpy function and want to find better ways to do the memory movement. But from the link you sent me, I feel it is not clear that memcpy function will surely be faster than a fully vectorized do loop, right?

Thanks!

Wentao

As you are running MPI, you can expect message passing to require "data movement" which you will need to distinguish from the fast_memcpy calls coming directly from your program.

If you built MPI with Intel compilers, you could expect the __intel_fast_memcpy to come in that way.  Switching out the normal MPI use of shared memory message passing might help to determine whether those are the function calls where your profiler shows time spent.   If your MPI has environment variables like Intel MPI to change the thresholds for streaming store cache bypass, working with those could be helpful.  The best setting may well vary with number of ranks.

If your profiler shows you assembly code, you would be able to see whether non-temporal data moves are in use in the hot spots, and attempt to evaluate effect of changes.

As I alluded to before, the __intel_fast_memcpy itself switches to nontemporal at some large data size.  In your case, it probably won't know when to make the switch unless you arrange to send both levels of loops in a single memcpy.   It can't guess the characteristics of your application, so you may choose to stop the compiler from substituting fast_memcpy, e.g. by placing !dir$ simd on that DO loop.  Then, you would evaluate the choice made by the compiler, for which recent versions should have -opt-streaming-stores=auto as default.  With auto, the compiler would probably choose streaming stores when it guesses the loop is long enough, if it doesn't see that you will read the data back.

You can set !dir$ vector nontemporal on the individual loop if you want that loop to use streaming stores always.  This should change the compiler's unrolling strategy.   If you want streaming stores only where you specify by directive, you would set the compiler option -opt-streaming-stores=never and check that it works according to you directive.  -opt-report includes reports on streaming-stores.

引文:

Tim Prince 写道:

As I alluded to before, the __intel_fast_memcpy itself switches to nontemporal at some large data size.  In your case, it probably won't know when to make the switch unless you arrange to send both levels of loops in a single memcpy.   It can't guess the characteristics of your application, so you may choose to stop the compiler from substituting fast_memcpy, e.g. by placing !dir$ simd on that DO loop.  Then, you would evaluate the choice made by the compiler, for which recent versions should have -opt-streaming-stores=auto as default.  With auto, the compiler would probably choose streaming stores when it guesses the loop is long enough, if it doesn't see that you will read the data back.

You can set !dir$ vector nontemporal on the individual loop if you want that loop to use streaming stores always.  This should change the compiler's unrolling strategy.   If you want streaming stores only where you specify by directive, you would set the compiler option -opt-streaming-stores=never and check that it works according to you directive.  -opt-report includes reports on streaming-stores.

Hi Tim,

Thank you so much for your detailed reply. Now I understand the mechanism about streaming stores and, as you suggested there are several ways to assist the compiler to generate streaming store instructions.Just to clarify:

What's the mechanism behind _intel_fast_memcpy that can make it copy data faster? Do you mean intel_fast_memcpy also utilizes streaming stores to accelerate memory copy when it finds possible benefit? I have checked the assembly files and I found most of the memory copy loops are not replaced with intel_fast_memcpy. At this point, I am not sure I should change the code structure so that more intel_fast_memcpy can be used or I just ignore intel_fast_memcpy and consider assisting the compiler to generate streaming store instructions?

Many thanks!

Bild des Benutzers jimdempseyatthecove

>> I have checked the assembly files and I found most of the memory copy loops are not replaced with intel_fast_memcpy.

Considering you use "real(kind=8), pointer :: cv(:,:), cvOld(:,:)" the compiler cannot know ahead of time if the pointer uses stride 1 for all of the indices.

Jim Dempsey

www.quickthreadprogramming.com
Bild des Benutzers Ronald W Green (Intel)

as Jim said, pointer-based arrays will lead to sub-optimal code.  I don't have the code nor know why pointer-based arrays were used.  Perhaps there is a need later for the pointers.  If pointer-based arrays are not necessary, allocatable arrays are always preferred and give the compiler much better information for optimizations.

引文:

Ronald W Green (Intel) 写道:

as Jim said, pointer-based arrays will lead to sub-optimal code.  I don't have the code nor know why pointer-based arrays were used.  Perhaps there is a need later for the pointers.  If pointer-based arrays are not necessary, allocatable arrays are always preferred and give the compiler much better information for optimizations.

Hi Ronald,

Thanks for your reply. Actually we are at the point that considering changing pointer-based arrays into allocatable arrays. Take the following loop as an example:

do i = 1, N
    time(i) = timeOld(i) + dt(i)
end do

I may declare things as pointer-based arrays: 

real(kind=8), pointer :: time(:), dt(:), timeOld(:)

Or I may declare things as allocatable arrays:

real(kind=8), Allocatable :: time(:), dt(:), timeOld(:)

Could you please explain more about why I should expect  that the latter one (allocatable arrays) will make the loop run faster? Is it because that there will be no pointer dereference ? Or is it because that there will be fewer indirect memory references? 

This code is quite big and changing the whole data structure from pointer-based arrays into allocatable arrays may take several months. That's why I first want to have a more clear understanding about the situation before taking action. I truly appreciate your time and help. Thanks!

 

 

Your profiling probably gives you more insight than we have on where the performance of pointers is in question.  If the compiler optimization reports show the desired optimizations (including memcpy substitutions), changing those instances may not be as much of a priority as where you identify opportunities to improve performance by engaging optimizations.

Bild des Benutzers jimdempseyatthecove

Wenato,

Unlike C/C++, in Fortran a pointer does not necessarily point to a contiguous slice of memory, whereas an allocatable (or static) array always points to contiguous memory. When the compiler can see all potential pointer associations, it could conceivably determine if all possible associations are contiguous, and thus optimize the code (e.g. memcpy).

Jim Dempsey
 

www.quickthreadprogramming.com

引文:

jimdempseyatthecove 写道:

Wenato,

Unlike C/C++, in Fortran a pointer does not necessarily point to a contiguous slice of memory, whereas an allocatable (or static) array always points to contiguous memory. When the compiler can see all potential pointer associations, it could conceivably determine if all possible associations are contiguous, and thus optimize the code (e.g. memcpy).

Jim Dempsey
 

Hi Jim,

Many thanks for your reply. Then I have a quick question regarding the following two code pieces:

Code Piece 1:

real(kind=8), allocatable :: time(:)

allocate(time(100))

Code Piece 2:

real(kind=8), pointer :: time(:)

allocate(time(100))

From your explanation, now I know in Code Piece 1, time will surely have a continuous slice of memory and the compiler knows that.
Then what about Code Piece 2? My understanding is time will also surely point to a continuous slice of memory but the compiler may not 
know that for certain kind of optimization. Am I correct?

Thanks!

Wentao

 

Bild des Benutzers jimdempseyatthecove

IIF (if and only if) the pointer is scope restricted to the "PROGRAM", SUBROUTINE, FUNCTION, meaning it is not located in a MODULE which could be USE'd by a compilation unit not currently visible to the compiler, then (in this simple case) it would be possible for the compiler to determine if the pointer => association were: a) contiguous, b) not contiguous, c) unknown

This said, the compiler writers are not obligated to use this information. Furthermore, they may have reasons, but will not state on this forum, for not inserting code to perform a determination of contiguous or not when the state is unknown (and thus take fast path). Possibly, some bench mark program runs poorly, at the expense of most user applications running poorly (for array copies via pointer).

Note, for small array (vector) copy, the extra overhead of using the temp copy may be less than running the determination code. However, I think when the array (vector) exceeds a hand full of cache lines, then I think the test and temp copy elimination would yield better performance *** provided the application is not full of strided pointers ***

There is an attribute CONTIGUOUS that is supposed to aid in the determination. I have not had time to experiment with this in a real application.

Jim Dempsey

www.quickthreadprogramming.com

Melden Sie sich an, um einen Kommentar zu hinterlassen.