Coarrays ??

Coarrays ??

 I'd appreciate comments, suggestions, and/or criticisms on the results of some 'playing' with COARRAYS in Intel Visual Fortran Composer XE 2013 SP1 Update 3 integrated with MS VS Pro 2013. All of the results were obtained with the i7-860 2.8 GHz CPU and 8 GB RAM; Windows 7 64 bit OS.
 The two attached *.f90 files summarize the timing results in comments at the tops of the files, particularly, the "compare_to_coarray.f90" file. 'Project/property' Fortran 'command lines' are in comments at the ends of the files. Notice that the times are for the actual multiplications excluding initializations, distributions, etc.

 Note that lines 24 and 28 of "test coarray matmul.f90" define variables without codimensions. I've been unable to find anything in various documentations about doing that. When I give all of the variables (except dsecnd) codimensions [*], the program still runs correctly, but the time is more than double that without those codimensions (8.2 vs 3.8 secs for 2000x2000 matrices).
 All of the times given here are for 'release' 'x64' configs. Win32 configs run nearly 2x faster for both the coarray and non-coarray programs. Times vary slightly from run to run and are probably significant only to ~0.3 secs.
 Unless I've done something dumb (a distinct possibility), I don't understand why anyone would use COARRAYS. The only argument that I've seen for them is the 'ease' of programming for those accustomed to FORTRAN. I certainly didn't experience 'ease', at least in this application. I found openMP much easier to learn, much more versatile than coarrays, and much better documented both online and in books. In this simple app, even Intel's auto-parallelism is much more impressive than coarrays.

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The purpose of COARRAY's is for running array slices in different processes (note processes not processors). This typically means when your problem size exceeds that of a process, then by use of inter-process communication (MPI) you can distribute the larger application across multiple processes.

Several years ago, it was not unusual for a large server to be limited to 32-bit processes. In this situation, when your problem size exceeded that of a 32-bit process, your application could use multiple 32-bit processes on the same system to get the work done. As the problem size increased, you could then use cluster and/or networking interfaces to interconnect the processes running on different systems in the cluster/network.

Now, with 64-bit processes and system memory in the 100's of GB, the need for COARRAY use reduces to problem sizes larger than that which can fit within an SMP system RAM and/or caches of all the processors and/or cores within each system.

On your 1P system you would generally not use coarrays excepting for program development for use of application on larger system. Now then if you had two i7-860's and fast interconnect, some problems may run faster using COARRAYs distributed across the two systems.

To get an idea of overhead

real :: t0, t1, t2, t3
! out        t=dsecnd()
    end if
    sync all !wait for distributions to complete
    do j=1,ncols !partial matrix multiplications
        do i=1,N    
            do k=1,N     !Won't vectorize (using /03), even with DIR$ IVDEP, etc.
            end do
        end do
    end do
    sync all       !wait for all multiplications to complete
    if (im.eq.1) then !use matmul to get C; check against the coC and report
!        t=dsecnd()-t
       print *,"time=",t3-t0, "1st sync=", t1-t0, "mm=", t2-t1, "2nd sync=", t3-t2

Jim Dempsey

As Jim suggests, coarrays are scalable far beyond what you can do with shared memory parallelism. Also, our implementation clearly has room for improvement in performance, and we have tasks underway to address that. Coarrays are also much better integrated with the language.

Retired 12/31/2016

Jim and Steve:
Your responses are greatly appreciated.
Although I had not measured the 'overhead' as Jim suggests, the long waits between launch and completion of the coarray version on my machine are striking.
Would either of you be kind enough to address the specific question of how non-coarray variables, like those in lines 24, 26, and 28, are handled in the coarray version? The necessity of lines 62 and 63 indicates that they are 'local' ... but then, why the big slowdown when they are given co-dimensions? In Jim's code snippet, are there (possibly different) values of t0, t1, t2, and t3 in the different images?

Without the codimensions they are local variables, independent in the images. There is some additional overhead to establishing coarrays, but once done, if you only reference them without cobounds, there isn't much added. There's no point to making a variable a coarray unless you intend to exchange values across the images.

I don't have the time right now to do detailed analysis of your programs to see why there's a difference.

Retired 12/31/2016

>> In Jim's code snippet, are there (possibly different) values of t0, t1, t2, and t3 in the different images?

Yes, but only those in image 1 were printed. It is benign to have the other threads read the time. You can use

if(im.eq.1) t0 = dsecnd()

But that clutters up the reading of the code.

If you want, you can add the image number to the print out and move the printout out of the if(im.eq.1) block. I think though the performance data is only pertinent for the im.eq.1. Though you never know what will surprise you when you see all the data.

Note, on your 1P system, the CPU clock tick (obtained by RDTSC) will be synchronized amongst all hardware threads. Therefore, if you coarray the time variables (t0,...t3), you can use the difference between the t0's, ... to figure out skew amongst the different processes (images). This may be of academic interest to you.

Jim Dempsey

Jim and Steve:

Thanks again for the responses. Since it's unlikely that I'll ever have occasion to use coarrays for any practical purpose, it's all 'academic' ... but unsatisfied curiosity is uncomfortable.

The assertion that the use of coarrays is only around what can be reasonably done inside a single process makes me uncomfortable.  Intel's implementation (and perhaps other initial implementations) may be aimed more at the distributed memory side of things, but my understanding is they aren't obliged to be.

I would very much like to see a coarray implementation that was aimed more at the high performance desktop workstation type environment.  That's still a very active area of use of the language - look at how many questions come up with people putting Excel or GUI's as the front end for a Fortran calculation back end.  Not everyone writing Fortran has a cluster conveniently installed in their garden shed. 

I appreciate (and use) that there are existing out-of-language technologies in the compiler (e.g. OpenMP and the auto-parallel stuff) that are aimed at this segment, but the in-language nature of coarrays and their potential for scalability appeal to me.

(While I'm sitting around wishing for things, I would also very much like to clean out my garden shed one day such that I had space to move around (let alone install a cluster).  I suspect the time frame for this happening is around the same as the time frame for me seeing a shared memory implementation of coarrays.)


You've enunciated exactly the point that was/is bothering me. Back in the bad old days of 32-bit systems, distributed memory was a necessity for serious number crunching. With 64-bits now common, I wonder?

From the responses by Jim and Steve, and the little I've been able to learn online and from the book "Modern Fortran Explained" (2011) by Metcalf, Reid, and Cohen, it seems to me that there is no way to make any sensible use of shared memory with coarrays. It also appears to me that any attempt to make reasonable use of shared memory would essentially duplicate openMP functionality. So ... why bother? Fortran programmers have always had to use "extensions" to the standards. 



>>The assertion that the use of coarrays is only around what can be reasonably done inside a single process makes me uncomfortable.

Within a single process, say multi-threaded OpenMP, the static data, say those defined in a module, COMMON and SAVE are essentially shared variables amongst all the threads of the process.

Within a distributed process, iow COARRAY, (each process single threaded), the static data, say those defined in a module, COMMON and SAVE are essentially private variables of the thread of the process running the specific image of the coarray.

When the COARRAY also uses multi-threading (e.g. OpenMP), then  the static data, say those defined in a module, COMMON and SAVE are essentially shard variables of all the thread of the process running the specific image of the coarray, AND private with respect to the other images.

There are exceptions to the above since both could have a block of Remote Memory Access data shared by multiple processes.

The application code is usually written with the expectation of static data being either share or private, but definitely not "who knows, maybe". You better know what you are doing when you blend code written in different paradigms.

Jim Dempsey

I think there's a conflation of implementation detail and the specification of higher level behaviour there. 

Making each coarray image (a language concept) a separate process (implementation detail) is certainly convenient for those doing the implementation, but it is not required.  You could, with non-trivial additional implementation effort, map images across to threads, with things like module variables/common/saved local variables being duplicated for each thread.  The language specifies that if I reference or define a non-coindexed variable, then I am referencing or defining something that belongs to the image.  The language doesn't care about static data or threads or processes - it just tells me how things will work if I follow its rules.

Tying how a thread based implementation of coarrays and something like OpenMP interact might be tricky (an obvious solution is simply -"You can't do that"), but that's always a risk of out-of-language solutions when the base language evolves.  I understand Intel's coarray implementation is built on a customised form of MPI - I suspect that if a program started playing deviously with the MPI library directly the coarray implementation would get rather confused - and Intel's support response would probably and reasonably be "You can't do that".

Beyond the aspect of the hardware that the typical corporate world mere mortal Fortran programmer has access to, consider the tools that are typically available.  The Visual Studio environment and similar supporting tools allow you to examine and debug the state of threaded programs relatively easily.  You can also debug multi-process "programs" such as coarrays, but it is much more complicated.  If you pay a bit more, Intel will bundle with the Fortran compiler a tools that do things like correctness and performance checking for a threaded process.  Perhaps it is user error on my part, but my experience is that if you throw a coarray program at these tools then they become practically useless.  That's a bit disappointing.

Our coarray implementation is layered on Intel MPI, but we don't use anything non-standard in MPI other than how we "launch" the application. We have many customers who have used our coarray support with other MPI stacks. I will note that the other major coarray implementation, Cray's, uses their custom interconnect and supports only some models of their systems.

The advantage of using MPI is that it scales out far beyond what shared memory can do. I agree that for problem sizes that fit on shared memory systems that OpenMP can yield better performance, but getting OpenMP correctness and performance right is not trivial, and neither OpenMP nor MPI plays well with Fortran language semantics.

I share Ian's dismay that debugging coarray applications is difficult. We have work underway on that. Our cluster analysis tools, such as Intel Trace Analyzer and Collector, can be useful for looking at coarray application performance. and there are MPI-friendly debuggers on the market.

Retired 12/31/2016

Leave a Comment

Please sign in to add a comment. Not a member? Join today