Webinar from December 4, 2012: Fortran Standard Parallel Programming Features, hosted by Steve Lionel
Fortran programmers have been doing parallel processing for many years using methods outside the Fortran standard such as auto-parallelization, OpenMP* and MPI. Fortran 2008, approved as an international standard in late 2010, brought parallel programming into the language for the first time with, not one, but two language features! In this article I will give a brief overview of these two new features, DO CONCURRENT and coarrays; both supported in Intel Fortran compiler.
Slides from the webinar are here - there was no recording.
At the end of the webinar was a Q+A session - here are the questions asked with expanded answers:
Q: Is it true that the do concurrent construct ALWAYS parallelizes the code? It is stated that it is the coder responsibility to make sure there are no dependencies -- yet it seems that the compiler is still checking things.
A: DO CONCURRENT requests parallelization (when auto-parallel is enabled), but it is not a guarantee. If the compiler can prove there is a dependence - for example, if you assign to a procedure-scope scalar variable inside the loop - it may decline to parallelize. You can enable optimization reports to see what it did. DO CONCURRENT is also an aid to vectorization.
Q: In my experience so far with coarrays, the critical factor in good performance is the proper distribution of work across images, since communication between images is very slow. Will this aspect of the Intel coarray implementation be improved?
Q: Why is data transfer between coarrays on different images so slow?
A: Yes, we are actively working on performance improvements. Compile and link with -coarray. Coarrays are not available on MacOS.
Q: You have to supply -parallel/-Qparallel to get DO CONCURRENT to parallize (assuming it's safe to do so)?
Q: Can I say: if iteration sequence is not important replace do loop by do concurrent?
A: Yes, you can do that.
Q: In shared memory, MKL already provides several linear algebra procedure that are automatically multi-threaded. Do co-arrays complement that functionality in any way? Are co-arrays compatible with BLAS and LAPACK routines?
A: Intel MKL uses OpenMP internally for threading. While we have not done extensive testing combining coarrays with other parallel methodologies, we don't know of any reason why you shouldn't be able to pass the local part of a coarray to a BLAS or LAPACK routine. It will just be treated as a local variable.
Q: What about debugging tools? Debugging is currently very primitive!
A: Yes, it is. I think one can also use Rogue Wave TotalView there. On Windows there isn't anything comparable, though we're investigating options. PRINT statements work...
Q: I understand there's no in-line assembler ability (I think you just said that), but can the compiler output assembler at all?
A: Yes. On Windows, use /FA. On Linux, -S.
Q: It seems to me that coarrays do something very similar to a regular array with extra dimensions. Am I missing something?
A coarray application runs in parallel, which can reduce overall runtime and can potentially scale out to hundreds and thousands of "images". It's not appropriate for all applications.
Q: What is wrong with "forall"?
Q: Why "do concurrent" is better than "forall"?
A: The semantics of FORALL don't provide the guarantee against loop dependencies needed for effective parallelization. Also, many people misunderstand the way FORALL executes. DO CONCURRENT has parallel-friendly semantics and fewer restrictions on what can be in the loop body.
Q: Do every MPI enabled program can be parallelized using Coarrays instead? Or maybe coarrays are suitable only for some of algorithms?
A: I would not say "every" MPI program, but if your program is already structured to use MPI you'll probably find coarrays easier to integrate than if you're using OpenMP only. The advantage of coarrays is the integration into the language and ease of programming.
Q: Does the compiler support inline assembly?
A: No. Intel C++ does support this, though we discourage it in general. You can call assembler routines from Fortran if you wish.
Q: How are coarrays designed to fit in the model of using MPI to parallelise across cluster nodes and OpenMP within them? Thanks!
A: Our implementation of coarrays uses MPI as the underlying transport layer. Coarrays in general are similar to the "one-sided" MPI model.
Q: Coarrays are not supported for Mac OS?
Q: What is the limitation that prevents Co Arrays on the Mac?
A: We do not support coarrays on MacOS. The initial barrier is that Intel MPI is not supported there, and we use special features of Intel MPI to manage the "launch" of a coarray application. While we do have the ability to disable this launch feature (requiring you to do your own mpirun), we have not built the compiler and libraries to support coarrays on MacOS. We don't have current plans to do so, but are always interested in hearing from customers who would like to see this in the future.
Q: Do the number of images correspond to the number of hardware cores in a computer?
A: By default, it uses the number of execution threads supported by the system, so that is sockets * cores * threads. You can control this with a compile option or the FOR_COARRAY_NUM_IMAGES environment variable.
Q: Will it be able to give comparable/better performance to OpenMP or MPI applications in the future?
A: It should be comparable to a well-written MPI application in the future, though MPI gives you a lot more control that may be beneficial. Look instead at coarrays as providing scalable parallelism in a way that is well-integrated with the Fortran language.
Q: Will the parallelization opportunities from DO CONCURRENT be any better than OpenMP as far as the compiler is concerned?
A: I would say no - DO CONCURRENT is pretty much equivalent to an OpenMP parallel DO. OpenMP gives you a lot more control over things such as loop-private variables, scheduling and the like. But DO CONCURRENT is easy and doesn't require you to become an expert in OpenMP.
Q: Do all routines called from a "do concurrent" require recursive?
A: The standard requires that any procedure you call in a DO CONCURRENT be PURE. There is no requirement that it be RECURSIVE. If you have enabled the compiler's auto-parallelism feature, procedures will be "recursive" by default but if you are calling external procedures they will need to be thread-safe.
Q: How does Block : End block construct differ from the Associate : End Associate Contruct ?
A: BLOCK creates a new variable whose scope is entirely within the execution of the BLOCK. In the context of DO CONCURRENT, a BLOCK in the loop body creates a separate variable for each iteration of the loop. ASSOCIATE creates an alias for a variable or expression in an outer scope, not an entirely new variable.
Q: Can you preference coarray threads to run in specific cores?
A: You can use Intel MPI support for process affinity to select how cores are assigned to MPI processes. Note that coarray images are separate processes, not threads.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804