Webinar from December 4, 2012: Fortran Standard Parallel Programming Features, hosted by Steve Lionel
Fortran programmers have been doing parallel processing for many years using methods outside the Fortran standard such as auto-parallelization, OpenMP* and MPI. Fortran 2008, approved as an international standard in late 2010, brought parallel programming into the language for the first time with, not one, but two language features! In this article I will give a brief overview of these two new features, DO CONCURRENT and coarrays; both supported in Intel Fortran Composer XE.
Slides from the webinar are here - there was no recording.
At the end of the webinar was a Q+A session - here are the questions asked with expanded answers:
Q: Is it true that the do concurrent construct ALWAYS parallelizes the code? It is stated that it is the coder responsibility to make sure there are no dependencies -- yet it seems that the compiler is still checking things.
A: DO CONCURRENT requests parallelization (when auto-parallel is enabled) but it is not a guarantee. If the compiler can prove there is a dependence - for example if you assign to a procedure-scope scalar variable inside the loop - it may decline to parallelize. You can enable optimization reports to see what it did. DO CONCURRENT is also an aid to vectorization.
Q: In my experience so far with coarrays, the critical factor in good performance is the proper distribution of work across images, since communication between images is very slow. Will this aspect of the Intel coarray implementation be improved?
Q: Why is data transfer between coarrays on different images so slow?
A: Yes, we are actively working on performance improvements. An early effort is enabled in the Composer XE 2013 Update 1 compiler if you add the option /switch:coarray_opts (on Linux, -switch coarray_opts). These changes will be the default in a future major version. Please note that /switch (-switch) is undocumented and is not something you should depend on long-term, but if we tell you about a particular use of it, feel free to try it out and let us know how it works for you.
Q: You have to supply -parallel/-Qparallel to get DO CONCURRENT to parallize (assuming it's safe to do so)?
Q: Can I say: if iteration sequence is not important replace do loop by do concurrent?
A: Yes, you can do that.
Q: In shared memory, MKL already provides several linear algebra procedure that are automatically multi-threaded. Do co-arrays complement that functionality in any way? Are co-arrays compatible with BLAS and LAPACK routines?
A: Intel MKL uses OpenMP internally for threading. While we have not done extensive testing combining coarrays with other parallel methodologies, we don't know of any reason why you shouldn't be able to pass the local part of a coarray to a BLAS or LAPACK routine. It will just be treated as a local variable.
Q: What about debugging tools? Dubugging is currently very primitive!
A: Yes, it is. In the Fortran for Linux release notes we give instructions on how to debug coarray applications. I think one can also use Rogue Wave TotalView there. On Windows there isn't anything comparable, though we're investigating options. PRINT statements work...
Q: I understand there's no in-line assembler ability (I think you just said that), but can the compiler output assembler at all?
A: Yes. On Windows, use /FA. On Linux, -S.
Q: it seems to me that coarrays do something very similar to a regular array with extra dimensions. Am I missing something?
A coarray application runs in parallel, which can reduce overall runtime and can potentially scale out to hundreds and thousands of "images". It's not appropriate for all applications.
Q: What is wrong with "forall"?
Q: Why "do concurrent" is better than "forall"?
A: The semantics of FORALL don't provide the guarantee against loop dependencies needed for effective parallelization. Also, many people misunderstand the way FORALL executes. DO CONCURRENT has parallel-friendly semantics and fewer restrictions on what can be in the loop body.
Q: Do every MPI enabled program can be parallelized using Coarrays instead? Or maybe coarrays are suitable only for some of algorithms?
A: I would not say "every" MPI program, but if your program is already structured to use MPI you'll probably find coarrays easier to integrate than if you're using OpenMP only. The advantage of coarrays is the integration into the language and ease of programming.
Q: Bearing in mind this is a relative new feature, what kind of performance benefits would you get from the present implementation of the coarrays?
A: In the current implementation, an application that does a lot of work inside the image and is not dominated by moving data between images will see benefits from coarray parallelization. Right now, the productivity improvements are perhaps stronger. Over time, performance will be better.
Q: Does the compiler support inline assembly?
A: No. Intel C++ does support this, though we discourage it in general. You can call assembler routines from Fortran if you wish.
Q: Could you elaborate on C inter-operability
Q: What are pacularites debugging parallel programs in composer?
A: On Linux, the Intel Debugger (and Rogue Wave TotalView) have numerous features for debugging parallel applications. On Windows, the Visual Studio debugger can do some thread control and also has "attach to process" capability.
Q: Before updating intel professional 11. 1 to composer what parallel properties of fortran 2008 we can stll use?
A: Intel Fortran 11.1 does not support any of the Fortran 2008 parallel features.
Q: How are coarrays designed to fit in the model of using MPI to parallelise across cluster nodes and OpenMP within them? Thanks!
A: Our implementation of coarrays uses MPI as the underlying transport layer. Coarrays in general are similar to the "one-sided" MPI model.
Q: Coarrays are not supported for Mac OS?
Q: What prevents the implementation of these features on Macintoshes?
Q: What is the limitation that prevents Co Arrays on the Mac?
A: We do not support coarrays on OS X. The initial barrier is that Intel MPI is not supported there, and we use special features of Intel MPI to manage the "launch" of a coarray application. While we do have the ability to disable this launch feature (requiring you to do your own mpirun), we have not built the compiler and libraries to support coarrays on OS X. We don't have current plans to do so, but are always interested in hearing from customers who would like to see this in the future.
Q: When is the block feature expected to be added?
A: I can't give a timeframe for this. It remains high on our list of things to do.
Q: Can you expand on the new Event() functionality, as it pertains to a parallel operation in multi-core environments?
A: I talked about events in the context of the "Enhanced coarray features" technical specification being developed for Fortran 2015. At present, there is just a general outline which you can read in ftp://ftp.nag.co.uk/sc22wg5/N1901-N1950/N1924.txt
Q: Do the number of images correspond to the number of hardware cores in a computer?
A: By default, it uses the number of execution threads supported by the system, so that is sockets * cores * threads. You can control this with a compile option or the FOR_COARRAY_NUM_IMAGES environment variable.
Q: Will it be able to give comparable/better performance to openmp or MPI applications in the future?
A: It should be comparable to a well-written MPI application in the future, though MPI gives you a lot more control that may be beneficial. Look instead at coarrays as providing scalable parallelism in a way that is well-integrated with the Fortran language.
Q: Will the parallelization opportunities from DO CONCURRENT be any better than OpenMP as far as the compiler is concerned?
A: I would say no - DO CONCURRENT is pretty much equivalent to an OpenMP parallel DO. OpenMP gives you a lot more control over things such as loop-private variables, scheduling and the like. But DO CONCURRENT is easy and doesn't require you to become an expert in OpenMP.
Q: Do all routines called from a "do concurrent" require recursive?
A: (This is a somewhat different answer than I gave during the webinar.) The standard requires that any procedure you call in a DO CONCURRENT be PURE. There is no requirement that it be RECURSIVE. If you have enabled the compiler's auto-parallelism feature, procedures will be "recursive" by default but if you are calling external procedures they will need to be thread-safe.
Q: How does Block : End block construct differ from the Associate : End Associate Contruct ?
A: BLOCK creates a new variable whose scope is entirely within the execution of the BLOCK. In the context of DO CONCURRENT, a BLOCK in the loop body creates a separate variable for each iteration of the loop. ASSOCIATE creates an alias for a variable or expression in an outer scope, not an entirely new variable.
Q: We have build MPICH2 with Intel compiler, and running Superbayes (using mpirun), on 48 cores (12 nodes SLURM). How about purchasing Intel Cluster toolkit or studio will help?
A: You would be able to use the Intel Trace Analyzer and Collector to look at your MPI traffic and detect problems such as deadlocks. You wouldn't have to use the Intel MPI it includes, but Intel MPI is API-compatible with MPICH2.
Q: Can you preference coarray threads to run in specific cores?
A: You can use Intel MPI support for process affinity to select how cores are assigned to MPI processes. Note that coarray images are separate processes, not threads.
Q: Are there any plans for incorporating inter-operability with the GNU UPC Compiler?