Modifying the Program to Use Coarrays

Coarrays are used to split the trials across multiple copies of the program. They are called images. Each image has its own local variables, plus a portion of any coarrays shared variables. A coarray can be a scalar. A coarray can be thought of as having extra dimensions, referred to as codimensions. To declare a coarray, either add the CODIMENSION attribute, or specify the cobounds alongside the variable name. The cobounds are always enclosed in square brackets. Some examples:

real, dimension(100), codimension[*] :: A
integer :: B[3,*]

When specifying cobounds in a declaration, the last cobound must be an asterisk. This indicates that it depends on the number of images in the application. According to the Fortran standard, you can have up to 15 cobounds (a corank of 15), but the sum of the number of cobounds and array bounds must not exceed 31. As with array bounds, it is possible to have a lower cobound that is not 1, though this is not common.

Since the work is being split across the images, a coarray is needed to keep track of each image's subtotal of points within the circle. At the end the subtotals are added to create a grand total, which is divided as it is in the sequential version. The variable total is reused, but make it a coarray. Delete the existing declaration of total and insert into the declaration section of the program:

! Declare scalar coarray that will exist on each image
integer(K_BIGINT) :: total[*] ! Per-image subtotal

The important aspect of coarrays is that there is a local part that resides on an individual image, but you can access the part on other images. To read the value of total on image 3, use the syntax total[3]. To reference the local copy, the coindex in brackets is omitted. For best performance, minimize touching the storage of other images.

In a coarray application, each image has its own set of I/O units. The standard input is preconnected only on image 1. The standard output is preconnected on all images. The standard encourages the implementations to merge output, but the order is unpredictable. Intel® Fortran supports this merging.

It is typical to have image 1 do any setup and terminal I/O. Change the initial display to show how many images are doing the work, and verify that the number of trials is evenly divisible by the number of images (by default, this is the number of cores times threads-per-core). Image 1 does all the timing.

Open the file mcpi_sequential.f90 and save it as mcpi_coarray.f90.


print '(A,I0,A)', "Computing pi using ",num_trials," trials sequentially"
! Start timing
call SYSTEM_CLOCK(clock_start)
! Image 1 initialization
if (THIS_IMAGE() == 1) then
    ! Make sure that num_trials is divisible by the number of images
    if (MOD(num_trials,INT(NUM_IMAGES(),K_BIGINT)) /= 0_K_BIGINT) &
        error stop "num_trials not evenly divisible by number of images!"
    print '(A,I0,A,I0,A)', "Computing pi using ",num_trials," trials across ",NUM_IMAGES()," images"
    call SYSTEM_CLOCK(clock_start)
end if

Use the following steps:

  1. Make the test using the intrinsic function THIS_IMAGE. When it is called without arguments, it returns the index of the invoking image. The code should execute only on image 1.
  2. Ensure that the number of trials is evenly divisible by the number of images. The intrinsic function NUM_IMAGES returns this value. error_stop is similar to stop except that it forces all images in a coarray application to exit.
  3. Print the number of trials and the number of images.
  4. Start the timing.

Images other than 1 skip this code and proceed to what comes next. In more complex applications you might want other images to wait until the initialization is done. When that is desired, insert a sync all statement. The execution does not continue until all images have reached that statement.

The initialization of total does not need to be changed. This is done on each image's local version.

The main compute loop needs to be changed to split the work. Replace:

do bigi=1_K_BIGINT,num_trials
do bigi=1_K_BIGINT,num_trials/int(NUM_IMAGES(),K_BIGINT)

After the DO loop, insert:

! Wait for everyone
sync all

Sum the image-specific totals, compute, and display the result. Again, this is done only on image 1. Replace:

! total/num_trials is an approximation of pi/4
computed_pi = 4.0_K_DOUBLE*(REAL(total,K_DOUBLE)/REAL(num_trials,K_DOUBLE))
print '(A,G0.8,A,G0.3)', "Computed value of pi is ", computed_pi, &
    ", Relative Error: ",ABS((computed_pi-actual_pi)/actual_pi)! Show elapsed time
call SYSTEM_CLOCK(clock_end,clock_rate)
print '(A,G0.3,A)', "Elapsed time is ", &
  REAL(clock_end-clock_start)/REAL(clock_rate)," seconds"
! Image 1 end processing
if (this_image() == 1) then
    ! Sum all of the images' subtotals
    do i=2,num_images()
        total = total + total[i]
    end do
    ! total/num_trials is an approximation of pi/4
    computed_pi = 4.0_K_DOUBLE* (REAL(total,K_DOUBLE)/REAL(num_trials,K_DOUBLE))
    print '(A,G0.8,A,G0.3)', "Computed value of pi is ", computed_pi, &
        ", Relative Error: ",ABS((computed_pi-actual_pi)/actual_pi)
    ! Show elapsed time
    call SYSTEM_CLOCK(clock_end,clock_rate)
    print '(A,G0.3,A)', "Elapsed time is ", &
        REAL(clock_end-clock_start)/REAL(clock_rate)," seconds"
end if

Use the following steps on the new code:

  1. Execute this code only on image 1.
  2. The total (without a coindex) already has the count from image 1, now add in the values from the other images. Note the [i] coindex.
  3. Ensure that the rest of the code is the same as the sequential version.

All of the images exit.

For more complete information about compiler optimizations, see our Optimization Notice.