Overloading Fortran CoArrays For Monte Carlo Analysis

Overloading Fortran CoArrays For Monte Carlo Analysis

I have a legacy code/simulation model that I am using with Intel Visual Fortran/Visual Studio 2012 for a numerical experiment. The experiment needs the simulation model to be run 1000 times with different initial conditions. The problem is during the simulation the 1000 models need to communicate with each other for calculating a variance between the changing outputs (the models are independent the rest of the time). Doing this outside of the model or stopping and restarting the model is not an option, so I need to find a way to communicate between the 1000 model runs at various points of the simulation.

A gross example of one program run would be

!This do loop progresses through time for this sim model
DO I=1,100

My thought is that since CoArrays are similar to MPI where the entire program is cloned that it would be ideal to for INTEL FORTRAN to make a CoArray of length 1000 and use that for communicating when the variance needs to be calculated.

The problem is how to manage the images such that only 8 (for an 8 core processor) run at a time and stop/wait for the rest of the images until they all reach the IF(MOD(I,10).EQ.0) to exchange information.

Similar to the previous code:




!This do loop progresses through time for this sim model
DO I=1,100   
!LOOP PROGRESSES USING ONLY 8 IMAGES UNTIL IF(MOD(I,10).EQ.0) IS TRUE, then the next 8 starts and this continues until all 1000 images have made it to IF(MOD(I,10).EQ.0). Then the variance is calculated and then 8 images continue one and the process repeats again until all 1000 images are at the next IF(MOD(I,10).EQ.0).

What this would do is allow for a legacy code to be run multiple times with minimal modification and be a cool way of overloading a CoArray.

Another option is to clone the program and run 1000 independent models, but I have no clue how to get FORTRAN to pause/wait/transmit information ammongst 1000 independent programs.

Thanks for all the help and comments as always!

17 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

With the proviso that I barely know what I'm doing - you can use the SYNC IMAGES statement to set up a progressive march of images through a segment.  Some of the examples in the standard come close.

  INTEGER, PARAMETER :: set_length = 8
  IF (MOD(NUM_IMAGES(), set_length) /= 0)  &
      ERROR STOP "Thar'll be trouble ahead capt'n!"
    ! Wait for the image eight before to complete.
    IF (THIS_IMAGE() > set_length)   &
        SYNC IMAGES (THIS_IMAGE() - set_length)
    ! Do useful work.
    ! Let the image eight ahead know we are done.
    IF (THIS_IMAGE() <= NUM_IMAGES() - set_length)  &
        SYNC IMAGES (THIS_IMAGE() + set_length)
    ! Wait for everyone.

Where I think your approach is going to break is that the implementation very much assumes that each image has a processing unit (a hardware thread) all to its self.  When an image needs to wait around for other images to do something useful, I think it just spins. Your eight images doing useful work are going to be fighting 992 images all collectively yelling "Are we there yet?"  As any parent that has embarked on a length journey would know, having one [small] image [of yourself] yelling that constantly is troublesome enough, the prospect of 992 times that is enough to drive any self respecting individual to drink instant coffee.  (Perhaps I am mistaken about the implementation though.)

I think you might be better off considering 1000 jobs that eight (or whatever are available) images need to do.

You have not indicated the type of simulation you are undertaking. If it is event based or in the time domain, then the simulations might be able to be undertaken as a simgle computational model, dimensioned for 1,000 independent operations.

If the 1,000 models are all in the time domain (or all in a uniform stepping dimension), then you could create a single time based step solution and then update 1000 different simulations. At each interval then the required parameters could be communicated between each simulation.

This approach is different to a monte-carlo simulation which is event based, which is the case with a number of monte-carlo simulation packages.

My monte-carlo simulation method has always been time based, to achieve this interaction. An example is the simulation of a large port operation, which involves many berths that have significant independent operation, except for sharing some interface facilities, such as land transport or the sea interface.

While the adoption of a fixed time step is claimed to be less efficient that an event based simulation for simple systems, the control of the interaction of (in your case 1,000 nearly independent operations) becomes a much simpler system to monitor and control with a uniform stepping single dimension parameter.

The adoption of a single dimension parameter can create problems with finite step computation as if the number of time steps is extremely large ( > 10^10) round off errors need to be managed. You need to keep track of the precision of the single step performance estimates.

The time to run 1000 simulations simultaneously is probably less than the time it takes to develop and solve the interface between 1,000 independent computer processes shared in a computer network.

There is also the issue of model verification, so by keeping the computational method simple, the time it takes you to validate the model would be far less in the approach I have outlined. It is an approach that is more easily scaled up to the 1,000 "independent" models.



If you are running on a uini-processor (e.g. 8-core as you mentioned) or SMP and if you do not have process memory limitations (e.g. 32-bit process and ~2GB VM limitation is too small for your simulation), then you may be best served using a single process and OpenMP. This will reduce your inter-process communication. And you will not have the unwarranted "Are we there yet" CPU contention as IanH mentions.

On the other hand, if the 8-core platform is for development and your production "system" is 100's of networked processors then coarrays or MPI would be warranted.

Jim Dempsey

What I am trying to do is an Ensemble Kalman filter which combines field observations with simulated results in the time domate. The simulated model acts like a blackbox that propagates forward its state variable until there is an observation, then the states are updated based  on the observation values. For the Ensemble approuch you run 1000 blackbox runs that propagate forward a different initial condition and at the time of an observation the results of the 1000 model runs are updated by comparing each of their results amongest each other and the observation.

For my situation a single model run takes about 20 hours to complete, is contains lots of separate nonlinear interators and ~30,000 lines of code. Since each model with a different initial condition is independent, I like to be able to run the entire blackbox in parallel without much modification to the code. RAM should not be an issue since I can spread the workload over a cluster (more than 8 threads, that just was just meant as an example) or reduce the Ensemble to a smaller number to meet a single CPU ram limit.

Ultimately the program takes the form that I showed in my first post where there is an overall loop that represents a time step that propagates the models state forward in time. Within that loop are all the nonlinear iterators, forcings and various pieices that effect a single state vector. When the overall iteration finishes and it represents a time when an observation is made, I want to be able to pause the program, have it communicate with the other initial condition models (when they reach the same observation time) and compare their results with the observations. Then literally the state vector that would be cycled to the top of the loop is changed for all the initial condition models and it begins propagating forward in time until the next observation.

At the point of pausing/waiting, I do not mind making separate programs and running them with a Batch script. My problem is how to get the programs to wait and transfer their state vectors to be compared amongst each other and the obesrvations and then begin processing again.

Scott B.,

Your Monte Carlo model approach is very different to mine. It would appear that you have 1000 models x 20 hours of computation between updates; 20,000 hours of computing per update step !

My Monte Carlo models typically run for about 15 minutes and simulate 7 years of operation, with 15 second simulation time steps; that is 15 million time steps @ 60 micro seconds compute time per update. My problems must be very different to what you are solving.
I have been doing my type of monte carlo modelling since 1978. Given the advances in computer performance, I wonder for how long your model approach would have run if started back in 1978 !

20,000 hours of computing is a huge amount of changeing numbers in memory. Is there a different way to describe this numerical system ?


I meant that a complete model run from start to finish of the time frame simulated takes 20 hours. The model runs are typically run on a Windows Cluster so it does not take a full 20,000 hours to run 1000 model calls.

A single time step usually takes about 30 seconds, depending on the number of nonlinear features being used during that time step.

Given that there has been over a 1,000 fold increase in performance since 1978 (clock rates were about 1 mhz) and there was difficulty in finding mult-procerssor hardware; 20,000 hours now would have been 20 million hours back then, which is 2,283 years. If you were able to have started the run in 1978, it would not complete until 4261 !

In 1976, I wrote a program to calculate a covariance matrix for loading effects in a floor, which used numerical integration on a 4th order integral. After starting the program, I then re-wrote the approach, using Greens theorem, to reduce the integral to 2 levels, developed the program, ran it and got the answer before the first program had completed.

You are dealing with such long run times with your problem definition. Today, even with the advances in computer power, you appear to have the time to achieve a similar outcome if you could redefine your approach to the problem.


John, remaking the simulation model is not an option nor using an approximation. The main reason I posted on here was to get advice how I can trick CoArrays into some sort of overloading scheme, so I can run an esemble of models as CoArrays that communicate with each other whenever an observation is reported.

As part of learning I'm trying to get a set of coarray use scenarios/examples clear in my head so I thought about this some more.  See attached.  I don't know if this is even vaguely relevant and also, sadly, don't know if it even works, because of those three little letters "I" "C" and" E" that often conspire to ruin my day.  If Dr Fortran spots this let me know, otherwise I shall report it to premier support tomorrow.

There is a restriction in the standard that "if a variable is defined on an image in a segment, it shall not be referenced, defined, or become undefined in a segment on another image unless the segments are ordered".  I've been assuming that "a variable" doesn't mean "any part of a variable" - i.e. if one image defines array(i)[j], another image can define array(i+1)[j] without requiring cross image segment ordering.  Is this actually the case?  (I've also asked this on c.l.f.)

(Edit to note that an answer on c.l.f says my assumption is wrong - so some changes made.  The compiler still chokes.)


Downloadapplication/octet-stream therealmonty.f906.45 KB
Downloadapplication/octet-stream therealmonty2.f907.12 KB

Ian, what compiler options are you using to get the ICE? I haven't been able to trigger one with various attempts. Also, which compiler version?

Retired 12/31/2016

Hi Steve,  can you explain how image control can be used to make only one set of images active at anytime. Kinda like using critical, but instead of having just one thread, send say 8 threads at a time.

I am not sure if I understand how sync images() can be used.


(The beta with version with /check:all /warn:all /standard-semantics /Qcoarray.  I thought I triggered it with the current release too at one stage, but perhaps my recollection also has catastrophic errors.)

Ian, I can't reproduce it in a later 14.0 beta version (nor in 13.1)

Scott, there isn't at present a mechanism to do what you want. There is a proposal for Fortran 2015 to add two features that may be useful to you, teams and events. With teams, you can assign a subset of images to be working on a given task, with limited communication to the parent team and its siblings. Events allows you to have a counter and images wait until the counter goes to zero. I suppose you could have all the 8 images wait, and then have the "master" image "post" the event 8 times, freeing the 8 waiters, but I really thing teams is more along the lines of what you're looking for. SYNC_IMAGES would not help you here.

Retired 12/31/2016

Thanks, I had read about the Teams feature and will explore that in the future (when its availible).

How would you code up so that images can wait.

For example say NUM_IMAGES()=64 and with only 8 images active at one time. One way I have thought of doing it is having all CoArrays dimensioned as A[8,*] and then the second dimension of the coarrays would represent a group.

Group 1 A[:,1]

Group 2 A[:,2]

Would the following work:


DO I=1,100

!Do the Do Loop say 10 times with 8 images, then move to the next group

!Do stuff with 8 images (identical work amongs all 64 images just different numbers)




I would probably code this using the paradigm of a work queue with image 1 assigning taks and images checking the queue waiting for something to do.

Retired 12/31/2016


Maybe you should re-think your problem in a different way. Currently you designed 1000 simulations as seperate images to be run on less than 1000 cores (logical processors). This will require 1000 processes (programs) and an accompanying oversubscription issue (spin-wait time on one process consuming processing time of other processes).

If you wish to continue to use coarrays in your design, then I suggest you create a system of coarrays where the number of coarrays equals the number of logical processors available (which may be distributed amongst the cluster/network). This number of coarrays is now less than the number of simulations. The coarrays, visible to each/all images, are not your current working dataset. Rather you use them as a messaging system.

Initialy, all the images come up with no simulation data to work on. Each image has a means to indicate to the master image that it is available to run a simulation. The master image then distributes the first set of simulation data to the available images. As each image completes a run, it stores locally a copy of the simulation, then indicates it is available for further processing. The master image, seeing an available for processing image, determines if there are more simulations, if more, it assigns the simulation to this specific image. This repetes until all 1000 simulations have been distributed (in a load balanced manner). Once all have been distributed, the master can specify how many iterations to run each simulation until you want to synchronize and produce new initial conditions. The updated initial conditions can be sent, and the next series of iterations can run (without transmission of the bulk of the data).

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today