| October 29, 2009 1:00 AM PDT | |
As an example of multicore-enabling a performance-sensitive application, we decided to convert the Discrete Hedging example to a parallel Cilk++ program (that is, to "Cilkify" the example) and increase performance on multicore systems from the popular computational finance library QuantLib. We chose DiscreteHedging.cpp, an example of using the QuantLib Monte Carlo framework.
As a preview of the results we obtained, here is a graph of the results on a 16-core system, showing that speedup is almost linear with the number of workers, and hence the number of cores used by the parallel Cilk++ program.
All together, it took a week to Cilkify this example: a day to install and become familiar with the QuantLib code, two days to parallelize it and eliminate data races, and two days to benchmark and tune the performance.
Monte Carlo techniques are popular due to their applicability to the multi-dimensional problems frequently encountered in computational finance. However, the computation can be quite time consuming due to the need for simulating many trajectories with multiple parameters. And so the arrival of multicore processors on the desktop and in the corporate data center bodes well for such compute-intensive applications – though it does require these applications to be multicore-ready. We'll show how to use Cilk++ to make the Discrete Hedging example multicore-ready and, in the process, discuss the challenges and how we addressed them
The QuantLib library consists of approximately 1,500 source files with 5-10 functions each, totaling hundreds of thousands of lines. Because Cilk preserves your code’s serial semantics, you can “eat the elephant one bite at a time” – applying Cilk++ to a relatively small portion of the code. In this particular example, we touched approximately 100 lines of code, while achieving great linear speedup (12-14 X on a 16-core system).
Let's take a look at the approach and results.
Introduction to Discrete Hedging Examples
This is the description given by the author of the example.
This example computes profit and loss of a discrete interval hedging
strategy and compares with the results of Derman & Kamal's (Goldman Sachs
Equity Derivatives Research) Research Note: "When You Cannot Hedge
Continuously: The Corrections to Black-Scholes"
http://www.ederman.com/emanuelderman/GSQSpapers/when_you_cannot_hedge.pdf
Suppose an option hedger sells an European option and receives the
Black-Scholes value as the options premium.
Then he follows a Black-Scholes hedging strategy, rehedging at discrete,
evenly spaced time intervals as the underlying stock changes. At
expiration, the hedger delivers the option payoff to the option holder,
and unwinds the hedge. We are interested in understanding the final
profit or loss of this strategy.
If the hedger had followed the exact Black-Scholes replication strategy,
re-hedging continuously as the underlying stock evolved towards its final
value at expiration, then, no matter what path the stock took, the final
P&L would be exactly zero. When the replication strategy deviates from
the exact Black-Scholes method, the final P&L may deviate from zero. This
deviation is called the replication error. When the hedger rebalances at
discrete rather than continuous intervals, the hedge is imperfect and the
replication is inexact. The more often hedging occurs, the smaller the
replication error.
We examine the range of possibilities, computing the replication error.
The example uses Monte Carlo methods to exam the range of possibilities.
Monte Carlo methods conduct a random walk and simulate the random behavior underlying the financial models.
Each path represents a possible chain of asset prices changing from the settlement to exercise time, and correspondingly the P&L according to the asset prices. The length of the chain corresponds to the number of discrete hedging steps.
The Monte Carlo methods collect the results of P&L from many paths, and report the replication error with respect to mean, deviation, etc.
Parallelization Steps
The discrete hedging example using Monte Carlo methods is intrinsically parallel. The majority of the execution time is spent on generating and pricing paths, while the computation of one path is independent of the others. One would expect to use a cilk_for loop to generate and price all paths, and use a reducer to collect the results. Our work of parallelizing this example involves three steps:
- Identify the portion of the code that lend themselves to parallelism, and expose the parallelism using Cilk++ keywords (cilk_for, cilk_spawn, cilk_sync) and hyperobjects.
- Run Cilkscreen race detector to check for data races, and fix them.
- Run the parallelized example and compute the speedup. Identify performance problems and tune the code for better parallel performance.
Step 1: Cilkify the example
As mentioned, the majority of the execution time of the example is spent on generating and pricing paths. This part of work is carried out at function ReplicationError::compute in DiscreteHedging.cpp. Therefore, we chose this function as an entry point to cilkify the discrete hedging example.
The original version of ReplicationError::compute is presented in Figure 1. The code mainly sets up the financial instruments, and calls a Monte Carlo simulation to carry out the exploration of nSamples number of paths.
Hide/Show this code
// The computation over nSamples paths of the P&L distribution
void ReplicationError::compute(Size nTimeSteps, Size nSamples)
{
QL_REQUIRE(nTimeSteps>0, "the number of steps must be > 0");
/* Black-Scholes framework: the underlying stock price evolves
lognormally with a fixed known volatility that stays constant
throughout time.
*/
Calendar calendar = TARGET();
Date today = Date::todaysDate();
DayCounter dayCount = Actual365Fixed();
Handle<quote> stateVariable(
boost::shared_ptr<quote>(new SimpleQuote(s0_)));
Handle<yieldtermstructure> riskFreeRate(
boost::shared_ptr<yieldtermstructure>(
new FlatForward(today, r_, dayCount)));
Handle<yieldtermstructure> dividendYield(
boost::shared_ptr<yieldtermstructure>(
new FlatForward(today, 0.0, dayCount)));
Handle<blackvoltermstructure> volatility(
boost::shared_ptr<blackvoltermstructure>(
new BlackConstantVol(today, calendar, sigma_, dayCount)));
boost::shared_ptr<stochasticprocess1d> diffusion(
new BlackScholesMertonProcess(stateVariable, dividendYield,
riskFreeRate, volatility));
// Black Scholes equation rules the path generator:
// at each step the log of the stock
// will have drift and sigma^2 variance
PseudoRandom::rsg_type rsg =
PseudoRandom::make_sequence_generator(nTimeSteps, 0);
bool brownianBridge = false;
typedef SingleVariate<pseudorandom>::path_generator_type generator_type;
boost::shared_ptr<generator_type> myPathGenerator(new
generator_type(diffusion, maturity_, nTimeSteps,
rsg, brownianBridge));
// The replication strategy's Profit&Loss is computed for each path
// of the stock. The path pricer knows how to price a path using its
// value() method
boost::shared_ptr<pathpricer><path> > myPathPricer(new
ReplicationPathPricer(payoff_.optionType(), payoff_.strike(),
r_, maturity_, sigma_));
// a statistics accumulator for the path-dependant Profit&Loss values
Statistics statisticsAccumulator;
// The Monte Carlo model generates paths using myPathGenerator
// each path is priced using myPathPricer
// prices will be accumulated into statisticsAccumulator
MonteCarloModel<singlevariate,pseudorandom>
MCSimulation(myPathGenerator,
myPathPricer,
statisticsAccumulator,
false);
// the model simulates nSamples paths
MCSimulation.addSamples(nSamples);
// the sampleAccumulator method
// gives access to all the methods of statisticsAccumulator
Real PLMean = MCSimulation.sampleAccumulator().mean();
Real PLStDev = MCSimulation.sampleAccumulator().standardDeviation();
Real PLSkew = MCSimulation.sampleAccumulator().skewness();
Real PLKurt = MCSimulation.sampleAccumulator().kurtosis();
// Derman and Kamal's formula
Real theorStD = std::sqrt(M_PI/4/nTimeSteps)*vega_*sigma_;
std::cout << std::fixed
<< std::setw(8) << nSamples << " | "
<< std::setw(8) << nTimeSteps << " | "
<< std::setw(8) << std::setprecision(3) << PLMean << " | "
<< std::setw(8) << std::setprecision(2) << PLStDev << " | "
<< std::setw(12) << std::setprecision(2) << theorStD << " | "
<< std::setw(8) << std::setprecision(2) << PLSkew << " | "
<< std::setw(8) << std::setprecision(2) << PLKurt << std::endl;
}
Figure 1. Original version of ReplicationError::compute function.
The Cilkified version of the compute function is shown in Figure 2. We only needed to make four simple changes in the code. The main idea is to divide the big task of exploring “nSamples” number of paths into smaller tasks which can be carried out in parallel:
- Create a Cilk++ hyperobject as a reducer to collect and summarize the results (refererred to as "Cilk++ change #1" below);
- To divide up the work, we use a cilk_for loop (refererred to as "Cilk++ change #2" below);
- "Cilk++ change #3" to pass the hyperobject as input to MonteCarloModel; and
- "Cilk++ change #4" to read the results from the hyperobject .
Hide/Show this code
// The computation over nSamples paths of the P&L distribution
void ReplicationError::compute(Size nTimeSteps, Size nSamples)
{
QL_REQUIRE(nTimeSteps>0, "the number of steps must be > 0");
//Cilk++ change #1: create a hyperobject to collect result statistics.
cilk::hyperobject<statistics> *rr = new cilk::hyperobject<statistics>;
Date today = Date::todaysDate();
Calendar calendar = TARGET();
DayCounter dayCount = Actual365Fixed();
//Cilk++ change #2: divide the big task of exploring nSamples paths into
// nTasks number of subtasks where each of them explores
// nDividedSamples number of paths. The cilk_for loop
// parallelizes the iterations over the subtasks.
int nTasks = 128;
Size nDividedSamples = nSamples / nTasks;
cilk_for(int i = 0; i < nTasks; i++) {
Handle<quote> stateVariable(
boost::shared_ptr<quote>(new SimpleQuote(s0_)));
Handle<yieldtermstructure> riskFreeRate(
boost::shared_ptr<yieldtermstructure>(
new FlatForward(today, r_, dayCount)));
Handle<yieldtermstructure> dividendYield(
boost::shared_ptr<yieldtermstructure>(
new FlatForward(today, 0.0, dayCount)));
Handle<blackvoltermstructure> volatility(
boost::shared_ptr<blackvoltermstructure>(
new BlackConstantVol(today, calendar, sigma_, dayCount)));
boost::shared_ptr<stochasticprocess1d> diffusion(
new BlackScholesMertonProcess(stateVariable, dividendYield,
riskFreeRate, volatility));
Statistics statisticsAccumulator;
PseudoRandom::rsg_type rsg =
PseudoRandom::make_sequence_generator(nTimeSteps, 0);
bool brownianBridge = false;
typedef SingleVariate<pseudorandom>::path_generator_type
generator_type;
boost::shared_ptr<generator_type> myPathGenerator(new
generator_type(diffusion, maturity_, nTimeSteps,
rsg, brownianBridge));
boost::shared_ptr<pathpricer><path> > myPathPricer(new
ReplicationPathPricer(payoff_.optionType(), payoff_.strike(),
r_, maturity_, sigma_));
MonteCarloModel<singlevariate,pseudorandom>
MCSimulation(myPathGenerator,
myPathPricer,
statisticsAccumulator,
false);
//Cilk++ change #3: Passing in the hyperobject at the result collection
// stage so the results are collected and "reduced"
// correctly.
MCSimulation.addSamples(nDividedSamples, rr);
}
//Cilk++ change #4: Getting results from the hyperobject
Size nTotalSamples = (*rr)().samples();
Real PLMean = (*rr)().mean();
Real PLStDev = (*rr)().standardDeviation();
Real PLSkew = (*rr)().skewness();
Real PLKurt = (*rr)().kurtosis();
delete rr;
Real theorStD = std::sqrt(M_PI/4/nTimeSteps)*vega_*sigma_;
std::cout << std::fixed
<< std::setw(8) << nTotalSamples << " | "
<< std::setw(8) << nTimeSteps << " | "
<< std::setw(8) << std::setprecision(3) << PLMean << " | "
<< std::setw(8) << std::setprecision(2) << PLStDev << " | "
<< std::setw(12) << std::setprecision(2) << theorStD << " | "
<< std::setw(8) << std::setprecision(2) << PLSkew << " | "
<< std::setw(8) << std::setprecision(2) << PLKurt << std::endl;
}
Figure 2: Cikified version of ReplicationError::compute function.
Based on the changes made to compute function, it is not hard to see that the remaining work at the cilkifying step is to define the statistics hyperobject, and add a method in MonteCarloModel to use the hyperobject as the data collector.
To make the Statistics class work as a hyperobject, all we need to do is to add a reduce function to the class. The Statistics class inherits the class
GeneralStatistics. We can therefore either add a reduce function on Statistics or GeneralStatistics.
We choose to add a reduce function to GeneralStatistics because such a reduce function is applicable to GeneralStatistics and all of its subclasses. Figure 3 shows the implementation of this reduce function. It simply combines the data from two GeneralStatistics objects.
Hide/Show this code
class GeneralStatistics {
public:
/*! implement reduce operation so the results can be collected
through cilk hyperobject. */
void reduce(GeneralStatistics *other) {
samples_.insert(samples_.begin(),
other->samples_.begin(), other->samples_.end());
}
. . . . . .
}
Figure 3: Implementation of reduce function in class GeneralStatiscs
Finally, we want to add a method (MonteCarloModel::addSamples(Size, cilk::hyperobject<statistics>*)) in MonteCarloModel to take the hyperobject of Statistics as the data collector. Since the orignal code has a similar method
addSamples(Size) where it uses a default collector, the new method follows the behavior of the original one, and uses the hyperobject collector.
Figure 4 shows the codes of the existing function addSample(Size), and figure 5 shows the new function with the changes.
Hide/Show this code
template <template <class> class MC, class RNG, class S>
inline void MonteCarloModel<mc,rng,s>::addSamples(Size samples) {
for(Size j = 1; j <= samples; j++) {
sample_type path = pathGenerator_->next();
result_type price = (*pathPricer_)(path.value);
. . . . . .
if (isAntitheticVariate_) {
. . . . . .
//use default collector to collect data
sampleAccumulator_.add((price+price2)/2.0, path.weight);
} else {
sampleAccumulator_.add(price, path.weight);
}
}
}
Figure 4: Existing function addSample(Size)
Hide/Show this code
template <template <class> class MC, class RNG, class S>
inline void MonteCarloModel<MC,RNG,S>::addSamples(Size samples,
cilk::hyperobject<Statistics>* acc) {
for(Size j = 1; j <= samples; j++) {
sample_type path = pathGenerator_->next();
result_type price = (*pathPricer_)(path.value);
. . . . . .
if (isAntitheticVariate_) {
. . . . . .
//use hyperobject to collect data
(*acc)().add((price+price2)/2.0, path.weight);
} else {
(*acc)().add(price, path.weight);
}
}
}
Figure 5: New function addSample(Size, cilk::hyperobject<statistics>*)
So far, we have completed the first step – Cilkifying the example.
Step 2: Check and Eliminate Races
After Cilkifying the example, we ran Cilkscreen to detect data races. The data race report indicated that the data races primarily occur in two locations: the Date class, and the random number generator.
The races on Date class are straightforward. For example, as shown in Figure 6, Date has a function minDate() which generates a static instance of the class. When minDate() is called by concurrent threads, it creates data races. A fix for this type of races is to remove the static qualifier, and instantiate a copy of the Date class.
Hide/Show this code
Date Date::minDate() {
//changed from:
//static const Date minimumDate(minimumSerialNumber());
//to:
const Date minimumDate(minimumSerialNumber());
return minimumDate;
}
Figure 6. Revised codes for minDate() function of the class Date.
The races on the random number generators took some effort to track down. The root of the data races is that each call PseudoRandom::make_sequence_generator in the compute function creates an instance of MersenneTwisterUniformRng
class to generate uniform random numbers. Each instance of MersenneTwisterUniformRng needs a seed to be initialized. The seed generator class SeedGenerator inherits Singleton class, which applies a Singleton pattern and ensures only one unique instance of the class exists at anytime.
When different parallel instances of MersenneTwisterUniformRng try to get the seed concurrently, they call the get() function of the same instance of SeedGenerator. Since get() is not a constant function and changes variables of SeedGenerator, there is a data race. Our solution to these races was to add a non-Singleton seed generator class NSSeedGenerator,
and let MersenneTwisterUniformRng get seed from an instance of NSSeedGenerator. Figure 7 shows the new class NSSeedGenerator, whose implementation is similar to that of SeedGeneraton, where the difference is that this new class doesn't inherit the Singleton class.
Figure 8 presents the revised code of MersenneTwisterUniformRngcalling NSSeedGenerator.
In summary, the loop, as originally coded, is not a good candidate for parallelization since each loop iteration depends upon the results of the previous iteration.
Hide/Show this code
class NSSeedGenerator {
public:
NSSeedGenerator();
unsigned long get();
private:
void initialize();
MersenneTwisterUniformRng rng_;
};
NSSeedGenerator::NSSeedGenerator() : rng_(42UL) {
initialize();
}
void NSSeedGenerator::initialize() {
// firstSeed is chosen based on system time and used for the rng
unsigned long firstSeed = tick();
MersenneTwisterUniformRng first(firstSeed);
// secondSeed is as random as it could be
// feel free to suggest improvements
unsigned long secondSeed = first.nextInt32();
MersenneTwisterUniformRng second(secondSeed);
// use the second rng to initialize the final one
unsigned long skip = second.nextInt32() % 1000;
std::vector<unsigned long> init(4);
init[0]=second.nextInt32();
init[1]=second.nextInt32();
init[2]=second.nextInt32();
init[3]=second.nextInt32();
rng_ = MersenneTwisterUniformRng(init);
for (unsigned long i=0; i<skip ; i++)
rng_.nextInt32();
}
unsigned long NSSeedGenerator::get() {
return rng_.nextInt32();
}
Figure 7. Codes of non-Singleton SeedGenerator class NSSeedGenerator.
Hide/Show this code
void MersenneTwisterUniformRng::seedInitialization(unsigned long seed) {
//change from:
//unsigned long s = (seed != 0 ? seed : SeedGenerator::instance().get());
//to:
unsigned long s;
if(seed == 0){
NSSeedGenerator sg;
s = sg.get();
} else {
s = seed;
}
mt[0]= s & 0xffffffffUL;
for (mti=1; mti<N; mti++) {
mt[mti] =
(1812433253UL * (mt[mti-1] ^ (mt[mti-1] >> 30)) + mti);
mt[mti] &= 0xffffffffUL;
}
}
Figure 8. Revised code of MersenneTwisterUniformRng calling NSSeedGenerator to get seed.
Step 3: Performance Tuning
With the data races eliminated, we were now ready to test the performance of our first implementation. We ran the parallelized discrete hedging code on an AMD machine with 16 cores (four Quad-Core AMD Opterons connected by HyperTransport). With hedge = 10 and nSamples = 1,000,000, we only saw a 4x speed-up on 16 cores. What was the cause of performance bottleneck?
The Cilk++ Parallel Performance Analyzer indicated that the parallelism is around 128, and therefore the program should have sufficient parallelism on 16 cores. Moreover, the burdened parallelism – which considers the runtime and scheduling overhead, and may be a better estimate of the achievable speedup – of the program is close to its parallelism around 128, which indicates that the granularity of the parallel sections is also sufficiently large. Exploring the code performance further (by incorporating timing measurements in several locations), we quickly zeroed in on the fact that the number of atomic instructions of the code measured by our Parallel Performance Analyzer was large. Specifically, the number of atomic instructions was around 5% of the number of instructions on span. These atomic operations could be quite costly in terms of performance.
Further analysis showed that these atomic operations are mainly the result of doing reference counting when using boost::shared_ptr. In particular, the atomic operations occurring in the GeneralBlackScholes::drift() function accounted for much of this overhead.
This function is called at every step of a path for all paths. Figure 9 shows the original implementation of drift() function, and a chain of functions/classes called. In summary, drift() invokes YieldTermStructure::forwardRate() function twice, where it creates a new instance of InterestRate each time. The constructor of the InterestRate class creates a new instance of DayCounter, and the DayCounter class mallocs object instances referred by shared_ptr. Therefore, the drift() function that has little "real" work, and mainly performs dynamic object management and atomic operations.
We decided to revise the drift() function and its callees as shown in Figure 10.
We add a new method simpleForwardRate to YieldTermStructure. Instead of creating an instance of InterestRate that again creates instances of DayCounter, simpleForwardRate computes and returns a Real value (type double) by following the computation of the interest rate in InterestRate class without going through the large overhead steps of object creation and shared pointer management. The changes made in drift() function and its callees turned out to significantly impact the performance. With these changes, the number of atomic operations was reduced by 96% , and the new code performs well – nearly 14X on 16 cores (see below).
Hide/Show this code
Real GeneralizedBlackScholesProcess::drift(Time t, Real x) const {
Real sigma = diffusion(t,x);
// we could be more anticipatory if we know the right dt
// for which the drift will be used
Time t1 = t + 0.0001;
return riskFreeRate_->forwardRate(t,t1,Continuous,NoFrequency,true)
- dividendYield_->forwardRate(t,t1,Continuous,NoFrequency,true)
- 0.5 * sigma * sigma;
}
InterestRate YieldTermStructure::forwardRate(Time t1,
Time t2,
Compounding comp,
Frequency freq,
bool extrapolate) const {
if (t2==t1) t2=t1+0.0001;
QL_REQUIRE(t2>t1, "t2 (" << t2 << ") < t1 (" << t2 << ")");
Real compound = discount(t1, extrapolate)/discount(t2, extrapolate);
return InterestRate::impliedRate(compound, t2-t1,
dayCounter(), comp, freq);
}
InterestRate::InterestRate(Rate r, const DayCounter& dc,
Compounding comp, Frequency freq)
: r_(r), dc_(dc), comp_(comp), freqMakesSense_(false) {
......
}
InterestRate InterestRate::impliedRate(Real compound, Time t,
const DayCounter& resultDC,
Compounding comp, Frequency freq) {
QL_REQUIRE(compound>0.0, "positive compound factor required");
QL_REQUIRE(t>0.0, "positive time required");
Rate r;
switch (comp) {
case Simple:
r = (compound - 1.0)/t;
break;
case Compounded:
r = (std::pow(compound, 1.0/(Real(freq)*t))-1.0)*Real(freq);
break;
case Continuous:
r = std::log(compound)/t;
break;
case SimpleThenCompounded:
if (t<=1.0/Real(freq))
r = (compound - 1.0)/t;
else
r = (std::pow(compound, 1.0/(Real(freq)*t))-1.0)*Real(freq);
break;
default:
QL_FAIL("unknown compounding convention ("
<< Integer(comp) << ")");
}
return InterestRate(r, resultDC, comp, freq);
}
Figure 9: Original Implementation of GeneralBlackScholes::drift() function and its callees
Hide/Show this code
Real GeneralizedBlackScholesProcess::drift(Time t, Real x) const {
Real sigma = diffusion(t,x);
Time t1 = t + 0.0001;
Rate rfRate = riskFreeRate_->forwardRateSimple(t,t1,Continuous,NoFrequency,true);
Rate dyRate = dividendYield_->forwardRateSimple(t,t1,Continuous,NoFrequency,true);
return rfRate
- dyRate
- 0.5 * sigma * sigma;
}
Rate compute_interest_rate(Real compound,
Time t,
Compounding comp,
Frequency freq) {
Rate r;
switch (comp) {
case Simple:
r = (compound - 1.0)/t;
break;
case Compounded:
r = (std::pow(compound, 1.0/(Real(freq)*t))-1.0)*Real(freq);
break;
case Continuous:
r = std::log(compound)/t;
break;
case SimpleThenCompounded:
if (t<=1.0/Real(freq))
r = (compound - 1.0)/t;
else
r = (std::pow(compound, 1.0/(Real(freq)*t))-1.0)*Real(freq);
break;
default:
QL_FAIL("unknown compounding convention ("
<< Integer(comp) << ")");
}
return r;
}
Rate YieldTermStructure::forwardRateSimple(Time t1,
Time t2,
Compounding comp,
Frequency freq,
bool extrapolate) const {
if (t2==t1) t2=t1+0.0001;
QL_REQUIRE(t2>t1, "t2 (" << t2 << ") < t1 (" << t2 << ")");
Real compound = discount(t1, extrapolate)/discount(t2, extrapolate);
return compute_interest_rate(compound, t2-t1,
comp, freq);
}
Figure 10: Revised Implementation of GeneralBlackScholes::drift() function and its callees
Performance
After one round of performance tuning, we revised a small portion of code that was is called many times and was largely responsible for the performance problems. The revised parallelized code scales nicely, achieving a speed-up of nearly 14X on 16 cores (4 quad-core 1.1 GHz CPUs with 8G RAM).
In summary, the original function was not well optimized for serial execution, and this inefficiency remained in the Cilk++ parallel program. Our optimization improves both programs, and it's always advisable to tune C++ programs as much as possible before the Cilk++ conversion. In practice, we've frequently found that Cilk++ conversion leads to close inspection of the code and results in performance inprovements to both the Cilk++ and C++ code.
For more complete information about compiler optimizations, see our Optimization Notice.
Comments (2) 
| February 28, 2011 5:44 AM PST
zc | Is it possible to get the source code? |


atul kumthekar
We are trying to evaluate performance of volatility calculations using Heston's algorithm in R language
on a 64 core architecture. Is their any way I can benchmark on the systems you are talking about?
Can i submit the code to be tried out to someone? We do use Monte Carlo and our problem is embarassingly parallel (thinks i)