Multiple Instance of Application with Degraded Performance

Multiple Instance of Application with Degraded Performance

I am running a large set of simulations using a Fortran application.  My standard approach is to use mutiple instances of the application to run several jobs at a time and a batch file in each CMD window so the entire set can be processed.  The number of instances depends on the number of physical cores, but is usually 3 or 4.  I have noticed a large time increase in run time when running 3 versus 1 (about 2.5 times slower), which it is still faster to run 3 at a time than 3 in series.    I use the MKL and the functions I am using are by default multithreaded.  I have found that setting the environment variable mkl_num_threads=1 helps with the slow done, but only gets it to the 2.5 times slower. I am familiar with multithreading but not how to harness it for good.  For what it is worth, I am an engineer not a programmer, so my application is very basic as are my programming ninja skills. So my first question is if I want to keep my process the same (ie multiple instances), are there other ways to help the sims from competing with each other?  I played a little with the affinity mask, both as an environment variable and through the task manager, without any luck, but this is probably because this was the first time I messed with it. Any help would be appreciated.

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

There are others here who have more experience with such things, but my first guess would be that you are "oversubscribing" the cores in the initial case - more threads than there are execution units. Setting the thread count to 1 goes the other way and probably leaves cores unused. You may also have contention for I/O, memory or cache.

My first advice would be to run a single instance of the program under Intel VTune Amplifier XE and look at its chart of concurrency over time. In fact, VTune is a great tool for figuring out what is going on with your program. If you don't already have it, you can get a 30-day free trial.

Steve - Intel Developer Support

You might want to consider running with the single threaded version (sequential) of the MKL library (as opposed to running the multi-threaded version with 1 thread).

Jim Dempsey

You can profile it with VTune for obtaining CPU counters data and later profile with Xperf to get more insight into interaction with OS.If you are using Windows of course.

So I downloaded the trial version of VTune and was able to quickly find some hot spots that needed fixing.  These help speed a single instance, but there is still the issue of multiple instances.  There are a lot of options in VTune and I am not sure exactly what I should be looking for.  I ran the General Exploration under Sandy Bridge... for a single instance and when I was running a second instances in a separate command shell.  The summary tab has a couple of areas highlighted which, along with some of the other memory outputs, are listed below for the two scenarios:


The first value is the single instance and the second value is with 2 instances running:


LLC Miss 0.012,0.118

LLC Hit 0.09,0.098

DTLB Overhead (Highlighted) 0.118,0.121

ICache Misses (Highlighted) 0.019,0.02


A little more background on my application.  My application is an engineering tool that is for internal use only.  I have to deal with large matrix multiplication where the matrix size can be up to 3000 X 3000.  At a basic level, my program is laid out using global variable via the Module and USE commands and with allocatable arrays since my problem size is always changing.  I simply read in the sizes I need via and input file and then allocate the memory.  I use the global variable since I need access to a large set of my variables within most of my functions. I have a feeling the global allocatable arrays are affecting the multiple instance performance and I am sure there is plenty of room for improvement in this realm.  The problem size keeps growing so this issue of degraded performance is getting more and more important.  In the past we have been able to deal with the time increases, so dealing with memory use is new to me.  Any suggestions on what general fixes to make or what I should look for in the VTune?

Do you use any prolonged file I/O operations and does your further processing must wait for completion?


I am not positive I know what you are asking, but I will try my best to answer.  I am writing to a file as the application runs.  It writes a line in a text file that contains about 30 columns.  Then at the end of the job, there is a large write of several large matrices.  As far as waiting for completion, I guess the answer is yes.  The next 'case' does not start until the previous is finished.

The problem appears to be LLC evictions by running multiple instances concurrently inside single threaded MKL.

Consider using a memory mapped file to hold an inter-process shared variables that:

a) Holds the number of instances of the application running. (requires code in your application to figure this out)
b) Holds the size of the LLC.  (requires code in your application to figure this out, or environment variable)
c) Holds the number of LLC (number of sockets),  (requires code in your application to figure this out, or environment variable)
d) Has table of LLC current load values (table size == number of LLCs)
e) Has table of LLC total load values (table size == number of LLCs)

First instance of application (and/or like applicaitons) initializes values.
Each instance:

a) determines its estimated LLC load. This would be a function of the matrix size (or sizes if app uses differing sizes).
b) Searches the table of LLC total load values for the entry with the (first) lowest value. Uses thread-safe (XCHGADD) means to add the estimated load. Then uses the index found into this table as the LLC number (socket number) to restrict the application to run on.
c) Using LLC number, affinity pin the process to that socket.
d) At end of program remove estimated load from table of LLC total load values for the LLC

During execution, using shell functions to call MKL functions (e.g. matrix multiply):

a) estimate LLC load for current call (may vary during program execution) 
b) use a thread-safe load barrier

int currentLoad;
  (currentLoad = XCHGADD(&LLCcurrentLoad[myLLC], myLoad)) // pre-myLoad > 0?
  && (currentLoad > LLCloadMax))
  XCHGADD(&LLCcurrentLoad[myLLC], -myLoad)); // remove load 
  Sleep(0); // In event of more threads then logical processors - give up load on system
// here when safe to use MKL
XCHGADD(&LLCcurrentLoad[myLLC], -myLoad)); // remove load
} // return from shell function

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today