Parallelizing Legacy code using Fine Grained Distributed Processing

By Tom Spyrou (5 posts) on September 2, 2009 at 1:21 pm

Adding parallel processing to legacy code is a desire of every software company that has an existing product which is significant in complexity and which needs to run faster. Processor clock rates are not increasing much and now multiple cores are being added to chips instead. The problem of speeding up software is moving from a hardware improvement problem to a software parallelization problem. This is a follow-on post to Why Parallel Processing? Why now? What about my legacy code?, please read this posting for more background.

Typically with multi-core processors, the first thought is to use multiple threads in a shared memory programming paradigm to parallelize a software algorithm. This approach can work very well, especially for software designed from the ground up to be thread safe and thread efficient. Thread safety means that the threads and data structures are written in such a way that there are no race conditions between the threads for shared data. Thread efficient is a term I use to discuss the efficiency of the scheme used to avoid race conditions as well as the code’s ability to efficiently use the processor and its cache and memory bandwidth to keep the processors busy. Making legacy code thread safe and thread efficient is often a difficult task, especially for large pre-existing code bases and/or code bases that have been developed over a long period of time. In such code a top level understanding of the code call chains and architecture is often not complete. Simple locking can make the code thread safe but often leads to locks which have a long duration and make the code thread safe but not thread efficient. Re-coding the data structures and code can be prohibitively expensive and lengthy for the short term needs of the software’s user.

To scale serial programs through parallelization in the most thread efficient way typically involves a major re-architecture and re-implementation. This is especially true when scaling programs not only to multi-core systems but to many-core systems. Multi-core is usually defined as systems with 4, 8 maybe 16 cpus. Many-core is defined as systems having at least 16 processors and usually 64 or 128 processors. It is nearly impossible to get significant speedups on many-core systems without coding with such systems in mind. I do not think the approaches and shortcuts I have used on multi-core systems will allow legacy software to scale on many-core systems. However it is possible on multi-core systems to get a decent speedup with smaller changes to the code using clever software engineering approaches.

In this posting I want to discuss one such approach which is the use of Fine Grained Distributed Processing to speed up compute intensive pieces of serial programs.

Many programs which run for a long time have parts of the code which are a significant bottleneck or percentage of the runtime. These parts of the code can be found by profiling the code and looking to see where the time is spent. Once the code that needs to be sped up is found, the first step, before attempting any parallel coding, is to speed up the single CPU performance as much as possible. Optimizing the single CPU algorithm and data structure layout and access is key to being able to parallelize effectively. If the algorithm is poor then even when parallelized there will still be inefficient use of cpus. If the data locality is poor causing many cache misses then the memory to CPU data bandwidth will be the bottleneck in the single CPU version of the code and this will get even worse in the N-cpu version of the code since at the limit the parallel algorithm may require N times the amount of data bandwidth. In my experience this optimization is generally a local optimization and requires changing the local algorithm and the specific data structures upon which it depends. Sometimes it turns out to be a more global problem and when this is the case it is definately better to find out about it before trying to parallelize the program and fix it in the serial version.

Once the code and data access is optimized, then the next step is to find parallelism in the algorithm. Many times algorithms or steps of a program which are time consuming involve loops or other constructs which can be decomposed into parallel pieces. If the underlying code and structures are or can be made thread safe then shared memory threads can be a great way to go. Often this is not the case. When shared memory threads are not feasible one possible approach is to start separate processes, either on the local machine, if there are cpus available, or on remote machines that have good connectivity to the master process.

The remote processes can then be set up as servers which receive messages with work to do and which reply with a result. The client server paradigm fits well here. This paradigm is similiar to having threads running and whose work is determined by populating a queue. The remote processes are servers much like a web server which take requests and return results. The main program is the client and like an internet browser orchestrates the work and is the main interface to the user.

The protocol between the master and slave processes should be set up so that whether the slave ends up running on the local machine or a remote host that the socket programming will be the same. In this way the distributed programming can work on a single multi-core machine or across a farm of machines. On a farm the master program will not know what machine the slave will wake up on so a small handshaking protocol can be set up to initialize the connection between the master and the slave. I have summarized one simple protocol below. This protocol is kept very simple to avoid any chance of deadlock and to make debugging simple. The slaves are set up to log incoming messages so that if there is a bug in the slave it can be debugged independently of the master and other slaves.

The application which I am working on uses TCL as its command line language. To keep the protocol even simpler I found it helpful for the master and slave to be the same executable and for the master to pass TCL commands whose parameters contained the work to be done and whose return values were the results of the work. In this way the act of adding new TCL commands to perform the specialized work could be done in such a way as to re-use code from the serial master and allow a lower level access to one of the sub-calculations which was part of the serial master routine that is being parallelized.

The master program was loaded with the full set of data structures and state of the original serial program. The slave invocations of the same executable were done in a stateless mode. No data was loaded and the program was set up to listen to a socket for TCL commands to execute and to reply to with results. These stateless servers have the advantage that they use very little memory and free any memory used from work command to work command. They tend to have great data locality since the incoming message is converted to a small targeted set of data structures by the same processor which will perform the calculations. It is also helpful since a process tends to have affinity for a single processor and a processor has affinity for the data which it allocates.

I call this approach Fine Grained Distributed Processing since the TCL commands can be a very fine grain of work as long as the work is enough to offset the time spent to package up the message for the socket and return the result. Even when running on a single multi-core machine the programming paradigm is distributed.

The servers do not have to strictly be stateless, but the more state they have, the more complex the protocol and synchronization becomes between the processes.

Some observations that I have with this approach are that on a modern local machine with 4cpus I could send 30K 30KB size messages per second when the master sent messages and the slaves did nothing but echo the message back to the master. I also noticed in real use that if the task size became less that 20ms that scalability really would suffer. These benchmarks need to be taken into account when designing the messages and analyzing the suitability of this approach for a given problem. On our Cadence Farm which has machines with gigabit or better connectivity I saw the benchmark degrade by about 10% to 30% when the slaves were on remote machines. I could send the same number of shorter messages or reduce the number of messages. This number I am sure will vary a lot from farm to farm and also varied for me from run to run which I assume was affected by network traffic at the time. In production usage we have used the local machine for fine grained tasks and use the remote version for larger tasks of multiple minutes in duration. In this way the message passing overhead is kept minimal.

One example application I used this technique on was to parallelize a parasitic reducer of electronic circuits. Basically on a chip the wires connecting the chips would ideally be perfect conductors but there are parasitic resistances and capacitances that need to be analyzed before the speed of the circuit can be calculated. This analysis is called reduction since the output of the analysis is a reduced mathematical model known as a transfer function. This problem is easily parallelizable since the wires connecting the devices can be computed in any order. However the code in question depends on large data structures and a math library that are not thread safe. Rather than make the major changes needed for thread safety, I tried the fine grained distributed approach with sockets and tcl commands. I was able to get around a 3X speedup on 4cpus for this step. An example of a small circuit with around 20K nets connecting components is below.

The Fine Grained Distributed Processing approach can be an easy and effective way to get parallelism out of legacy code for certain problems where the messages are easy enough to compose and not too long compared to the work that needs to be done.

One thing to note is that TCL itself has a robust socket language which is easy to use and works well across platforms. I recommend it for prototyping and quick work. http://www.tcl.tk/about/netserver.html has a small example.

This is the first time I have tried to describe this approach in pure written form, usually I do it in person. If you have any questions post a comment and we can discuss.

Categories: Parallel Programming, Software Tools
Tags: , , , ,

For more complete information about compiler optimizations, see our Optimization Notice.

Comments (10)

September 3, 2009 11:49 AM PDT

Gastón C. Hillar
Gastón C. HillarTotal Points:
4,424
Black Belt
Hey Tom,

Thanks for the detais. Really nice experience. I've enjoyed reading your posts.
You're talking about an EDA application. I've worked with this kind of tools. So, here is my question.
EDA usually perform thousands of complex mathematics operations. Have you added code to take advantage of SIMD, SSE, SSE2, SSE3, SSE4 instructions. For example, have you tested some parts of the code using Intel Math Kernel Library.
If you haven't I do believe you'd be able to achieve higher performance enhancements because you will be able to combine multicore + SIMD.
I had an experience recoding an EDA app (ten years ago) and I used SSE. It really improved performance. Now, with Intel Math Kernel Library it's easier than ever.

Again, great posts!

Cheers,

Gaston
September 3, 2009 12:22 PM PDT

Tom Spyrou
Tom Spyrou
Hi Gaston,

Thanks for the comment. Yes, when tuning the single cpu performance we tried out the Intel Math Kernel Library and like you said it was a big help. That library is obviously very well tuned. We also tried the Intel Compiler and found it to vectorize about the same spots in the code as gcc 4.0. One thing that the Intel Compiler helped with, again with single cpu code, was that it generated diagnostics that pointed us to a couple of buried problems with bitfields that were really slowing us down. So we've had a good experience with both the math libraries, which are now standard in our production tool, and ICC.

Tom
September 4, 2009 2:04 AM PDT


Robin Harker
HI Tom,

There is another, simpler way of acheiving a throughput speedup, of legacy codes, (such as EDA tools), on multi-core CPUs and without re-coding at all.

MultiCore-Optimizer, MCOPt for short, is a resource allocation middleware which only allows jobs to run when sufficient CPU and memory is available, so preventing system thrashing, memory swapping. This means jobs run at their optimum speed, because they are not contending for system resources, when scarce. Net result is you get greater throughput from your Intel or AMD multi-core CPU.

Robin
September 5, 2009 8:54 PM PDT

Tom Spyrou
Tom Spyrou
Hi Robin,

Here I was not discussing parallelizing multiple jobs but taking what currently could only be run as one job and allowing it to spread its works across mutiple cpus. It can't be automated but at this fined grained level needs the developer to have a detailed understanding of the algorithm and break up the work. Can you describe what MCOpt does in more detail?

Tom
September 7, 2009 1:59 AM PDT


Robin Harker
Hi Tom,

I realise we are talking about slightly different things, but I think it's worth bringing MCOPt into the discussion, as there are many serial jobs which wil probably benefit from a product like MCOPt and need no re-coding or tuning whatsoever.

NB MCOPt helps increase throughput, but also maintains minimum job run time, by ensuring jobs do not conflict with one another as they fight for resource.

What does MCOPt do?

Firstly we are doing real time job profiling, which allows MCOpt to know what resources are being used by running jobs, as well knowing what resoures are available on a given system and what resources are free. MCOPt then ensures optimum throughput by only allowing jobs to be started if sufficient resource is free. So if a job needs 10GB of RAM, but there is only 9GB free, that job will be blocked until at least 10GB is free. This ensures that swapping does not occur. So many EDA sites have problems in running a mixture if small and large memory jobs on the same cluster, because swapping, usually, adversely affects the large memory job, (which tend to be the expensive tools), causing runtime delay or failure to complete, rather than the small jobs.

We also improve throughput, with our pre-emptible backfilling mechanism. i.e. during an I/O wait, a running job is suspended, and it's resource is made available to a waiting job. As soon as the I/O is completed, we suspend the backfiller, re-start the original job, until it completes, or until it initiates another I/O call. This is again particularly useful for large EDA simulation runs, where the memory footprint of a job can either grow or shrink as the job is running. With MCOPt, we not only allow the large memory job to run, but also make the most efficient usage of system memory for other jobs, so maintaining optimum throughput.

Hope this helps?

Robin



September 9, 2009 2:25 PM PDT

Tom Spyrou
Tom Spyrou
Can your system work on machines managed by lsf?
September 10, 2009 1:49 AM PDT


Robin Harker
Absolutely, LSF, SGE, PBSPro, Torque, Symphony, etc. You rename bsub to our bsub, then any job that goes through the new bsub gets controlled by MCOPt.

You can also use it on standalone, (i.e. non GRID/cluster), environments. Then you prefix jobs with "mcorerun" and again job gets controlled by MCOPt, instead of by the linux scheduler.

Do you want a demo copy?

Robin
September 10, 2009 5:45 PM PDT

Tom Spyrou
Tom Spyrou
It sounds pretty useful. In our farm we all use resource strings to avoid contention but our customers don't.
I think it would be something I could recommment people try out when there job sizes are not predictable.
September 15, 2009 1:16 PM PDT


Robin Harker
Tom,

Were are you based, Santa Clara?
October 8, 2009 6:09 PM PDT

Tom Spyrou
Tom Spyrou
Hi Everyone,

I thought that I would post this since one of my co-workers at Cadence will be giving a webinar on his experience parallelizing an existing application. The application and algorithms involved are really complex. This should be especially interesting because it involves legacy code and also the development environment was Windows unlike many EDA applications which run on Linux/Unix. I really recommending watching.

Tom

October 27: John Schiavone, Cadence Design Systems: Real World Parallelism: Refactoring Legacy Code and Implementing Concurrency Cadence Allegro's complex Design Rules Checking (DCR) process is used to verify that designs meet constraint requirements. Development is currently underway to improve the performance of the DRC process using multithreading. View the design architecture and learn about the challenges faced in refactoring the legacy code, achieving platform independence, and performance verification.<a href="https://event.on24.com/event/36/88/3/rt/1/index.html?&eventid=36883&sessionid=1&key=D76A2FD29D7444AEC06765011A2D4953&tab=1&sourcepage=register">Link to John's Webinar</a>




Trackbacks (7)


Leave a comment  

To obtain technical support, please go to Software Support.
Name (required)*

Email (required; will not be displayed on this page)*

Your URL (optional)


Comment*