4,391 Posts served
10,712 Conversations started
- Academic

- Android

- Art, Music, & Animation

- Embedded Computing

- Events

- Game Development

- Graphics & Media

- Intel SW Partner Program

- Intel® AppUp Developer Program

- Manageability & Security

- Mobility

- Open Source

- Parallel Programming

- Performance and Optimization

- Power Efficiency

- Site News & Announcements

- Software Tools

- Association for Computing Machinery TechNews (ACM)
- Go Parallel! (Dr. Dobbs)
- HPCwire (Tabor Communications, Inc.)
- insideHPC (John West)
- Joe Duffy's Weblog (Microsoft)
- Microsoft Parallel Programming Development Center (Microsoft Germany)
- MultiCoreInfo.com
- scalability.org (Scalable Informatics)
- Software Dev Blog (Intel Germany)
- Soft Talk Blog (Intel United Kingdom)
- The Moth (Microsoft)
Why Parallel Processing? Why now? What about my legacy code?
By Tom Spyrou (5 posts) on August 31, 2009 at 4:35 pm
Many software companies have applications which are in use by their customers that have significant runtime and for which fast runtime is a necessity or a competitive advantage. There has always been the pressure to make such applications go faster. Historically, as processors have increased their speed, the needed speedups could often be achieved by tuning the single cpu performance of the program and by utilizing the latest and fastest hardware. In the Electronic Design Automation industry that I am a part of, it has always been the case that the newest machines had to be used to run the design tools which were being used to design the next generation of processors. The speed and memory capability of the newest machines had always been just enough to design the next generation chips. Other types of cpu intensive software have also ridden the hardware performance curve in this way.
We will no longer see significant increases in the clock speed of processors. The power consumed by the fastest possible processors generates too much heat to dissipate effectively in known technologies. Instead processor manufacturers are adding multiple processors cores to each chip. Why does this help? Power Consumed = Capacitance * Voltage^2 * Frequency. If a given calculation is perfectly moved from a processor running at N Gigahertz to 2 parallel processors running at N/2 Gigahertz where does the savings come from? It would seem that each processor runs in half the power but now there are 2 processors which would mean that the same power is used. The power savings comes from the fact that slower processors can run at a lower voltage. For example a processor running at half the frequency can run at around 8/10 the Voltage level. .8^2 is .64 which implies a 36% power savings. If you scale this up to 32 cpus then it will be possible to get a lot of compute power for much lower power consumption and therefore much lower required heat dissipation. Eventually it seems that even cell phones and other embedded devices will move to multi core processing for this reason. More compute capabilities or longer battery life for the same capabilities. Both are compelling values.
Part of the assumption that goes into the definition of how this power savings will be achieved is that the software implementation of the parallel program running on the 2 slower processors must be perfectly efficient. Well, nothing in the real world is perfectly efficient. Even if the coding is not perfectly efficient, as long as it is reasonably efficient, then there is a benefit. If the parallel coding is inefficient, then it might be that the parallel program will use more power on the slower processors than the serial program running on the fast single processor. However, since faster processors that won’t melt can no longer be made, we are kind of stuck with going parallel and need to do our best.
I say stuck because from a software development perspective a large new burden is being placed on software developers. That burden is to write programs that are as efficient as possible and which make use of N processors, hopefully where N is configurable by the user and can be increased as new processors chips with more cores become available. For most developers this is something really new and really complex. It also presents a huge discontinuity for software companies with large investments in legacy code.
I joined the company I work for in 2006 with the job of parallelizing a product with 6 Million lines of code developed over 10 years and which is made up of dozens of very complex intertwined algorithmic steps. I wanted to write this introductory blog because I haven’t seen the path to where we are and the need for parallel programming described from the big picture. I plan to write a few blogs going forward on the problem and some solutions for parallelizing legacy code. The solutions will work in some cases and will parallelize programs to some degreee but not perfectly. In the end software developers are engineers and need to make engineering trade offs. Please check back on this blog for some tricks of the trade in parallelizing legacy code. I have been able to get around a 2X speedup on 4cpus for this 6 Million line code base. Good but a lot more to do. I am hoping to help others by sharing and also look for new ideas from smart and innovative people that I hope will joing in the discussion.
Categories: Parallel Programming, Software Tools
Tags: code, cosumption, legacy, Power Efficiency, processing
For more complete information about compiler optimizations, see our Optimization Notice.
Comments (14)
| September 1, 2009 3:18 PM PDT
Aaron Tersteeg (Intel)
|
Tom, I don't know if it always makes sense to let the user configure the number of cores used. Doesn't it make more sense for the operating system to manage CPU utilization by applications? Letting the user define the processors used just sounds like a messy proposition. Aaron |
| September 1, 2009 3:59 PM PDT
Clay Breshears (Intel)
|
I love the idea that you will be bloging about, Tom. Can't wait to hear more from "the trenches" about how parallelism is being perceived and conquered in the real world. @Aaron - I wonder if the word "configure" was not the best choice. I think Tom was saying that the user should be able to choose N cores out of P (P >= N) cores. A software "knob" can be built into the application to make use of a number of cores that is appropriate. If more cores are available and the code will run efficiently, then more cores should be used; but the user should know the performance limits of the application and be able to choose the right number for the data and the computation involved. At least that is how I read that section. (Also, I don't think anyone expects that applications can be pulled out of thin air on which to execute.) |
| September 1, 2009 4:19 PM PDT
Tom Spyrou |
I think that if the user is running on his/her own box then the software should be able to automatically detect the number of cpus to use. This is a common use model, but more and more companies are trying to leverage farms of shared machines for compute intensive and/or memory intensive applications. In the case of the shared farm, the user has to ask the farm for a number of cpus. For example in LSF ILoad Sharing Facility) from platform computing the user would type "bsub -nCpus 4 bigrun.csh". In this case the user's job might be put on a 16 cpu machine but 4 cpus are reserved. If the software sees 16 cpus on the hardware and uses them all then the farm machine will be over-subscribed and could thrash from too much context switching or even crash. The other cpus will be allocated to other users' work. Some farms will kill jobs that use too many cpus. So its important that their be coordination between the farm cpus requested and the applications use of cpus. In our software at Cadence we allow the user to specify a given number of cpus or to specify use of all cpus on the machine. In this way both use models are supported. |
| September 2, 2009 4:49 AM PDT
Stephen Doyle | I'm very impressed by the effort tp parallelize 6 million lines of code and getting a 2x speedup on 4 cores. Did you use any particular tools or concurrency libraries such as OpenMP etc. to achieve this? |
| September 2, 2009 10:54 AM PDT
Gastón C. Hillar
|
Tom, What kind of application are you talking about? A single-tier software, a multi-tier solution? It would be great to know more details about the kind of application. I ask you this because you're talking about both farms and multicore. |
| September 2, 2009 12:05 PM PDT
Tom Spyrou | Stephen, I used a combination of PThreads, Forks and fine grained distributed remote processes. I plan to blog about these techniques soon. I didn't use openMP or anything higher level mostly because I was already familiar with the other techniques. Also my use of PThreads or any form of Shared Memory threads was limited do to the difficulty in making the code thread safe. There was a lot of intertwined code and I started out looking for ways to avoid the thread safety issues at first which the forks and distributed processes connected by sockets provided. |
| September 2, 2009 12:07 PM PDT
Tom Spyrou | Gaston, it is an EDA application which is a single tiered application. Since the application often needs 64Gig of Memory and can run for 10's of hours on a single cpu, most users use the command line interface and submit runs to an lsf queue. Within this model, the application can submit sub-processes to lsf to gain access to other cpus which it sends work to over sockets for a fine grained distributed paradigm. I will blog about this soon. |
| September 2, 2009 3:29 PM PDT
Tom Spyrou | http://software.intel.com/en-us/blogs/2009/09/02/parallelizi..... rocessing/ is a follow on to this that described fine grained distributed processing. I hope its helpful. |
| September 3, 2009 10:21 AM PDT
Casey Weltzin |
Great article Tom! You present a very clear picture of the motivation for multicore processing and a concise summary of the challenges that software developers are now faced with. Some thoughts to this point: I agree with previous comments that the end goal is to write applications that will effectively scale across any number of cores. Otherwise, programmers will find themselves rewriting code many times with subsequent generations of multicore CPUs. One option is having the programmer identify dependencies clearly, and then using a compiler that can take the dependencies and make intelligent decisions about what pieces of code can run in parallel threads and how many threads to use. Dataflow programming is an interesting way to do this (I work a lot with LabVIEW). Of course, the ideal solution for those working with existing sequential text-based code would be to create advanced compilers that automatically detect potential optimizations and them re-implement the code in parallel. This seems like a solution, however, that is not scalable in the long run. |
| September 5, 2009 9:57 PM PDT
Tom Spyrou | Can you post a link to a simple example using LabVIEW? I am not familiar with it. |
| September 8, 2009 8:50 AM PDT
Leon van der Westhuizen | Guys this is awsome, I don,t even live in your world but am caught in the cross hairs of this discussion. Good luck with parallelizing and multithreading. 6 Million line code base - IMPRESSIVE |
| September 9, 2009 3:23 PM PDT
Tom Spyrou | What world do you live in? If you can share the problem maybe we could brainstorm about possible solutions. |
| October 8, 2009 7:09 PM PDT
Tom Spyrou |
Hi Everyone, I thought that I would post this since one of my co-workers at Cadence will be giving a webinar on his experience parallelizing an existing application. The application and algorithms involved are really complex. This should be especially interesting because it involves legacy code and also the development environment was Windows unlike many EDA applications which run on Linux/Unix. I really recommending watching. Tom October 27: John Schiavone, Cadence Design Systems: Real World Parallelism: Refactoring Legacy Code and Implementing Concurrency Cadence Allegro's complex Design Rules Checking (DCR) process is used to verify that designs meet constraint requirements. Development is currently underway to improve the performance of the DRC process using multithreading. View the design architecture and learn about the challenges faced in refactoring the legacy code, achieving platform independence, and performance verification.<a href="https://event.on24.com/event/36/88/3/rt/1/index.html?&eventid=36883&sessionid=1&key=D76A2FD29D7444AEC06765011A2D4953&tab=1&sourcepage=register">Link to John's Webinar</a> |
Trackbacks (8)
- Why Parallel Processing? Why now? What about my legacy code?
September 1, 2009 6:39 PM PDT - Parallelizing Legacy code using Fine Grained Distributed Processing
September 2, 2009 4:00 PM PDT - Komputer Ku Termasuk Kelas Apa……….? :: Computers and Equipment
September 3, 2009 6:49 PM PDT - Why Parallel Processing? Why now? What about my legacy code? « icompiler
September 15, 2009 10:55 AM PDT - Parallelizing legacy Unix/Linux code using copy on write fork()
September 25, 2009 6:43 PM PDT - Parallelizing Legacy code using Fine Grained Distributed Processing - Storage Informer
September 25, 2009 6:48 PM PDT - Review of 2 Performance related Chapters of “The Intel Guide for Developing Multithreaded applications” – Intel Software Network Blogs
March 1, 2010 4:58 PM PST - Review of 2 Performance related Chapters of “The Intel Guide for Developing Multithreaded applications”
March 1, 2010 6:40 PM PST




Gastón C. Hillar
4,424
It's a great idea to talk about the great adventure of parallelizing legacy: "6 Million lines of code developed over 10 years".
Be sure I'll follow your blog posts. :)
Cheers,
Gaston