Preliminary results on a parallelizing runtime for an implicitly parallel language

Preliminary results on a parallelizing runtime for an implicitly parallel language


Hello everyone, my name is Dan Amelang and I'm a PhD student at the University of California, San Diego. As part of my dissertation research I'm developing a declarative, implicitly parallel programming language. The language is similar to previous work in dataflow, stream processing and functional languages. Closely related work includes SISAL, Kahn Process Networks, Dataflow Process Networks, and StreamIt, among others. It also borrows from Haskell somewhat.

Up to now, the focus has been on making the language as expressive as possible. But, as of last October, the language design was mostly complete and I had a fairly large case study, so I decided to give a shot at parallelizing programs written in the language. Because the language is implicitly parallel, no program modifications should be needed to get parallel execution. I only needed to modify the compiler and runtime.


The previous runtime had to be completely rewritten. In particular, scheduling and garbage collection required redesigning (after reading several papers on the topics). Problems with false sharing required restructuring internal data structures. It was quite an adventure, but eventually I come up with something I was happy with.

Unfortunately, the changes to the runtime meant that the compiler would have to generate code somewhat differently (the compiler generates C source code as its output). I should point out that all the "parallel smarts" are in the runtime -- the compiler merely generates code that uses the runtime's API.

But, rather than take the time to fix the compiler, I decided to manually write some simple programs that resembled those typically generated by the compiler. I had already begun to do this earlier as I was trying out new ideas in the runtime. So the results I'm going to report are, unfortuntely, not for code generated from the compiler. But I've tried to make the benchmarks very similar to the code that the compiler usually generates for my case studies.


For all benchmarks, the actual work performed on each floating point input number is a call to the logf() function. Because the output of one stage is the input to another, the output is inverted if less than zero, to avoid negative numbers since they are not in the domain of the logf function. Initial input numbers were generated randomly via rand(). The benchmarks differ in how many times logf() is called per input number (each stage itself can call logf multiple times for each input number). By changing the nubmer of times we call logf per input number, we can artificially increase the workload per input number for a given benchmark. The benchmarks also differ in how long each pipeline is, how many numbers are in the input stream, and how many pipelines are executed in total during the benchmark.

The first benchmark is the least realistic, representing a best case scenario for parallel execution. It executes several hundred pipelines of 40 stages each, feeding each pipeline a stream of 3000 floating point numbers. Each stage calls logf() 100 times per input number. Intuitively, this benchmark represents a program with many, deep pipelines, a very high work/input element ratio and modest input size (per pipeline). This means that there is lots of work to do (no cores should have reason to idle), and the cores work quite a bit on their own, without need to communicate with other cores very frequently. Obviously a very favorable scenario for parallel execution. I ran the benchmark using 1, 2, 4, 8, 16, 24 and 32 cores. As expected, speedup was close to linear all the way up to 32 cores. Absolute time was 3m58.034s for 1, 0m7.680s for 32, for a speedup of ~31x. Not a bad start.

The next benchmark is more realistic, reducing the number of stages per pipeline to 10. I have several pipelines in my case study that are about this deep. The workload is still unrealistic, but we'll address that later. Speedup was about linear up to 16 cores, but for 24 cores it was 22.7, and for 32 it was only 27. Not bad, but could be better.

I then further reduced the pipeline depth to 5 stages, which is pretty shallow but still common in my experience. Speedup was good up to 24 cores again, but there was virtually no improvement beyond that.

Next, I restore the pipeline depth to 10, reduce the input stream length to 1000 (which is pretty small), and drastically reduce the workload per input element. Now only a single call to logf is issued per input number (as opposed to 100 calls). From my experience, this is a very realistic benchmark. It's got moderate pipeline depth, moderate/smallish input stream size and a light workload. I execute a million of these pipelines, which for, say, computer graphics, it very reasonable. I was very surprised to get a linear speedup all the way to 24 cores, with a 30.7x speedup for 32 cores. Very encouraging!

I should note that at this point, results began to vary from run to run. I'm reporting the best time above, but I am a bit concerned about my scheduling algorithm and perhaps the OS scheduler's influence. Ideally, I'd give more hints to the OS wrt what threads should run where. I'll look into this later.

Finally, I stacked quite a bit against myself by reducing the workload even more to a mere absolute value calculation per input number. This is a fairly unrealistic workload. Speedup was linear up to 8 cores, after that I started getting slowdowns. I could get even more unrealistic at this point and make the pipelines very shallow, or input streams very short, but there isn't much point.

You'll have to excuse the lack of speedup graphs. These are "first attempt" results and I expect to put more effort into the presentation after things settle down more.


I do have a couple ideas for tweaks to the runtime to address some scalability issues I ran into. For where I hoped to be, I'm pretty happy, though, so I'm going to wait a bit to implement these ideas.

My focus right now is to go back and fix the compiler to generate code that matches my new parallel runtime. After that, I hope to benchmark larger programs on the Manycore Testing Lab machines and report back here.

A big thank you to Intel and the people at the MTL for providing this fabulous service. It has been very enlightening to use the MTL machines and the MTL staff have been extremely helpful and responsive.


3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Sounds like a great project! As a point of advice in case you decide to try to turn this into a publication, most of the research in the field will want to also see a speedup baseline that isn't the one-processor version of your own system. So, if you took an optimized sequential C version of your benchmarks:
1) How many processors do you need before you can match that C version's speed?
2) How do you continue to scale past that point?

We've been working in this space for the last four years or so (Manticore, an implicitly-threaded parallel dialect of ML), and providing a good baseline has been critical not only to publication but also to double-checking our own work. It's hard to know how much overhead you've introduced (running your parallel runtime on just 1 core) and how much of your speedup curve is just from erosion of your own system's overhead versus actually speeding up the problem (speedup curves versus the sequential C baseline on multiple cores).

Best of luck!

Hi Lars!

Thanks for the suggestion -- it makes a lot of sense. I'm definitely going to include this comparison in my next benchmark session.

I am already somewhat familiar with the Manticore project (I think it was a DAMP paper I read last year). It also sounds like an interesting project, best of luck to you too.


Leave a Comment

Please sign in to add a comment. Not a member? Join today