To speedup or not to speedup. That is the question

With the advent of multicores, the teaching of parallel programming is again in the spotlight. However, despite the many years of experience on this subject that resulted from the supercomputing bubble of the 1980s, there is still much to be learned. Of the many issues that we could discuss about the teaching of parallel programming, the most important in my opinion is the need to teach program optimization. By this I mean the performance improvement process that comes after we have found all the important parallelism in a program. I believe most of us have seen the need for this process at some time. The reason is that (as we are also painfully aware) in many cases the first version of a parallel program is slower than its sequential counterpart, and the only way to obtain improvements is through a careful optimization process. This process may involve replacing an inefficient parallel kernel or algorithm, increasing locality, reducing communication costs, improving load balancing, avoiding overhead, avoiding conditionals, taking advantage of instruction level parallelism.

Knowing how to optimize programs is of great importance to any programmer, not just programmers of parallel machines. An inefficient program is a bad program. Interactive programs must be reasonably fast to be "usable". Efficient programs for science and engineering enable faster explorations of ideas and designs and therefore enable better and more accurate results within a fixed amount of time. However, after several years of rapid advances in hardware technology we saw a significant reduction of interest in teaching high-performance programming techniques. Interest in performance continued among students of machine design and some specialized software areas such as compilers, but the vast majority of programmers of sequential machines felt that in many cases they could afford to develop programs with "reasonable" performance and devote little or no time to program optimization. In the brave new world of multicores the situation is different. The programmer of these parallel machines is on the spot. Scalability, which can be measured objectively (think speedup), is of great importance for the durability of programs. This will typically require the programmer to pay close attention to the numerous factors that determine execution time. Program optimization is closely tied to the programming of parallel machines and a good parallel programmer needs a solid background in program optimization.

The need to teach parallel program optimization creates an immense difficulty in the design of our curricula. Part of the problem is that much work must be done to compensate for the lack of adequate coverage of program optimization. Although many of the topics covered today are closely related to program optimization, the focus is not appropriate to develop in the student a good understanding of program optimization. Thus, the subject of algorithm design is directly relevant to optimization, but the way it is taught today is insufficient due to the focus on operation count and asymptotic behavior of algorithms. Asymptotic behavior does not always determine actual performance and, in any case, numerous factors other than operation counts such as locality and parallelism (at the instruction level in the case of conventional machines) influence performance. In today’s computer architecture courses students learn about components such as caches, registers, functional units, control units, but the implications of these components on the way programs should be written for performance is seldom discussed in detail. Compiler courses also cover some notions of program performance, but automatic optimization is poorly understood (in part because even manual optimization is poorly understood) and introductory compiler courses focus on only a few isolated optimization issues and typically low level ones such as register allocation and loop invariant removal.

This situation must be improved and I am sure in time it will be. As parallel computing becomes the norm the importance of education in program optimization will become evident. However, for the time being the teachers of parallel programming are faced with the need to educate "performance illiterate" pupils.
For more complete information about compiler optimizations, see our Optimization Notice.

Comments

Clay Breshears (Intel)'s picture

One other area in which sequential performance optimization plays a huge role is game programming. Squeezing out a few more flops or frames per second can allow more characters or better physics to be incorporated into a game.

I had not even thought about optimizations, beyond the parallel set you mentioned at the end of the first paragraph, for parallel applications being done on "serial" code. And I should have since we always recommend to customers that all serial optimizations be made before attempting parallelism. We've seen too many cases where parallelism has become a liability when the reverse order is done.

In the "good old days" of computer programming, optimizations on size of code and speed were stressed. With the increasing amount of memory and speed of processors, those skills have gone by the wayside, for the most part. If we are to include optimization techniques in the university curriculum, I can see another "but what do we take out" debate resulting.

Dave Padua's picture

What do we take out?

This is of course a very important issue.

My opinion is that performance is and has always been one of the main themes in Computer Science: Algorithm design is mainly about performance (although other factors such as accuracy are important). Computer organization is also mainly about performance (cache memories, instruction level parallelism, multiprocessing,…). Again, compiler optimization is about this.

I claim that we need to teach well this dimension of computer science. Teaching well may involve introducing a few new ideas and bringing to light the connection across areas (algorithms/architecture/compilers…). Possible ways to address this problem include adding topics and exercises to existing courses. This I believe must be done to improve the usefulness of algorithm and machine organization courses. In addition, capstone courses could be introduced to teach program performance as a way to tie architecture, compilers algorithms. Perhaps a “multicore” course could be such a capstone course.

's picture

I'm also concerned that most of the students I see don't understand performance issues. We do have a class on it (an elective), but it doesn't get nearly as much student traffic as it deserves. Many students hit upper level classes at the moment without even an understanding of why to use anything other than the default settings on the compiler/linker and no idea that there are useful ways to measure the performance of code and that those ways should be used to guide any thoughts of optimization.

I wish the situation were different, but although most students use computers all the time, very few come into college with any idea of how a computer accomplishes anything and what that implies for how to get a computer to do what you want.

Sam Midkiff's picture

If you look at the "Computing Curricula 2001 Computer Science" ACM report, and search for "performance" and "abstraction", you find many fewer references to the former than the latter. "Optimization" comes up in the context of language translation and dynamic optimization. From the mid to late 1980s until very recently Moore's law made worrying about performance in most cases uneconomical - the challenge was getting functionality out the door. Games, HPC and system software were different, and programmers that did well in those areas were "heroic" -- an exception.

While I share John's unhappiness with most students coming to colleges having no idea of how computers work, my bigger worry is that too many leave viewing computers as a Java machine, or as only an abstract symbol processing unit. It has been taught for years that hand tuning code can make life difficult for the compiler, and that the best strategy is to write clean, maintainable code. And while those are still necessary goals, we need to also be teaching that some level of tuning, and targeting the underlying architecture is good (e.g. programming for multiple cores), but that some level is usually bad (e.g. CSE or constant propagation or hand register allocation).

Finally, regardless of what we teach, for most students functionality is going to be the number one goal until we make performance a metric for grading.