Using Amdahl's Law for Energy Efficient Performance Estimation?

While trying to find an answer to my previous question, I stumbled across the paper "Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era" (Computer, Dec. 2008, pp. 24-31) by Dong Hyuk Woo and Hsien-Hsin S. Lee (Georgia Institute of Technology). The title had me thinking that this might be an investigation into finding a metric or upper bound on how energy efficient an application could be. It didn't quite turn out to be that simple, but the findings are interesting.

The authors look to evaluate which of three possible processor core architectures might be best for parallel execution that minimizes energy consumption. The three model core arrangements are 1) multi-core (several large processing cores on a single chip), 2) manycore (lots and lots of simpler, more power efficient cores), and 3) a combination of a single large core with many simpler cores. The first is like the current dual-, quad-, and hexa-core processors, the second is akin to a GPU, and the third is a hybrid conglomeration of large core sitting on a GPU.

For the purposes of the model formulas, the maximum power consumption of a single large core is normalized to 1 and the power consumption of an idle processor is an added variable, k. For the first architecture, the new variable is used in the traditional Amdahl's Law formula as a multiplier to the serial percentage of time multiplied by the (n-1) idle cores. Some simple algebraic manipulation and the authors generate a formula for estimating the average power consumption, in watts (W), for a parallel application with n cores and the stated percentages of parallel and serial work. A similar derivation is done for the manycore model with the power consumption per simple core being 0.25 of the large core. With the hybrid model, the assumption used to derive a corresponding formula is for the large single core to handle the serial execution and the simpler cores do the parallel work.

Since a measure of the watts consumed is provided by the model, the authors then compute performance per watt (Perf/W) by computing the original Amdahl's formula divided by the formula to compute W. In order to compare the three model to each other, a power budget is imposed, which sets the number of cores available for each model.

The conclusions drawn from comparing the model with various numbers of cores and fractions of parallel execution of the overall execution time are probably the most interesting part of the article. For example, the first result reported is that to achieve the highest Perf/W value in the multi-core model, the parallelization must scale linearly. If the application doesn't scale linearly,the processor (model) must dissipate more energy than the serial version since the idle power of the extra cores scales linearly.

The ultimate result of the paper was that the hybrid model, one large core and many small cores, was the most power scalable. The manycore option does well with high amounts of parallelism and lower power budgets (fewer total cores), but as that budget increases, the number of simple cores increases and the effective serial execution performance does not. The hybrid model, with the single large core in place of several simpler cores, can more efficiently handle the serial portions of the execution (than one simple core out of the dozens sitting idle).

As I was reading the paper I could identify which was the model of current standard multi-core processors available in abundance today. The manycore model could easily be a GPU or MIC accelerator by itself. The hybrid model suggested the combination of a manycore accelerator and a dual-core processor (in the absence of  heterogeneous core chips). I wondered where vector hardware fits into the three models. Considering just the vector registers alone might suggest it would be an instance of the manycore model. However, these registers are part of a larger core, which makes me think of the hybrid model. Maybe they are a second level of parallel execution that isn't accounted for in the three models.

It's a good paper. But I have a couple of quibbles. First, the only way that parallel execution on the multi-core model underachieves the serial equivalent execution is if the sequential code is run on a single core system. A later comment in the paper makes me think that this is the assumption, but it's not too clear. This assumption is not valid in the real-world. For a true apples-apples comparison, the serial code needs to be run on a multi-core processor, too. If that were the case, I contend that the parallel execution consumes less energy.

For example, assume that we have an execution time of 10 time units (let's call them moops) . On a quad core processor running the serial code we would have one core running full speed for 10 moops and the other three cores generate an aggregate 30 moops of idle consumption. If the algorithm is 50% parallel, we would have 5 moops of  full power consumption in serial, 5 moops of full consumption in parallel across four cores, and 15 moops total of idle consumption. Even if the code is 10% parallel there would only be 27 moops of total idle consumption. Any level of (perfect) parallelism is going to prove to consume less energy than the serial equivalent on the same system. Am I missing something?

Note that I included '(perfect)' at the end of the previous paragraph. There will always be overhead in parallel computations and this will expand the execution time of the parallel portions and, consequently, the full consumption time of the execution (e.g., the 50% parallel portion above might require 5.4 moops of full consumption).

Second, Amdahl's Law is an estimate of speedup. Speedup is a dimensionless number. That is, I divide the execution time of the serial code with the time of the parallel execution to get a simple  number since the moops of the two quantities cancel each other out in that calculation. If I need 10 moops of serial time versus 6.35 moops of parallel time, I get a 1.57X speedup. 1.57 whats? ('X' is not a unit.) Speedup is a metric of relative performance, but it's not what I really think of when I think of performance.

To me "performance" is more absolute. Typically this is some countable quantity like number of transactions, floating-point operations, or feet traveled. It can be associated within a time unit measure, too, like transactions per moop, floating-point operations per second, or furlongs per fortnight. Thus, the metrics of transactions per watt or flops per watt or feet per watt make sense to me. Improvements that raise the performance value or lower the watt value show a trend in the right direction for achieving better energy efficient performance.

I'm still not able to quite wrap my head around the efficacy of speedup per watt (or even speedup per joule, which is also used in the Woo and Lee paper) as an absolute measure of energy efficient performance. It may be that I'm reading too much into this and the metrics are simply used to compare the three architectural models described (within the assumptions given). Perhaps it is simpy just a model after all.
For more complete information about compiler optimizations, see our Optimization Notice.

1 comment

Top

Another problem with the paper is that the authors did not consider voltage scaling after parallization, which would save tons of power for many-core architectures.