Core Challenge In Speeding Up Python, PHP, HHVM, Node.js...

 Genevieve Bell, Flickr.comWhen I mention interpreted languages and you wonder if this relates to interpretive dance, the answer is "no", and my advice is "move along now." Otherwise, read on.

At a tender young age, say about 20 years old, I got my first taste of interpreted languages. And like a lot of those concepts from school they are still valid today. Particularly since all of today's hot scripting languages (Python, PHP, HHVM, javascript) are all interpreted.

A traditional compiler translates a high-level computer program into machine code for the CPU you want to run it on. An interpreted language translates a high-level language into the machine code for some imaginary CPU. For historical reasons, this imaginary CPU is called a "virtual machine" and its instructions are called "byte code."

One advantage of this approach is development speed: creating this byte code for an imaginary CPU is usually much faster than creating machine code for a real CPU. It's also more portable: that byte code can in theory run on a wide variety of machines. So a developer can write high-level code and debug it rapidly, rather than waiting for a long recompile.

Unfortunately, it also slows down execution of the program. That's a problem if your million-user web site is implemented in PHP or HHVM. It's a problem if your cloud is implemented in OpenStack, which is mostly written in Python.

But really, why is it slower to execute an interpreted language? How can we quantify what makes it slow? And how do we approach making it faster?

What it boils down to is how long it takes to get anything done. Let's take a simple example:

a = b + c

The core of that addition operation turns into a BINARY_ADD byte code. This is really a machine opcode for the imaginary Python machine. My associate Steve Dohrmann [1] did a little measurement on his desktop just now, and the result was 78 instructions to execute this add operation.

If you were to write this little bit of code in a non-interpreted language like C and compile it, the ADD instruction is ... one instruction.

At the risk of sounding obvious here, 78 instructions takes a lot more time to run than 1 instruction. A lot of this extra overhead is due to the Python language being a dynamically typed language. That means, the BINARY_ADD operation needs to figure out what types it is dealing with before it can do an addition (or a string concatenation, for example). IN any case, the overhead is just very high for an interpreted language.

How to speed this up?

Here's where the performance expert will step forward with the answer: Decrease the number of instructions needed to complete a unit of work, while maintaining or reducing the average number of cycles per instruction. In the jargon of the trade, this means to decrease pathlength while keeping CPI constant or diminished.

For anyone who is not a performance expert, that's about as useful as an espresso machine on a backpacking expedition. Useful, but a bit unwieldy.

Try this instead: suppose I invented a new CPU which had a BINARY_ADD instruction built in. Presto! My Python path length is now 1, just like the native C version. But if that new instruction took 80 clocks to execute, then your CPI would be much larger and offset all of the advantage of the decreased pathlengh. Both are important.

So I need to figure out some way to decrease pathlengh while keeping an eye on the CPI. It's like making sure I don't try to lose weight by eating low calorie food while washing it down with a milkshake.

With this basis in mind, we can now start thinking about how to speed up the core of the these interpreted languages.

[1] I am indebted to Steve for helping me gather data on the topic and validate my assumptions.

For more complete information about compiler optimizations, see our Optimization Notice.