On October 21, 2015, I was invited to give a technical talk at ZendCon, the largest gathering of the PHP community. A lot of excitement this year surrounds the release of PHP 7 which represents a massive performance improvement over previous versions.
Here are a few notes from my talk. The promise was to offer a way for attendees of the talk to get even more performance out of PHP 7.
The bottom line: We show you some practical things you can do right now to speed up your PHP performance on your site. Read on.
I started with some comments about my background. I've been an operating systems and compiler guy for my whole career in computer science. Back in 2001, I first began managing a group of server performance engineers. Back then the work was focused on Microsoft SQL Server and Oracle database. After about five years of that work, I moved into the Open Source Technology Center at Intel for an 8 year stint leading operating systems projects. In 2015, I rejoined the datacenter software technology group to launch our effort on optimizing server languages like PHP.
Our work in PHP began several years ago. We're indebted to Moh Haghighat, an Intel Senior Principal Engineer, who had the vision and foresight to encourage us to invest in this optimization work.
After this brief introduction, I wanted to set the stage for software performance engineering by talking about the general approach we take.
For example, I think many people just assume that they can get greater software performance if they upgrade to newer generations of processors. For example, our latest Xeon server processors (codenamed Broadwell) have a number of performance-enhancing features, such as a larger out-of-order scheduler, a larger L2 TLB and a faster floating point multiplier. And while all this is true, there are some proactive steps we can take in software to boost performance on existing hardware.
The most important starting point we take is what I have dubbed the Core Software Strategy. Rather than speeding up a particular application, we pick a representative workload or benchmark and then speed up the most essential core software building block on which it is based. That way, any other customer workload which resembles our benchmark should speed up as well. We have used this strategy for years, optimizing gcc and Java. For PHP, the representative workloads are WordPress, Drupal and MediaWiki.
This decades-long approach to analyzing core software has driven us to create some of the most innovative suites of analysis tools. We use these to turn a more architectural lens on the core software, such as PHP. One of the first findings we discovered about PHP running WordPress and the like is how many cycles are spent in the processor's "front-end".
At this point, we need a brief reminder of your first semester computer architecture course. In order to speed up performance, processors are designed with "pipelining" in mind. If this works as designed, part of the processor can be busy fetching and decoding instructions while another part of the processor core can be executing instructions which were fetched earlier or committing results of earlier operations to memory.
If it's working properly, this pipeline architecture can actually get multiple instructions executed per clock cycle. But our analysis shows that real customer workloads like WordPress actually spends 40 to 50% of cycles stalled in the processor front-end, the part which fetches and decodes instructions. This is a very high percentage; for optimal workloads, we should see front-end stall cycles down in the teens.
So what are the causes of a large percentage of cycles stalled in the front-end? This is a common problem with interpreted languages, like PHP. Often this is because PHP simply has a very large code footprint. Here we talk about the performance concept of pathlength - the number of instructions it takes to process some kind of operation, like WordPress page fetches. If you can reduce the pathlength while keeping clocks per instruction (CPI) constant or reducing it, then you will get a throughput improvement.
Some If you have a workload with a large code footprint, some culprits to look for are a high miss rate in the ITLB (the virtual memory Translation Lookaside Buffer dedicated to instructions), a high miss rate for the instruction cache or a high level of branch mis-predicts, all of which we can measure. The bottom line is that front-end issues add (latency) cycles spent waiting for instructions to be loaded and decoded.
So what can be done about it?
Before I get into solutions, we should take a moment and talk about discipline. I find that we can make progress with performance if we can set up our lab to measure things in a scientific manner. We want to have a workload which we can run repeatedly and get nearly the same results every time we run it. Besides repeatability, we want the workload to be representative of the kinds of work we are doing. And its usually good to have something which will run for at least 10 minutes or so, rather than complete in less than a second.
Once we have a stable and repeatable workload, it's important that we only change one thing at a time and measure throughput after making the change. Otherwise, we might not know which change caused performance to improve or regress. Finally, it's critical to make sure we're not bottlenecked on some system resource or tuning parameter.
Examples of this latter are tuning parameters in the filesystem or the operating system itself. I generally like to have CPU utilization which is greater than 90%. This means we're not hung up on inadequate I/O or something similar.
In our group, we set up what we call our 0-day lab. This is a facility for downloading the sources of our open source PHP project every night, building it and measuring its performance. I've noticed that the performance of PHP can swing several percentage points every day. It's really nice to see where we're at every day, so that if I make a 1% performance improvement it doesn't get swamped by a 10% performance regression caused by someone else's patch.
We have a nice little management dashboard we use for monitoring the nightly 0-day lab results. We also mail the nightly results to the php-internals mailing list, so that the entire community can benefit. In the nightly mail, we show the performance change from the last day's run and do a comparison with an earlier release or milestone, so that you get a cumulative idea of how performance is going. With PHP, we have our big three workloads of WordPress, Drupal and WikiMedia, but we also have a number of micro-benchmarks. Finally we report on the relative standard deviation of our performance runs. This shows how repeatable the workload is on our setup. For example, if the highly report shows a 0.25% performance regression, but the relative standard deviation is 0.6%, I won't get too excited about it, since the result is in the noise level.
(If you want to see these nightly emails, subscribe to the php-internals mailing list. Soon we might have the dashboard view available as well).
Now that we have a nice repeatable lab setup, we're ready to make some optimization changes! How can we make some things run faster?
First of all, let's see if we can get the compiler to generate better code.
As gcc compiles the PHP interpreter, it employs a number of heuristics to suss out how the code is going to run. This enables the compiler to generate code so that the processors will likely predict the correct branches to take. Unfortunately, these heuristics can only go so far. What the compiler really needs is an example of the PHP interpreter running, so that it is more certain of which branches will actually be taken.
To provide this example to the compiler, we employ a technique called "Profile Guided Optimization" or PGO. Using PGO, we can train the compiler by creating an instrumented version of PHP, and letting it run for a while, generating a profile. We then take this profile and recompile the PHP interpreter. The resulting compiled binary of PHP will be tuned for how it is actually used in practice.
In my talk I actually showed a demo of this: PHP 7 running WordPress compared against WordPress trained using PGO. The demo showed the relative performance delta on an instantaneous basis. The performance benefit was around 7%. For a real customer workload and for no code changes, this is actually an excellent improvement. (This was a great demo, thanks to Bogdan Andone for creating it!)
You can take advantage of this as well. You should be able to get a nice boost in throughput for your PHP application if you download the PHP 7 sources, use "make prof-gen", use the resulting PHP to run your application and then recompile it using "make prof-use". The resulting PHP binary from this second compilation should be faster. If you have a repeatable and stable version of your PHP application to run with this as the only change, it should show an improvement.
In future, I would really like to have some kind of "golden profile" or super representative training workload which would allow us to include PGO training as a default part of the PHP build. We're not there yet in PHP7, but you can custom-build your own optimized PHP7 binary in the meantime.
The other key learning I passed on at the conference was to use a server processor like the Intel Xeon processor to run PHP. I gave the example of a patch created by one of our engineers, a patch to the fast_memcpy() function within the PHP interpreter. This patch enhanced the function with some of Intel's specialized SSE2 instructions.
Our engineer was only able to see a 1% performance improvement when running the patched PHP on a desktop Haswell processor. To our delight, when it was deployed on a Xeon server version of Haswell, the same PHP binary showed a 2.9% boost on WordPress and 5.9% improvement on MediaWiki.
So the other actionable step you can take is to run PHP on latest generation Xeon processor CPUs for the best improvement.
So in summary, here are some things you can do to amp up your PHP base website:
- Switch to PHP 7. This is really an amazing boost in performance that it's just worth it to make the change.
- Set up some disciplined lab which will let you compare PHP performance changes.
- Download the PHP sources and train PHP using your lab workload.
- If you want to develop more on the PHP interpreter itself, join the community and keep your eyes on our 0-day results.