Using Linux Top to troubleshoot multi-core scalability issues at DreamWorks Animation

Imagine you are placed in an animated movie production environment where multiple applications run concurrently to solve a problem, and each application is using fork-join process parallelism during its run. You are asked why the overall run is not scaling well with the number of cores on the system!

Clearly, at this point understanding system behavior is a good idea. You want to see the CPU utilization of the cores, how much I/O is going on, etc. If the applications run for hours, you also want to log the results to a file for later analysis. Linux provides many tools to help in this regard – vmstat, mpstat, and sar.

But only a few tools allow you to track the CPU utilization with process names. Top is very useful in this regard. It has a handy –b option to run in batch mode while saving results to a file. It is easy to write a script to parse the file. It also may be useful to use gnuplot to draw graphs of CPU utilization of the processes, and use evince or gv to display the postscript plots.

So you may think you are home free – but the not so obvious part is that some of top’s interesting options are not available in the batch mode. For example, PPID (Parent Process ID) is not available by default when you run top, and can only be activated in the interactive run of top.

~/.toprc file to the rescue! This file in your home directory can be created to store your preferred options for all future top runs (batch or interactive).

Suppose you want the batch top run to create updates every 1 second, show threads, ignore idle processes, and show PPID. Here is how you create this particular ~/.toprc file. In a command window:

top (start top in command window; by default top updates every 3 second)
s (hit s)
1 (hit 1; set 1 second delay between updates )
H (hit H; show threads)
i (hit i; ignore idle processes)
f (hit f; bring up selections to choose additional fields to display)
b (hit b; chooses PPID field)
o (hit o; bring up selection to change order of fields; e.g. keep PID and PPID together)
W (hit W; save preference to ~/.toprc file)


Now, you can run top in batch mode:

top –b –n 800000 > results.file (-b means run in batch; -n 800000 means top outputs 800000 times)


Here is the format of what is captured in the “results.file” every 1 second and you can see the parent/children relationship among the “rend” processes:


Let us now see how data mining the top generated file in different ways helped us to spot multi-core scalability issues at DreamWorks Animation.

Here is one graph generated from the top output when a production shot (frame) was rendered in parallel on a 4-core system (Graph 1). Clearly, the 4-core system is underutilized till about 3500 seconds (100% utilization means all 4 cores are fully used). Note: green represents user time and red represents system time in Graph 1.


Graph 1: Overall CPU utilization on a 4-core system during render of a shot

OK, but what processes were running? Data mining the top generated file again (for simplicity I am only showing processes relevant to our discussion) we see that two main processes/programs were running, rend1 and rend2 (Graph 2). Rendering pipeline stage 1, represented by program rend1, was not running in parallel at all in the 500 second to the ~2800 second range (approximately 20% system CPU utilization on a 4-core system), which surprised us. Rendering pipeline stage 2, represented by program rend2 ran fully in parallel (100% system CPU utilization on 4-cores).


Graph 2: CPU utilization of processes executing rendering pipeline stage 1 (rend1) and stage 2 (rend2)

Let’s drill down into render pipeline stage 2 first (3500-4300 sec). In Graph 3 below, you can see two rend2 processes (PPID 30739 and 30570) forked 4 children each, and all the 4-cores were utilized fully.


Graph 3: CPU utilization of child processes of rendering pipeline stage 2

What happened in the render pipeline stage 1? Let us focus in on the 1000 – 2500 second range. In Graph 4 below we see some interesting behaviors. Only three rend1 processes are running (we expected four, one per core on the 4-core system). In addition, only one of the rend1 processes seems to utilize a single core fully. The other two rend1 processes are barely using the cores.


Graph 4: CPU utilization of child processes of rendering pipeline stage 1 (1000-2500 sec)

This analysis of parent/children processes CPU utilization using top helped us zero in on areas to investigate in DreamWorks Animation’s render applications, and to make further improvements on scalability. If there is interest, I may write about using gnuplot to generate graphs such as I show above.
For more complete information about compiler optimizations, see our Optimization Notice.