You are invited to try Intel® VTune™ Amplifier - Platform Profiler feature. Free download.
Intel® VTune™ Amplifier - Platform Profiler is a tool that helps users to identify how well an application uses the underlying architecture and how users can optimize hardware configuration of their system. It displays high-level system configuration such as processor, memory, storage layout, PCIe* and network interfaces (see Figure 1), as well as performance metrics observed on the system such as CPU and memory utilization, CPU frequency, cycles per instruction (CPI), memory and disk input/output (I/O) throughput, power consumption, cache miss rate per instruction, and so on. Performance metrics collected by the tool can be used for deeper analysis and optimization.
There are two primary audiences for Platform Profiler:
Figure 1. High-level system configuration view from Platform Profiler
The main difference between Platform Profiler and other VTune Amplifier analyses is that Platform Profiler helps profile a platform for longer periods of time incurring very little performance overhead and generating a small amount of data. The current version of Platform Profiler can run up to 13 hours and generate less data than VTune Amplifier would do in 13 hours. So, one can simply start Platform Profiler running, and keep it running for 13 hours, meanwhile utilizing the system as he or she prefers, then stopping Platform Profiler profiling. Platform Profiler will collect all the profiled data and display the system utilization diagrams. VTune Amplifier, on the other hand, cannot run for such a long period of time, since it generates profiled data of gigabytes of magnitude in a matter of minutes. Thus, VTune Amplifier is more appropriate for fine tuning or for analyzing an application rather than a system. How well is my application using the machine or perhaps How well is the machine being used are key questions that Platform Profiler can answer. But, How do I fix my application to use the machine better is the question that VTune Amplifier answers.
Platform Profiler is composed of two main components: a data collector and a server.
The steps required for tool installation and usage are described in Platform Profiler User Guide.
In the rest of the article, I demonstrate how to navigate and analyze the result data it collects. I use a movie recommendation system application as an example in this article. The movie recommendation code is obtained from the Spark* Training GitHub* website. The underlying platform is a two-socket Haswell server (Intel® Xeon® CPU E5-2699 v3) with Intel® Hyper-Threading Technology enabled, 72 logical cores, 64 GB of memory, running an Ubuntu* 14.04 operating system.
The code is run in Spark on a single node as follows:
spark-submit --driver-memory 2g --class MovieLensALS --master local movielens-als_2.10-0.1.jar movies movies/test.dat
With the command line above, Spark runs in local mode with four threads specified with the --master local option. In local mode there is only one driver, which acts as an executor, and the executor spawns the threads to execute tasks. There are two arguments that can be changed before launching the application which are driver memory (--driver-memory 2g) and number of threads to run with (local). My goal is to see how much I stress my system by changing these arguments, and to find out if I can identify any interesting pattern happening during the execution using Platform Profiler's profiled data.
Here are the four test cases that were run and their corresponding run times:
spark-submit --driver-memory 2g --class MovieLensALS --master local movielens-als_2.10-0.1.jar movies movies/test.dat (16 minutes 11 seconds)
spark-submit --driver-memory 2g --class MovieLensALS --master local movielens-als_2.10-0.1.jar movies movies/test.dat (11 minutes 35 seconds)
spark-submit --driver-memory 8g --class MovieLensALS --master local movielens-als_2.10-0.1.jar movies movies/test.dat (7 minutes 40 seconds)
spark-submit --driver-memory 16g --class MovieLensALS --master local movielens-als_2.10-0.1.jar movies movies/test.dat (8 minutes 14 seconds)
Figures 2 and 3 show observed CPU metrics during the first and second tests, respectively. Figure 2 shows that the CPU is underutilized and the user can add more work, if the rest of the system is similarly underutilized. The CPU frequency slows down often, supporting the thesis that the CPU will not be the limiter of performance. Figure 3 shows that the test utilizes the CPU more due to an increase in number of threads, but there is still significant headroom. Moreover, it is interesting to see that by increasing the number of threads we also decreased the CPI rate, as shown in the CPI chart of Figure 3.
Figure 2. Overview of CPU usage in Test 1.
Figure 3. Overview of CPU usage in Test 2.
Figure 4. Memory read/write throughput on Socket 0 for Test 1.
The increase in number of threads also increased the number of memory accesses, when you compare Figures 4 and 5. This is an expected behavior and is verified by the data collected by Platform Profiler.
Figures 6 and 7 show L1 and L2 miss rates per instruction for Tests 1 and 2, respectively. Increasing the number of threads in Test 2 drastically decreased L1 and L2 miss rates, as depicted in Figure 7. We found out that the application incurs less CPI rate and less L1 and L2 miss rate when you run the code with more threads, which means that once data is loaded from the memory to caches a fairly good amount of data reuse happens, which benefits the overall performance.
Figure 6. L1 and L2 miss rate per instruction for Test 1.
Figure 7. L1 and L2 miss rate per instruction for Test 2.
Figure 8 shows the memory usage chart for Test 3. Similar memory usage patterns are observed for all other tests as well; that is, used memory is between 15-25 percent, whereas cached memory is between 45-60 percent. Spark caches its intermediate results in memory for later processing, hence we see high utilization of cached memory.
Figure 8. Memory utilization overview for Test 3.
Finally, Figures 9-12 show the disk utilization overview for all four test runs. As the amount of work has increased across the four test runs, the data shows that a faster disk improves the performance of the tests. The number of bytes transacted is not a lot, but the I/O operations (iops) are spending significant time waiting for completion. This can be seen by the Queue Depth chart. If the user is unable to change the disk then adding more threads would help to tolerate the disk access latency.
Figure 9. Disk utilization overview for Test 1.
Figure 10. Disk utilization overview for Test 2.
Figure 11. Disk utilization overview for Test 3.
Figure 12. Disk utilization overview for Test 4.
Using Platform Profiler, I was able to understand the execution behavior of the movie recommendation workload and observe how certain performance metrics change across a different number of threads and driver memory settings. Moreover, I was surprised to find out that a lot of disk write operations happen during the execution of the workload, since Spark applications are designed to run in memory. In order to investigate the code further, I will proceed with running VTune Amplifier's Disk I/O analysis to find out the details behind the disk I/O performance.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804