(This work was done by Vivek Lingegowda during his internship at Intel.)
When parallel applications have to initialize their data from file as their input or write result data to file as their output, when not optimized, the data read/write can become a limiting factor for the multi-threaded application’s performance in a multi-core environment. Such file I/O bottlenecks can be easily overcome in data parallel applications. Data parallel applications are those where data can be distributed among threads for concurrent processing without the need for sharing or communication between them.
In this blog we will consider the example of Blackscholes – a data-parallel application that is part of the PARSEC benchmark suite. We will show how optimizing data read/write can significantly improve the scalability of the multi-threaded application.
Blackscholes prices a portfolio of options with the Black-Scholes partial differential equation. The application input is a file with rows of options data with each row containing details for one option. The input dataset we used consists of 2.5 million options. During the initialization phase, the application uses a single thread to read the options data from file line by line and initialize the data structures. It then uses multiple threads in parallel to compute the prices for all the options, and then finally writes the result prices one per option to an output file line by line in a single thread serially. The graph below shows the performance scaling of the application for varying number of threads on a system with 80 logical cores [Performance Ratio: Application run time with 1 thread/Application run time with x threads]
Figure 1: Blackscholes Pre-Optimization Performance Scaling
From the above graph it can be seen that there is no performance gained by increasing the number of threads above 15.
The below MPSTAT chart shows the CPU loads at different instants of time over the course of the application running time when run with 80 threads.
Figure 2: MPSTAT data showing the CPU load at different instants of time
Here we see that when the application is run with 80 threads, 65% of the time is spent on reading and initializing options data from file serially and another 25% of the time is spent on writing the results to file serially. The performance scaling tops out at 15 threads because the processing time of the serial parts of the application begins to outweigh the processing time spent in the parallel parts. This is a good example of Amdahl’s law – the serial part of the program imposes limitations on the overall performance gains of the applications. Adding more threads/CPU cores is not going to help unless we optimize the serial part of the application.
We will now look into one of the approaches taken to optimize the application data read/write for better performance scaling.
Optimizing Data initialization:
During the application initialization phase, rather than reading options data from file line by line, the complete file data is read in bulk to an in-memory buffer and pointers to each option data are created through tokenizing the buffer in a single thread. Subsequently options data reading and initialization to data structures is parallelized by distributing the options data pointers among threads. Now multiple threads concurrently read the options data and also process them to compute prices.
Optimizing data writing:
After multiple threads concurrently process the options data, instead of a single thread writing the results line by line to an output file, multiple threads concurrently write the results to their part of a common output buffer. When all parallel threads are done with writing the results to the buffer, the complete buffer is written in bulk to an output file in a single thread.
The below graph compares the performance of the BlackScholes application after optimization.
Figure 3: Blackscholes Performance scaling after Optimization
With the above approach of data read/write optimization we are able see 5x speedup in the application, thus by optimizing the serial part we can improve the overall performance of a parallel application.