Performance tuning of an existing application is truly a challenge and it depends on a lot of factors like the nature of algorithm the application works on, if the implementation is scalable to take advantage of thread/data parallelism etc. The most logical approach any developer would follow for tuning the performance of an application is to do a dynamic profiling of the application under different workloads, try to analyze the hotspots in that application, and then fine tune them to work best on a given hardware architecture.
This tutorial introduces the basic hardware and software architecture of the Intel Xeon Phi coprocessor, describing their general features, and provides a first view of the various programming models that support High Performance Computing on hosts equipped with the coprocessor, including offloading selected code to the coprocessor, Virtual Shared Memory, and parallel programming using OpenMP*, Intel Cilk Plus, and Intel Threading Building Blocks.
Ambient Occlusion is an algorithm that approximates the reflection of light off non-reflective surfaces. Since calculating true light reflection is incredibly expensive and impractical given today's hardware, algorithms like ambient occlusion are used to get convincingly close. Ambient occlusion finds intersections with objects in the scene and a ray from the origin to each pixel on the screen. If there is an intersection (a "hit"), it searches for intersections again, but using the "hit" as an origin. Depending on how many intersections it finds, it will be lighter or darker, to imitate shadows, which is the goal of ambient occlusion. Intel® Cilk™ Plus
cilk_for is used to render multiple horizontal lines in parallel, while Intel Cilk Plus Array Notation is used to speed up the search for intersections with the "hit" as an origin. In the scalar implementation, the auto-vectorizer does a somewhat poor job of vectorizing the ambient occlusion calculation (intersections with the "hit" as an origin), which can be fixed by adding a single Intel Cilk Plus SIMD Notation line.
The followings are samples to demonstrate the Intel(R) Cilk(TM) Plus implementations and its performance benefits for the popular classic algorithms. Select the sample name to find more detail information.
Monte Carlo algorithms solve deterministic problems by using a probabilistic analogue. The algorithm requires repeated simulations of chance that lend themselves well to parallel processing and vectorization. The simulations in this example are run serially, with Intel® Cilk™ Plus Array Notation (AN) for vectorization, with Intel Cilk Plus
cilk_for for parallelization, and with both vectorization and
cilk_for. In this example, the Monte Carlo algorithm is utilized to estimate the valuation of a European swaption, which is fundamentally calculated by the difference between the strike price and the future estimated value, or forward swap rate. The Monte Carlo algorithm estimates the valuation by applying the initial conditions to a normal distribution over many simulations to calculate a normal valuation.
This section contains the following utility classes:
|Timer Utility||Utility class written in C++ that can be used to measure performance.
Supported platforms: Windows*, Linux* and OS X*.
This webinar focuses on the Intel System Studio components that can be used to implement signal processing workloads. The webinar complements the previous “tools of the embedded trade” webinars on Intel® VTune™ Amplifier, Intel® Inspector, Intel® JTAG Debugger, and other process-oriented features (e.g. cross-compilation). We provide a case study to familiarize you with the signal processing functionality of Intel System Studio and cover the following components: