Performance tuning of an existing application is truly a challenge and it depends on a lot of factors like the nature of algorithm the application works on, if the implementation is scalable to take advantage of thread/data parallelism etc. The most logical approach any developer would follow for tuning the performance of an application is to do a dynamic profiling of the application under different workloads, try to analyze the hotspots in that application, and then fine tune them to work best on a given hardware architecture. These hotspots could either be a function or loop which handles high computation load. Intel provides a dynamic profiling tool named Intel® Vtune Amplifier XE which is used for profiling any given application. Once the hotspots are identified, then the next approach is to analyze the corresponding algorithm and look for potential unexploited thread/data parallelism. Also it is a good programming practice to write the code scalable so that it makes use of all the available cores (thread parallelism) and SIMD (Single Instruction Multiple Data) registers in each core (data parallelism). This paper recommends the step by step approach to enable an application with both task parallelism and data parallelism using Intel® Cilk™ Plus. Also the usage of every explicit vectorization extension is explained in detail with examples which clearly gives a good understanding on how and when to use them. To continue reading please click here.