Optimizing Image Resizing Example of Intel® Integrated Performance Primitives (Intel® IPP) With Intel® Threading Building Blocks and Intel® VTune™ Amplifier

For Intel® System Studio 2015, find the corresponding article here -> click

For Intel® System Studio 2016, find the corresponding article here -> click

 

< Overview >

 In this article, we are enabling and using Intel® Integrated Performance Primitives(IPP), Intel® Threading Building Blocks(TBB) and Intel® VTune™ Amplifier on Linux. We will build and run one of the examples that come with IPP and apply TBB and VTune on the example to observe the performance improvement of using Intel® System Studio(ISS) features.

 Test environments are listed below. 

  •  Linux, Ubuntu 18.04 LTS 64bit
  •  Intel® System Studio 2019 Update 1
  •  Intel® Integrated Performance Primitives 2019 Update 1
  •  Intel® Threading Building Blocks 2019 Update 2
  •  Intel® VTune™ Amplifier 2019 Update 1

 This example was tested on i5 dual core platform with the Hyper Threading.

 

< Building the IPP example with TBB libraries >

 

STEP 1. Setup the environment variables for IPP, TBB 

  We need to setup environment variables for IPP and TBB to work appropriately. Use the following 3 commands in the command line then the variables will be set. It is needed to input the right target architecture when you execute them. ex) 'ia32' IA-32 target and 'intel64' for Intel®64 target. ex) 'linux' for a Linux target, 'android' for an Android target and 'mac' for a Mac target. 

$ source /opt/intel/system_studio_2019/compilers_and_libraries/linux/bin/compilervars.sh -arch <arch type> -platform <platform type>

  To verify if the above commands were executed correctly, type 'printenv' and check if 'IPPROOT'and 'TBBROOT' are listed and indicating IPP and TBB install directories, and 'PATH' is indicating'/opt/intel/system_studio_2019/compilers_and_libraries/linux/bin/<arch type>'. For future usage, it is recommended to write a bash script to enable multiple features of ISS at once.

$ printenv | grep IPPROOT

$ printenv | grep TBBROOT

$ printenv | grep PATH

 

STEP 2. Find the example

  First, we will go find the IPP example and prepare to build with additional ISS features applied such as TBB and ICC.

  When you install ISS 2019 with a default setting,  the IPP example archive file is located at

/opt/intel/system_studio_2019/compilers_and_libraries/linux/ipp/examples

 you will find 'components_and_examples_lin_iss.tgz' in the location. Extract the examples wherever you like (but don't extract it at a directory where you need strict permissions. Do it where you can play without type 'sudo' otherwise, building the example gets complicated), and find 'ipp_resize_mt' example folder. That is the example we are using here. You can find an additional document at '<Extracted Eamples>/components/examples_core/documentation/ipp-examples.html' when you extract the examples.

 

STEP3. Build the example

 Go to the sample directory, examples_core, build the samples using Makefiles.

 Go to the main examples directory examples_core. Execute GNU* Make as

$ make [ARCH=ia32|intel64] [CONF=configuration] [clean]

where ARCH and CONF are optional parameters.

CONF=configuration is an optional parameter to build release or debug version of example application with or without Intel® TBB support. The possible configurations are

ConfigurationDescription
releaseDefault configuration for release build without Intel® TBB. The compiler option is "-O2"
debugDebug build with compiler options "-O0 -g" without Intel® TBB
release_tbbRelease build with Intel® TBB support. The compiler options are "-O2 -DUSE_TBB"
debug_tbb

Debug build with Intel® TBB support. The compiler options are "-O0 -g -DUSE_TBB". The debug libraries of Intel® TBB are used

The optional parameter 'clean' cleans up the working directory.

You can run make command either from top directory of Intel® IPP examples components/examples_core or from example specific sub-directory, for example, components/examples_core/ipp_fft. In the first case all examples will be built, in the second - you will build only the specific example.

Let's build all examples with 'debug' and 'debug_tbb' configurations. 

<Extracted Eamples>/components/examples_core$ make ARCH=intel64 CONF=debug
<Extracted Eamples>/components/examples_core$ make ARCH=intel64 CONF=debug_tbb

then check if your build processes went through without an error. 

If done, there will be 'debug' and 'debug_tbb' folders in <Extracted Eamples>/components/examples_core/_build/intel64. 

 

 

Step4. Run

 

We will compare the performance of the IPP Resize example with and without TBB enabled. 

The IPP Resize example simply shows the performance how long in average it spends on resizing one image.  

 

Let's take a look at the following as the options and arguments that can be used to execute the resize sample without TBB first. 

<Extracted Eamples>/components/examples_core/_build/intel64/debug$ ./ipp_resize_mt -h

 

Now, let's run the example built without TBB, with an argument '-l 1000' to run the resize 1000 times. 

<Extracted Eamples>/components/examples_core/_build/intel64/debug$ ./ipp_resize_mt -l 1000

The loop average for resizing 1 image is 0.782ms. 

 

Now let's take a look at the options of the resize sample with TBB and run it. 

<Extracted Eamples>/components/examples_core/_build/intel64/debug_tbb$ ./ipp_resize_mt -h

different from the one without TBB, this example has '-t' option to choose the number of TBB threads. We will have '-t' as default but if you want to change and see what differences it makes, change between 1 to the number of logical core your platform has. 

<Extracted Eamples>/components/examples_core/_build/intel64/debug$ ./ipp_resize_mt -l 1000

the average time for 1 image is 0.499ms. Utilizing 4 threads brings about 63% speedup in this case. 

Threading Building Blocks (TBB) lets you easily write parallel C++ programs that take full advantage of the multicore performance, that are portable and composable, and that have future-proof scalability.

 

Step5. Analysis

 To verify if the example technically exploits 4 cores simultaneously, we can use VTune to investigate. 

Let's see the first VTune results without TBB. 

< CPU Utilization - Resize Without TBB >

The horizontal axis indicates how many cores have been utilized simultaneously and the vertical axis indicates the accumulated time the cores being utilized. The CPU utilization of the example without TBB shows that only 1 core is being utilized at a time. No multithreading has happened during the collection. When no core is working, it counts as 'Idle'.

 

However, here we see a different one, the CPU utilization graph of the resize example with TBB. 

< CPU Utilization - Resize With TBB >

It clearly shows that 2~4 cores have been utilized simultaneously for a long time which means multi threads where working on the same task. 

 

 

< Botton up view - Resize Without TBB >

In the bottom up tab, we can see the function 'I9_ownRow1Linear8uQ14' and 'I9_ownColLinear8uQ14' are the functions called by 'l9_ippiResizeLinear_8u_C1R' and do the main job. We can find that from the call stacks of the functions. Please refer to the below images.

< The callstacks of the main hotspots  >

This means these two functions which take the most of the time of the execution, are the main hotspots and are worth looking closely. So when the filter in those two functions only, we can see the thread that works the hardest. And we will compare how the main threads look different from each test. 

< Thread 'I9_ownRow1Linear8uQ14' and 'I9_ownColLinear8uQ14' without TBB >

 

< Thread 'I9_ownRow1Linear8uQ14' and 'I9_ownColLinear8uQ14' with TBB >

 

As we can see above, TBB enables parallelism on the functions to make the cores work together. The workload is evenly distributed to the TBB threads for the available cores. If one core finishes its work before other cores and becomes idle and the other cores still have a significant amount of work in their queue, TBB reassigns some of the work from one of the busy cores to the idle core.

 

Step 6. Conclusion

  We saw how easily an IPP example can be built and tested with other components of Intel System Studio. It is recommended to take a close look into the IPP example to learn how to program with IPP and TBB. TBB here parallelizes for the dual-core processor and increase the performance.

  For the tools we utilize here are the followings. To see more information about them, please click on their name. 

For more complete information about compiler optimizations, see our Optimization Notice.