Accelerate Your Application via IPP Image Processing in Parallel Studio - C code vs. IPP Resize

Summary
Intel® Parallel Studio 2011 release recently. IPP as one key component of Intel® Parallel Composer provide user a easy and faster way to accelarate digital application. This article shows how to employ IPP image processing function to develop parallel ready application and provide a sample to shows the performance difference between IPP and general C code on resizing image, which is wide-used functionality in image processing field. Test show that the IPP function can run 44x faster than corresponding C code. If enabling parallel, the speed up will high 50x on Core 2 Quad 2.66GHz machine. 

Attached is the sample project, one Parallel Composer 2011 project in MicroSoft Visual Studio 2005 IDE.
Some developers may install Intel Parallel Composer with Microsoft Visual Studio 2010. Here is the project.

How to build the Sample

1. Build system requirement

Software:
•   Intel Parallel Studio 2011 and Microsoft* Visual Studio 2005 and later
•   (optional)  install static ipp library separately from http://software.intel.com/en-us/articles/intel-ipp-static-libraries/

Hardware:  The latest dual-core/Quad Core machine with Windows xp/Windows Vista/Windows 7

2. Download and Unzip the Resize_Image_PS_VS2005.zip to a directory, let's name <WorkDIR>

3. Go to <WorkDIR> and double click the Resize Image.sln.  The msvc2005 IDE will prompt automatically.

4. From the main toolbar select Project>> Intel Parallel Composer 2011 » Select Build Component.

(or right-click the Project in Solution Explorer) , check Use IPP. click OK

5. Then build the application, from the main toolbar select Build >> Build solution

Please see the build details in Use Intel IPP in Intel® Parallel Composer

How to run the application

1. Run the application
From the main toolbar, select Debug >> Start Without Debugging. The application windows start, Click Open File, Select LennaC1.bmp 
ReadLenna.JPG

2. click menu "Process => Resize image" to Resize the image. 
Enter the zoom factor in horizontal (x) and vertical (y) directory in Resize dialog box.  Click OK  Process.JPG

3: Click lennC1.bmp and repeat step 2 again, make sure click button USE_IPP. Then get the below image  result1.JPG

IPP Function Adoption: 
Assume the sample is the application we want to improve the performance via IPP function.  
1.  Find the c code resize image function in RESIZE.cpp

unsigned long C_Code_Resize(unsigned char * src, int srcWidth, int srcHeight,   int srcStep, unsigned char* dst, int dstWidth, int dstHeight, int dstStep, double m_zoom_x, double m_zoom_y, int interpolation)

It is called by function ProcessImage(CSampleDoc *pSrc) in ippiAddC.cpp

2. Check ipp manual ippiman.pdf and find the function ippiResizeSqrPixel have same functionality.  Then replace the C function with IPP function.  
Declare a similiar function in RESIZE.cpp
unsigned long IPP_Resize( void* src, int srcWidth, int srcHeight,int srcStep,  void* dst,  int dstWidth, int dstHeight, int dstStep, double m_nzoom_x, double m_nzoom_y, int interpolation)

And call it in ProcessImage(CSampleDoc *pSrc) in ippiAddC.cpp instead of call C_Code_Resize().  (In order to compare the performance, we keep the c function call here.)

if (m_USE_IPP)
{
ippStaticInit();
//---- perform IPP Funtion Code to rotate a image  -----//
run_time = IPP_Resize(pSrc->DataPtr(),pSrc->Width(),pSrc->Height(),pSrc->Step(),(Ipp8u*)pDst->DataPtr(),        pDst->Width(),pDst->Height(),pDst->Step(),m_zoom_x,m_zoom_y,m_Interpolation);
}
else{         //---- perform C Code to rotate a image  -----//
run_time = C_Code_Resize((unsigned char *)pSrc->DataPtr(),pSrc->Width(),
pSrc->Height(),pSrc->Step(), (unsigned char *)pDst->DataPtr(), pDst->Width(),pDst->Height(),pDst->Step(),m_zoom_x,m_zoom_y,m_Interpolation);
}     

3. Write the IPP code to replace the C code.  The table show the original C code and the IPP code

Tthe C code

The IPP code

unsigned long C_Code_Resize(unsigned char * src, int srcWidth, int srcHeight,int srcStep, unsigned char* dst, int dstWidth, int dstHeight, int dstStep, double m_zoom_x, double m_zoom_y, int interpolation)

{//---------- Perform 1 order linear ---
//define record time variable
unsigned long start_clock,stop_clock;    start_clock = RUNTIME;

const unsigned char *tmpSrc;
unsigned char *tmpRef;
int width = srcWidth;
int height = srcHeight;
double xInv = 1.0 /  m_zoom_x;
double yInv = 1.0 /  m_zoom_y;

int colInd, rowInd;
int i, j, xSrc0, xSrc1, ySrc0, ySrc1, wdroi, hdroi;
int idxl, idyt, icol, jrow;
double row, col;
double y1, y2, y3, y4, v, v1, v2, tempV,tempV2;

idxl=0;
idyt=0; 
wdroi = dstWidth;
hdroi = dstHeight;

tmpSrc = src;
for(int kloop=0;kloop<LOOP;kloop++)

{  
tmpRef = dst ;
for (j = 0, jrow = idyt; j < hdroi; j++, jrow++) {         row = (jrow + 0.5) * yInv - 0.5;

rowInd = (int)floor(row);
ySrc0 = ts_iGetCoord_vs(rowInd, rowInd,  0, srcHeight, srcHeight);
ySrc1 = ts_iGetCoord_vs(rowInd, rowInd + 1, 0, srcHeight, srcHeight);
for (i = 0, icol = idxl; i < wdroi; i++, icol++) { 
col = (icol + 0.5) * xInv - 0.5;
colInd = (int)floor(col);
xSrc0 = ts_iGetCoord_vs(colInd, colInd,   0, srcWidth, srcWidth);
xSrc1 = ts_iGetCoord_vs(colInd, colInd + 1, 0, srcWidth, srcWidth);
y1 = (double)tmpSrc[ySrc0 * srcStep + xSrc0];
y2 = (double)tmpSrc[ySrc0 * srcStep + xSrc1];
y3 = (double)tmpSrc[ySrc1 * srcStep + xSrc0];
y4 = (double)tmpSrc[ySrc1 * srcStep + xSrc1];  
ts_iLinearCalcSP_vs(col + 0.5, colInd + 0.5, colInd + 1.5, y1, y2, &v1);            ts_iLinearCalcSP_vs(col + 0.5, colInd + 0.5, colInd + 1.5, y3, y4, &v2);
ts_iLinearCalcSP_vs(row + 0.5, rowInd + 0.5, rowInd + 1.5, v1, v2, &v);
//(ts_isaturate_vs(v);
tempV = (int)(v + EXP + 0.5);             tmpRef[i] =(unsigned char)((tempV > 255) ? 255 : (tempV < 0) ? 0 : tempV);
}
tmpRef += dstStep;
}  
}

stop_clock = RUNTIME;

int mhz;

ippGetCpuFreqMhz(&mhz);

return (stop_clock - start_clock)/mhz/LOOP;

}

unsigned long IPP_Resize(void* src, int srcWidth, int srcHeight,int srcStep,  void* dst,  int dstWidth, int dstHeight, int dstStep,   double m_nzoom_x, double m_nzoom_y, int interpolation)

{

//   define record time variable
unsigned long start_clock,stop_clock;     start_clock= RUNTIME;

// define IPP function parameter

IppiRect srcRoi = {0,0, srcWidth, srcHeight};

IppiRect dstRoi={0,0, dstWidth,dstHeight};

 

IppiSize srcSize = {srcWidth, srcHeight};

IppiSize dstSize = {dstWidth, dstHeight};

 

int BufferSize;

ippiResizeGetBufSize(srcRoi, dstRoi, 1, interpolation, &BufferSize);

Ipp8u* pBuffer=ippsMalloc_8u(BufferSize);



for(int i=0;i<LOOP;i++)

//---------- Perform IPP function:ippiResizeSqrPixel_8u_C1R  -------------------------------------------//

ippiResizeSqrPixel_8u_C1R((Ipp8u*)src, srcSize, srcStep, srcRoi, (Ipp8u*)dst, dstStep, dstRoi, m_nzoom_x,m_nzoom_y,0, 0, interpolation, pBuffer);

ippsFree(pBuffer);
stop_clock = RUNTIME;
int mhz;
ippGetCpuFreqMhz(&mhz);
return (stop_clock - start_clock)/mhz/LOOP;

 

Performance Gain

On one test machine (core 2 Quad 2.66GHz), as the result image show that the performance gain is 15654/353=44x.

The test is linking serial IPP static library.  As the ippiResize is threaded in dynamic library and threaded IPP static library. If enable the multithread, the performance gain will be more than 50x (depends on the core numbers and image size).

Conclusion
Intel® Parallel Studio 2011 provide developer a first suit of tool for easy developing parallel application on multi-core platform. IPP is part of key component of Intel® Parallel Studio. It provide over thousands highly-optimizated functions that offer the support for for developing high performance digital media application. This article describes a brief way to adopt IPP function instead of source code via Parallel Studio Project and gain over 40x performance speed up outright.

Optimization Notice in English

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

a-j-syed-raziul-hassan's picture

I like this....Thanks, forwarder .....