Complex multiply by conjugate

Complex multiply by conjugate

It would save memory bandwidth if there was a function that would multiply two complex vectorswith the result being as if one of thesource vectors had been conjugated. This is currently a two step process of conjugating one vector and then performing the complex vector multiply.

Please consider this for a future release.

16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

couldn't you use ippsMulPackConj_%

If I understand MulPackConj correctly, I would have to convert my complex data to packed format, multiply, and then convert it back. My data is stored as Ipp32fc.

My main objective with this request is to reduce the memory footprint and/or memory accesses required to perform the multiplication. The conj() and mul() sequence that I currently use requires 3 reads and 2 writes of complex data to get the job done when it is possible (when combined) to use 2 reads and 1 write.

I would hazard a guess that conjugate multiply is already used internally by some of the other algorithms provided. It is a relatively common operation in singal processing involving complex data.

I have been there also, trying to get IPP DFT effective.

My problem was that I couldn't find any good FFT/DFT samples from IPP.

I'd like to have a good sample that loads a grayscale image, and then perform some frequency domain filtering, and then saving the result. The frequency domain filter could be a butterworth filter for example.

If this was available, we'd be able to continue with other filters. Google only indicates general frequency domain theory. Here, we need specific Intel IPP information, to get top performance.

Maybe Intel could expand Picnic to include a simple FFT/DFT workbench function.

Ok, it would be a nice to have,

but if you're sure your pointers are 16 byte aligned and your processor supports sse3, it's really easy to implement in SSE intrinsics (_mm_mul_ps/_mm_hadd_ps for single precision)

Thanks for the suggestions.

I haven't written any sse intrinsic code in Visual Studio. I did some monkey see monkey do code changes in Unix/gcc in a previous life that used them.

What would that look like in Visual Studio C/C++? Or do you have a good reference link?

it will look like

[bash]void MulConjCompC( float* pSrc1, float* pSrc2, float* pDst, int count)
{
  for( int i=0; i

I did not compare the outputs of SSE implementation and C, maybe the imaginary part of SSE has the wrong sign.

Thanks a bunch. I will try this out and see what kind of change it has on the timing of my algorithm.

I would still like Intel to supply the functionality so it will work on all platforms and be supported/optimized by them on future processors.

Best Reply

sorry, yesterday I just had 10min to "hack" some buggy code

here's the correction

//the C version
void MulConjCompC( float* pSrc1, float* pSrc2, float* pDst, int count)
{
  for( int i=0; i  {
    pDst[0] = pSrc1[0]*pSrc2[0] - pSrc1[1]*pSrc2[1];
    pDst[1] = pSrc1[1]*pSrc2[0] - pSrc1[0]*pSrc2[1];
  }
}

//2 complex values with 1 loop
void MulConjCompSSE( float* pSrc1, float* pSrc2, float* pDst, int count)
{
  assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) & 0xF));  //check for 16byte align
  __m128* src1 = (__m128*)pSrc1;
  __m128* src2 = (__m128*)pSrc2;
  __m128* dst = (__m128*)pDst;
  for( ; count>0; count-=4, src1++, src2++, dst++)
  {
    __m128 d1 = _mm_mul_ps( src1[0], src2[0]);
    __m128 ds = _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1));
    __m128 d2 = _mm_mul_ps( ds, src2[0]);
    ds = _mm_hsub_ps( d1, d2);  //horizontally add 2 values
    *dst = _mm_shuffle_ps( ds, ds, _MM_SHUFFLE(3, 1, 2, 0));
  }
  MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count);  //rest
}

//4 complex values with 1 loop
void MulConjCompSSE3_2( float* pSrc1, float* pSrc2, float* pDst, int count)
{
  assert( !(((INT_PTR)pSrc1 | (INT_PTR)pSrc2 | (INT_PTR)pDst) & 0xF)); //check for 16byte align
  __m128* src1 = (__m128*)pSrc1;
  __m128* src2 = (__m128*)pSrc2;
  __m128* dst = (__m128*)pDst;
  for( ; count>0; count-=8, src1+=2, src2+=2, dst+=2)
  {
   __m128 d1 = _mm_mul_ps( src1[0], src2[0]);
   __m128 d2 = _mm_mul_ps( _mm_shuffle_ps( src1[0], src1[0], _MM_SHUFFLE(2, 3, 0, 1)), src2[0]);
   __m128 e1 = _mm_mul_ps( src1[1], src2[1]); 
   __m128 e2 = _mm_mul_ps( _mm_shuffle_ps( src1[1], src1[1], _MM_SHUFFLE(2, 3, 0, 1)), src2[1]);
   __m128 f1 = _mm_hsub_ps( d1, e1);
   __m128 f2 = _mm_hsub_ps( d2, e2);
   dst[0] = _mm_unpacklo_ps( f1, f2);
   dst[1] = _mm_unpackhi_ps( f1, f2);
  }
  MulConjCompC( (float*)src1, (float*)src2, (float*)dst, count); //rest
}


renegr - I didn't expect someone to take the time to actually code it. Many thanks. Your help is very much appreciated.

Any one else that is interested,

Since what I actually wanted was dst = src1 * conj(src2) the real calculation inMulConjCompC() should be:

pDst[0] = pSrc1[0]*pSrc2[0]+ pSrc1[1]*pSrc2[1]; // notice the + instead of -

and therefor I think the first mm_hsub_ps() in MulConjCompSSE3_2() should be:

__m128 f1 = _mm_hadd_ps( d1, e1); // noitice the hadd instead of hsub

Of course you're right (it was correct in the c function)

I'm currently gaining my SSE experiences so it was a nice practice. Would be nice if you could provide some measures on how your performance did increase by this functions.

Hi,

I have a similar need to have a Complex multiply by conjugate for arrayfunction. Currently the function is a C implementation in my application. The function takes a quite a good amount of time in my application.
I have to optimize the implementation using IPP. I triedtwo versions (given below )but both are slower then the C version.

//IPP implementation

IppStatus ippStats;

Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format

/*//Implement-1

// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j

ippsMul_32f(pAreal,pBreal,pCreal,Np);

ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);

//Calculating q = (aI*bR-aR*bI)

ippsMul_32f(pAimag,pBreal,pCimag,Np);

//Make -bI vector from bI vector

ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);

ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);

*/

//Implement-2

//Allocate memory for pSrcA, pSrcB and pSrcC buffer

pSrcA = ippsMalloc_32fc(3*Np);

pSrcB = &pSrcA[Np];//ippsMalloc_32fc(Np);

pDst = &pSrcA[2*Np];//ippsMalloc_32fc(Np);

if((pSrcA==0)||(pDst==0)||(pSrcB==0))

{

IPP_ERR_STATUS_var = ippStsMemAllocErr;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//First Convert 2 separte IQ buffers to single complex number buffer

//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);

ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);

ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);

//Using IPP API:-

//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);

ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);

if(ippStats!=ippStsNoErr)

{

IPP_ERR_STATUS_var = ippStats;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//Now Convert output complex buffer to 2 separte IQ buffers

//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);

ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

Then I implemented the SSE version of the function as

void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)

{

/*16-byte alignment check*/

{

__m128 m1, m2, m3, m4;

//__m128 sr1,sr2,si1,si2,dr1,dr2;

// __declspec(align(16)) float *pSrc1re2;

__m128 *srcR1 = (__m128*)pSrcRe1; //a1re

__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im

__m128 *srcR2 = (__m128*)pSrcRe2; //b1re

__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im

__m128 *destR = (__m128*)pDstRe; //ResRe

__m128 *destI = (__m128*)pDstIm; //ResIm

for(int i = 0 ;i< count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1)

{

m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)

m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)

m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)

m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)

*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)

*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)

}

mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real

const Float_t *pAimag, // source of array 'A' imag

const Float_t *pBreal, // source of array 'B' real

const Float_t *pBimag, // source of array 'B' imag

Uword32 Np, // number of points

Float_t *pCreal, // Dest for real part

Float_t *pCimag)

C version

int s;

float ar, ai, br, bi;

for (s = 0; s < count; s++) // for each array element

{

ar = *pSrcRe1++; // get the real and imaginary parts

ai = *pSrcIm1++;

br = *pSrcRe2++;

bi = *pSrcIm2++;

*pDstRe++ = ar * br + ai * bi; // do the complex multiply

*pDstIm++ = ai * br - ar * bi; // B conjugate of B

}

This function is very fast compare to C as long all the input vectors are 16-byte aligned. I have limitation not to change the current framework of application. And I am not sure how to force a vector to be 16-byte aligned. Kindly suggest what are the options I can try
1. algin vectors to 16-byte.
2. or using IPP to have faster routine.
Hoping to get a speedy reply from experts.

Note: I have just started using IPP/SSE. Its hardly a week's experience.

Regards
Rohit

//////// Posted Twice/////////

Hi,

I have similar requirementto have Complex multiply by conjugate function for array . Currently the function is a C implementation. The function takes quite a good amount of time in my application. I have to optimise wit either IPP or SSE.

Details on the function
--------------------------

Description:Array A x Comp Conj of B; Result in C.

mult_fc_fcConj_arrays (const Float_t *pAreal, // source of array 'A' real

const Float_t *pAimag, // source of array 'A' imag

const Float_t *pBreal, // source of array 'B' real

const Float_t *pBimag, // source of array 'B' imag

Uword32 Np, // number of points

Float_t *pCreal, // Dest for real part

Float_t *pCimag) // Dest for imag part

I have tried two version of IPP implementation. But both version are slower then C.

//IPP implementation

IppStatus ippStats;

Ipp32fc *pSrcA, *pSrcB, *pDst; // Typecasting input and outpointers in IPP Complex format

/*//Implement-1

// p + qj = (aR+aI*j)*(bR-bI*j)=(aR*bR+aI*bI) + (aI*bR-aR*bI)j

ippsMul_32f(pAreal,pBreal,pCreal,Np);

ippsAddProduct_32f(pAimag,pBimag,pCreal,Np);

//Calculating q = (aI*bR-aR*bI)

ippsMul_32f(pAimag,pBreal,pCimag,Np);

//Make -bI vector from bI vector

ippsSubCRev_32f_I(0,(Ipp32f *)pBimag,Np);

ippsAddProduct_32f(pAreal,pBimag,pCreal,Np);

*/

//Implement-2

//Allocate memory for pSrcA, pSrcB and pSrcC buffer

pSrcA = ippsMalloc_32fc(3*Np);

pSrcB = &pSrcA[Np];//ippsMalloc_32fc(Np);

pDst = &pSrcA[2*Np];//ippsMalloc_32fc(Np);

if((pSrcA==0)||(pDst==0)||(pSrcB==0))

{

IPP_ERR_STATUS_var = ippStsMemAllocErr;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//First Convert 2 separte IQ buffers to single complex number buffer

//IppStatus ippsRealToCplx_32f(const Ipp32f* pSrcRe, const Ipp32f* pSrcIm, Ipp32fc* pDst, int len);

ippsRealToCplx_32f(pAreal, pAimag, pSrcA, Np);

ippsRealToCplx_32f(pBreal, pBimag, pSrcB, Np);

//Using IPP API:-

//IppStatus ippsMulByConj_32fc_A21 (const Ipp32fc* pSrc1, const Ipp32fc* pSrc2, Ipp32fc* pDst, Ipp32s len);

ippStats = ippsMulByConj_32fc_A21 (pSrcA, pSrcB, pDst, Np);

if(ippStats!=ippStsNoErr)

{

IPP_ERR_STATUS_var = ippStats;

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

return afDspUtils_Filter_Intel_IPP_error_See_IPP_ERR_STATUS;

}

//Now Convert output complex buffer to 2 separte IQ buffers

//IppStatus ippsCplxToReal_32fc(const Ipp32fc* pSrc, Ipp32f* pDstRe, Ipp32f* pDstIm, int len);

ippsCplxToReal_32fc( pDst, pCreal, pCimag, Np);

myIppsFree(pSrcA);//myIppsFree(pSrcB);myIppsFree(pDst);

Then I implemented the function using SSE. Which is quite fast and works perfect in my application as long as the vectors are 16-byte aligned.

void MulConjCompSSE( float* pSrcRe1,float* pSrcIm1, float* pSrcRe2,float* pSrcIm2, float* pDstRe,float* pDstIm, int count)

{

//Check for 16-byte alignment

{

__m128 m1, m2, m3, m4;

//__m128 sr1,sr2,si1,si2,dr1,dr2;

// __declspec(align(16)) float *pSrc1re2;

__m128 *srcR1 = (__m128*)pSrcRe1; //a1re

__m128 *srcI1 = (__m128*)pSrcIm1; //a1Im

__m128 *srcR2 = (__m128*)pSrcRe2; //b1re

__m128 *srcI2 = (__m128*)pSrcIm2; //b1Im

__m128 *destR = (__m128*)pDstRe; //ResRe

__m128 *destI = (__m128*)pDstIm; //ResIm

for(int i = 0 ;i< count; count-=4, srcR1+=1,srcI1+=1, srcR2+=1,srcI2+=1, destR+=1,destI+=1)

{

m1 = _mm_mul_ps( *srcR1, *srcR2); //(a1Re*b1Re)

m2 = _mm_mul_ps(*srcI1,*srcI2); //(a1Im*b1Im)

m3 = _mm_mul_ps(*srcR1,*srcI2); //(a1Re*b1Im)

m4 = _mm_mul_ps(*srcI1,*srcR2); //(a1Im*b1Re)

*destR = _mm_add_ps(m1,m2); // (a1Re*b1Re)+(a1Im*b1Im)

*destI = _mm_sub_ps(m4,m3); //(a1Im*b1Re)-(a1Re*b1Im)

}

}

I have limitation not to disturb the application framework i.e changing the interface. I have to use vectors. And I don't know how to force the vector to have 16-byte aligned allocation. I have to optimised the speed for this function. Kindly suggest the options for the following
1. Either a method to allocate vector 16-byte aligned
2. Or a fast IPP based method to implement this function
3. SSE implementation to handle non-aligned vectors.

Hoping to get a speedy reply from the experts.

Note: I have just started working on IPP/SSE. its just a week into my first IPP/SSE routine.

Regards
Rohit

I see that you tried already to use__declspec( align(16) ) declaration but it is commented out. Did you have any problems?

You canalsouse_mm_malloc function:

...
__m128 *pVec1 = ( __m128 * )_mm_malloc( _RTVECTOR_SIZE * sizeof( __m128 ), 16 );
...

Best regards,
Sergey

I have tried both options. It will work when the vectors are defined with in the function. I asked the case how to handle the vectors passed to function are not 16-byte aligned. i.e if the function are passed with vectors which are not aligned to 16-byte boundary.RegardsRohit

Hi Rohit,

I think I've already provided an answer at http://software.intel.com/en-us/forums/showthread.php?t=101316 thread.

Regards,
Igor

Leave a Comment

Please sign in to add a comment. Not a member? Join today