Missing ippiMulPack_32f for real unpacked input

Missing ippiMulPack_32f for real unpacked input

I'm working on a DFT filter for images using a Real filter.

I forward transform the image using ippiDFTFwd_RToPack_32f_C1R(). The result is packed complex.

I then want to multiply my filter into the transform, but there is a problem: I don't know how to create a packed filter, and there is no variant of ippiMulPack_32f that takes an non-packed filter as source.

Of course, I could create a complex filter, then convert that to packed complex, but that extra step is costly.

Is there an ippiMulPack that takes unpacked real input?

Does anybody know how to write a packed complex image?

(w*h 32fc image array)

23 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

the packed format is well described in the according pdf documentation. You can write your own function which transforms your filter into packed format.

I would not say "well described" for Intells explanation/documentation of their RCPack2D format.

The PDF indicates shaded colum only for Even, but there is no shaded column...

Also, I was looking for a quick solution, on how to take an Ipp32f image (data is Real, Im=0), and shuffle that into a packed Ipp32f.

Normally, I'd look elsewhere, but since this format is from Intel, I'm hoping they have a quick solution, and of cource, all in the light of speed/high performance.

So you mean ippiCplxExtendToPack_ ?

No, I mean, instead of Create Real Filter Cplx, followed by ippiCplxExtendToPack, followed by ippiMulPack_32f, I'm looking to see if there is a way to skip one step, doing Create Real Filter 32f, followed by some unknown function ippiMulRToPack_32f.

This way, the filter uses less memory, ippiCplxExtendToPack does not have to be called == more performance.

Alternatively, I there is a function to pack a Real image, or if I could write my own function (I can't, because documentation is too low), then I'd just use the packed functions directly, also with high performance.

We have our own function which converts real into IPP packed data - but sorry, that's copyrighted. Of course it is no function with 2 or 3 lines (our C optimized function needs about 90 lines of code). I don't see any code piece here which could be optimized by SSE, so I think therefore it isn't part of IPP

Your 90 lines of code are what prevents me to start recoding that myself!

I still feel its a pity Intel does not supply this for us. Looking at UIC, Intel hands out tons of code to implement image compression. For Fourier Transforms, Intel does not really have a good sample. Of course there is some sample code, but that is more for the beginning, on how to prepare the buffers and call the functions.

I really miss a good DFT/FFT sample, one that does some optimized image filtering, for instance, a Butterworth sharpening filter. One that includes all border management (still a problem for many), and a useful real to packed function.

Anyway, you've told me about your copyrighted code, so what now? Can you give a hint with some pseudo code? :))

We've developed (a collegue) the transformation based on the information of RCPack2D format described in ippiman.pdf (March 2009) on page 718-719. There is a table on how the data is located in memory. I don't think that there is a better way to describe the format than with a table which is good to imagine.

The 90 lines of code include also

  • moves imaginary data into RCPack2D
  • handles even and also odd # of rows
  • optimized by removing jumps for odd/even column/row index within loops

Maybe you need none of these points.

I only need the first point.

In the PDF, "Table 10-5, RCPack2D Storage for Even Number of Rows", I see that the first row is boxed, why is that?

If I ignore the box, I see an incomplete pattern:

First row (odd rows):
Re A(0,0) Re A(0,1) Im A(0,1) ... Re A(0,(N-1)/2) Im A(0,(N-1)/2) Re A(0,N/2)

How am I supposed to know what is in the "..." ?
If it is A(I,J), I can see that I is 0 for the row.
I can see that J goes 0,1,1, but what comes after that? I don't like guesswork in my programming!
Could it be (j+1)/2, where j=0..N/2 ?

Then the selection of Re/Im components. I seen Re, Re, Im, ..., how am I supposed to known what is in "..."?
Is the next Im or Re?
Is the full pattern Re, Re, Im, Im, Re, Re, Im, Im etc, or is it Re, Re, Im, Re, Re, Im, etc?

Any help highly appreciated!

You think completely in the wrong direction.

Let's look at first row: Re Re Im .... Re Im Re -> the "..." here means pairs of Re/Im for each row index, only the first and the last element are real parts.

Now look at the columns from row 2 on, the first and last column are pairs of Re/Im while the columns between are completely real or imaginary parts.

And now the last point for the even # of rows the last row is symmetric to the 1. one.

I guess you mean "means pairs of Re/Im for each column index...", no?

Lets for arguments sake say that N=32, therefore there are 16 columns in the packed array (even width). Is the first row then this (according to you, not according to the PDF)? :

Re(0,0), Re(0,1), Im(0,1), Re(0,2), Im(0,2), Re(0,3), Im(0,3), Re(0,4), Im(0,4), Re(0,5), Im(0,5), Re(0,6), Im(0,6), Re(0,7), Im(0,7), Re(0,8)

The PDF does not say this.

>> "means pairs of Re/Im for each column index..." yes of course

>> Lets for arguments sake say that N=32, therefore there are 16 columns in the packed array (even width).

No, if N==32, then 30 columns are in the packed array, just Im(0,0) and Im(0,N/2) are removed

Ahh, yes, source items are Re+Im.

So, for N=32, there are 32 source items (with Re+Im) in a row, and the packed array will have 32 float items (packed array has same number of floats as source complex array).

I guess I'll have to decide speed from between writing my filter directly to a packed array versus writing a simple float array, and then calling ippiCplxExtendToPack. Since I only do one single image when filtering, the computation always includes generating the filter.

- Generate Real filter
- ippiCplxExtendToPack
- ippiDFTFwd_RToPack
- ippiMulPack_32f
- ippiDFTInv_PackToR

- Generate packed filter
- ippiDFTFwd_RToPack
- ippiMulPack_32f
- ippiDFTInv_PackToR

(C) future, whished for, IPP function
- Generate Real filter
- ippiDFTFwd_RToPack
- ippiMulRealToPack_32f32fc
- ippiDFTInv_PackToR

Which one would be fastest?

>>Which one would be fastest?

Of course it will be (A) or (B)

  1. in most cases you will create the packed filter once and do ippiMulPack multiple times
  2. ippiMulPack is much faster than ippiMulRealToPack because source1 and source2 have the same format and 4 floating point multiplications can be done at once in SSE

Therefore (B) only has an advantage over (A) if you really often generate new packed filters

I need to generate the filter for every image since each image has another (unknown) dimension.

I guess I should decide if

- Generate Real filter Cplx
- ippiCplxExtendToPack

is slower/faster than

- Generate packed filter

Since ippiDFTFwd_RToPack+ippiMulPack_32f+ippiDFTInv_PackToR is rather time consuming anyway, I guess I might as well use (A), and drop my ideas of generating a packed filter.

I do not immediately agree that 4 multiplications can be done at once in SSE because complex mulitiplication is not just using multiply, but also add.

In fact, since multiplying a real number to a complex number only involves multiplying the real number into each of Re, Im, that would be very fast if it was supported by IPP.

Of course complex multiplication can be fast (see entry "Complex multiply by conjugate" within this forum where I posted some SSE3 code doing this.

>>In fact, since multiplying a real number to a complex number only involves multiplying the real number into each of Re, Im, that would be very fast if it was supported by IPP.

No because your real data buffer has a completely different data aligning compared with the packed data. So you need a loop with jumps or an aligning correction which will be slower than just streaming data through the SSE unit.

The slowest step in your code is the DFT. What you need is a FFT for arbitrary sizes, which isn't supported in IPP. Maybe you should take a look at FFTW.

>What you need is a FFT for arbitrary sizes, which isn't supported in IPP. Maybe you should take a look at FFTW.

Maybe you can let us know some details about the limitation of DFT functions:

ippiFFTFwd_/ippiFFTInv_ functions needs the data lengths must be integer powers of 2.

But for,

ippiDFTFwd_/ippiDFTInv_ functions, the N and M can take on arbitrary integer non-negative values.



The limitation of the DFT is its complexity O(N^2), while FFT has a complexity of O(N log N). For more details about arbitrary length FFT's, you can take a look at wikipedia (prime-factor fft algorithm)


Since I know that IPP convolution filters will internally use Fourier transforms if it determines it would be faster, I'd like to know if IPP DFT internally uses FFT for the same reasons?

Those internal IPP decisions are also a bit less documented. Is there a specific location for more docs?

The name "Convolution" does not reason to the internal algorithm which is used. The name "DFT" of course. I wouldn't name it DFT if it is a FFT in some cases :)

But will be easy to find out, just take an image with a power of 2 for width/height and test the DFT/FFT

With convolution I mean ippiFilter32f. If I'm informed well, it uses DFT/FFT internally, if its size is larger than X.

I did take a look at FFTW, but it has a price tag for non-GPL applications.


The name of "DFT" may bring some confusing here. For IPP DFT transform function (ippiDFTFwd_/ippiDFTInv), it is optimized with some fast implementation. If its length is power of 2, it calls corresponding FFT functions. For some other lengths, it is based on prime-factor algorithm. the function have optimized for many primes.

For ippiFilter function, it does not use FFT internally, it has direct implementation.

ippiConv and ippiCrossCorr use 2D FFT. So if you needs filtering an image with rather big kernel (~20x20), it's better to use ippiConv, instead of ippiFilter.



Thanks for the good explanation.

IMO a FFT is a FFT even if it has arbitrary length. But maybe it is historical and based on the first implementation of the DFT which really was a DFT and till now the algorithm was more and more optimized and uses FFT instead of a real DFT now.

In summary this means I can remove all decisions whether to call DFT or FFT by just calling FFT everytime because performance loss here will be insignificant

Leave a Comment

Please sign in to add a comment. Not a member? Join today