# FIR filter code optimization

## FIR filter code optimization

Hi all,
I'm trying to implement FIR filter (convolution) on XeonPhi. So what I mean by FIR filter (convolution). Let's have a signal x[i] where [] means that it is discrete. Let's have also some coefficients a[i]. The FIR filter is

y[i]=SUM from n=0 to n=T of (x[i-n]*a[n])

, where T is number of taps. This is FIR filter on single frequency channel. The actual filter is multi-channel. This means that if I have N channels and if I'm computing one channel after another I access data like this:

|*------N-1 numbers-----|*------N-1 numbers-----|*------N-1 numbers-----|...
^                               ^                               ^
Begining               second x                    third x ...
(first x accessed)    stride N

The coefficients are are distinct for each tab and each channel, so they have size Channels*Taps. If I'm performing repeated FIR filter, I'm using same coefficients.

Performance:
So far I've been able to achieve only F=150GFlop/s duration of the calculation is t=0.078s, this is for 2GB of data. It is not ideal to measure bandwidth bound code by flops, but I don't have bandwidth results. For Xeon Phi I'm using a modified code from CPU which on two Xeon E5-2650 does F=79GFlops in t=0.147s. Also If I increase number of taps (thus increasing data reuse) the Xeon Phi performance decreasing in contrast with CPU code where performance is increasing. I know this is cache effect so I produced, what I believe to be blocked code, however this code is not performing well.

Taps/performance
Phi: 8/111 12/150 16/62 20/48 24/48 28/49
CPU: 8/76 12/80 16/110 20/98 24/98 28/102

Do you have any suggestions how to improve the code? Is my code faulty in some way?

Here is simplified code:

```INNER=(96*1024/(4*nTaps)); //this should give number of channels that would fit into 100kB cache
OUTER=nChannels/INNER;
REMAINDER=REG-OUTER*INNER;

for(o=0;o<OUTER;o++){
for(blt=0;blt<block_step;blt++){
for(i=0;i<INNER;i++){
i_spectra[0]=_mm512_setzero_ps();i_spectra[1]=_mm512_setzero_ps();
i_spectra[2]=_mm512_setzero_ps();i_spectra[3]=_mm512_setzero_ps();

for(t=0;t<nTaps;t++){

} // for nTaps

_mm512_store_ps(&spectra[(RpCl*c)*FpR + 2*bl*nChannels],i_spectra[0]);
_mm512_store_ps(&spectra[(RpCl*c+1)*FpR + 2*bl*nChannels],i_spectra[1]);
_mm512_store_ps(&spectra[(RpCl*c)*FpR + (2*bl+1)*nChannels],i_spectra[2]);
_mm512_store_ps(&spectra[(RpCl*c+1)*FpR + (2*bl+1)*nChannels],i_spectra[3]);
}// for INNER
}// for nSpectra
}// for OUTER

```

8 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

The compiler should interleave the loads. You might try this if it does not:

```INNER=(96*1024/(4*nTaps)); //this should give number of channels that would fit into 100kB cache
OUTER=nChannels/INNER;
REMAINDER=REG-OUTER*INNER;

for(o=0;o<OUTER;o++){
for(blt=0;blt<block_step;blt++){
for(i=0;i<INNER;i++){
i_spectra[0]=_mm512_setzero_ps();i_spectra[1]=_mm512_setzero_ps();
i_spectra[2]=_mm512_setzero_ps();i_spectra[3]=_mm512_setzero_ps();

for(t=0;t<nTaps;t++){

} // for nTaps

_mm512_store_ps(&spectra[(RpCl*c)*FpR + 2*bl*nChannels],i_spectra[0]);
_mm512_store_ps(&spectra[(RpCl*c+1)*FpR + 2*bl*nChannels],i_spectra[1]);
_mm512_store_ps(&spectra[(RpCl*c)*FpR + (2*bl+1)*nChannels],i_spectra[2]);
_mm512_store_ps(&spectra[(RpCl*c+1)*FpR + (2*bl+1)*nChannels],i_spectra[3]);
}// for INNER
}// for nSpectra
}// for OUTER

```

Jim Dempsey

Hi,
I'm very sorry for my late reply.

I might by misunderstanding the concept of blocking stuff to fit into a cache, I simply divide processed data into blocks which could fit into cache and then process them one after another. So why do I see such a massive drop in performance. I understand that there will be decline but it seems that the code is behaving like there is no blocking at all.

To jimdempseyatthecove:
I should always say what compiler I'm using and I always forgot. I'm using Intel compiler 13.1.1 (for native compilation). I have tried what you have suggested, without any difference. So the compiler is probably doing on its own.
Thanks for good suggestion I'll try to remember that.

What are the values for nSpectra, nChannels and nTaps? I know you stated nTaps is variable, but I do not know the suitable ranges. nChannels may be fixed. Also, because you are using _mm512_load_ps, this would appear to imply you are loading 16 floats for processing as vector.

If you can describe your data layout, it would be helpful to us (me) to make recommendations.

If you can, provide a stripped down main that generates test data and calls your function in a timed loop. Do not supply your entire application. Supply just the essentials to "benchmark" your function.

I notice you have a for(o=0; o < OUTER; ++o) loop, yet none of the internal indices are strided by o. This leads me to believe you are redundantly processing data.

Data organization makes a big difference. Correct me if this is wrong: I assume all lanes of your _mm512 data are productive (IOW this is not scalar operations with one channel out of 16 available).

More efficient use of cache can be attained by using stride-1 vector loads (cache line aligned). Your code has stride of nChannels.

On Xeon Phi you have 4 hardware threads per core. The hardware threads within a core share L1 and L2 cache of its core, but also can access data within L2 of other cores (at higher latency). You have a variety of environment variables you can set to juxtapose hardware threads with OpenMP logical processors: KMP_AFFINITY=COMPACT, KMP_AFFINITY=SCATTERR, and many others.

Analyze your load patterns, reorganize if necessary, such that the L2 cache hits have majority within the same core. Note, the optimal hardware thread in core for this loop may differ from the optimal layout for other portions of your code.

Due to you having block_step=nSpectra/nThreads, you've chosen the number of blocks==nThreads. Therefore, the optimal loop nesting order will depend on the size relationship between block_step and nTaps.

I came across an acronym that is pertinent for optimization: OHIO

"Only Handle It Once"

As it applies to programming this is extended to Only Handle It As Minimally As Possible (OHIASMAP, which is harder to pronounce and remember). OHIO is the optimal solution, strive to get close to OHIO as possible.

Jim Dempsey

Convolution and dense matrix multiplication are the only two commonly used kernels that are capable of approach peak performance on most cached systems, but the implementation details can be quite complex.   In particular, both cases require aggressive reuse of summation variables in registers in order to reduce the number of loads from cache.   Following Jim Dempsey's comment, it is critical to know the sizes of the the arrays and filters, and the number of arrays and filters, along with a sample layout for the whole thing to give a concrete starting point.

Optimization of convolutions is not terribly hard on a "traditional" cached system, but SIMD arithmetic makes it much more challenging to implement the required register reuse.  Obviously Intel has done it, since they get very good performance with DGEMM, but they are free to choose their block size in that case, while your convolution filter width may not be adjustable.

As in most systems, there is a trade-off between blocking for L1 cache and blocking for L2 cache.  L1 cache provides the best available bandwidth, but with only 1/16 the capacity of the L2.  The L2 has a maximum sustainable read bandwidth of about 1/3 of the L1 read bandwidth (measured in isolation -- i.e., ignoring cache fills and writebacks that occur when streaming data from memory).

But before we worry about these details, looking at some specific sizes of interest would be helpful.   You should be able to stream (contiguous) data from memory at 120 GB/s or better, and it looks like you are not very close to that yet.  This could be due to either compute limitations (typical for convolutions with cacheable filter sizes), or due to inefficient use of the cache lines that are actually transferred.

"Dr. Bandwidth"

Hi,
it seems that I poorly explained how do I do things or not at all. Let me try that again.

The real data comes in complex format, this means two single precision numbers (float) <[real part][imaginary part]>;<[real part][imaginary part]>;... To simplify things it can be thought as a array of floats. The complex nature has no effect on code or data parallelism, it just doubles the size of computed array.

Sizes:
nSpectra, nChannels and nTaps are all somewhat arbitrary, but it can be assumed that  nChannels are multiples of 16, and that nSpectra are much bigger than number of threads. I assume 120000 in single run. But the code is meant to run continuously in offload. So most probably in production it will run on nSpectra< 100 000, because of double buffered offload. Taps are so far completely arbitrary, but from final result perspective more is better.

Input:
So Input is array of floats. These floats are grouped into sets of size nChannels. In the end one would get Input=<Spectra01>;<Spectra02>;... , where each spectra is Spectra=<float>;<float>;... and it contains nChannels numbers (floats). Each number represent one channel. Input is a multiple of nChannels. sizeof(Input)=N*nChannels;

Coefficients:
Has size of sizeof(coeff)=T*nChannels , where T is number of taps. Again it is grouped in sets of size nChannels.

Output:
Is similar to input but it has size only sizeof(output)=(N-T+1)*nChannels. This is due to algorithm used.

Align:
Data allocation is done using _mm_malloc and align length is 4096.

Each thread reads through _mm512_load_ps 16 floats from 16 channels of the first spectra.  This way I read one cacheline at a time. After that loads associated 16 floats from coeff array (again whole cacheline). These are then multiplied and added using _mm512_fmadd_ps. This cycle is repeated T (taps) times, while loop through T reads channels (16 floats) from next spectra (or in general bl+t spectra while bl is id of current spectra).

When loop though taps is finished, code proceed to another 16 channels (floats) This is repeated until whole spectra is processed.

Then next spectra and so on.

I've found the fastest threads assignment is if I use affinity=compact and threads are ordered that first thread (id=0) process first spectra (thus accessed also next T-1 spectra) second thread second spectra and so on. If we consider what thread id=0 is computing then it starts at spectra [0],[nThreads],[2*nThreads],..., where nThreads is total number of threads.

Non-blocked simplified code:

```int FpR=16; //floats per register 512/(8*4)
int REG=nChannels/FpR;

#pragma omp parallel shared(input_data,spectra,coeff) private(blt,th_id,block_step,bl,t,c,i_data,i_coeff,i_spectra,i_temp)
{
// note: this is for ideal case where nSpectra is a multiple of number of threads, => remainder=0;

for(blt=0;blt<block_step;blt++){
for(c=0;c<REG;c++){
i_spectra = _mm512_setzero_ps();
for(t=0;t<nTaps;t++){
i_temp = _mm512_mul_ps(i_data,i_coeff);
} // for nTaps

_mm512_store_ps(&spectra[c*FpR + bl*nChannels],i_spectra);
}// nChannels
}// for nSpectra
}
```

Enhancements to this is that one thread reads two consecutive spectra (so for example spectra 2 and 3) and that it loads more then one cache line. I attached file with working version which can be run natively on Xeon Phi.

The blocking is done by splitting the channel for loop into one that is outside spectra loop and one that is inside spectra loop. I hoped this would be sufficient to create blocking code which would take advantage of cache. But the performance does not seem to be reflecting that. Since for large number of taps (T=64+) performance is roughly one predicted by the roofline model, which does not take cache into account. I assume that one thread has 512kb/4 cache, end lowered it a bit.

Blocked simplified code:

```int FpR=16; //floats per cacheline and floats per VPU register
int REG=nChannels/FpR;
int INNER=(96*1024/(4*nTaps))/FpR;//number of channels for cache
int OUTER=REG/INNER;
// this is for ideal case where REG is multiple of INNER
// INNER is number of channels processed in one loop

#pragma omp parallel shared(input_data,spectra,coeff) private(REG,i,o,blt,th_id,block_step,bl,t,c,i_data,i_coeff,i_spectra)
{

for(o=0;o<OUTER;o++){
for(blt=0;blt<block_step;blt++){
for(i=0;i<INNER;i++){
c=i+o*INNER;
i_spectra = _mm512_setzero_ps();
for(t=0;t<nTaps;t++){
} // for nTaps

_mm512_store_ps(&spectra[c*FpR + bl*nChannels],i_spectra);
}// for INNER
}// for nSpectra
}// for OUTER

}
```

I hope this is more clear and apologies for previews version of the post which was buggy.

## Attachments:

AttachmentSize
8.45 KB

To jimdempseyatthecove:
>> What are the values for nSpectra, nChannels and nTaps?

In my previous post.

>> If you can describe your data layout...

In my previous post.

>> If you can, provide a stripped down main that generates test data and calls your function in a timed loop.

In the attachment of my previous post.

>> I notice you have a for(o=0; o < OUTER; ++o) loop...

That was regrettably a mistake, sorry. The correct version of the blocked code is in second code.

>> Data organization makes a big difference. Correct me...

Yes all lanes are productive.

>> More efficient use of cache can be attained by using stride-1 vector loads (cache line aligned).

Well my code is strided (sort of), while it is in the Taps loop. How ever I tried to switch loops so the loop order would be for (nSpectra) for(nTaps) for(nChannels),but in this case I need to save the content of the i_spectra register and load it repeatedly in order to perform the multiple add operation.

>> On Xeon Phi you have 4 hardware threads per core. The hardware...

Thread parallelization and affinity is mentioned above in my previous post.

As I mentioned above I tried to switch the for loops. The reorganisation of data is not cost effective.

>> Due to you having block_step=nSpectra/nThreads, you've chosen the number of blocks==nThreads. Therefore, the optimal loop nesting order will depend on the size relationship between block_step and nTaps.

I do not know what do you mean by that.

Ignac,

I am in the process of upgrading my software (MPSS 3.2.3 and newer Parallel Studio XE). Hopefully the upgrade will be completed today. Then I can take a closer look at your code.

Jim Dempsey