Hi all,

I'm trying to implement FIR filter (convolution) on XeonPhi. So what I mean by FIR filter (convolution). Let's have a signal x[i] where [] means that it is discrete. Let's have also some coefficients a[i]. The FIR filter is

y[i]=SUM from n=0 to n=T of (x[i-n]*a[n])

, where T is number of taps. This is FIR filter on single frequency channel. The actual filter is multi-channel. This means that if I have N channels and if I'm computing one channel after another I access data like this:

|*------N-1 numbers-----|*------N-1 numbers-----|*------N-1 numbers-----|...

^ ^ ^

Begining second x third x ...

(first x accessed) stride N

The coefficients are are distinct for each tab and each channel, so they have size Channels*Taps. If I'm performing repeated FIR filter, I'm using same coefficients.

Performance:

So far I've been able to achieve only F=150GFlop/s duration of the calculation is t=0.078s, this is for 2GB of data. It is not ideal to measure bandwidth bound code by flops, but I don't have bandwidth results. For Xeon Phi I'm using a modified code from CPU which on two Xeon E5-2650 does F=79GFlops in t=0.147s. Also If I increase number of taps (thus increasing data reuse) the Xeon Phi performance decreasing in contrast with CPU code where performance is increasing. I know this is cache effect so I produced, what I believe to be blocked code, however this code is not performing well.

Taps/performance

Phi: 8/111 12/150 16/62 20/48 24/48 28/49

CPU: 8/76 12/80 16/110 20/98 24/98 28/102

Do you have any suggestions how to improve the code? Is my code faulty in some way?

Here is simplified code:

INNER=(96*1024/(4*nTaps)); //this should give number of channels that would fit into 100kB cache OUTER=nChannels/INNER; REMAINDER=REG-OUTER*INNER; th_id = omp_get_thread_num(); nThreads = omp_get_num_threads(); block_step=nSpectra/nThreads; for(o=0;o<OUTER;o++){ for(blt=0;blt<block_step;blt++){ bl=blt*nThreads+th_id; for(i=0;i<INNER;i++){ i_spectra[0]=_mm512_setzero_ps();i_spectra[1]=_mm512_setzero_ps(); i_spectra[2]=_mm512_setzero_ps();i_spectra[3]=_mm512_setzero_ps(); for(t=0;t<nTaps;t++){ i_coeff[0]=_mm512_load_ps(&coeff[(RpCl*c)*FpR+t*nChannels]); i_coeff[1]=_mm512_load_ps(&coeff[(RpCl*c+1)*FpR+t*nChannels]); i_data[0]=_mm512_load_ps(&input_data[(t+2*bl)*nChannels+(RpCl*c)*FpR]); i_data[1]=_mm512_load_ps(&input_data[(t+2*bl)*nChannels+(RpCl*c+1)*FpR]); i_spectra[0]=_mm512_fmadd_ps(i_coeff[0],i_data[0],i_spectra[0]); i_spectra[1]=_mm512_fmadd_ps(i_coeff[1],i_data[1],i_spectra[1]); i_data[0]=_mm512_load_ps(&input_data[(t+2*bl+1)*nChannels+(RpCl*c)*FpR]); i_data[1]=_mm512_load_ps(&input_data[(t+2*bl+1)*nChannels+(RpCl*c+1)*FpR]); i_spectra[2]=_mm512_fmadd_ps(i_coeff[0],i_data[0],i_spectra[2]); i_spectra[3]=_mm512_fmadd_ps(i_coeff[1],i_data[1],i_spectra[3]); } // for nTaps _mm512_store_ps(&spectra[(RpCl*c)*FpR + 2*bl*nChannels],i_spectra[0]); _mm512_store_ps(&spectra[(RpCl*c+1)*FpR + 2*bl*nChannels],i_spectra[1]); _mm512_store_ps(&spectra[(RpCl*c)*FpR + (2*bl+1)*nChannels],i_spectra[2]); _mm512_store_ps(&spectra[(RpCl*c+1)*FpR + (2*bl+1)*nChannels],i_spectra[3]); }// for INNER }// for nSpectra }// for OUTER