Why is my AVX slower than SSE?

As the description of "IIR Gaussian Blur Filter Implementation using Intel® Advanced Vector Extensions",

The AVX should be faster than SSE,But, my result of performance measurement as following:

 The computer supports AVX
number CPU in the system = 4

 IIR Gaussian Filter Coefficients are:
a0 = 0.021175, a1 = -0.017807, a2 = 0.021103, a3 = -0.017875, b1 = -1.837578, b2
 = 0.844174, cprev = 0.510583, cnext = 0.489409

image width = 1024, height = 1024

Running multi threaded SSE code

Running multi threaded AVX code

Extract non-zero byte from _m128i


I have 4 _m128i 64byte elements which can contain 0 or non-zero (+ve, -ve) values. I want to extract non-zero values from them.

I looked at _mm_extract_epi8/_mmextract_epi16 but the syntax is int _mm_extract_epi16 (__m128i a, int imm) where imm is the index, hence I have to loop to get non-zero values.

Any intrinsics functions that can be used to avoid loop will be helpful. Inputs appreciated.


Is it ok to create an array of _m256i

Hi all! I am parallelizing a certain dynamic programming problem using AVX2.
In the main iteration of my calculation, I calculate column in matrix where each cell is an AVX2 register -> _m256i. I use values from the previous matrix column as input values for calculating the current column. Columns can be big, so what I do is I have an array of _m256i values, like this: _m256i prevColumn [N].

Cache and _mm_prefetch


I have some code, where I iterate over an array in reverse order. I already use SSE,AVX (depending on what CPU supports). Normally prefetching of CPU should be finde, if I iterate over an arry from begin to end. But what about end to begin, so reverse? Does the CPU realize this pattern?

Or should I give hints, with _mm_prefetch? If so, how do I use this intrinsic. Should I always give L1 as cache level. And how many iterations before should I prefetch data?

Intel SDE and VS2013


I try using Intel® Software Development Emulator with Visual Studio 2013 but I have troubles.

Try to start a debug session with SDE Debugger fails, saying my programme (of the visual studio project) could not be launched because of a missing component dll. CTRL + F5 will start the programm but it crashes soon.

Running SDE tool with my exes from a command is fine. But one of my shuffles seems wrong so I would like to inspect vector registers at a certain part. Any idea? Or is VS2013 not supported?

Haswell and crosslan


I build a code for Integralimage computation with SSE and its quite good. But I have serious problems making use of AVX/AVX2. I run my code on an i5-4460.

What is the basis: For integral image I need rowsum which is not optimal for vector units but can be done by shuffle and add. And I need to broadcast the last element to all elements as a second step. This can be done with a shuffle.

Now with AVX, there is no full shuffle for 32 bit, but I can do it with a normal shuffle and _mm256_permute2f128_ps.

