Intel® Streaming SIMD Extensions

Vector programming. SSE4.2 to AVX2 conversion examples.

In this blog I’ll try to show how to convert SSE4.2 assembly to AVX2 (using the schemes from the blog Programming using AVX2) and how this affects performance.

  • Easy case. When it is enough to add “v” prefix and replace “xmm” with “ymm”.

Consider we have the following loop:

Interpreting Intel SDE avx/sse transition tracker

Hello, I am running Intel SDE in 'ast' mode (AVX/SSE Transition tracker.) on Mac OSX. I struggle to interpret the results.

First off, in the output, it shows function addresses, not function names. Should it not show the symbols? I built my app with -g.

Next, this is the output I see: are these numbers indicative of excessive transitions? Or are they in a normal range?

_mm256_blend_epi16 doesn't work as documented

The documentation for _mm256_blend_epi16 doesn't indicate that it operates on individual 128-bit channels, but this is the behavior I am seeing.  Is this the correct behavior?  Here is a reproducer code below showing the behavior for _mm256_blend_epi16 and _mm256_blend_epi32 where I attempt to insert a value into the first position of a vector using the blend instruction.

Speedup with bulk/burst/coupled streaming write?

  Hello togehther,

I've some very simple question. I hope, this is really simple.

As I read and done already, bulk (coupled) streamin read/write should give some till significant speedup.

After some more profiling, I've found one very small older method im our software that takes to much time in my opinion. The most time is spent to the last instruction - wtite data. For the future question - there is no guarantee by design, that destination memory fits in some cache and, more, the cache is not overwritten so far - so there are really some access penalties.

PCIe Root Complex and the PCH

Hello All,

First of all, sorry this is not in the appropriate forum but I was directed to post this here.

I have a question that's been bugging me regarding the PCIe Root Complex and the PCH and I'm hoping someone will be able to help clear things up a bit.

I've always presumed that the PCIe Root Complex was a combination of the CPU and the PCH as they both contain PCIe Root Ports, thereby connecting PCIe devices to CPU/memory. 

Early indicators of AVX512 performance on Skylake?

Hi all,

Looking ahead, what can we expect from the first generation of AVX512 on the desktop - or when should we expect an announcement?

In the past:

- The first generations of SSE CPUs didn't have a full-width engine, they broke 128-bit SSE operations in to two 64-bit uOps

- The first AVX CPUs (Sandy Bridge / Ivy Bridge) needed two cycles for an AVX store - the L1 cache didn't have the bandwidth to perform a store in one cycle

So what I'd like to know is:

- Will the AVX512 desktop CPUs be able to handle a full-width L1 load and store per cycle?

pmovzxbd using memory operands

Is there a way to use pmovzxbd with a memory operand from intrinsics currently I have either

_mm_cvtepu8_epi32(_mm_cvtsi32(ptr[offset])); //(movd)

_mm_cvtepu8_epi32(_mm_insert_epi32(_mm_setzero_si128(),ptr[offset],0));  //(pinsrd)

The movd or pinsrd should not be needed; in assembly I can write something like


pmovzxbd xmm0,[rax+rdx*4]


Is there a way I can make this call using intrinsics instead of assembly.

Suscribirse a Intel® Streaming SIMD Extensions