Intel ISA Extensions

Looking for smartest way to insert a DWORD into AVX register

Hi all,

I'm looking for the smartest(=fastest) way to insert a DWORD into an AVX register.

Here is what I found so far:

AVX vinsertps doesn't work because it clears the upper 128bits and the immediate value can't address the upper 128bits anyway

AVX vpinsrd doesn't work for the same reason, and - truly sad unless the docs are wrong - hasn't been promoted in AVX2, even though the immediate value has space to encode where to insert also in 256bit vectors.

Simple question about single and double float terminology

On a 32 bit system, a program loop will take the same time to process a loop using double floating point arithmetic as it does with single floating point arithmetic. The double float calculations are done done in hardware, as opposed to using some sort of software emulation, as is done on most GPUs. The GPU takes more than twice as long to process  a loop of double floats than it does a single float loop.

Please exclude all thought of SSE or AVX registers or calculations for the moment.

How do you move 128-bit value to a new 258-bit register to both lanes?

I have yet to find the optimal way. There are two and half solution, none of them are flawless.

1. vbroadcasti128 ymm0, qword ptr [...]

Second operand is a memory type, perfect for loading global constants while saving 16 bytes, but not if you already have the source in a xmm register. Why there is no register to register form? Even the intrinsic takes a value type, which the compiler has to save first, to reload, crazy.

2. cast xmm0 to ymm0, vinserti128 ymm0, ymm0, xmm0, 1

Function Vectorization

Recently I tried to vectorize a code with simd pragma, and from the Intel VTune report, I see that almost 20% of CPU time is in "__svml_sincos4_e9" function which is apparently the vectorized version of trigonometric functions. My question is, why this function takes this much time, as the non-vectorized version takes less than 1% of CPU time?

I'm using Intel c++ 13.3 with -xAVX and -axAVX flags set on.

Mixing AVX and MMX code

Dear all,

hope this hasn't been asked before, but I couldn't find a way to search the forum..?

In high performance code I'm using MMX and SSE together, since this gives me 8 additional very valuable registers. Looking at the AVX docs, this seems no longer possible with AVX code, since all MMX-related SSE instructions have not been promoted with a VEX prefix, and are therefore legacy instructions which I may no longer use (or face the deadly mixing penalty that requires VZEROUPPER etc.).

Use pointer to __m256 or use _mm256_load_ps

Hi

I noticed there are two popular ways when writing intrinsics for moving data into ymm registers. I'll use a simple vector addition example to clearify my question.  Assuming a[], b[], c[] are three aligned memory buffers, I would like to do  "c[] = a[] + b[]".

First option, use pointers:

 __m256* vecAp = (__m256*)a;

__m256* vecBp = (__m256*)b;

__m256* vecCp = (__m256*)c;

 

 for (int i=0; i < ARR_SIZE ; i+=8)

     *vecCp  = _mm256_add_ps(*vecAp, *vecBp);

     vecAp++;

     vecBp++;

CPU Family and Model

Hello everyone,

i'm writing a small program which needs the CPU identification. But perhaps do something wrong.

With CPUID and EAX=01H, i want to get the Family ID and Model ID. On a machine with CPU of the type X5570, i get

Family ID = 6 (bits 8 to 11 in EAX after execution)

Model ID = A. (bits 4 to 7 in EAX after execution)

But I can't find in the "Develeoper's Manual"  a Displayfamily_Displaymodel with this value? 

My question:

how is Displayfamily_Displaymodel calculated? 

Thanks,

Bo

IB gpr load latency and displacement size

I've been looking at the latency of LDs upon my IB.  I've noted that in the opt guide there's discussion as to the effect displacement size has upon LD latency.  So I decided to test that, and to my best efforts I find that you can only get 4 clks of LD latency if you DON'T have a displacement.  As soon as a displacement is added I find that I observe 1 extra clk in latency, which is counter to what the optimization guide states.

订阅 Intel ISA Extensions