# Get _mm_alignr_epi8 functionality on 256-bit vector registers (AVX2)

Hello,

I'm porting an application from SSE to AVX2 and KNC.

# FMA manipulation of register’s content for XMM, YMM and ZMM register sets

hello, there wasn’t a typical introduction thread so since it’s my first post i though to introduce myself. my name is mile (yes like the measuring unit) and i’m a student. i’m noob in this area.

i’m writing a paper for school and before posting my question(s) here i’ve thoroughly researched for an answer online to the best of my abilities but i didn’t managed to find one. after browsing the forum i’ve decided to post in new topic instead going off topic in another one.

# gather instructions and the size of indexs for a given base gpr size

Hi,

Hi all,

I'm a little puzzled about the generated assembly code for this little piece of Cilk code:

void gemv(const float* restrict A[4], const float *restrict x, float * restrict y){
__assume_aligned(y, 32);
__assume_aligned(x, 32);
__assume_aligned(A, 32);
y[0:4]  = A[0:4][0] * x[0];
y[0:4] += A[0:4][1] * x[1];
y[0:4] += A[0:4][2] * x[2];
y[0:4] += A[0:4][3] * x[3];
}

Looking at the generated assembly code:

# Will AVX-512 replace the need for dedicated GPU's?

I do not expect it to replace high end graphics cards, and will likely be less efficient powerwise than a dedicated gpu (integrated or discrete). As far as I can tell performance wise it will easily make a CPU on par with a mid range GPU, which is far and above what the majority of people need. A 3Ghz 4 Core Skylake cores will have 768GFlops(3Ghz * 4Core * 2x16FMA). The GPU takes up a enough die space to allow for 8 core chips, which would double the max flops. Intel already has the OpenGL and DirectX software renderers from Larrabee.

# unaligned loads avx-128 vs. -256

I just saw that my cases using _mm256_loadu_ps show better performance than _mm_loadu_ps on corei7-4, where the latter was faster on earlier AVX platforms (in part due to the ability of ICL/icc to compile the SSE intrinsic to AVX-128).

Does this mean that advice to consider AVX-128 will soon be of only historical value?  I'm ready to designate my Westmere and corei7 linux boxes as historic vehicles.

# Latest ASM compiler other than Intel C and C++ Compilers

Hi,

Am trying to code my application in Assembly to run on x86. Please suggest me the suitable compiler which will support all SSE4.2 Assembly instructions(other than Intel Compiler). If any links which help in execution and procedure will be helpful.

# Is there some books about SIMD(sse, avx and so on) optimization?

~Can someone please recommend a few books on program optimization?

I use  multithreading and simd to improve the performance of the program.

I always learn simd through the website, and ask questions in the web site.

Now I want to buy some books to learn. Is there any books on simd ? Thanks

# Instruction set extensions programming reference, revision 17,

An updated instruction set extensions programming reference, revision 17, has been posted here.

• Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions
• Intel® Secure Hash Algorithm (Intel® SHA) extensions
• Intel® Memory Protection Extensions (Intel® MPX)