Intel ISA Extensions

MPX instructions not in the Appendix A opcode map


In the last release 55  of  Intel® 64 and IA-32 Architectures Software Developer’s Manual in Vol 2C A-11, we can't see MPX instructions. In fact, I usually use opcode maps to find instructions encoding. I am not sure this forum can be used to report typos like these. Just tell me if I am not in the right place.



small typo in Intel® 64 and IA-32 Architectures Software Developer’s Manual


It seems that there is a small typo in the Intel® 64 and IA-32 Architectures Software Developer’s Manual (Order Number: 253665-054US April 2015), page 3-149 (cmpss instruction) :

128-bit Legacy SSE version: The first source and destination operand (first operand) is an XMM register. The second source operand (second operand) can be an XMM register or 64-bit memory location.

It should be 32-bit memory location.




Guaranteed atomic operation clarification


I'm trying to understand a line in the Intel Architecture manual. It's a description of a memory operation that is guaranteed to be atomic.

The line is at Chapter 8, Section 8.1.1 "Guaranteed Atomic Operations", second bullet list, second item:
>16-bit accesses to uncached memory locations that fit within a 32-bit data bus

The way I interpret this (which must be wrong) is: Accesses to 16-bit regions of memory that are not currently cached and that fit within a data bus that transfers 32-bit values.

the issue about APIC drop msix interrupt

hello, I have a difficult problem,.scenes are as follows:

the hardware env is Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, a Altera FPGA board. 

the os is Linux debian-rss 3.16.7-ckt7

FPGA create 32 DMA transfer to cpu, generate a interrupt per transfer.

This 32 interrput distribution to 8 diffirent msix IRQ.

According to APIC spec, each interrupt maybe one in ISR, one in IRR,the third maybe dropped.

But now i distribution 2 interrputs to each IRQ, why maybe dropped interrputs?

Dynamic Shift


I am trying to achieve a dynamic shift. Well, let me explain the task. I process data with SSE, AVX. Data gets loaded, worked with and later results are stored. To support arbitrary lengths, I need some kind of maskload, but also for SSE.

Suppose my lenght is 9 elements, I work with int32 and SSE. First load, second load is fine. Third load is fine from memory bound, this is no problem. But only element 0 in vector register is valid, others need to be zero. How do I achieve this best?

Why is my AVX slower than SSE?

As the description of "IIR Gaussian Blur Filter Implementation using Intel® Advanced Vector Extensions",

The AVX should be faster than SSE,But, my result of performance measurement as following:

 The computer supports AVX
number CPU in the system = 4

 IIR Gaussian Filter Coefficients are:
a0 = 0.021175, a1 = -0.017807, a2 = 0.021103, a3 = -0.017875, b1 = -1.837578, b2
 = 0.844174, cprev = 0.510583, cnext = 0.489409

image width = 1024, height = 1024

Running multi threaded SSE code

Running multi threaded AVX code

Extract non-zero byte from _m128i


I have 4 _m128i 64byte elements which can contain 0 or non-zero (+ve, -ve) values. I want to extract non-zero values from them.

I looked at _mm_extract_epi8/_mmextract_epi16 but the syntax is int _mm_extract_epi16 (__m128i a, int imm) where imm is the index, hence I have to loop to get non-zero values.

Any intrinsics functions that can be used to avoid loop will be helpful. Inputs appreciated.


Suscribirse a Intel ISA Extensions