Intel ISA Extensions

Bloated Instruction counts in SDE as compared with that from HW PMC 0xC0

I have noted in multiple (though infrequent but freqent enough) circumstances that the instruction counts for execution of a binary in SDE and that reported by PMC 0xC0 differ by ORDERS of magnitude.  I just ran a version of hmmer compiled with Intel 14.0 and SDE is reporting to me (v5.38 of SDE and it's run with sde -mix -top_blocks 3000 upon a Haswell system) that hmmer took 60 Trillion instructions to execute.  I know that number is bogus since in Open64 it only took 1.05 Trillion to execute as reported by the PMCs.

AVX Optimizations and Performance: VisualStudio vs GCC


   I have recently written some code using AVX function calls to perform a convolution in my software. I have compiled and run this code on two platforms with the following compilation settings of note:

1. Windows 7 w/ Visual Studio 2010 on a i7-2760QM

   Optimization: Maximize Speed (/O2)

   Inline Function Expansion: Only __inline(/Ob1)

   Enable Intrinsic Functions: No

   Favor Size or Speed: Favor fast code (/Ot)

2. Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE

_mm_load_ps generates VMOVUPS

Hi all,

I've tested the following case using Intel XE Compiler 2011.3 and 2013.4

I have a question, let's take a very basic SSE function:

void test1(float * pool)
    __m128 v = _mm_load_ps(pool);
    __m128 a = _mm_load_ps(pool + 8);
    _mm_store_ps(pool + 16, _mm_add_ps(v, a));
    printf("test1: %gn", pool[16]);

if I compile it without specific flags i get expected SSE code, aligned load (explicit for pool, implicit for pool + 20h) and store (pool + 40h):

do _mm256_load_ps slower than _mm_load_ps?

I'm tried to improve performance of simple code via SSE and AVX, but I found the AVX code need more time then the SSE code:

void testfun()


int dataLen = 4800;  

int N = 10000000;

 float *buf1 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));  

float *buf2 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));  

float *buf3 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));  

Gather of byte/word with avx2

We have some SSE code that is effectively trying to do 3D texturing from a volume dataset. Our datasets often have only 16-bits of information per voxel. Our inner loop calculates 8 indices into the volume data array and grabs 8 voxels based on those indices. I thought the new gather instructions would be ideal for this, but unfortunately they do not support loading sixteen bit quantities.

Unable to ative the SSE nstruction set byadding compile flag “march=native” in gcc

My machine is Core2 microarchitecture and I try to compile some arithmetic codes by using the SSE instruction set. I search on the web and official manual, the answer is everything I need to do (in simplest way)  to add the flag: march=native, because my chip support SSE. But when I use "gcc -march=native -Q --help=target -v" to check if the flag really works, the results display on the screen is a little bit beyond expectation, like:

-msse [disabled]

-msse2 [disabled]

-msse2avx [disabled]

-msse3 [disabled]

-msse4 [disabled]

-msse4.1 [disabled]

Is Haswell's new transactional memory 'TSX' actually slower than locking?

Dear all,

just got my fingers on a Haswell system and tried the new TSX extension, hoping to boost performance of my multi-threaded app.

But what I found was rather shocking, the numbers are execution times in microseconds:

A) 29122 - App running with a single thread and without any locking

B) 42762 - Same as A) above, but just adding an XBEGIN/XEND pair (with nothing in between) at the critical sections. So even though I don't do any transaction yet, the code takes 46% longer to execute. That's much more than I had expected.

Why weren't PINSR* instructions extended to 256-bits in AVX2

In the process of testing VGATHER* instructions, a couple questions arose.  One needs to put the indexes into a {X|Y}MM register for the VSIB addressing.  To do so I imagine it's adventageous to put those indexes from GPRs to XMM.  To do this most efficiently I'd imagine you would put directly the GPR value into the proper location of a XMM or YMM.  This can be done with VPINSRD and VPINSRQ, however, you can't put these values into the upper 128-bits of a YMM.  Was there some rationale as to why this wasn't important.

Question/Advice on PERMD and PERMPS..

Intel, the instructions above take the form:


the mask containing the indexes is stored in XSRC1 and the bytes to be permuted are in XSRC2.  Why is the mask not in XSRC2 and viceversa.  The PERMILPS and other instructions have used the implicit mask as the last SRC.  Is there a reason you changed this?  I just ask because the x86 instruction set is complicated as it is.. and now to have some forms of PERM instructions in one orientation and others in the opposite.. it's just confusing.  

Subscribe to Intel ISA Extensions