Vectorization - single compilation unit doubles performance(!)

Hi - I'm using Visual Studio C++ 2012 in Windows 8 with Intel Compiler 16.0 to develop some code to implement a digital signal processing algorithm. The main loop iterates over received 'symbol' data (1200 symbols), and lends itself well to vectorization. My laptop has an i5-4300U which supports AVX2 instructions.

Do Non-Temporal Loads Prefetch?

I can't find any information on this anywhere. Do non-temporal load instructions (e.g. MOVNTDQA), which use the separate non-temporal store rather than the cache hierarchy, do any prefetching? How does the latency and bandwidth compare to a normal load from main memory?

Is the way to think about the store as if it is as "close" to main memory as the L3 cache, but also as "close" to the register files as the L1 cache?

Possible bug in SDE - jump with 16-bit operand size

The Software Developer's Manual, and the corresponding AMD document, indicate that after the new RIP is calculated, it is then truncated to whatever the instruction's operand size is.

To see if this was actually true, I assembled a JMP instruction with a 66 prefix, to set an operand size of 16 bits.  I would expect this to jump to a 16-bit address.

Running this instruction on my AMD Steamroller CPU, I got a segmentation fault.

But running it with SDE, the trace shows a jump without truncating the destination address.

It would appear that SDE is incorrect.

Random performance - Difficult optimization issue, likely related to instruction loop alignment

I'm currently having a hard time trying to optimize a performance oriented library for Intel Core.

With most essential ingredients in place, I'm now progressing more slowly, making small modifications, checking performance change at each iteration.

Problem is, I'm in a difficult phase : modifying a part of a code can trigger large performance differences into other *unrelated* parts of the same code.
This is not a measurement noise : it's perfectly reproducible, and significant (between 10-20% depending on scope tested). 

First touch time greater than parallel time

Hi all,

I was looking to parallelize my code for speedup.

As xeon phi was a NUMA core I used the first touch placement of the data.

while xeon phi is performing better than xeon no doubt, the problem is that totaltime(time for first touch+looptime) is greater.

How do I resolve this issue?

This code when integrated into the main code(cannot post it here) will call state function many times from various different places. So is it possible that even if I dont first touch as I have in the code attached below this overhead is just a onetime problem?

Intel MIC MPI symmetric job profiling using Vtune


I want  to profile the my MPI application executing on HOST+MIC using  symmetric mode execution. I used the following command but it says cannot execute binary. I source the then used the following 

mpirun -host test -n 2 amplxe-cl -collect hotspots -r result-dir1 ./hello : -host test-mic0 -n 4 amplxe-cl -collect hotspots -r result-dir1 ./hello.mic

Can someone help me to profile my MPI application in symmetric mode execution.

As a second option I tried

Compiling for Xeon Phi co-processor

Can somebody help me build Aerospike database server for Intel Xeon Phi Co processor. A step by step guide would be appreciated (as I am new to the Intel MIC). I am able to build the database server on the host environment but it is native execution of the server on Xeon phi co-processor where I have completely lost. Thank you in advance.


Hey Guys, 

I am playing with the CPU_MASK mechanism in COI (both in mpss 3.5.2 and mpss 3.6). However, I found it is not working as I expected. Suppose that we have 224 threads (on a Phi) and we divide them into 4 partitions. Thus, each partition of threads has 56 threads. 

Partition 1: thread 1 -- thread 56

Partition 2: thread 57 -- thread 112

Partition 3: thread 113 -- thread 168

Partition 4: thread 169 -- thread 224. 

Subscribe to Server