Intel® Streaming SIMD Extensions

Does VPMASKMOV require an aligned address?

The online for VPMASKMOV says that "mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated."  But the documentation in the Intel Instruction Set Reference Guide does not mention an alignment requirement, and seems to imply that it is not required: "Faults occur only due to mask-bit required memory accesses that caused the faults.".  

Huge time cost while assigning

Hello Guys:)

It is very nice to have this forum. I'm a fresh on the ISA Extension and expect to have your insight:)

My code snippet, which conducts a convolution computing, is attached as a figure. and here is my confusing issue:

Time was consumed hugely when I tried to assign the computed result to image buffer. Computing time of extension sets(line 512~544) only takes about 7~8ms, but the assign work(line 548) takes about 25~26ms.

Cannnot change IA32_PERF_CTL value: it gets overwritten by the operating system


I'm running an experiment on a server machine with a quad-core Xeon X5355 processor running a linux system.

I try to control core voltage and frequency separately by writing to the msr IA32_PERF_CTL (0x199). I change the value of IA32_PERF_CTL using a "wtmsr" command and verify that its value has been changed using a "rdmsr" command. However, when I run "rdmsr 0x199" again a few seconds later, I find that the value of IA32_PERF_CTL is overwritten with its previous value. The value of IA32_PERF_STATUS does not represent my change either.

Suggestion about memory-access-signaling mechanism


while I was trying to solve some particular multi-thread problem, it occurred to me that it could be solved more efficiently with special assistance from the CPU.

The situation is as follows: say one thread needs to block until the content of a particular 4-byte (or can be other size) location in the memory is changed. (It think the usefulness of this is very obvious and there is no need to give concrete examples to demonstrate it).

What are the currently available options:

TSX results - please explain

I am using Roman Dementiev's code as a base and modifying it to determine if TSX operations are behaving according to expectations.

I am counting the number of times that xbegin() returns successful, the number of times it aborts and the number of times that fallback lock is used.

Processing of data in SSE/AVX/AVX2


Im working on my project and Im looking for the answer:

When Im processing 256-bits of data, is better to use (in one core) for this one whole YMMx register or to split them for 2x128-bits and process them through 2 XMMx registers at different ports, hence on different SSE/AVX unit (in Sandy Bridge there are 3 ports per core for AVX)?  Which option is faster?

Suscribirse a Intel® Streaming SIMD Extensions