I try using Intel® Software Development Emulator with Visual Studio 2013 but I have troubles.
Try to start a debug session with SDE Debugger fails, saying my programme (of the visual studio project) could not be launched because of a missing component dll. CTRL + F5 will start the programm but it crashes soon.
Running SDE tool with my exes from a command is fine. But one of my shuffles seems wrong so I would like to inspect vector registers at a certain part. Any idea? Or is VS2013 not supported?
I build a code for Integralimage computation with SSE and its quite good. But I have serious problems making use of AVX/AVX2. I run my code on an i5-4460.
What is the basis: For integral image I need rowsum which is not optimal for vector units but can be done by shuffle and add. And I need to broadcast the last element to all elements as a second step. This can be done with a shuffle.
Now with AVX, there is no full shuffle for 32 bit, but I can do it with a normal shuffle and _mm256_permute2f128_ps.
I would like to see the Intel Intrinsics Guide include all of the intrinsics, not just the SSE-based intrinsics.
It would be helpful if the 'interlocked' and other intrinsics were also included, especially with timing information.
I am using AVX/SSE instructions to replace memcmp and our workload includes comparing 64 bytes and occasionally 64 and 128 bytes. I am using following function cmp32 for 32byte comparisons and extend it 2 times for 64 or 4 times for 128 bytes and I am hardly getting 1% performance improvement. Testing was done on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, Ubuntu 14.04 x86_64.
I tried replacing following lines
vcmp = _mm256_cmpeq_epi64(xmm0, xmm1);
vmask = _mm256_movemask_epi8(vcmp);
FLOPS means total floating point operations per second, which is used in High Performance Computing. In general, Intel(R) VTune(TM) Amplifier XE
only provides metric named Cycles Per Instruction (average CPI), that is to measure performance for general programs.
I'm currently writing a x86 code emulator.
So I need to know detailed instruction's operation.
I was writing a DAA instruction, detailed operation is like this,