Intel ISA Extensions

Question on AVX Instruction set reference

If suppose there is a legacy 128 bit SIMD instrcution and the data is held in a 256 bit register. And if the entire 256 bit data is to be processed but since only a 128 bit SIMD is availabale, how can one:

1. run the instruction on the lower 128 bits
2. run the instruction on the upper 128 bits
3. shift the upper 128 bits to the lower 128 bit position

Thanks

Performance Counters to measure L1, L2 Cache Misses

Hi,
I'm currently optimizing some algorithms in assembler using software prefetching.

Now I'd like to measure the changes.
I used the performance counters below on my Xeon 5130 with Intel Core Architecture. But while the execution time decreases after optimization the l1 and l2 cache misses seem to increase.

Performance counters I used:

Use Eventnum Umask
L1 Requests 0x40 0x0F
L2 Requests 0x2E 0xFF
L1 Misses 0xCB 0x02
L2 Misses 0x24 0x01

Are these the right performance counters?

Thanks in advance,
Michael

Intel Compiler AVX Instructions

The option QxAVX (for using AVX instructions) is not available with the evaluation version of the ICC compiler 11.1065, vs 2008. Also the header file gmmintrin.h mentioned in the help is not available when included

catastrophic error: could not open source file "gmmintrin.h"

How can we build the source with AVX intrinsics in it. Is it sufficient to use o3 (Intel specific optimization).

Is there a solution to this problem.

P-State invariant TSC on Nehalem platforms with multi-packages

The following comment was made by Intel:

"The time-stamp counter on Nehalem is reset to zero each time the
processor package has RESET asserted. From that point onwards the TSC
will continue to tick constantly across frequency changes, turbo mode
and ACPI C-states. All parts that see RESET synchronously will have
their TSC's completely synchronized. This synchronous distribution of
RESET is required for all sockets connected to a single PCH. For large,
multi-node systems, RESET might not be synchronous."

Dot Products and overhead of Address increments....

All,

I was just trying out to optimize the "Dot Product" operation of 2 vectors. Both the vectors are laid out in aligned memory locations as arrays.

I did an assembly implementation only to realize that repeated additions are causing resource stalls (at least thats what I infer)

For example, consider this:

Subscribe to Intel ISA Extensions