1024 bit AVX

1024 bit AVX

The following blog article describes the AVX as having been designed for use with up to registers of 1024 bits.

http://electronicdesign.com/article/digital/intel_s_avx_scales_to_1024_b...

Is this for real ? Where could I find an Intel descriptions about this with some background. The AVX intro document mentions this too but only gives it a mention.

With 1024 bit length registers, how would the memory supply enough data to feed this beast?

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Matthias Kretz's picture

This is only part of the VEX prefix, which is used to encode the AVX instructions. In the prefix there's a two-bit field which denotes the vector width: 0 - 128, 1 - 256, 2 - 512, 3 - 1024. Though only 0 and 1 are fixed at this point, because no other hardware exists and no plans about hardware which uses 2 or 3 in this field are known.

Obviously this is a field for future-proofing the AVX instructions. Which is a first for x86 AFAIK - and a really good decision IMHO. One may assume that a 1024-bit vector CPU would use cache lines that are at least 1024-bit wide. Nowadays they're still at 512 bits.

Vc: SIMD Vector Classes for C++ http://code.compeng.uni-frankfurt.de/projects/vc

Quoting magicfoot
With 1024 bit length registers, how would the memory supply enough data to feed this beast?

It doesn't necessarily have to process 1024-bit in one clock cycle. The Pentium 3 and 4 executed 128-bit SSE instructions on 64-bit execution units by splitting them into two uops.

This implicitly allows access to more physical register space. An AVX-1024 instruction would be the equivalent of "unrolling" it into four AVX-256 instructions, but without the risk of running out of ymm registers. This means it's easier to cover instruction latencies, and thus you can reach higher effective throughput.

A possibly even more compelling reason to implement AVX-1024 without widening the execution path is power consumption. Instead of splitting the instruction into four uops, I believe it could remain a single uop by performing the actual sequencing at the issue stage. This means the entire front-end and even part of the scheduler could be clock-gated due to this lower instruction rate.

Note that this is quite similar to how GPUs function. AMD processes 2048-bit vectors on 512-bit execution units in four cycles, while NVIDIA processes 1024-bit vectors on 512-bit execution units using a front-end clocked at half the frequency.

Together with AVX2's gather and FMA support, I believe this would make GPGPU processing obsolete. The CPU is much more flexible, and a homogeneous high-throughput architecture would eliminate the CPU-GPU communication overhead, offering even higher effective performance.

Together with AVX2's gather and FMA support, I believe this would make GPGPU processing obsolete. The CPU is much more flexible, and a homogeneous high-throughput architecture would eliminate the CPU-GPU communication overhead, offering even higher effective performance..

I heard it for the last few years and till now CUDAproved it is much faster.
I tried using AVX with dual cpu - each with 6 cores and I continuously get stuck with poor memory bandwidth.
It looks as if the AVX was meant to improve the floating-point vector api, yet no-one took into account the bandwidth problem. In certain cases the SSE or C code get better results.

Quoting gilgil
I heard it for the last few years and till now CUDAproved it is much faster.I tried using AVX with dual cpu - each with 6 cores and I continuously get stuck with poor memory bandwidth.
It looks as if the AVX was meant to improve the floating-point vector api, yet no-one took into account the bandwidth problem. In certain cases the SSE or C code get better results.

That's mainly due to the lack of gather support. A single gather instruction can replace 18 instructions!

FMA would also offer a significant increase in computing power. And because of this Haswell is also expected to double the cache bandwidth. Note that Sandy Bridge-E is said to already have twice the RAM bandwidth. So the combination of all these things would make the CPU highly effective at throughput computing.

Quoting Matthias Kretz
This is only part of the VEX prefix, which is used to encode the AVX instructions. In the prefix there's a two-bit field which denotes the vector width: 0 - 128, 1 - 256, 2 - 512, 3 - 1024. Though only 0 and 1 are fixed at this point, because no other hardware exists and no plans about hardware which uses 2 or 3 in this field are known.

I checked the Programming Reference and I actually couln't locate this 2-bit field. There's only a 1-bit VEX.L field for indicating 128-bit or 256-bit operation... Or did I miss something?

There's a VEX.m-mmmm field in the 3-byte VEX format which has three reserved bits though.

Quoting c0d1f1ed

There's a VEX.m-mmmm field in the 3-byte VEX format which has three reserved bits though.

I just found out that one of these bits is used by AMD's XOP instructions. Interestingly, it's the middle bit of the trio, which suggests that perhaps Intel already has specific plans for the first bit...

Hi Gilgil, do you have any results to show how the memory bandwidth is affected on your multicore ? Graph or Table ? I run some intense routines on an 8 core chip and see only a slight degradation(maybe 5%) to what I assume may be memory bandwidth bottlenecking. What is the nature of the routines that are causing the bottleneck ? i.e. Montecarlo, etc.

Quoting c0d1f1ed

Quoting c0d1f1ed

There's a VEX.m-mmmm field in the 3-byte VEX format which has three reserved bits though.

I just found out that one of these bits is used by AMD's XOP instructions. Interestingly, it's the middle bit of the trio, which suggests that perhaps Intel already has specific plans for the first bit...

Never mind, I've finally taken the time to study the VEX encoding (in particular how it avoids collision with legacy instructions), and I found out that while XOP encoding uses the same format, it actually maps to a previously unused part of a 'group opcode', were the mod field overlaps with the VEX.mmmmm field and needs to have a fixed value.

But for VEX I believe the three first bits of the mmmmm field are all still available for future extensions, like 1024-bit AVX...

Login to leave a comment.