AVX512f on non-MIC this year?

AVX512f on non-MIC this year?

Hi all,

Can we expect AVX512f on non-MIC systems this year, or only on Knights Landing during 2015?

Thanks,

  Angus.

17 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

What is the difference between AVX-512 and AVX-512F?

AVX512 can be viewed as  the larger family of instructions. AVX512F is the "foundation" portion of those instructions. All chips supporting AVX512 will have AVX512F.

A couple of blog posts try to elaborate:

https://software.intel.com/en-us/blogs/2013/avx-512-instructions

https://software.intel.com/en-us/blogs/additional-avx-512-instructions

 

Quote:

iliyapolak wrote:

What is the difference between AVX-512 and AVX-512F?

you can easily discover all the already disclosed AVX-512 subsets thanks to the Intrinsics Guide https://software.intel.com/sites/landingpage/IntrinsicsGuide/

when your mouse is over the AVX-512 checkbox the list is expanded and you can then pick AVX-512F, AVX-512BW, etc.

Thank you guys for your help.

Many places on the interwebs have reported that Intel announced that Skylake should be shipping in the second half of 2015.

Skylake is expected to support AVX-512F (e.g., https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2015-...), but I have not been able to find any documentation on which AVX-512 subsets will be supported on the various products.

"Dr. Bandwidth"

The reason I ask is that AVX-512f has only seemingly been confirmed for "Skylake Xeons".

The latest version of Intel SDE has support for two unreleased uArchs, SKL and SKX. The latter appears to support AVX-512f, the former not. Others have speculated that "SKX" means "Skylake Xeon" - which would somewhat make sense, AVX-512 seems like quite a lot of extra silicon for lower power mobile cores to carry.

And yet for the last few generations, mobile has shipped first, with workstation/server parts about a year later.

If Skylake follows the same pattern, we wouldn't expect Xeon parts until next year, hence the original question.

Quote:

angus-hewlett wrote:
Others have speculated that "SKX" means "Skylake Xeon"

"SKX" for "Skylake Xeon", is more than speculation since it is used in some Intel documents, such as this one (see slide 6): http://gcc.gnu.org/wiki/cauldron2014?action=AttachFile&do=get&target=Cauldron14_AVX-512_Vector_ISA_Kirill_Yukhin_20140711.pdf

Interesting point about the use of the term "Skylake Xeon"....    I freely admit that I am bewildered by the many code names that Intel uses, but I don't recall Intel releasing a mobile or desktop processor that does not include the primary SIMD ISA extension that is supported by the server processors referred to by the same code name.   For Nehalem/Westmere, Sandy Bridge/Ivy Bridge, and Haswell/Broadwell, I think that the Core i3/5/7 parts support the same SIMD ISA as the corresponding Xeon processors.

Intel has certainly released processors without the newest ISA support (e.g., Atom & Xeon Phi), but they have been very clear about not referring to them with the same code names as any of the parts with the newest ISA.

Of course there is always a first time?   And there is certainly the possibility that AVX-512 could be supported "functionally" on some low-power chips -- giving compatibility without speedup (relative to AVX2, for example).   There is nothing wrong with this, but I would not want to be the one in charge of managing the marketing message --- disastrous confusion would be very easy to achieve!

I am still trying to wrap my head around the various optional subsets.  The Intel Architecture ISA Extensions document mentions (1-6) as instruction "groups" and (9) as a modifier for most instructions in groups (1-6).  Groups (7-8) are listed in a separate chapter -- they appear to target the Xeon Phi family (along with (1-2), as noted in the 2nd blog entry referenced above).

  1. AVX512F       Foundations -- adds 512-bit support to many instructions (>100?), plus new instructions (I counted 84 new ones).   Some of the new instructions support treating two ZMM registers as a single 1024-bit vector!
  2. AVX512CD    "Conflict Detection" -- includes three instructions:  (i) broadcast 8 or 16 bits of the mask register to the 2/4/8/16 fields in a vector register; (ii) detect duplicate 32/64-bit values in a 128/256/512-bit register; (iii) count leading zeros in each 32/64-bit word of a 128/256/512-bit register.
  3. AVX512DQ    Double and Quad (integer) vector support -- approximately 41 instructions (most with many possible operand types)
  4. AVX512BW    Byte and Word (integer) vector support -- approximately 51 instructions (most with many possible operand types), but lots of overlap with the AVX512DQ set -- just extending the data types supported for the instructions
  5. AVX512IFMA   Integer Fused Multiply-Add: 2 instructions for high/low result of Fused Multiply-Add for 2/4/8-element vectors of 52-bit integers stored in 64-bit fields of 128/256/512-bit vectors.
  6. AVX512VBMI   Byte-level vector permute (on 128/256/512/1024-bit vectors) and a select+pack instruction -- 4 instructions (with multiple argument types)
  7. AVX512ER      Exponential and Reciprocal Functions -- 10 instructions (with multiple argument types)
  8. AVX512PF     Prefetch Instructions -- 4 instructions (with multiple argument types).  These are all labelled "sparse", but of course the indices can point to contiguous cache lines -- allowing a single instruction to fetch up to 8 or 16 cache lines (depending on the size of the index variables).
  9. AVX512VL     Add support for 128-bit and 256-bit vector lengths (in addition to the standard 512-bit) -- applies to most of the instructions in groups (1-6), but not to the instructions in groups (7-8).

I guess I am also glad that I am not a compiler writer.   In practice there are probably only going to be a few combinations of the subsets supported, but since the subsets are all identified independently, I can imagine very painful combinatorial explosions in the logic for code generation.   E.g., Use 512-bit encodings for vectors of 32-bit integers if AVX512DQ is supported, but revert to AVX2 for vectors of 16-bit integers if AVX512BW is *not* supported.  For code that mixes the two, consider using the 256-bit optional argument length for the vectors of 32-bit integers if AVX512VL is supported, otherwise write code that promotes the vectors of 16-bit values internally into vectors of 32-bit values and include lots of guard code to make sure that all the exceptional cases are handled.  This could make use of the AVX512CD "Counting Leading Zero Bits" instruction -- if AVX512CD is supported -- otherwise you have to figure another way to do it. 

Are we headed back to the 1980's with CISC instruction sets that are too complex to be usable by compilers?

"Dr. Bandwidth"

Quote:

John D. McCalpin wrote:

I am still trying to wrap my head around the various optional subsets. 

I'll suggest to have a look at slide 3 of the PDF I was referring to in my previous post, it clarifies what is planned for KNL and SKX

Huh... AVX512IFMA...52-bit integers in 64-bit filed.

Looks like a hack of double precision FMA using denormalized numbers. The programmer may need a #pragma or !DEC$ to disable/enable the integer fma on the next statement. I anticipate that the 52-bit restriction applies to the operands for the multiplication and not the addition.

Jim Dempsey

The IFMA instructions are certainly strange. 

It is not hard to imagine adding them to the Floating-Point pipeline --- mostly all you have to do is turn off the normalization.

It is harder to imagine why it would be considered a worthwhile investment in development and debugging.  With the 52-bit limitation I can't see that it will be useful in anything other than a library built specifically around "bignums" constructed in 52-bit chunks?  Of course it would have to be vectorized too, which is not a property that is typically associated with bignums?

The low-order result might be easy to use if you are working with inputs to the multiplication that are all 26 bits or less.

"Dr. Bandwidth"

52-bit * 52-bit could yield 104-bit result (40-bit high and 64-bit unsigned low). You'd need two instructions, one to generate the high result and one to generate the low result, or one with two output registers. You would also need something to propagate the carry, I believe the new integer instructions handle multi-precision (though I haven't used it).

Jim Dempsey

The new instructions do provide the option to save either the low 52 bits or the high 52 bits of each field of the vector of 104-bit products, and add those to the previous contents of the corresponding 64-bit field in the output SIMD register.

This makes my head hurt a bit, since the 52+52 alignment does not correspond to boundaries of the 64-bit accumulators.  If you wanted to use the high and low results to implement a 104-bit accumulator, I think you would need to identify when the low-order 64-bit output overflows 52 bits, since that needs to be propagated to the upper 52 bits.  Hmmm...  maybe you can just do the accumulation for the full vector length, then at the very end use packed bitwise operations to extract the upper 12 bits of the low-order 64-bit accumulator and add them to the high-order 64-bit accumulator fields.   Since the upper accumulator is in a different register, you don't need to worry about shifting across lanes in a single SIMD register.    Then you can either use the results as a 52bit+52bit "bignum" or you can do the extracts and shifts required to convert it to a 40bit+64bit result in a 64bit+64-bit contiguous field.

I can certainly imagine that there are use cases.  It is hard to imagine that the set of applications that will benefit from this correspond to enough revenue to justify the development and validation costs.   But maybe I just need a better imagination.  ;-)

"Dr. Bandwidth"

Thinking about it some more, the IFMA instructions are pretty close to one piece of what you need to implement the "exact dot product" algorithm proposed by Kulisch (look for "Kulisch exact dot product").   This algorithm uses a 4288-bit accumulator to accumulate products of 64-bit IEEE floating-point values with no rounding errors (until the final rounding).   

The absence of rounding errors is of interest in some parallel computing applications, for which bit-wise reproducibility of dot products (independent of the number of tasks executing or the order in which the tasks execute) is considered to be a very useful property.

4288 bits is 67 64-bit words, which sounds slightly insane, but it is actually not too bad in its primary use cases.  

  • Using 64-bit ADD and ADC, it takes 67 instructions and 133 cycles to add two of these accumulators together.  Since ADC has a 2-cycle latency on most processors you can do two of these in parallel in 134 cycles.
  • Adding products of two 64-bit floating-point numbers into the accumulator requires updating either 2 or 3 64-bit fields in the accumulator, depending on the alignment of the 104-bit intermediate product.  (This is the part that looks similar to the IFMA instructions.)   Carries out of the high-order 64-bit word will be rare, and carries to the next field beyond that will be extraordinarily rare (only occurring when all 64 bits are set before the carry arrives).   I have not worked out the implementation in detail, but it should be possible to use multiple cores to update the accumulator as fast as you can load two streams of 64-bit floating-point numbers from memory.
  • On a clustered system it turns out that the time required to send a 67-word exact dot product accumulator is almost the same as the time required to send a single 8-Byte double-precision accumulator -- most of the time is latency and overhead, not actual data transmission time.   With FDR Infiniband, for example, sending the extra 66 64-bit words should add about 0.1 microseconds to the ~1 microsecond nominal cost of sending a short message.
"Dr. Bandwidth"

>>I think you would need to identify when the low-order 64-bit output overflows 52 bits, since that needs to be propagated to the upper 52 bits.  Hmmm...  maybe you can just do the accumulation for the full vector length, then at the very end use packed bitwise operations to extract the upper 12 bits of the low-order 64-bit accumulator and add them to the high-order 64-bit accumulator fields.

Not having read the pay walled description in the link in #16, doing some speculative thinking, I think it goes along this line:

In traditional multi-integer addition, you have one carry bit that has to immediately be propagated to the next higher precision result accumulation due to the carry bit being reused for that operation.

In the scheme proposed, the carry can potentially be 12 bits, and is not held in a fleeting condition register but rather stored along with the results data from the partial accumulation (i.e. inside the extra 12-bits of the 52-bit partial result). This carry can thus be re-used 4095 times without loss of value, meaning the carry propagation across the width of the multi-52-bit word precision at relatively long intervals (either every 4096 accumulations, or potentially whenever any 12-bit accumulator exceeds an easily determinable upper value.

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today