AVX-512 is a big step forward - but repeating past mistakes!

AVX-512 is a big step forward - but repeating past mistakes!

Аватар пользователя Agner

AVX512 is arguably the biggest step yet in the evolution of the x86 instruction set in terms of new instructions, new registers and new features. The first try was the Knights Corner instruction set. It had some problems and AVX512 is better, so I am quite happy that AVX512 seems to be replacing the Knights Corner instruction set. But there are still some shortsighted issues that are lilkely to cause problems for later extensions.

We have to learn from history. When the 64-bit mmx registers were replaced by the 128-bit xmm registers, nobody thought about preparing for the predictable next extension. The consequence of this lack of foresight is that we now have the complication of two versions of all xmm instructions and three states of the ymm register file. We have to issue a vzeroupper instruction before every call and return to ABI-compliant functions, or alternatively make two versions of all library functions, with and without VEX.

Such lack of foresight can be disastrous. Unfortunately, it looks like the AVX512 design is similarly lacking foresight. I want to point out two issues here that are particularly problematic:

  1. AVX512 does not provide for clean extensions of the mask registers
     
  2. The overloading of the register extension bits will mess up possible future expansions of the general purpose register space

First the new mask registers, k0 - k7. The manual says that these registers are 64 bits, yet there is no instruction to read or write more than 16 bits of a mask register. Thus, there is no way of saving and restoring a mask register that is compatible with the expected future extension to 64 bits. If it is decided to give some of the mask registers callee-save status, then there is no way of saving and restoring all 64 bits. We will be saving/restoring only 16 bits and zeroing the rest. Likewise, if an interrupt handler or device driver needs to use a mask register, it has no way of saving and restoring the full mask register short of saving the entire register file, which costs hundreds of clock cycles.

It is planned that the mask registers can grow to 64 bits, but not more, because they have to match the general purpose registers. Yet, we can predict already now that 64 bits will be insufficient within a few years. There seems to be plans to extend the vector registers to 1024 bits. Whether they should be extended further has perhaps not been decided yet (these extensions are certainly subject to diminishing returns). People are already now asking for an addition to AVX512 to support vector operations on 8-bit and 16-bit integers. A 1024 bit vector of 8-bit integers will require mask registers of 128 bits. There are apparently no plans for how the mask registers can be extended beyond 64 bits, so we will be needing another clumsy patch at that time.

Let me suggest a simple solution to this problem: Drop the mask registers and allow 8 of the vector registers to be used as mask registers. Then we can be certain that the registers used for masks will never become too small because a mask will always need fewer bits than the vector it is masking. We have 32 vector registers now, so we can certainly afford to use a few of them as mask registers. I think, generally, that it is bad to have many different register types. It delays task switching, it makes the ABI more complicated, it makes compilers more complicated, and it fills up the already crowded opcode space with similar instructions for different register types. The new instructions for manipulating mask registers will not be needed when we use xmm registers for masks, because the xmm instructions provide most of this functionality already, and much more.

So let me propose: Drop the new mask registers and the instructions for manipulating them. Allow seven of the vector registers (e.g. xmm1 - xmm7 or xmm25 - xmm31) to be used as mask registers. All mask functionality will be the same as currently specified by AVX512. This will make future extensions problem-free and allow the synergy of using the same instructions for manipulating vectors and manipulating masks.

The second issue I want to point out relates to doubling the number of registers. AVX512 doubles the number of vector registers from 16 to 32 in 64-bit mode. It is natural to ask whether the number of general purpose registers can also be doubled. In fact, it can, though this will be a little complicated. I have posted a comment on Intel's blog with a possible technical solution. I am not convinced that more general purpose registers will give a significant improvement in performance, but it is quite possible that we will need more registers in the future, perhaps for purposes that don't exist today. We should keep this in mind and keep the possibility open for having 32 general purpose registers in a future extension. Unfortunately, AVX512 is messing up this possibility by overloading the register extension bits. The X bit is reused for extending the B bit, and the V' bit is reused for extending the X bit. This is a patch that fits only a very narrow purpose. It will be a mess if these bits are needed in future extenstions for their original purpose. We need two more bits (B' and X') to make a clean extention of the register space. We can easily get one more bit by extending the 0x62 prefix byte into 0x60 and use bit 1 of the 60/62 prefix as e.g. register extension bit B'. The byte 0x60 is only vacant in 64-bit mode, but we don't need the register extension bit in 32-bit mode anyway. The bit that distinguishes AVX512 instructions from Knights Corner instructions can be used as the X' register extension bit. No CPU will ever be able to run both instruction sets, so we don't need this bit anyway.

There are other less attractive solutions in case the Knights Corner bit cannot be used, but anyway I think it is important to keep the possibility open for future extensions of the register space instead of messing up everything with short-sighted patches.

I will repeat what I have argued before, that instruction set extensions should be discussed in an open forum before they are implemented. This is the best way to prevent lapses and short-sighted decisions like these ones.

 

www.agner.org
45 сообщений / 0 новое
Последнее сообщение
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Аватар пользователя QIAOMIN Q. (Intel)

Good catch,i agree with your last comment.

---QIAOMIN.Q

Аватар пользователя c0d1f1ed

Hi Agner,

Great to see you on this forum! Here's my intepretations of why certain AVX-512 design descisions were made:

Quote:

Agner wrote:
  • AVX512 does not provide for clean extensions of the mask registers

First the new mask registers, k0 - k7. The manual says that these registers are 64 bits, yet there is no instruction to read or write more than 16 bits of a mask register. Thus, there is no way of saving and restoring a mask register that is compatible with the expected future extension to 64 bits. If it is decided to give some of the mask registers callee-save status, then there is no way of saving and restoring all 64 bits. We will be saving/restoring only 16 bits and zeroing the rest. Likewise, if an interrupt handler or device driver needs to use a mask register, it has no way of saving and restoring the full mask register short of saving the entire register file, which costs hundreds of clock cycles.

The currently defined instructions which operate on the mask registers all start with a "W" to indicate they're 16-bit. I imagine that AVX-1024 would have "D" variants to operate on 32-bit. I don't see much of a problem there and 64-bit should be possible as well.

I don't see any value in giving some of the mask registers callee-save status. AVX-512 is intended to accelerate the execution of data-parallel loops (i.e. executing multiple iterations with each instruction in a SIMD fashion). There should be no function calls in the loop, unless they can be inlined (meaning there is no actual call). If things can't be inlined then the loop is not likely to be suitable for AVX-512 in the first place. But if you look at GPUs and how they can still use their wide vectors to execute quite complex compute shaders, there really aren't many data-parallel workloads which can't be vectorized.

I didn't double-check, but I don't think interrupt handlers or drivers can use the mask registers, or any part of AVX-512 for that matter. In my opinion any actual computing should happen in user space anyway. It is why device drivers have a user space component now as well, which improves both robustness and performance.

Quote:

It is planned that the mask registers can grow to 64 bits, but not more, because they have to match the general purpose registers. Yet, we can predict already now that 64 bits will be insufficient within a few years. There seems to be plans to extend the vector registers to 1024 bits. Whether they should be extended further has perhaps not been decided yet (these extensions are certainly subject to diminishing returns). People are already now asking for an addition to AVX512 to support vector operations on 8-bit and 16-bit integers. A 1024 bit vector of 8-bit integers will require mask registers of 128 bits. There are apparently no plans for how the mask registers can be extended beyond 64 bits, so we will be needing another clumsy patch at that time.

GPUs do not have instructions which operate on 8-bit or 16-bit (except possibly as the inner fields of a few vector-within-vector instructions). The way they handle 8-bit and 16-bit variables in the source code is to always expand it to 32-bit. Each mask bit therefore controls a 32-bit field. Yet they are clearly not significantly limited by this approach as they can process trillions of operations per second. So I see no compelling reason why AVX-512 would need instructions and mask registers for 8-bit or 16-bit fields. If more computing power is desired it should be achieved through widening the vectors, adding more vector units per core, and/or adding more cores. This would benefit all workloads, not just 8/16-bit ones. Vector-within-vector instructions can be very useful but only a limited set is required and the mask bits can still apply to whole 32-bit fields.

So it's important to realize that AVX-512 is not the 512-bit extension of AVX2. It is intended for an SPMD approach the way a GPU does it. Some multimedia applications which operate mainly on 8-bit and 16-bit data might be better off sticking with AVX2. Just like on a GPU, AVX-512 implementations will probably have a relatively low bandwidth per FLOP ratio. That's fine if you're doing lots of computations on relatively little data with lots of reuse of intermediate results. If on the other hand you're streaming data through and only perform a few operations on it like in some multimedia applications, then AVX2 might already be bandwidth limited so there would be no use for extending it to 512-bit for such applications.

Quote:

Let me suggest a simple solution to this problem: Drop the mask registers and allow 8 of the vector registers to be used as mask registers. Then we can be certain that the registers used for masks will never become too small because a mask will always need fewer bits than the vector it is masking. We have 32 vector registers now, so we can certainly afford to use a few of them as mask registers. I think, generally, that it is bad to have many different register types. It delays task switching, it makes the ABI more complicated, it makes compilers more complicated, and it fills up the already crowded opcode space with similar instructions for different register types. The new instructions for manipulating mask registers will not be needed when we use xmm registers for masks, because the xmm instructions provide most of this functionality already, and much more.

I don't think that's an option. Two FMA instructions per cycle per core already results in 6 vector register file accesses per cycle (although many operations could come from the bypass network, but that's affected too). Using the vector registers as masks as well would require even more accesses per cycle, and more dependency checking, and I doubt that's reasonably feasible (without exceeding power consumption targets). So from that perspective it's good to have different register types as they are fully independent and don't require additional ports for the same register file or more dependencies to check for.

The only other option would be to not do the mask operation as part of the same instruction. The vblendv instruction has been quite useful for implementing SIMD branching in SSE and AVX. But as you know it takes two uops; one to extract the MSBs from a vector register to form a compact mask and one vblend operation (where the compact mask is passed in where the immediate operand would otherwise be used). So basically it's three times slower than doing the mask operation as part of an arithmetic instruction. Perhaps that can be improved on though since FMA takes three full width input operands. Also, branching is relatively rare in data-parallel code and only results which are used after the branch merge point need vblendv operations (using mask registers would typically also affect temporary results which aren't used outside of the basic block).

Anyway, having the mask registers has the additional benefit of being able to completely clock gate the unused lanes to save power. Note also that this wouldn't be possible if the masks were full width vector registers which control how the result gets blended into the destination register on a per-bit bases instead of per-field. So I'm sure that Intel has evaluated these options and found that dedicated compact mask registers were well worth it.

I don't think it adds much compiler complication. After all you can start with straightforward code which uses blend instructions, and then iteratively optimize it by making use of the mask registers. Note again that you shouldn't have to deal with saving mask registers across calls at all.

Quote:

  • The overloading of the register extension bits will mess up possible future expansions of the general purpose register space

The second issue I want to point out relates to doubling the number of registers. AVX512 doubles the number of vector registers from 16 to 32 in 64-bit mode. It is natural to ask whether the number of general purpose registers can also be doubled. In fact, it can, though this will be a little complicated. I have posted a comment on Intel's blog with a possible technical solution. I am not convinced that more general purpose registers will give a significant improvement in performance, but it is quite possible that we will need more registers in the future, perhaps for purposes that don't exist today. We should keep this in mind and keep the possibility open for having 32 general purpose registers in a future extension. Unfortunately, AVX512 is messing up this possibility by overloading the register extension bits. The X bit is reused for extending the B bit, and the V' bit is reused for extending the X bit. This is a patch that fits only a very narrow purpose. It will be a mess if these bits are needed in future extenstions for their original purpose. We need two more bits (B' and X') to make a clean extention of the register space. We can easily get one more bit by extending the 0x62 prefix byte into 0x60 and use bit 1 of the 60/62 prefix as e.g. register extension bit B'. The byte 0x60 is only vacant in 64-bit mode, but we don't need the register extension bit in 32-bit mode anyway. The bit that distinguishes AVX512 instructions from Knights Corner instructions can be used as the X' register extension bit. No CPU will ever be able to run both instruction sets, so we don't need this bit anyway.

I don't think there's much of a point in having 32 general purpose registers for x86, even when looking far into the future. Having 16 registers instead of 8 (or 6 if you don't count the stack registers) made a measurable difference, but 32 registers only make sense if you have lots of temporary results or you wish to use unrolling. Typical scalar code is actually quite branchy so you barely ever have long basic blocks which have a use for more than 16 registers. And the exceptions are more often than not data-parallel code, for which you can/should use AVX-512 (which is why it does have 32 registers).

Yes, ARMv8 does have 32 scalar registers but the effort for achieving that was much lower than what it would cost x86, and I suspect that they didn't recognise that any code which has some benefit from 32 registers would actually be better off with wider vector instructions. So I find that AVX-512 does a good job at creating a balanced homogeneous high-throughput ISA with plenty of opportunity for future extensions.

That said you bring up some interesting ideas for instruction encodings but I think they're more valuable to be reserved for other extensions than for having 32 general purpose registers.

Quote:

I will repeat what I have argued before, that instruction set extensions should be discussed in an open forum before they are implemented. This is the best way to prevent lapses and short-sighted decisions like these ones.

It's great that Intel now shares information on future ISA extensions ahead of the hardware being available, but I'm not sure if it's realistic or useful to go a step further and engage in discussions on an open forum. Many company's employees are restricted or strongly discouraged from posting on public forums due to the risk of revealing company IP or revealing plans to the competition. So Intel's software partners would just discuss things privately instead. Also, the number of independent people who understand all the aspects that are impacted, are far and few between. With all due respect for your much appreciated evaluations of x86 architectures, it looks like even you didn't fully understand how AVX-512 is intended to be used and why certain design decisions were made. Don't get me wrong, I'm not claiming to understand every aspect either, and we'd probably need information that only Intel has access to to be able to make a full assessment. Also note that from a PR point of view Intel probably doesn't want to sound desperate by explicitly asking the general public for help.

So the openness they have right now and our ability to discuss things here is probably the best we should expect and certainly offers a lot of room for suggestions. So please continue to visit these forums to share your valuable opinions.

Аватар пользователя Agner

Thanks for trying to clarify this.
You talk about how GPUs do things. AVX512 is not only for GPUs, it is planned to be implemented in mainstream CPUs in 2015 as I understand it. It is going to be used for all kind of purposes. This would certainly include data types that fit into 8 or 16 bits, such as text processing, encryption, file compression, DNA analysis, finite field algebra, and probably a lot of other applications that we haven't thought of.

The Knights Corner/Xeon Phi instruction set, which has no official name and no CPUID bit, was introduced as an instruction set for a MIC coprocessor, but it was designed so that it can be combined with the existing x86 instruction set. Later it was modified and renamed to AVX-512. Knights Corner was based on the old Pentium microarchitecture, and strangely supports the obsolete x87 but not SSE. It can convert integers of all sizes to floats when loading from memory and convert back to integer when saving. This means that a vector of 16 8-bit integers is handled as 16 32-bit floats. This is certainly an inefficient and power-wasting way of handling 8-bit data.

With AVX-512 we have to handle 8-bit and 16-bit integers either with AVX2 using full size masks or converting to 32-bit integers and use AVX-512 with bit masks. It is inconvenient to have two different kinds of masks. I am sure that the software community will ask for support for 8-bit and 16-bit integers in zmm registers. In fact, some have already asked for this.

Quote:

Using the vector registers as masks as well would require even more accesses per cycle, and more dependency checking, and I doubt that's reasonably feasible (without exceeding power consumption targets). So from that perspective it's good to have different register types as they are fully independent and don't require additional ports for the same register file or more dependencies to check for.

Intel processors have a limitation of two input dependencies per µop except for FMA instructions. AMD has never had any such limitation, I don't know why. You seem to know about internal technical details that I have no access to. As I understand you, it is easier to check for an extra input dependency if it is from a different register file?

Quote:

Anyway, having the mask registers has the additional benefit of being able to completely clock gate the unused lanes to save power. Note also that this wouldn't be possible if the masks were full width vector registers which control how the result gets blended into the destination register on a per-bit bases instead of per-field. So I'm sure that Intel has evaluated these options and found that dedicated compact mask registers were well worth it.

My idea is to use only one bit per field. When masking a vector of 32 floats, I would use only the lower 32 bits of an xmm register. Why do you think it is easier to turn off unused parts of the execution unit when the mask bits are in a mask register than in an xmm register or general purpose register?

Quote:

Many company's employees are restricted or strongly discouraged from posting on public forums due to the risk of revealing company IP or revealing plans to the competition.

That's an old fashioned way of thinking. The trend goes towards more openness. People want open standards today, not de facto industry standards decided through short-sighted market considerations.

The x86 instruction set encoding is an ugly patchwork generated by a long history of short sighted solutions and dirty patches. AVX512 only makes this worse. The instruction length decoder is a power-hungry bottleneck due to all the complicated workarounds. Things would be much simpler today if Intel had made more future-oriented decisions along the way.

For example, there are three bytes in the opcode map that are used for old undocumented opcodes dating back to the first 8086 processor. I can't believe that all processors today still support these undocumented opcodes. If Intel had made an official announcement 30 years ago that these undocumented opcodes could not be used, then they could have disabled them 25 years ago and they would be available for extensions today. Instead, they are doing all kinds of desperate tricks to squeeze more and more instruction codes into an allready overloaded opcode map. And yet, there is still no official statement about these undocumented opcodes. Why not disable them today so that they will be available for other purposes in some unknown future. (They are disabled in 64-bit mode anyway).

In the same way, I think we need a decision about the future of the x87 and mmx registers. Many people regard them as obsolete, and I think they are quite costly to implement in the CPU pipeline. Microsoft tried to disable them in 64-bit Windows, but changed their mind. If the x87 registers should ever be disabled then it has to be announced maybe 10 or 20 years in advance. That's why it is not too early to discuss the future of x87 and perhaps an alternative solution for high precision math.

We need long term planning, but AVX512 is repeating past mistakes by making short-term patches that cannot easily be extended, should the need arise in the future.

www.agner.org
Аватар пользователя iliyapolak

>>>So from that perspective it's good to have different register types as they are fully independent and don't require additional ports for the same register file or more dependencies to check for.>>>

Do you mean architectural registers or registers of physical register file?

Btw very interesting post.

Аватар пользователя iliyapolak

>>>I didn't double-check, but I don't think interrupt handlers or drivers can use the mask registers, or any part of AVX-512 for that matter. In my opinion any actual computing should happen in user space anyway.>>>

I think that kernel mode driver in case of performing floating point calculations which involves AVX instruction will need to preserve its floating point context by caling KeSaveExtendedProcessorState.Now regarding the future AVX-512 usage in kernel mode it depends on support from OS.IIRC it is unsafe to use AVX registers during the ISR routine.

Moving some of the  functionality to user space mode driver is related to reduction of costly context switches between user mode and kernel mode components.

Аватар пользователя c0d1f1ed

Quote:

Agner wrote:Thanks for trying to clarify this.

You talk about how GPUs do things. AVX512 is not only for GPUs, it is planned to be implemented in mainstream CPUs in 2015 as I understand it. It is going to be used for all kind of purposes. This would certainly include data types that fit into 8 or 16 bits, such as text processing, encryption, file compression, DNA analysis, finite field algebra, and probably a lot of other applications that we haven't thought of.

Yes, AVX-512 definitely seems intended to be implemented in CPUs. And that's exactly what I'm talking about. But you shouldn't forget where the design ideas originated from, as they determine the intended manner in which they should be used (but not limit what they should be used for)!

GPUs achieve their phenomenal computing power through multiple very wide SIMD units in which each lane processes one instance of a scalar 'program'. This can be a pixel shader, a vertex shader, a compute shader, etc. These programs are really just the inner code of a (nested) loop; they are executed on thousands of elements. So multiple (scalar) iterations of this loop are bundled together to form vectorized code. But each lane of the vector ALUs on which it is execute is 32-bit, even when the program processes 8-bit or 16-bit.

Again note that this is all the GPU needs to achieve its massive throughput. It can do 8-bit and 16-bit processing perfectly well, including for all the applications you mention. So AVX-512 brings that same technology straight into the CPU cores. There's no reason to suddenly go and change the fundamental computing concepts for these SIMD units. Each lane processes one iteration of a loop. If your inner loop code happens to be processing 8-bit or 16-bit values, they just get expanded to 32-bit, just like on a GPU, and just like on a GPU you'll still get a ton of computing power.

Trying to instead process 16 x 32-bit and 32 x 16-bit and 64 x 8-bit breaks down this concept of one loop iteration per lane. Switching the number of things you process in parallel requires lots of data swizzling instructions which rapidly defeats the benefits of processing more elements in the first place. The only valuable compromise is having small 32-bit or 64-bit vector operations within each lane, thus forming vector-within-vector instructions. Only a limited number of those would be of value (possibly inspired by MMX).

The only difference between AVX-512 and the GPU is that now this computing power will becomes part of the CPU cores. And that's the really revolutionary part. Using GPUs for generic applications has been notoriously difficult. Whenever your software has a big data processing loop the developer would have to isolate it and rewrite it in an SPMD fashion (the loop iteration is removed but you tell an API you want N number of instances of it, which could be executed accross many SIMD lanes and many cores and many threads). There's significant overhead in migrating tasks to a different processor type, and the tasks have to be large enough to potentially compensate for that. So often the developer has to find ways to combine multiple algorithms into even bigger loops. This is very hard and never a perfect process because scalar and vectorizable code becomes entangled. The inner code can't be too big either because GPUs have very limited register and cache storage per scalar strand. It gets even worse when trying to target multiple architectures with wildly different characteristics. So its an impossible balancing act for anything that isn't as embarassingly parallel as graphics. Heterogeneous GPGPU is destined to fail despite AMD's gallant efforts to lower part of the overhead.

Intel appears to still want to tap into the good bits of a GPU architecture, but makes it developer-friendly by unifying the CPU and GPU into a homogeneous architecture. It even becomes compiler-friendly. But to get back to my point, don't try to make AVX-512 into something it's not that would defeat this unification with GPU technology.

Quote:

The Knights Corner/Xeon Phi instruction set, which has no official name and no CPUID bit, was introduced as an instruction set for a MIC coprocessor, but it was designed so that it can be combined with the existing x86 instruction set. Later it was modified and renamed to AVX-512. Knights Corner was based on the old Pentium microarchitecture, and strangely supports the obsolete x87 but not SSE.

Intel's first attempt at stealing the thunder of the GPU companies was to try and build a discrete GPU with CPU qualities: Larrabee. This was a mistake. It would have taken billions of dollars to get enough of a market share for it that developers would take a peek at it for anything other than graphics. But it's still heterogeneous, and still suffers from all its inherent overhead. Meanwhile Intel already has a huge market share for its CPUs and homogeneous computing is much more welcomed by developers. So they can just incrementally introduce GPU technology into the CPU cores at relatively limited cost and risk.

AVX-512 is a bit more revolutionary than evolutionary because it breaks away from the legacy way of thinking about CPU vector instructions, which have tightly packed 8-bit and 16-bit data. GPUs show that this isn't necessary. In fact it causes a great deal of complication with ever wider vectors. So it's a necessary break and developers who program at the assembly level have to get educated about the change in paradigm.

Quote:

It can convert integers of all sizes to floats when loading from memory and convert back to integer when saving. This means that a vector of 16 8-bit integers is handled as 16 32-bit floats. This is certainly an inefficient and power-wasting way of handling 8-bit data.

Is it really? AMD's top dog GPU has 2816 scalar 32-bit ALUs on a single die. And it's only going to continue to increase! So ALUs are cheap. Even if we look at the CPU the scalar ALUs are all 64-bit and yet we use them for 16-bit and 8-bit values all the time. It's not worth it trying to use 16-bit and 8-bit ALUs instead. At a system level it's insignificant and not power-wasting at all to have 64-bit ALUs. Of course you can stuff 8 x 8-bit in 64-bit registers and process them with 64-bit vector ALUs, but that's not hugely more power-efficient than expanding them to 8 x 32-bit, especially when you take into account that those 256-bit vector ALUs are mostly used for processing actual 32-bit elements and the whole thing can scale up to 512-bit and beyond if it's not encumbered by having to cater for tightly packed 8-bit data.

And again, vector-within-vector instructions offer the best of both vector paradigms for certain use cases.

Quote:

With AVX-512 we have to handle 8-bit and 16-bit integers either with AVX2 using full size masks or converting to 32-bit integers and use AVX-512 with bit masks. It is inconvenient to have two different kinds of masks. I am sure that the software community will ask for support for 8-bit and 16-bit integers in zmm registers. In fact, some have already asked for this.

I don't think it's that inconvenient to have different kinds of masks. Try to focus your reasoning on programming for a single lane in a high-level language. Either you use an if() statement, which at the vector assembly level can use the dedicated mask registers, or you just use AND/OR operators using integers as masks, which become full vectors at the assembly level after vectorization. So they really are very different but well understood concepts at the high level programming language level. It only gets confusing or inconvenient when you try to reason at the assembly level.

I hate to say this but the time when programming in assembly made you smart or cool is coming to an end. It's not a wise thing to do any more and mostly only compiler writers should concern themselves with it. But don't dispair. There was a time when people who programmed in assembly were sneered at and looked down on by the people who were programming in binary. It's a good thing that we got past that and many great things came out of higher-level programming. Only the writers of compiler back-ends should worry about binary. Don't get me wrong, I find it incredibly valuable for high-level programmers to know assembly, but only in as far as it makes them better high-level programmers. I wrote some compilers myself, but basically to get away from writing assembly.

AVX-512 will create a new era in computing, but the vast majority of developers should be able to use it effortlessly or even unknowingly in a high-level language. So only issues at the assembly level that would affect things at a high-level language level, should be brought to Intel's attention. Having two types of masking at the assembly level is a trivial concept at the high-level language, and only a minor inconvenience for compiler writers.

Quote:

Quote:

Using the vector registers as masks as well would require even more accesses per cycle, and more dependency checking, and I doubt that's reasonably feasible (without exceeding power consumption targets). So from that perspective it's good to have different register types as they are fully independent and don't require additional ports for the same register file or more dependencies to check for.

Intel processors have a limitation of two input dependencies per µop except for FMA instructions. AMD has never had any such limitation, I don't know why. You seem to know about internal technical details that I have no access to. As I understand you, it is easier to check for an extra input dependency if it is from a different register file?

Actually I think your documents on micro-architecture is what first made me aware of register file port limits... Anyway, since then I've also gained some common knowledge of digital circuit design and what makes them more complex or consume more power. I can highly recommend "Digital Integrated Circuits" by Rabaey et al.

I don't have any knowledge of Intel's design but I believe multi-banking is now the norm for creating multi-ported register files. But you still have to balance the number of banks against the number of bank conflicts, and each bank can be multi-ported itself. Either way using vector registers as masks and expecting the masking to be done in the same pipelined instruction requires extra reads from the same register file, which definitely makes them more complex and power hungry. Note that GPUs even avoid the complexity of multi-banking, by reading the operands over multiple cycles. They have multiple register files which are independent because they are used for different threads/fibers. CPUs can't do that because they are expected to have high single-threaded performance, but clearly for AVX-512 to compete with GPUs in performance/Watt they need to minimize the required ports per register file. Hence having a separate register file for the masks is a good thing.

The scheduler has to check for the extra dependency on the mask register, regardless of what register file it's from. I forgot about that before. If they keep the unified scheduler then I imagine the result bus is shared between scalar and vector units so that doesn't cause any extra routing. There might be some clever reuse of the result bus for flags.

Quote:

Quote:

Anyway, having the mask registers has the additional benefit of being able to completely clock gate the unused lanes to save power. Note also that this wouldn't be possible if the masks were full width vector registers which control how the result gets blended into the destination register on a per-bit bases instead of per-field. So I'm sure that Intel has evaluated these options and found that dedicated compact mask registers were well worth it.

My idea is to use only one bit per field. When masking a vector of 32 floats, I would use only the lower 32 bits of an xmm register. Why do you think it is easier to turn off unused parts of the execution unit when the mask bits are in a mask register than in an xmm register or general purpose register?

Sorry, I didn't get that you'd compact the mask into the lower part of a vector register. But that would mean that depending on the instruction the mask bits would apply to 8-bit, 16-bit, 32-bit or 64-bit fields. They'd have to be routed to very different locations based on the instruction type, and it gets worse as the SIMD width gets larger. Routing things that far with different path lengths depending on the instruction type is very difficult at ~4 GHz. Note that GPUs don't have to route masks across lanes at all. Each lane just operates as if it's a scalar core, and they each can have their own register file for the masks, which are single bits per scalar core. They practically don't have to know they're part of a SIMD array, since there's no data being exchanged between lanes (some exceptions exist).

So for AVX-512 to be power efficient and scalable you need to keep mask bits close to the lanes they affect.

Quote:

Quote:

Many company's employees are restricted or strongly discouraged from posting on public forums due to the risk of revealing company IP or revealing plans to the competition.

That's an old fashioned way of thinking. The trend goes towards more openness. People want open standards today, not de facto industry standards decided through short-sighted market considerations.

I applaud the trend towards openness. I'm just trying to be realistic and I'm afraid there's still a lot of old fashioned thinking. A large portion of knowledgeable people just won't contribute to an open forum, for various reasons. There are also lots of armchair chip designers who, while intelligent, just lack the inside knowledge for why their proposal won't offer a good cost/gain balance. You can't expect Intel engineers to engage with them for the small chance of picking up a valuable idea. Don't get me wrong, hardware engineers should talk a lot to software engineers, but I'm not sure an open forum is the single most efficient way to exchange information on needs and limitations. It is one end of the communication spectrum, with a lot of noise, and I think the open discussions we have right here are the best interaction with Intel that individual developers can hope for.

Quote:

The x86 instruction set encoding is an ugly patchwork generated by a long history of short sighted solutions and dirty patches.

I agree it's a patchwork and not a pretty one, but technology doesn't have to be pretty (on the inside) to be commercially successful. x86 is a huge success for multiple decades because it's a extendable patchwork. Could you have designed it significantly better while still keeping it successful for the needs during all these years? I mean, sure, in hindsight someone could easily design a better 64-bit ISA than x86-64, for today's needs. But that architecture would be very inefficient to implement a decade or two ago. So x86 has managed to stay relevant by extending the patchwork to the best of the possibilities available at each point in time. It has fantastic backward compatibility, and the value of this should not be underestimated. The advantages of a new ISA would not outweigh the disadvantages of breaking backward compatibility. There's still room for more patchwork before we get to that point.

Note that ARM is relatively young but it's starting to show signs of becoming a patchwork as well and it's going to get worse if it wants to survive and retain backward compatibility. It's just an inevitability even for the smartest engineers in the world.

Quote:

AVX512 only makes this worse. The instruction length decoder is a power-hungry bottleneck due to all the complicated workarounds. Things would be much simpler today if Intel had made more future-oriented decisions along the way.

First of all let's not forget that decoding instructions with a average length of maybe 6 bytes shouldn't have to cost more power than loading data for it and executing it if it's 512-bit wide. Also, Intel's proven ability to compete with ARM designs in the mobile market with both Haswell and Atom architectures shows that the ISA isn't a huge limiting factor.

And again it's questionable whether they could have made more future-oriented decisions while still being as successful every step of the way. The instructions you might find useless today must have had a significant purpose at some point. Or maybe they were even thinking they were being future-oriented, but things went into a different direction than expected. The things you're proposing today could be equally well intentioned but turn out to be a limiting factor for the future.

Quote:

For example, there are three bytes in the opcode map that are used for old undocumented opcodes dating back to the first 8086 processor. I can't believe that all processors today still support these undocumented opcodes. If Intel had made an official announcement 30 years ago that these undocumented opcodes could not be used, then they could have disabled them 25 years ago and they would be available for extensions today. Instead, they are doing all kinds of desperate tricks to squeeze more and more instruction codes into an allready overloaded opcode map. And yet, there is still no official statement about these undocumented opcodes. Why not disable them today so that they will be available for other purposes in some unknown future. (They are disabled in 64-bit mode anyway).

I'm not sure which bytes you're talking about exactly, but some have been repurposed for 64-bit mode so that's exactly the kind of useful deprecation you're talking about. Also repurposing them for 32-bit mode would be useless since everything is evolving toward 64-bit.

Quote:

In the same way, I think we need a decision about the future of the x87 and mmx registers. Many people regard them as obsolete, and I think they are quite costly to implement in the CPU pipeline. Microsoft tried to disable them in 64-bit Windows, but changed their mind. If the x87 registers should ever be disabled then it has to be announced maybe 10 or 20 years in advance. That's why it is not too early to discuss the future of x87 and perhaps an alternative solution for high precision math.

Agreed that it's the right time to think about a new purpose for that opcode space. AMD already dropped 3DNow! support. x87 and MMX are still used quite a bit though by legacy software that is still in use. I don't think they're costly to implement, after all they were introduced many years ago and now they only cost a tiny fraction of the transistor budget and it keeps getting cheaper. But their opcode space could be quite valuable.

Аватар пользователя Clothoid

Again note that this is all the GPU needs to achieve its massive throughput. It can do 8-bit and 16-bit processing perfectly well, including for all the applications you mention. So AVX-512 brings that same technology straight into the CPU cores. There's no reason to suddenly go and change the fundamental computing concepts for these SIMD units. Each lane processes one iteration of a loop. If your inner loop code happens to be processing 8-bit or 16-bit values, they just get expanded to 32-bit, just like on a GPU, and just like on a GPU you'll still get a ton of computing power.

An important functionality that has to be supported by the ISA to enable this are gather operations that allow to upconvert 8 or 16 bit data types into 32 bit types and scatter operations that allow downconversion from 32 bit to 8 and 16 bit data types. The current Xeon Phi ISA and the orriginal Larrabee ISA supported this functionality - but i am not sure if this functionality was forgotten in AVX-512.

AT least when working with the latest Intel compiler 14.0 SP1 Update 1 my impression is that upconversion and downconversion with gather/scatter is missing in AVX512.

Аватар пользователя andysem

Quote:

c0d1f1ed wrote:

GPUs do not have instructions which operate on 8-bit or 16-bit (except possibly as the inner fields of a few vector-within-vector instructions). The way they handle 8-bit and 16-bit variables in the source code is to always expand it to 32-bit. Each mask bit therefore controls a 32-bit field. Yet they are clearly not significantly limited by this approach as they can process trillions of operations per second. So I see no compelling reason why AVX-512 would need instructions and mask registers for 8-bit or 16-bit fields. If more computing power is desired it should be achieved through widening the vectors, adding more vector units per core, and/or adding more cores. This would benefit all workloads, not just 8/16-bit ones. Vector-within-vector instructions can be very useful but only a limited set is required and the mask bits can still apply to whole 32-bit fields.

So it's important to realize that AVX-512 is not the 512-bit extension of AVX2. It is intended for an SPMD approach the way a GPU does it. Some multimedia applications which operate mainly on 8-bit and 16-bit data might be better off sticking with AVX2. Just like on a GPU, AVX-512 implementations will probably have a relatively low bandwidth per FLOP ratio. That's fine if you're doing lots of computations on relatively little data with lots of reuse of intermediate results. If on the other hand you're streaming data through and only perform a few operations on it like in some multimedia applications, then AVX2 might already be bandwidth limited so there would be no use for extending it to 512-bit for such applications.

I humbly disagree that 8/16 bit operations are not needed. It sounds like "640k will be enough for anyone."

Bandwidth has a potential for increase (DDR4, Crystall Well) while CPU clocks and CPI seems to be close to their limits. Even with current technologies, I find it hard to believe that multimedia applications won't benefit from wider vectors, given 8/16 bit operations are available. After all, if it's slower than memcpy then chances are high it's computation-bound, and this is usually the case in my (related to multimedia) practice. It is often possible to design your data processing workflow to conserve bandwidth and favor more intermediate calculations. It may not be easy, but this makes it possible to increase performance when wider vectors are available. And having wide vectors without being able to operate on the smaller units just feels like a waste to me.

Quote:

c0d1f1ed wrote:

Anyway, having the mask registers has the additional benefit of being able to completely clock gate the unused lanes to save power.

I'm not sure clock gating actually does any benefit in this case. You either use masks+xmm registers, or only xmm or none of them (in case of the generic x86 code). Clock gating only saves power in the second case, relative to the first one. But if you use xmm registers as masks you actually make the first case equivalent to the second one and thus more power efficient.

Quote:

c0d1f1ed wrote:

I don't think it adds much compiler complication. After all you can start with straightforward code which uses blend instructions, and then iteratively optimize it by making use of the mask registers. Note again that you shouldn't have to deal with saving mask registers across calls at all.

Not sure how much complication it brings but it requires more from the register allocator and dependency tracker in the compiler. And the compiler may have to spill/restore mask registers in contexts other than function calls. For example, the algorithm may require more masks than mask registers available in the CPU.

And having to call a function in a SIMD-optimized loop is not that far fetched example, actually. For instance, I once implemented some string formatting routines optimized for SSE/AVX, which had to call a function to store the partial results in the STL stream. Naturally, I had to vzeroall before the call in case of AVX, and in case of SSE the compiler generated xmm register spills/restores around it. You may argue that this is not a good algorithm for SIMD-optimization but I think otherwise. As long as it's faster than the scalar version, it is a good candidate.

Аватар пользователя andysem

Quote:

c0d1f1ed wrote:

Even if we look at the CPU the scalar ALUs are all 64-bit and yet we use them for 16-bit and 8-bit values all the time. It's not worth it trying to use 16-bit and 8-bit ALUs instead. At a system level it's insignificant and not power-wasting at all to have 64-bit ALUs. Of course you can stuff 8 x 8-bit in 64-bit registers and process them with 64-bit vector ALUs, but that's not hugely more power-efficient than expanding them to 8 x 32-bit, especially when you take into account that those 256-bit vector ALUs are mostly used for processing actual 32-bit elements and the whole thing can scale up to 512-bit and beyond if it's not encumbered by having to cater for tightly packed 8-bit data.

And again, vector-within-vector instructions offer the best of both vector paradigms for certain use cases.

Using wider unit sizes and vector-within-vector tricks often require additional overhead to handle overflows and saturation that comes naturally with native support for narrow unit sizes. Unit size conversion also takes clocks.

Аватар пользователя iliyapolak

.>>> I can highly recommend "Digital Integrated Circuits" by Rabaey et al.>>>

If you are interested in design of digital logic I would like to recommend a fairy advanced book on digital system design and firmware algorithms written by Milos D. Ercegovac "Digital Systems and Hardware/Firmware Algorithms.

Аватар пользователя iliyapolak

>>> (the loop iteration is removed but you tell an API you want N number of instances of it, which could be executed accross many SIMD lanes and many cores and many threads). >>>

Actually it is GPU kernel whose N instances are executed by N threads each of them operating on one data element.

Аватар пользователя Agner

c0d1f1ed wrote:
Quote:

But you shouldn't forget where the design ideas originated from, as they determine the intended manner in which they should be used (but not limit what they should be used for)!

This is the kind of thinking I am warning about. If you have one particular type of application in mind while designing something you risk creating a specialized system that is not suited for other purposes.

Quote:

If your inner loop code happens to be processing 8-bit or 16-bit values, they just get expanded to 32-bit, just like on a GPU, and just like on a GPU you'll still get a ton of computing power.

Why go for a ton when you can have 4 tons? You can have 16 floats (or 32-bit ints) in a 512 bit register, but you can have 64 8-bit integers with the same register size if it supports 8-bit operations.

Quote:

Trying to instead process 16 x 32-bit and 32 x 16-bit and 64 x 8-bit breaks down this concept of one loop iteration per lane. Switching the number of things you process in parallel requires lots of data swizzling instructions which rapidly defeats the benefits of processing more elements in the first place.

It is the Knights Corner/Xeon Phi that requires a lot of swizzling. It will convert a vector of 8-bit integers to floats when loading from memory, do all calculations on floats, and convert back to 8-bit integers when saving the results. This is a very inefficient way of using the vector register, it is slower, and it uses more power.

Quote:

The only difference between AVX-512 and the GPU is that now this computing power will becomes part of the CPU cores. And that's the really revolutionary part. Using GPUs for generic applications has been notoriously difficult.

I agree that it is good to move the work from the GPU to the CPU. I have never used GPUs for generic applications because of the overhead and compatibility problems. But revolutionary? It is just an extension of register size from 256 to 512 bits. That was certainly expected. Masked/predicated instructions, more registers, and many new instructions. That is a big step forward, and thank you for that, but I wouldn't call it revolutionary :)  But if you are right that people will stop using the GPU then, yes, you might call that revolutionary.

Quote:

I hate to say this but the time when programming in assembly made you smart or cool is coming to an end. It's not a wise thing to do any more and mostly only compiler writers should concern themselves with it.

I agree. The lowest level that a programmer would deal with today is intrinsic functions. But still, somebody has to make the compiler.

Quote:

I applaud the trend towards openness. I'm just trying to be realistic and I'm afraid there's still a lot of old fashioned thinking.

That's why we have to put pressure on them and tell them that their customers want openness.

Quote:

Don't get me wrong, hardware engineers should talk a lot to software engineers, but I'm not sure an open forum is the single most efficient way to exchange information on needs and limitations. It is one end of the communication spectrum, with a lot of noise, and I think the open discussions we have right here are the best interaction with Intel that individual developers can hope for.

This is not interaction, it is one-way communication. Intel engineers may listen but keep silent, for the reasons you mention. The problem is that no matter how excellent ideas we might come up with in this forum, it is inevitably too late to change anything at the time Intel have published a finished and complete plan. They probably have it all implemented in silicon by now. Noise can be dealt with by moderating the forum, but I don't think there is much noise here compared to other forums.

Don't forget that AMD have sued Intel for unfair competition. They might sue again if they they can make a credible claim that Intel gets an advantage by dictating a de-facto standard that AMD have to follow with a time lag of several years.

Quote:

The x86 instruction set encoding is an ugly patchwork generated by a long history of short sighted solutions and dirty patches.

I agree it's a patchwork and not a pretty one, but technology doesn't have to be pretty (on the inside) to be commercially successful. x86 is a huge success for multiple decades because it's a extendable patchwork.

I would say it's a success despite being very difficult to extend. If they had left just a few unused code bytes for future extensions, the coding would be much simpler today, and for example SSSE3 instructions would be one byte shorter. The ugly patches in the instruction coding scheme are not seen even by assembly programmers unless you study the details of how bits are coded. I had to do this because I have made a disassembler, but others are happily unaware how complicated it is.

Quote:

x87 and MMX are still used quite a bit though by legacy software that is still in use. I don't think they're costly to implement, after all they were introduced many years ago and now they only cost a tiny fraction of the transistor budget and it keeps getting cheaper. But their opcode space could be quite valuable.

Some CPUs have an extra pipeline stage just for handling the rolling x87 register stack. An extra pipeline stage means longer branch misprediction delays, also for non-x87 code. The rolling register stack is a completely different paradigm from the rest of the design. I can imagine that this is messing up the pipeline design and calling for a lot of compromises. It might be useful to move all x87 processing to a separate on-chip "coprocessor", but the overlay of mmx registers on x87 registers makes this difficult. It is probably easier to get rid of MMX code than x87 code since most MMX code has been updated to SSE2 anyway.

andysem wrote
Quote:

And having to call a function in a SIMD-optimized loop is not that far fetched example

I agree. Intel have an excellent function library called SVML (small vector math library) with all common mathematical functions in vector form. This library is indeed intended for being called from vectorized code. If we could make one or a few of the mask registers callee-save then we can use a mask register across a function call, and the leaf function itself would of course use other mask registers that are not callee-save. We are certainly missing an instruction to save a full mask register in a way that is compatible with future extensions.

www.agner.org
Аватар пользователя c0d1f1ed

Quote:

Clothoid wrote:An important functionality that has to be supported by the ISA to enable this are gather operations that allow to upconvert 8 or 16 bit data types into 32 bit types and scatter operations that allow downconversion from 32 bit to 8 and 16 bit data types. The current Xeon Phi ISA and the orriginal Larrabee ISA supported this functionality - but i am not sure if this functionality was forgotten in AVX-512.

AVX-512 has down convert instructions and zero/sign extend instructions for dealing with packed small data elements. And just like the current Xeon Phi ISA it only has 32-bit and 64-bit gather. Smaller elements could be gathered by masking the upper parts of 32-bit elements, if you allocated some padding. In the wost case you have to fall back to scalar code. But you should really consider rethinking your algorithms if you need lots of byte gathers.

Quote:

AT least when working with the latest Intel compiler 14.0 SP1 Update 1 my impression is that upconversion and downconversion with gather/scatter is missing in AVX512.

As far as I know that has never existed as part of gather/scatter. Anyway, note that there's always the possibility to add a few more instructions. There isn't an architectural or encoding limitation of AVX-512 that stands in the way of it as far as I can tell. It's just a matter of requiring really good use cases that justify it. Of course it would be nice to make the ISA as complete as possible from the get-go, but as we've seen in the past adding too many instructions too early can be a waste if they have no real value.

Аватар пользователя c0d1f1ed

Quote:

andysem wrote:I humbly disagree that 8/16 bit operations are not needed. It sounds like "640k will be enough for anyone."

I understand your concern, but please avoid that somewhat straw-man like argument. It is abused or misinterpreted too often.

I didn't say 8/16-bit operations aren't needed at all. I was merely saying that the mask registers don't have to apply to individual packed bytes. You can have vector-within-vector instructions in which the lanes are 32-bit or larger. And for certain packed 8/16-bit operations you can also still use AVX2.

What I also tried to say is that 512-bit instructions that operate on packed 8/16-bit elements might not be possible without sacrificing latency and/or power, and even more importantly may stand in the way of future widening. As proven by GPU computing, most workloads are 32/64-bit so it's more valuable to be able to later extend to 1024-bit than to make sure the full width of your ALUs is used for any data. Unless you want to claim 512-bit is enough for everyone doing 32/64-bit processing?

Perhaps it really is, because GPUs have stopped widening their vectors because otherwise branch divergence would lead to too low efficiency...

Quote:

Bandwidth has a potential for increase (DDR4, Crystall Well) while CPU clocks and CPI seems to be close to their limits. Even with current technologies, I find it hard to believe that multimedia applications won't benefit from wider vectors, given 8/16 bit operations are available. After all, if it's slower than memcpy then chances are high it's computation-bound, and this is usually the case in my (related to multimedia) practice. It is often possible to design your data processing workflow to conserve bandwidth and favor more intermediate calculations. It may not be easy, but this makes it possible to increase performance when wider vectors are available. And having wide vectors without being able to operate on the smaller units just feels like a waste to me.

While DDR4 and L4 cache are nice ways to increase bandwidth, they're only relatively small effective improvements and they don't come for free. These techologies will have to deliver all the bandwidth we need for many years to come. Meanwhile computational power increases at a faster pace, as it always has. Core count is virtually unlimited, while DDR5 and L5 would have even lower diminishing returns. GPUs already suffer from this quite badly. The historic rule of 1 byte of RAM bandwidth per FLOP is no longer achieved, and there will be plenty more room for more compute cores on new process nodes that make it worse.

AVX-512 and its successors will eliminate the GPU but inherit its issues. In a sense it's of course great that many workloads won't be arithmetic limited any more. So my point is that you shouldn't be too concerned about these 8/16-bit element workloads, especially with vector-within-vector instructions for the most common cases.

Quote:

I'm not sure clock gating actually does any benefit in this case. You either use masks+xmm registers, or only xmm or none of them (in case of the generic x86 code). Clock gating only saves power in the second case, relative to the first one. But if you use xmm registers as masks you actually make the first case equivalent to the second one and thus more power efficient.

It's really the opposite. The k mask registers allow clock gating per scalar element in the vector. It's masking out calculations for branches of code those elements are logically not executing, but SIMD forces them to stay in lock-step with the elements (aka. strands/lanes) that do take those code paths.

Not having the mask registers makes it impractical to implement this clock gating. Care to elaborate why you think not having them would save more power through clock gating instead?

Quote:

Not sure how much complication it brings but it requires more from the register allocator and dependency tracker in the compiler. And the compiler may have to spill/restore mask registers in contexts other than function calls. For example, the algorithm may require more masks than mask registers available in the CPU.

It doesn't require "more" from the register allocator at all. It is completely orthogonal and can be used opportunistically. It would have made things harder if the mask registers had to be shared in some way, but you get them in addition to all the rest.

Yes, some code could use more mask registers than available, but how is this worse than when having zero mask registers?

Quote:

And having to call a function in a SIMD-optimized loop is not that far fetched example, actually. For instance, I once implemented some string formatting routines optimized for SSE/AVX, which had to call a function to store the partial results in the STL stream. Naturally, I had to vzeroall before the call in case of AVX, and in case of SSE the compiler generated xmm register spills/restores around it. You may argue that this is not a good algorithm for SIMD-optimization but I think otherwise. As long as it's faster than the scalar version, it is a good candidate.

Something being just a little faster than the scalar version does not justify breaking AVX-512's programming model for. It aims to parallelize 32-bit code 16-fold. It seems that you're still stuck thinking about the old CPU vector instruction paradigms. The only way to understand AVX-512 and realize what you shouldn't try to force it to be is to consider it the unification of GPU technology into the CPU cores.

Аватар пользователя andysem

Quote:

c0d1f1ed wrote:

I understand your concern, but please avoid that somewhat straw-man like argument. It is abused or misinterpreted too often.

I'm sorry but I don't think I'm making a straw-man's argument. I really got the impression you're making a point that 8/16-bit operations are not needed (in AVX-512). Perhaps these words made me think so:

Quote:

c0d1f1ed wrote:

So I see no compelling reason why AVX-512 would need instructions and mask registers for 8-bit or 16-bit fields.

In any case, no offense was meant on my side.

Quote:

c0d1f1ed wrote:

What I also tried to say is that 512-bit instructions that operate on packed 8/16-bit elements might not be possible without sacrificing latency and/or power, and even more importantly may stand in the way of future widening. As proven by GPU computing, most workloads are 32/64-bit so it's more valuable to be able to later extend to 1024-bit than to make sure the full width of your ALUs is used for any data.

I don't have the data to judge which workloads (on smaller or wider units) are more widespread, but I don't think that GPU computing makes a strong case here. GPUs are known to have a more narrow scope of applications than CPU instruction extensions. There are multiple reasons for that and I suspect that architectural limitations are not the last of them.

I won't argue that support for 8/16-bit elements is free. I'm just saying that this support is needed for a certain (and quite significant, IMHO) amount of applications. You also said that ALUs are cheap, so it seems there shouldn't be much of a problem putting them into silicon to support smaller elements for the benefit of those applications.

However, I could understand if support for smaller elements was a too big step to make for the first implementation of AVX-512, so that the support is added later, in AVX-512-2 or whatever its name is. But in order for this to be possible, mask registers should also have a clear path of extension.

Saying that AVX2 is enough for 8/16-bit operations and AVX-512 and later instruction sets are for larger units only is what I disagree with and was comparing to the "640k will be enough for anyone" saying.

Quote:

c0d1f1ed wrote:

While DDR4 and L4 cache are nice ways to increase bandwidth, they're only relatively small effective improvements and they don't come for free. These techologies will have to deliver all the bandwidth we need for many years to come. Meanwhile computational power increases at a faster pace, as it always has. Core count is virtually unlimited, while DDR5 and L5 would have even lower diminishing returns. GPUs already suffer from this quite badly. The historic rule of 1 byte of RAM bandwidth per FLOP is no longer achieved, and there will be plenty more room for more compute cores on new process nodes that make it worse.

Yes, increasing the number of cores is a way of increasing computational power, but it is also known that not all workloads can benefit from this kind of parallelism. I think this is the main reason we're mostly using 2-4 core CPUs now (sans supercomputers). I consider single thread performance just as important as the number of hardware threads.

Quote:

c0d1f1ed wrote:

The k mask registers allow clock gating per scalar element in the vector. It's masking out calculations for branches of code those elements are logically not executing, but SIMD forces them to stay in lock-step with the elements (aka. strands/lanes) that do take those code paths.

Not having the mask registers makes it impractical to implement this clock gating.

I don't have the background in designing CPUs, so my point of view is probably quite naive. I don't understand what you meant by "SIMD forces them to stay in lock-step with the elements (aka. strands/lanes) that do take those code paths." Why is it not practical to perform clock gating based on bits from xmm registers instead of k registers?

Quote:

c0d1f1ed wrote:

Care to elaborate why you think not having them would save more power through clock gating instead?

By removing k registers you also remove all power consumption associated with them, don't you?

Quote:

c0d1f1ed wrote:

Yes, some code could use more mask registers than available, but how is this worse than when having zero mask registers?

My point was that there is need for the operations to save and restore the registers. If xmm registers are used as masks then there already are such operations and the point is moot.

Quote:

c0d1f1ed wrote:

Something being just a little faster than the scalar version does not justify breaking AVX-512's programming model for. It aims to parallelize 32-bit code 16-fold. It seems that you're still stuck thinking about the old CPU vector instruction paradigms. The only way to understand AVX-512 and realize what you shouldn't try to force it to be is to consider it the unification of GPU technology into the CPU cores.

"A little faster" was somewhat an order of magnitude in terms of data throughput, IIRC, so it was quite significant. Maybe I'm thinking old ways, but I apply the suggested architectural improvements to the cases I have at hand and see that it doesn't apply that well.

Аватар пользователя iliyapolak

>>

The k mask registers allow clock gating per scalar element in the vector. It's masking out calculations for branches of code those elements are logically not executing, but SIMD forces them to stay in lock-step with the elements (aka. strands/lanes) that do take those code paths.

Not having the mask registers makes it impractical to implement this clock gating.>>>

As far as I understand addition of mask registers at hardware level(register file) instead of using xmm/ymm bits could force the reimplementation of control signals either vertical or horizontal encoding to include additional bits of mask register and increased wire latency(delay) when accessing those bits thus at miniscule time interval of one clock generating more energy.

Аватар пользователя bronxzv

Quote:

andysem wrote:

By removing k registers you also remove all power consumption associated with them, don't you?

the k logical registers will most probably map to the same physical register file than GPRs (*1) thus "removing" the k registers will not save power since the register files will be actually kept unchanged, on the other hand doing the logical operations on 512-bit registers (as defined in the OP's proposal if I got it right) instead of 8-bit/16-bit masks will use arguably more power in pure waste, as you know probably a classical use case with masks is to compute a series of mask with compare instructions, then to AND or OR them together before to use only the final mask for actual masking, it's quite common to have more instructions for computing the masks than instructions using them so having the k logical registers of the smallest useful width is arguably a sensible choice to save power by moving around and computing up to 64x less bits (with packed doubles) than with full width zmm registers

*1: in the initial AVX-512 products, but maybe will map to the physical vector register file (as x87/xmm/ymm/zmm logical registers map to a common physical register file) in the future allowing for example 128-bit masks for AVX-1024 with 8-bit elements, it will be just a matter to define the new max width for k registers as 128-bit and to provide the necessary spill/fill instructions

 

Аватар пользователя c0d1f1ed

Quote:

Agner wrote:

c0d1f1ed wrote:
Quote:

But you shouldn't forget where the design ideas originated from, as they determine the intended manner in which they should be used (but not limit what they should be used for)!

This is the kind of thinking I am warning about. If you have one particular type of application in mind while designing something you risk creating a specialized system that is not suited for other purposes.

You talked about x86 being a patchwork due to mistakes made in the past. I'm afraid that in many cases trying to design something with multiple computing paradigms in mind is exactly what lead to such mistakes. Only a few paradigms survive the test of time and lead to an elegant, scalable architecture. I say paradigm instead of "application" because my interpretation of AVX-512 being the unification of GPU technology into the CPU is absolutely not limited to one type of application.

GPU-style parallel computing is a powerful 'new' computing paradigm being introduced into the CPU. But trying to run more than 16 maskable strands in a 512-bit SIMD unit is in my opinion a mistake which leads to bad scalability. So with all due respect I find it quite ironic that you're being so hard on Intel for creating a patchwork with mistakes that lead to overhead, while you are potentially suggesting something that would be regarded as a problem in the future. GPUs have 32-bit lanes, for very good reasons despite that some of the data is 16 or 8-bit, so why would you ignore that altogether?

I'm not claiming I have all the right answers here. I fact my point is that nobody has them. Only hindsight is 20/20 so it's clear to identify past mistakes, but right now we're dealing with the very same kind of unknowns that the Intel engineers were facing when they made the past mistakes that make x86 less than optimal. So I'm just humbly asking to leave that bias behind so we can focus our discussion on the technical issue and hopefully help avoid mistakes.

Quote:

Quote:

If your inner loop code happens to be processing 8-bit or 16-bit values, they just get expanded to 32-bit, just like on a GPU, and just like on a GPU you'll still get a ton of computing power.

Why go for a ton when you can have 4 tons? You can have 16 floats (or 32-bit ints) in a 512 bit register, but you can have 64 8-bit integers with the same register size if it supports 8-bit operations.

Because being 'greedy' like that seldom produces the best results in the long run. Yes, you can have 4x higher performance for 8-bit, in this time frame. But in 20 years from now we'll have CPUs with more cores and wider vectors than memory bandwidth and power consumption allow to be active all at the same time. So computing resources is not the issue. Power and bandwidth are. Of course 8-bit vector instructions can save on power for 8-bit vectorizable workloads, but you have to look at the bigger picture. What's the impact on scalar and 32-bit vector workloads? And why not take the best of both worlds with vector-within-vector instructions? What kind of use cases do you have that are worth risking to lose the scalability that GPUs have?

Quote:

Quote:

Trying to instead process 16 x 32-bit and 32 x 16-bit and 64 x 8-bit breaks down this concept of one loop iteration per lane. Switching the number of things you process in parallel requires lots of data swizzling instructions which rapidly defeats the benefits of processing more elements in the first place.

It is the Knights Corner/Xeon Phi that requires a lot of swizzling. It will convert a vector of 8-bit integers to floats when loading from memory, do all calculations on floats, and convert back to 8-bit integers when saving the results. This is a very inefficient way of using the vector register, it is slower, and it uses more power.

As the master of analyzing x86 architectures, could you try and measure the exact power consumption difference between using 8-bit and 32-bit vector instructions (same number of elements, i.e. MMX vs. AVX2)?

I don't think the difference will be all that big. And again, look at the bigger picture of what the impact would be to have 16-bit and 8-bit variants of all 512-bit vector instructions, as well as masking for these smaller elements. How much perf/Watt for 32-bit workloads are you willing to sacrifice for that? And I don't mean to sound like a broken record, but why can't vector-within-vector instructions be a near-perfect compromise?

Quote:

Quote:

The only difference between AVX-512 and the GPU is that now this computing power will becomes part of the CPU cores. And that's the really revolutionary part. Using GPUs for generic applications has been notoriously difficult.

I agree that it is good to move the work from the GPU to the CPU. I have never used GPUs for generic applications because of the overhead and compatibility problems. But revolutionary? It is just an extension of register size from 256 to 512 bits. That was certainly expected. Masked/predicated instructions, more registers, and many new instructions. That is a big step forward, and thank you for that, but I wouldn't call it revolutionary :)  But if you are right that people will stop using the GPU then, yes, you might call that revolutionary.

I think you're vastly underestimating the possibilities. Almost any loop with independent iterations can be vectorized. That's most loops in compute-intensive applications, and the hotspots are always in the loops. None of today's mainstream compilers vectorize loops this way, so it's going to be a huge paradigm shift when suddenly 16 iterations are executed per instruction. AMD is putting a crazy amount of effort (relative to their resources) into trying to create a heterogeneous architecture in which the GPU can be reasonably used by generic applications. They do this because the gains are so huge. Unfortunately for them heterogeneous computing is inherently developer-unfriendly.

Mark my words, this paradigm shift will revolutionize the meaning of computing. It has revolutionized graphics computing within GPUs, but we have yet to see the full extent in which it can and will impact generic computing on the CPU. There will be new applications that are simply not commercially feasible today. You're right that it's 'only' a doubling of 256-bit to 512-bit, but how many applications do you know which use 256-bit SIMD vectorization of loops? AVX2 requires intrinsics, which takes a lot of hours and skilled engineers (read: expensive). AVX-512 makes it feasible to let the compiler to do it for you. We'll basically go from scalar loops to 16x parallelized loops. I don't know any other word than revolutionary, to describe that.

Quote:

Quote:

I agree it's a patchwork and not a pretty one, but technology doesn't have to be pretty (on the inside) to be commercially successful. x86 is a huge success for multiple decades because it's a extendable patchwork.

I would say it's a success despite being very difficult to extend. If they had left just a few unused code bytes for future extensions, the coding would be much simpler today, and for example SSSE3 instructions would be one byte shorter. The ugly patches in the instruction coding scheme are not seen even by assembly programmers unless you study the details of how bits are coded. I had to do this because I have made a disassembler, but others are happily unaware how complicated it is.

You need a bit of economic perspective here. That's very little effort in the big picture. I've written an assembler as well and I've fixed encoding bugs in LLVM. Every new extension caused me another week of work, tops. But that's saving millions of other developers who use this software, from doing that same work. A 'cleaner' ISA makes no significant difference overall. It's never really a trivial task. And heck, LLVM's assembly encoding is in just a couple of files, while there are thousands of files for everything else within the project.

So ISA complexity or uglyness is pretty meaningless. It's merely a challenge for a handful of well paid engineers who have great job security. Nobody cares. It's what the ISA achieves that really matters to the rest of the world. So I stand by my point: x86 is successful because it's extendable, no matter how challenging it is.

Quote:

Quote:

x87 and MMX are still used quite a bit though by legacy software that is still in use. I don't think they're costly to implement, after all they were introduced many years ago and now they only cost a tiny fraction of the transistor budget and it keeps getting cheaper. But their opcode space could be quite valuable.

Some CPUs have an extra pipeline stage just for handling the rolling x87 register stack. An extra pipeline stage means longer branch misprediction delays, also for non-x87 code. The rolling register stack is a completely different paradigm from the rest of the design.

Exactly. This is an example of trying to support another paradigm, solely for the ease of implementing a legacy expression evaluator, which ended up not being worth it in the longer run by not looking at the longer term picture. You'd be making the same kind of mistake when adding 8-bit maskable lanes to AVX-512 for short-term reasons. It doesn't mesh well with the more valuable and scalable paradigm of executing multiple scalar iterations of a loop in parallel. If you want 8-bit elements, make them part of vector-within-vector instructions and leave the 32-bit lanes alone.

Аватар пользователя bronxzv

Quote:

c0d1f1ed wrote:

AVX2 requires intrinsics, which takes a lot of hours and skilled engineers (read: expensive). AVX-512 makes it feasible to let the compiler to do it for you.

I'll be glad to learn which feature(s) of AVX-512 that are missing from AVX2 make it feasible to use the auto-vectorizer with AVX-512 but not AVX2 

Аватар пользователя c0d1f1ed

Quote:

andysem wrote:

I'm sorry but I don't think I'm making a straw-man's argument. I really got the impression you're making a point that 8/16-bit operations are not needed (in AVX-512). Perhaps these words made me think so:

Quote:

c0d1f1ed wrote:

So I see no compelling reason why AVX-512 would need instructions and mask registers for 8-bit or 16-bit fields.

In any case, no offense was meant on my side.

And none was taken. I just think it's "cheap" to use the 640k argument without detailed argumentation. I mean, anything someone thinks would be a waste and would hamper future extensions could be attacked using the 640k argument. But it's a bad argument. GPUs have scaled performance at an incredible rate, despite processing 8-bit values in 32-bit lanes. So that goes squarely against the 640k argument.

Look, with scalar code nobody expects 8-bit arithmetic to be faster than 32-bit. Why would it have to be faster with SIMD parallelism? Except for vector-within-vector, it makes the hardware considerably more complex to have many lane widths.

Quote:

I don't have the data to judge which workloads (on smaller or wider units) are more widespread, but I don't think that GPU computing makes a strong case here. GPUs are known to have a more narrow scope of applications than CPU instruction extensions. There are multiple reasons for that and I suspect that architectural limitations are not the last of them.

GPUs have a narrow scope of applications because they have to be programmed heterogeneously, and because they have low single-threaded performance. AVX-512 has neither of those issues. 8-bit and 16-bit vector instructions that are not vector-within-vector, are not going to help make the scope even wider. If you think otherwise please plus sum up some applications that would benefit from them. And please don't use the 640k anti-argument.

Quote:

I won't argue that support for 8/16-bit elements is free. I'm just saying that this support is needed for a certain (and quite significant, IMHO) amount of applications. You also said that ALUs are cheap, so it seems there shouldn't be much of a problem putting them into silicon to support smaller elements for the benefit of those applications.

The problem isn't the ALUs. The problem is the masks. I am suggesting vector-within-vector instructions, which adds a tiny amount of ALU complexity, but keeps the masking simple.

Quote:

However, I could understand if support for smaller elements was a too big step to make for the first implementation of AVX-512, so that the support is added later, in AVX-512-2 or whatever its name is. But in order for this to be possible, mask registers should also have a clear path of extension.

I asked Agner this same question: why would you want a different number of lanes for 32-bit, 16-bit and 8-bit elements? It breaks the paradigm of one loop iteration per lane, and it takes many shuffle instructions to switch between them. What's wrong with just keeping 8-bit data in 32-bit lanes, or using vector-within-vector instructions?

Quote:

I don't have the background in designing CPUs, so my point of view is probably quite naive. I don't understand what you meant by "SIMD forces them to stay in lock-step with the elements (aka. strands/lanes) that do take those code paths." Why is it not practical to perform clock gating based on bits from xmm registers instead of k registers?

If you vectorize a loop which contains conditional statements, you need to execute all the paths that any of the elements of the vector are taking, and then blend the results together. But you're computing certain elements of the vectors that you're thowing away. The worst part of that is the wasted power.

The mask registers not only do the blending in the same instruction, they also allow to clock-gate the lanes which results are thown away anyway. This isn't possible with xmm registers because you can't read that many operands from the same register file per cycle at acceptable power consumption. You'd also need forwarding paths from the low 128-bit to the entire 512-bit width (or more), from every output to every input, splitting the bits up to each 8/16/32/64-bit element. That's not desirable either. With dedicated mask registers for 32/64-bit elements only, it gets much simpler.

Quote:

Quote:

c0d1f1ed wrote:

Yes, some code could use more mask registers than available, but how is this worse than when having zero mask registers?

My point was that there is need for the operations to save and restore the registers. If xmm registers are used as masks then there already are such operations and the point is moot.

So your concern is the need for extra instructions when using dedicated mask registers? That's really not an issue compared to the major difficulties if vector registers were to be used as compact masks.

Quote:

Quote:

c0d1f1ed wrote:

Something being just a little faster than the scalar version does not justify breaking AVX-512's programming model for. It aims to parallelize 32-bit code 16-fold. It seems that you're still stuck thinking about the old CPU vector instruction paradigms. The only way to understand AVX-512 and realize what you shouldn't try to force it to be is to consider it the unification of GPU technology into the CPU cores.

"A little faster" was somewhat an order of magnitude in terms of data throughput, IIRC, so it was quite significant. Maybe I'm thinking old ways, but I apply the suggested architectural improvements to the cases I have at hand and see that it doesn't apply that well.

I'm sure that wasn't an order of magnitude compared to using half the vector width. With AVX-512 we have the opportunity for 16x parallelization of a lot more code, with the potential for 32x in the future. That's way more valuable than doubling the performance of the AVX2 code you already have and get to keep. I mean, it's a simple choice: do you want 2x more but only for 8-bit data processing, or do you want 16x for a ton of applications? And again, vector-within-vector instructions can help you get that 2x or even 4x on top of that for 16-bit and 8-bit data respectively. The AVX-512 foundation seems very suitable to be extended that way, without needing any changes to how the mask registers work.

Аватар пользователя c0d1f1ed

Quote:

bronxzv wrote:

Quote:

c0d1f1edwrote:

AVX2 requires intrinsics, which takes a lot of hours and skilled engineers (read: expensive). AVX-512 makes it feasible to let the compiler to do it for you.

I'll be glad to learn which feature(s) of AVX-512 that are missing from AVX2 make it feasible to use the auto-vectorizer with AVX-512 but not AVX2 

AVX2 is a great leap forward (vector-vector shift was long missing), but compilers still have trouble using it for auto-vectorization because not ever scalar operation within a loop translates to a single equally fast vector instruction. In particular, AVX2 lacks a fast gather implementation, and loops with branches are unlikely to be vectorized due to having to execute multiple paths that burn full power and requiring blend instructions. Dedicated mask registers could make a marked difference. Think about the case where a vectorized loop has only one element to process. AVX-512 code wouldn't be so bad for this worst case.

It also doesn't hurt to have the potential for 16x parallelization of 32-bit code instead of 'only' 8x. On top of the other features it should make compilers a lot more confident that vectorizing a candidate loop will result in faster execution.

Аватар пользователя bronxzv

Quote:

c0d1f1ed wrote:AVX2 is a great leap forward (vector-vector shift was long missing), but compilers still have trouble using it for auto-vectorization

the auto vectorizer in the Intel compiler does a good job already with AVX2 IMHO so your comment about the programmers requiring to use intrinsics is bogus, if you were right it will mean that we will have to wait for AVX-512 enabled CPUs before to use the autovectorizer or the CILK+ array notation, this is wrong and is potentially very misleading for newbies reading your post(s)

btw the latest Intel compiler features both AVX2 and AVX-512 targets, if you are right it should be easy to find plenty of examples that vectorize well for AVX-512 but not for AVX2, I'm afraid you'll not find a lot of examples besides the ones where scatter instructions are needed, on the other hand all the cases with 8-bit or 16-bit elements are typically vectorized with AVX2, in fact the best is to use a blend of AVX/AVX2/AVX-512 in the general case so it makes not much sense IMHO to say that one ISA is easier to auto-vectorize for than the other

Quote:

c0d1f1ed wrote:AVX2 lacks a fast gather implementation,

as you know this is implementation dependent, not ISA dependent, by the time CPUs with AVX-512 ship, AVX2 gather will be faster than it is in HSW, most probably at the same speed than AVX-512 since they will use common hardware for gather

anyway, here again, it is easy to test with current compilers and as a matter of fact today's Intel compiler (both autovectorizer and CILK+ array notation)  generates vectorized code using AVX2 gather instructions

Quote:

c0d1f1ed wrote:and loops with branches are unlikely to be vectorized

they are typically vectorized using VBLENDVPS for branch elimination, btw a programmer will not be able to do much better using intrinsics, the fact that AVX-512 will allow a more power efficient implementation will be a welcomed enhancement but is not relevant to the fact that the current autovectorizers can do the job already for today's targets without the programmer having to use intrinsics, this leads already today to much faster and more energy efficient (J / work unit) code when compared to a scalar code equivalent

Аватар пользователя andysem

Quote:

bronxzv wrote:

Quote:

andysemwrote:

By removing k registers you also remove all power consumption associated with them, don't you?

the k logical registers will most probably map to the same physical register file than GPRs (*1) thus "removing" the k registers will not save power since the register files will be actually kept unchanged, on the other hand doing the logical operations on 512-bit registers (as defined in the OP's proposal if I got it right) instead of 8-bit/16-bit masks will use arguably more power in pure waste, as you know probably a classical use case with masks is to compute a series of mask with compare instructions, then to AND or OR them together before to use only the final mask for actual masking, it's quite common to have more instructions for computing the masks than instructions using them so having the k logical registers of the smallest useful width is arguably a sensible choice to save power by moving around and computing up to 64x less bits (with packed doubles) than with full width zmm registers

*1: in the initial AVX-512 products, but maybe will map to the physical vector register file (as x87/xmm/ymm/zmm logical registers map to a common physical register file) in the future allowing for example 128-bit masks for AVX-1024 with 8-bit elements, it will be just a matter to define the new max width for k registers as 128-bit and to provide the necessary spill/fill instructions

You may be right in that using xmm registers for masks could be more power consuming, although reusing GPR file would probably have a negative effect on register renaming. If GPR file is increased to accomodate the new registers then its power use is increased as well. And there is another thing to consider. In vectorized code the majority of instructions don't even touch general purpose registers (except for load and store operations and the parts that are not vectorized). I suppose, it allows GPR file to consume less power than it would if the file was used for k registers. I don't know how big the difference is (and if there is any), some educated research is needed to weigh all pros and cons.

Аватар пользователя andysem

Quote:

c0d1f1ed wrote:

Look, with scalar code nobody expects 8-bit arithmetic to be faster than 32-bit. Why would it have to be faster with SIMD parallelism?

Because scalar code by definition processes data units sequentially and vectorised code does it in parallel. The size of the vector (in units) defines the performance gain from vectorization. With SSE2 I'm able to process 16 bytes or 8 words at once, and AVX2 doubles that amount, so there is an obvious gain. Surely, you may not reach 2x speedup in a real application, but at least the potential is there and the real speedup is possible. With AVX-512 I'm able to process 16 bytes or words, which is worse than AVX2 in case of bytes and the same in case of words. Additionally, I have to perform conversions between 8/16-bit elements to 32-bit and back, which also takes time. This is hardly an improvement, likely the opposite.

You suggest using vector-within-vector approach, and if I understand you correctly and reading AVX-512 instructions right that's something like what people used before MMX, with general purpose registers. These tricks can be useful, as long as the algorithm is simple enough and you know the "pseudo-elements" within the "sub-vector" won't interfere with each other as the calculation goes. As soon as this doesn't hold the approach becomes not feasible and you're back to the scalar (in case of AVX-512 - 32-bit unit) code. The granularity of mask registers only complicates this because you're not able select units on byte or word level. Again, you have to resort to bit masks and logical operations, which in turn is tricky because of 32-bit only units. So no, I don't consider vector-within-vector approach as a suitable solution for the limitation AVX-512 makes.

Please, correct me if I misunderstood your vector-within-vector suggestion and you meant something different.

Quote:

c0d1f1ed wrote:

GPUs have a narrow scope of applications because they have to be programmed heterogeneously, and because they have low single-threaded performance. AVX-512 has neither of those issues. 8-bit and 16-bit vector instructions that are not vector-within-vector, are not going to help make the scope even wider. If you think otherwise please plus sum up some applications that would benefit from them.

I'm currently interested in realtime multimedia processing, which includes video and audio processing (scaling, colorspace conversion, blending, mixing, etc.) and compression. I wrote quite a few algorithms for processing media for my employer, and for the most part these algorithms involve 8 and 16-bit operations on data. I generally try to avoid FP calculations because it's slower than integer/fixed point and I didn't find much use in 32-bit integer operations in my area. So my primary interest is 8 and 16-bit integer operations.

GPUs are probably better tailored for my tasks, but their use is not beneficial in my case for various reasons, technical and not. One of the reasons is too much overhead because of the need to transfer data between CPU and GPU memory. So I'm very much interested in increasing data processing performance on the CPU.

Quote:

c0d1f1ed wrote:

The problem isn't the ALUs. The problem is the masks. I am suggesting vector-within-vector instructions, which adds a tiny amount of ALU complexity, but keeps the masking simple.

Sorry, but why applying a 16-bit mask to 32-bit lanes is less complex than applying a 64-bit mask to 8-bit lanes?

Quote:

c0d1f1ed wrote:

I asked Agner this same question: why would you want a different number of lanes for 32-bit, 16-bit and 8-bit elements? It breaks the paradigm of one loop iteration per lane, and it takes many shuffle instructions to switch between them. What's wrong with just keeping 8-bit data in 32-bit lanes, or using vector-within-vector instructions?

I think I answered this above. I'll just add that it doesn't require any excessive amount of shuffle instructions. Probably, because the input and the output data are 8/16-bit in most of my cases.

Quote:

c0d1f1ed wrote:

If you vectorize a loop which contains conditional statements, you need to execute all the paths that any of the elements of the vector are taking, and then blend the results together. But you're computing certain elements of the vectors that you're thowing away. The worst part of that is the wasted power.

The mask registers not only do the blending in the same instruction, they also allow to clock-gate the lanes which results are thown away anyway. This isn't possible with xmm registers because you can't read that many operands from the same register file per cycle at acceptable power consumption. You'd also need forwarding paths from the low 128-bit to the entire 512-bit width (or more), from every output to every input, splitting the bits up to each 8/16/32/64-bit element. That's not desirable either. With dedicated mask registers for 32/64-bit elements only, it gets much simpler.

I was thinking that sign bits of each lane would be used a s a mask bit, so no need to forward bits between different lanes. But if it's not possible or reasonable to implement the xmm/ymm/zmm register file so that it is able to serve for masks as well then ok, there's no choice but to have the separate mask registers. But that brings us back to the original concern - the suggested set of operations on the mask registers is incomplete and their extension course is uncertain wrt 8/16-bit units and larger vectors.

Quote:

c0d1f1ed wrote:

I'm sure that wasn't an order of magnitude compared to using half the vector width. With AVX-512 we have the opportunity for 16x parallelization of a lot more code, with the potential for 32x in the future. That's way more valuable than doubling the performance of the AVX2 code you already have and get to keep. I mean, it's a simple choice: do you want 2x more but only for 8-bit data processing, or do you want 16x for a ton of applications? And again, vector-within-vector instructions can help you get that 2x or even 4x on top of that for 16-bit and 8-bit data respectively. The AVX-512 foundation seems very suitable to be extended that way, without needing any changes to how the mask registers work.

It's not like AVX-512 is introducing operations on 32-bit elements. The operations existed since SSE2, and while they did not offer 16x speedup, like for bytes, 4x is also a big gain. And AVX-512 brings 2x speedup for 32-bit operations compared to AVX2, just as it could be for bytes and words. I'm not trying to make a choice here between 32-bit and 8-bit, and I don't see why such a choice should even exist. I want performance gains for all kinds of applications.

Аватар пользователя bronxzv

Quote:

andysem wrote:although reusing GPR file would probably have a negative effect on register renaming. If GPR file is increased to accomodate the new registers

the integer PRF has 168 entries in Haswell for example http://www.realworldtech.com/haswell-cpu/3/

I don't see why the k registers will need anything more since as you say vector code put typically low pressure on the integer register file

[EDIT] on the other hand if the vector PRF was used for the masks in ZMM registers (as per OP proposal) it will be probably needed to add more entries (and a lot worse: to add ports to sustain a decent IPC with masks) to this (8x wider) structure to adapt to the heavy usage (think to an inner loop with most instructions using masks and a lot of FMA instructions), in the end there will be even more imbalance between vector and integer PRFs when executing vector code

Аватар пользователя iliyapolak

Hi bronxzv

Thanks for posting that link.There is a lot of valuable information.

BTW Do you have any info about uops encoding(horizontal or vertical)?

Аватар пользователя bronxzv

Quote:

iliyapolak wrote:

Hi bronxzv

Thanks for posting that link.There is a lot of valuable information.

BTW Do you have any info about uops encoding(horizontal or vertical)?

I have no information about uops encoding and I suppose people with access to this information aren't allowed to disclose anything

Аватар пользователя iliyapolak

>>>I don't see why the k registers will need anything more since as you say vector code put typically low pressure on the integer register file>>>

And also on floating point register file(if at hardware level distinction is made between registers which operate on FP vector or those operating on integer vector code).Here I mean that for short sequences of Horner - like scheme code you will use at maximum 3-4 registers(architectural) per single term calculation.So there should not be a need to rename registers.

>>>I have no information about uops encoding and I suppose people with access to this information aren't allowed to disclose anything>>>

It seems that everything which is related to uops is closely kept as a secret.

Аватар пользователя c0d1f1ed

Quote:

andysem wrote:

Quote:

c0d1f1edwrote:

Look, with scalar code nobody expects 8-bit arithmetic to be faster than 32-bit. Why would it have to be faster with SIMD parallelism?

Because scalar code by definition processes data units sequentially and vectorised code does it in parallel.

It is still SIMD parallelism if you have 8-bit or 16-bit data in 32-bit lanes. So it's faster over the entire width, but does not have to be per lane. Again, GPUs achieve tremendous performance despite this due to the high total width. So there isn't one way it has to be done "by definition". Processing tightly packed 8-bit values comes at a cost, especially if you want each of them maskable with a predicate bit or want to shuffle them over a great distance. So you may want to compromise some 8/16-bit performance to keep 32/64-bit processing scalable. But you can get the best of both worlds with vector-within-vector instructions:

Quote:

You suggest using vector-within-vector approach, and if I understand you correctly and reading AVX-512 instructions right that's something like what people used before MMX, with general purpose registers. These tricks can be useful, as long as the algorithm is simple enough and you know the "pseudo-elements" within the "sub-vector" won't interfere with each other as the calculation goes. As soon as this doesn't hold the approach becomes not feasible and you're back to the scalar (in case of AVX-512 - 32-bit unit) code. The granularity of mask registers only complicates this because you're not able select units on byte or word level. Again, you have to resort to bit masks and logical operations, which in turn is tricky because of 32-bit only units. So no, I don't consider vector-within-vector approach as a suitable solution for the limitation AVX-512 makes.

Please, correct me if I misunderstood your vector-within-vector suggestion and you meant something different.

Vector-within-vector means each SIMD lane executes a small vector operation independently from the other lanes. For instance if you have a loop where you add RGBA colors that have 8-bit components, this can be SIMD parallelized. The 'Single Instruction' part of SIMD is just a 4x8-bit vector operation in this case instead of a scalar operation. The whole instruction thus becomes a 16x4x8-bit vector-within-vector instruction in the case of AVX-512. The difference with a 64x8-bit instruction as you are requesting, is that with a vector-within-vector instruction the mask bits still predicate an entire 32-bit lane or in other words the 4x8-bit inner vectors, instead of each 8-bit element individually. This is perfectly fine, since your original loop contains 4x8-bit operations and any branching would happen at 32-bit granularity!

Quote:

Quote:

c0d1f1edwrote:

GPUs have a narrow scope of applications because they have to be programmed heterogeneously, and because they have low single-threaded performance. AVX-512 has neither of those issues. 8-bit and 16-bit vector instructions that are not vector-within-vector, are not going to help make the scope even wider. If you think otherwise please plus sum up some applications that would benefit from them.

I'm currently interested in realtime multimedia processing, which includes video and audio processing (scaling, colorspace conversion, blending, mixing, etc.) and compression. I wrote quite a few algorithms for processing media for my employer, and for the most part these algorithms involve 8 and 16-bit operations on data. I generally try to avoid FP calculations because it's slower than integer/fixed point and I didn't find much use in 32-bit integer operations in my area. So my primary interest is 8 and 16-bit integer operations.

Then I think vector-within-vector instructions would be right for you, offering the full power of 512-bit without the need for 8-bit masking granularity.

Quote:

GPUs are probably better tailored for my tasks, but their use is not beneficial in my case for various reasons, technical and not. One of the reasons is too much overhead because of the need to transfer data between CPU and GPU memory. So I'm very much interested in increasing data processing performance on the CPU.

I agree. GPGPU is a minefield and even with AMD's HSA efforts there will still be too many variants which each have their own pitfalls. That's too hard for the average developer, or better yet compiler, to master. Unless your workload is 'embarassingly parallel' the ROI for using the GPU doesn't add up, and with things like AVX-512 it will keep diminishing.

Quote:

Quote:

c0d1f1edwrote:

The problem isn't the ALUs. The problem is the masks. I am suggesting vector-within-vector instructions, which adds a tiny amount of ALU complexity, but keeps the masking simple.

Sorry, but why applying a 16-bit mask to 32-bit lanes is less complex than applying a 64-bit mask to 8-bit lanes?

First of all you shouldn't compare just two widths which only differ by 2x. This is about supporting 8/16/32/64-bit mask granularity or just 32/64-bit. In the case of AVX-512 you'd have to route the lower 8 mask bits to 8x64-bit, the lower 16 bits to 16x32-bit, the lower 32-bits to 32x16-bit and 64-bit to 64x8-bit. That's a total of 120 bits running 'horizontally' over a great distance (assuming the SIMD lanes run vertically), instead of just 24. And not just that, you need to route two signal bits instead of one to select which of these four should be used by the next instruction, based on the instruction type. Then there's three vector execution ports per port, with integer and float domains . So supporting four instead of two predication granularities adds various gate and wire delays that have to be taken into account which either make the design slower or consume more power.

GPUs use clever tricks to support both 32-bit and 64-bit lanes with a minimum of cross-lane communication, and I imagine AVX-512 implementations will use similar tricks. The physical layout can differ quite a bit from the logical layout (so forget everything I said about vertical and horizontal). But I'm sure that supporting 16-bit and 8-bit granularity of predication significantly complicates things.

Quote:

Quote:

c0d1f1edwrote:

I'm sure that wasn't an order of magnitude compared to using half the vector width. With AVX-512 we have the opportunity for 16x parallelization of a lot more code, with the potential for 32x in the future. That's way more valuable than doubling the performance of the AVX2 code you already have and get to keep. I mean, it's a simple choice: do you want 2x more but only for 8-bit data processing, or do you want 16x for a ton of applications? And again, vector-within-vector instructions can help you get that 2x or even 4x on top of that for 16-bit and 8-bit data respectively. The AVX-512 foundation seems very suitable to be extended that way, without needing any changes to how the mask registers work.

It's not like AVX-512 is introducing operations on 32-bit elements. The operations existed since SSE2, and while they did not offer 16x speedup, like for bytes, 4x is also a big gain. And AVX-512 brings 2x speedup for 32-bit operations compared to AVX2, just as it could be for bytes and words. I'm not trying to make a choice here between 32-bit and 8-bit, and I don't see why such a choice should even exist. I want performance gains for all kinds of applications.

Yes you get 4x32-bit with SSE2 and 8x32-bit with AVX2, but their use is more limited than AVX-512, even it were restricted to 128-bit and 256-bit respectively. AVX-512 adds things that make it highly suitable for vectorizing generic loops. A fast gather operation is essential when you're doing any indexed addressing, predication masks keep branches efficient, broadcast allows to have scalar constants, etc.  And as I said before, it took until AVX2 to get vector-vector shift. So anything before that was unable to easily vectorize loops containing a shift operation.

So it's important to realize that we'll suddenly see a whole lot more applications benefit from SIMD. And not just 4x. We'll get up to 16x. And on top of that TSX is helping multi-core performance. So it's a few things that in isolation are only evolutionary, but combined result in a relatively sudden revolutiony change in the CPU's capabilities. Mark my words, it will be a new era in computing. You no longer have to think is this code threadable or vectorizable and on what device should I run it. It's all just code, and with the help of the compiler the CPU will extract any kind of parallelism that's in there.

Аватар пользователя c0d1f1ed

Quote:

andysem wrote:And there is another thing to consider. In vectorized code the majority of instructions don't even touch general purpose registers (except for load and store operations and the parts that are not vectorized). I suppose, it allows GPR file to consume less power than it would if the file was used for k registers.

Even just for load/store pointers and indices (e.g. the loop iterator), it amounts to a significant number of scalar register accesses for otherwise highly data-parallel code. So the scalar register file is always in use and from a performance/Watt perspective it would be more wasteful to not make use of the additional read ports. It's burning quite a lot of power anyway, so you might as well access 2-3 registers per cycle instead of ~1 for data-parallel code. It makes perfect sense to store the k registers in the scalar register file, and explains their 64-bit size.

And it's not just the register file itself. Renaming and scheduling are expensive stages too, so you want to reuse all of that.

Аватар пользователя andysem

Quote:

c0d1f1ed wrote:

Quote:

You suggest using vector-within-vector approach, and if I understand you correctly and reading AVX-512 instructions right that's something like what people used before MMX, with general purpose registers. These tricks can be useful, as long as the algorithm is simple enough and you know the "pseudo-elements" within the "sub-vector" won't interfere with each other as the calculation goes. As soon as this doesn't hold the approach becomes not feasible and you're back to the scalar (in case of AVX-512 - 32-bit unit) code. The granularity of mask registers only complicates this because you're not able select units on byte or word level. Again, you have to resort to bit masks and logical operations, which in turn is tricky because of 32-bit only units. So no, I don't consider vector-within-vector approach as a suitable solution for the limitation AVX-512 makes.

Please, correct me if I misunderstood your vector-within-vector suggestion and you meant something different.

Vector-within-vector means each SIMD lane executes a small vector operation independently from the other lanes.

I haven't seen such instructions in AVX-512, and I haven't come across any references to future extensions that introduce them. Do you have such references?

In any case, such instructions are much less useful than the true support for 8/16-bit elements, see below.

Quote:

c0d1f1ed wrote:

For instance if you have a loop where you add RGBA colors that have 8-bit components, this can be SIMD parallelized. The 'Single Instruction' part of SIMD is just a 4x8-bit vector operation in this case instead of a scalar operation. The whole instruction thus becomes a 16x4x8-bit vector-within-vector instruction in the case of AVX-512. The difference with a 64x8-bit instruction as you are requesting, is that with a vector-within-vector instruction the mask bits still predicate an entire 32-bit lane or in other words the 4x8-bit inner vectors, instead of each 8-bit element individually. This is perfectly fine, since your original loop contains 4x8-bit operations and any branching would happen at 32-bit granularity!

RGBA is just a special case, which is typically met in image processing. In video processing you typically deal with some variation of YUV colorspace, which is stored in planar format. The algorithm processes each plane individually, and every byte in the plane corresponds to a pixel (or several pixels) of the image. You have to apply the mask to 8-bit elements of the vector.

In audio processing, 16-bit samples are dominant nowdays, and since samples are stored sequentially, this requires masking 16-bit elements of the vector. 32-bit masking might be ok for the interleaved stereo case, but again, this is just a special case.

In string processing, you hardly ever deal with any interleaved data or with elements larger than 8 bits. UTF16 is widespread on Windows, but that's an exception and still isn't 32-bit.

In general I'd say that any case with interleaved data streams is more an exception than a rule, so creating hardware instructions to target particularly these cases seems unwise to me. Native support for 8/16-bit elements, on the other hand, would be very welcome.

Quote:

c0d1f1ed wrote:

Quote:

Quote:

Sorry, but why applying a 16-bit mask to 32-bit lanes is less complex than applying a 64-bit mask to 8-bit lanes?

First of all you shouldn't compare just two widths which only differ by 2x. This is about supporting 8/16/32/64-bit mask granularity or just 32/64-bit. In the case of AVX-512 you'd have to route the lower 8 mask bits to 8x64-bit, the lower 16 bits to 16x32-bit, the lower 32-bits to 32x16-bit and 64-bit to 64x8-bit. That's a total of 120 bits running 'horizontally' over a great distance (assuming the SIMD lanes run vertically), instead of just 24. And not just that, you need to route two signal bits instead of one to select which of these four should be used by the next instruction, based on the instruction type. Then there's three vector execution ports per port, with integer and float domains . So supporting four instead of two predication granularities adds various gate and wire delays that have to be taken into account which either make the design slower or consume more power.

GPUs use clever tricks to support both 32-bit and 64-bit lanes with a minimum of cross-lane communication, and I imagine AVX-512 implementations will use similar tricks. The physical layout can differ quite a bit from the logical layout (so forget everything I said about vertical and horizontal). But I'm sure that supporting 16-bit and 8-bit granularity of predication significantly complicates things.

If mask registers were designed with byte granularity in mind, all that complexity would be unnecessary. Just let k registers be 64-bit from the start, every bit would mask the corresponding byte in the vector. Larger-granularity masking would just use more than one bit per lane or just one (e.g. highest) bit to mask out the operation on the element. Comparison instructions would also set bits corresponding to the operation granularity (like pcmpgt/pcmpeq do now with xmm registers). No need for wiring different bits to different lanes at all.

Vector width extension then comes naturally with k registers width extension. I suppose, at that point the integer PRF wouldn't be enough to store the extended k registers, unless one k register is multiplexed from two physical registers in the file.

Quote:

c0d1f1ed wrote:

Yes you get 4x32-bit with SSE2 and 8x32-bit with AVX2, but their use is more limited than AVX-512, even it were restricted to 128-bit and 256-bit respectively. AVX-512 adds things that make it highly suitable for vectorizing generic loops. A fast gather operation is essential when you're doing any indexed addressing, predication masks keep branches efficient, broadcast allows to have scalar constants, etc.  And as I said before, it took until AVX2 to get vector-vector shift. So anything before that was unable to easily vectorize loops containing a shift operation.

So it's important to realize that we'll suddenly see a whole lot more applications benefit from SIMD. And not just 4x. We'll get up to 16x. And on top of that TSX is helping multi-core performance. So it's a few things that in isolation are only evolutionary, but combined result in a relatively sudden revolutiony change in the CPU's capabilities. Mark my words, it will be a new era in computing. You no longer have to think is this code threadable or vectorizable and on what device should I run it. It's all just code, and with the help of the compiler the CPU will extract any kind of parallelism that's in there.

While I agree that many features you refer to are very welcome, I wouldn't call AVX-512 a revolutionary extension and say that everything before it was not beneficial for 32-bit vectorization. Yes, AVX-512 opens new possibilities, and surely new applications and compilers will make use of it, eventually. But I also understand that most current (and near future) performance-critical code is tailored specifically for SSE/AVX, either by efforts of developers or compiler (for which I personally have little faith). The code that is not vectorized is either (a) not performance critical or (b) too complex to be vectorized. AVX-512 may help (b) but chances are high that the code will still be too complex for it as well, because otherwise it would have been rewritten for SSE/AVX already. So I can't agree with your 16x estimate.

In any case, my point was that you shouldn't restrict yourself to 32/64-bit only because the real world use cases are much more diverse. GPUs pull off the performance race because they are much more parallel than AVX2 or AVX-512, even though they are restricted to 32/64-bit. With shorter vectors, CPUs should be better suited for smaller data units which are actually commonly used in applications.

Аватар пользователя Agner

As I understand c0d1f1ed, your argument is that 32-bit granularity with clock-gating masks is a new paradigm and that 8-bit granularity is more costly in terms of routing clock gates, far distance shuffle instructions, and gather instructions. What you call vector-within-vector instructions is just a vector instruction with 8-bit or 16-bit elements, but masking with 32-bit granularity. I think it is quite cheap to extend the ALU so that it can handle carry at 8-bit granularity. Full-width shuffle and gather instructions with 8-bit granularity are not available with AVX2 or AVX512 and I doubt that they will ever be available. You have to use two-step shuffle for that.

Whether it is too expensive to mask with 8-bit or 16-bit granularity is difficult to argue when the people who know the hardware details seem to be gagged. Even if it is expensive to do low-granularity masking today, it may be cheap tomorrow, or they may have to do it anyway because of demand from SW people. My point is that the design should be extensible because we can't predict the future. The history of x86 instruction sets is full of examples of very shortsighted solutions with no concern for extensibility. They are doing the same mistake again with AVX512 by making mask registers with no plan for extension beyond 64 bits.

www.agner.org
Аватар пользователя bronxzv

[EDITED]

Quote:

andysem wrote:In any case, such instructions are much less useful than the true support for 8/16-bit elements, see below.

indeed, and btw Intel has never said it will not introduce full support (clean SoA support, not a clumsy AoS within SoA mess) for 8-bit and 16-bit elements in the future, at least scatter/gather support for 8-bit and 16-bit elements (case in point FP16 half-floats useful also for FP code) is much needed even for 16-way SIMD only

predicting the future of Intel's vector ISAs is a difficult art, as shown for example in this (not so old) c0d1f1ed's post : http://software.intel.com/en-us/comment/reply/277741/1460656

Quote:

andysem wrote:

Just let k registers be 64-bit from the start,

k registers are already defined as 64-bit, it will be easy to add in due time the 32-bit/64-bit spill/fill instructions, if the ABI requires a callee to not modify some k registers you'll have to recompile it, it doesn't look like a significant drawback IMHO, unlike the OP makes it sound

Аватар пользователя andysem

Quote:

bronxzv wrote:

Quote:

andysemwrote:

Just let k registers be 64-bit from the start,

k registers are already defined as 64-bit, it will be easy to add in due time the 32-bit/64-bit spill/fill instructions, if the ABI require a callee to not modify some k registers you'll have to recompile it, it doesn't look like a significant drawback, unlike the OP make it sound

What I meant is that all 64 bits could be used to simplify signal routing and add support for 8/16-bit vector elements. I.e. in case of 8-bit elements every bit of the mask is in effect, in case of 16-bit elements - bits 1, 3, 5 and so on, 32-bit - 3, 7, 11 and so on, 64-bit - 7, 15, 23 and so on.

Аватар пользователя bronxzv

[EDITED]

Quote:

andysem wrote:

Quote:

bronxzvwrote:

Quote:

andysemwrote:

Just let k registers be 64-bit from the start,

k registers are already defined as 64-bit, it will be easy to add in due time the 32-bit/64-bit spill/fill instructions, if the ABI require a callee to not modify some k registers you'll have to recompile it, it doesn't look like a significant drawback, unlike the OP make it sound

What I meant is that all 64 bits could be used to simplify signal routing and add support for 8/16-bit vector elements. I.e. in case of 8-bit elements every bit of the mask is in effect, in case of 16-bit elements - bits 1, 3, 5 and so on, 32-bit - 3, 7, 11 and so on, 64-bit - 7, 15, 23 and so on.

ah yes, I see what you mean, I can't comment on the simplified routing thing but as a programmer using only the LSBs looks far easier to grasp because it is consistent with the values returned by VMOVMSKPS/VMOVMSKPD/VPMOVMSKB etc., otherwise we will have 3 types of masks (the legacy SSE/AVX masks with the MSB of each element used as mask bit, the packed masks returned by VMOVMSKPS and the like, the new masks as per your definition)

Аватар пользователя c0d1f1ed

Quote:

andysem wrote:

c0d1f1ed wrote:

Vector-within-vector means each SIMD lane executes a small vector operation independently from the other lanes.

I haven't seen such instructions in AVX-512, and I haven't come across any references to future extensions that introduce them. Do you have such references?

I don't know of any that have been specified already. But my point is that AVX-512 can easily be extended to support them, without any changes to the predication masks. It's about thinking long term. In the shorter term, 32/64-bit operations are the most valuable use of those extra transistors so that's what we're getting first.

Quote:

In any case, such instructions are much less useful than the true support for 8/16-bit elements, see below.

c0d1f1ed wrote:

For instance if you have a loop where you add RGBA colors that have 8-bit components, this can be SIMD parallelized. The 'Single Instruction' part of SIMD is just a 4x8-bit vector operation in this case instead of a scalar operation. The whole instruction thus becomes a 16x4x8-bit vector-within-vector instruction in the case of AVX-512. The difference with a 64x8-bit instruction as you are requesting, is that with a vector-within-vector instruction the mask bits still predicate an entire 32-bit lane or in other words the 4x8-bit inner vectors, instead of each 8-bit element individually. This is perfectly fine, since your original loop contains 4x8-bit operations and any branching would happen at 32-bit granularity!

RGBA is just a special case, which is typically met in image processing. In video processing you typically deal with some variation of YUV colorspace, which is stored in planar format. The algorithm processes each plane individually, and every byte in the plane corresponds to a pixel (or several pixels) of the image. You have to apply the mask to 8-bit elements of the vector.

In audio processing, 16-bit samples are dominant nowdays, and since samples are stored sequentially, this requires masking 16-bit elements of the vector. 32-bit masking might be ok for the interleaved stereo case, but again, this is just a special case.

In string processing, you hardly ever deal with any interleaved data or with elements larger than 8 bits. UTF16 is widespread on Windows, but that's an exception and still isn't 32-bit.

In general I'd say that any case with interleaved data streams is more an exception than a rule, so creating hardware instructions to target particularly these cases seems unwise to me. Native support for 8/16-bit elements, on the other hand, would be very welcome.

I know about all those use cases. But you haven't given me a compelling reason why these should be predicatable at an 8/16-bit granularity. So far we've been able to live without predication at all, by using blend instructions and logic operations. They work just fine. And with vector-within-vector instructions, there would be absolutely no difference with what has been available in the past.

Predication masks are orthogonal to that. The only value they add is that the blend happens as part of the same instruction, and it can get clock gated per lane. In my opinion those features aren't very important to the use cases you've described. You need code with a significant number of branches with a fair bit of divergence, before this complexity starts to pay off. Parallelizable code with 8-bit or 16-bit values generally does not fall into that category. And even if it does, you still get the choice between using blend instructions or storing them in 32-bit lanes. Depending on the situation, one of these is prefectly acceptable. Again, we've gone without predication masks for ages, so it is not worth losing 32/64-bit scalability by demanding 8/16-bit predication granularity.

Quote:

c0d1f1ed wrote:

First of all you shouldn't compare just two widths which only differ by 2x. This is about supporting 8/16/32/64-bit mask granularity or just 32/64-bit. In the case of AVX-512 you'd have to route the lower 8 mask bits to 8x64-bit, the lower 16 bits to 16x32-bit, the lower 32-bits to 32x16-bit and 64-bit to 64x8-bit. That's a total of 120 bits running 'horizontally' over a great distance (assuming the SIMD lanes run vertically), instead of just 24. And not just that, you need to route two signal bits instead of one to select which of these four should be used by the next instruction, based on the instruction type. Then there's three vector execution ports per port, with integer and float domains . So supporting four instead of two predication granularities adds various gate and wire delays that have to be taken into account which either make the design slower or consume more power.

GPUs use clever tricks to support both 32-bit and 64-bit lanes with a minimum of cross-lane communication, and I imagine AVX-512 implementations will use similar tricks. The physical layout can differ quite a bit from the logical layout (so forget everything I said about vertical and horizontal). But I'm sure that supporting 16-bit and 8-bit granularity of predication significantly complicates things.

If mask registers were designed with byte granularity in mind, all that complexity would be unnecessary. Just let k registers be 64-bit from the start, every bit would mask the corresponding byte in the vector. Larger-granularity masking would just use more than one bit per lane or just one (e.g. highest) bit to mask out the operation on the element. Comparison instructions would also set bits corresponding to the operation granularity (like pcmpgt/pcmpeq do now with xmm registers). No need for wiring different bits to different lanes at all.

That would be an excellent suggestion if not for the fact that AVX-512 would already use all 64-bit of the k registers. Extending to 1024-bit would get very messy. I don't think that's a smart move for a predication granularity that's not going to be of great value anyway.

Quote:

Vector width extension then comes naturally with k registers width extension. I suppose, at that point the integer PRF wouldn't be enough to store the extended k registers, unless one k register is multiplexed from two physical registers in the file.

Then you need two register file accesses per predication mask (up to a total of six per cycle, without counting any pointer and index registers). Also, you'd easily run out of physical registers. This sounds like a costly hack to me, and again I don't think it's worth the effort.

Quote:

c0d1f1ed wrote:

Yes you get 4x32-bit with SSE2 and 8x32-bit with AVX2, but their use is more limited than AVX-512, even it were restricted to 128-bit and 256-bit respectively. AVX-512 adds things that make it highly suitable for vectorizing generic loops. A fast gather operation is essential when you're doing any indexed addressing, predication masks keep branches efficient, broadcast allows to have scalar constants, etc.  And as I said before, it took until AVX2 to get vector-vector shift. So anything before that was unable to easily vectorize loops containing a shift operation.

So it's important to realize that we'll suddenly see a whole lot more applications benefit from SIMD. And not just 4x. We'll get up to 16x. And on top of that TSX is helping multi-core performance. So it's a few things that in isolation are only evolutionary, but combined result in a relatively sudden revolutiony change in the CPU's capabilities. Mark my words, it will be a new era in computing. You no longer have to think is this code threadable or vectorizable and on what device should I run it. It's all just code, and with the help of the compiler the CPU will extract any kind of parallelism that's in there.

While I agree that many features you refer to are very welcome, I wouldn't call AVX-512 a revolutionary extension and say that everything before it was not beneficial for 32-bit vectorization. Yes, AVX-512 opens new possibilities, and surely new applications and compilers will make use of it, eventually. But I also understand that most current (and near future) performance-critical code is tailored specifically for SSE/AVX, either by efforts of developers or compiler (for which I personally have little faith). The code that is not vectorized is either (a) not performance critical or (b) too complex to be vectorized. AVX-512 may help (b) but chances are high that the code will still be too complex for it as well, because otherwise it would have been rewritten for SSE/AVX already. So I can't agree with your 16x estimate.

It's all about ROI. 4x isn't that compelling to most developers, considering they have to learn SSE2 intrinsics and all the quirks and limitations, leaving 2x at best in most cases. 8x peak gets more interesting but again most developers just won't leave the realm of their high-level language. 16x and the ability to have the compiler take care of it all, now that's hard to pass up on. Even if some performance is lost in the process, it's a pretty sure deal.

Note that AMD is betting the farm on HSA. The potential peak return is roughly the same as AVX-512, but the investment and risks are considerably higher. So don't underestimate what Intel is going to achieve here. AMD definitely thinks there's enough applications that would benefit from the raw processing power of the GPU's wide SIMD units. But heterogeneous is inherently more complex and has more overhead. So if HSA is supposed to be revolutionary, then I don't think AVX-512 should be regarded as anything less.

Quote:

In any case, my point was that you shouldn't restrict yourself to 32/64-bit only because the real world use cases are much more diverse. GPUs pull off the performance race because they are much more parallel than AVX2 or AVX-512, even though they are restricted to 32/64-bit. With shorter vectors, CPUs should be better suited for smaller data units which are actually commonly used in applications.

GPUs are not more parallel than AVX-512. Kaveri, AMD's latest APU, can only do 740 GFLOPS in the GPU. A quad-core with AVX-512 would be capable of 768 GFLOPS at 3 GHz. But you could easily fit 6 or 8 cores on a die instead if you don't waste area on a GPU.

Sure, the GPU is more "parallel" in the sense that it does more things in parallel, but it does them far more slowly. So in practice the compute density is about the same. GPU manufacturers claim slower but wider is more power efficient. Future CPUs might achieve the same thing by having two clusters of SIMD units which alternatingly execute AVX-1024 instructions on 512-bit units in two cycles, with each cluster dedicated to one thread.

So I don't think CPUs are any more or any less suited for small data. And even though such workloads are fairly common, I'm still not convinced that they would require 8/16-bit predication granularity.

Аватар пользователя andysem

Quote:

bronxzv wrote:

Quote:

andysemwrote: What I meant is that all 64 bits could be used to simplify signal routing and add support for 8/16-bit vector elements. I.e. in case of 8-bit elements every bit of the mask is in effect, in case of 16-bit elements - bits 1, 3, 5 and so on, 32-bit - 3, 7, 11 and so on, 64-bit - 7, 15, 23 and so on.

ah yes, I see what you mean, I can't comment on the simplified routing thing but as a programmer using only the LSBs looks far easier to grasp because it is consistent with the values returned by VMOVMSKPS/VMOVMSKPD/VPMOVMSKB etc., otherwise we will have 3 types of masks (the legacy SSE/AVX masks with the MSB of each element used as mask bit, the packed masks returned by VMOVMSKPS and the like, the new masks as per your definition)

The masks returned by movmsk* instructions are actually very similar to those described in AVX-512, in k registers (i.e. every bit in the mask is effective), while the sparse masks I described are similar to the masks in xmm/ymm registers (i.e. only MSB in the group of bits is the effective one). I don't think there will be much confusion, when the concept is understood that way. The additional benefit of the sparse masks is that they become independent of the vector granularity. You can create the mask by a 32-bit operation and then use it in 8 or 16-bit context without any changes. The programmer's interface is also simplified, since there would be no need for __mmask8, __mmask16, etc. types but just __mmask64. The drawback is that movmsk* instructions won't be useful for mask construction, but given their poor performance and availability of the new cmp* instructions I don't think this would be an issue.

Аватар пользователя andysem

Quote:

c0d1f1ed wrote:

I know about all those use cases. But you haven't given me a compelling reason why these should be predicatable at an 8/16-bit granularity. So far we've been able to live without predication at all, by using blend instructions and logic operations. They work just fine. And with vector-within-vector instructions, there would be absolutely no difference with what has been available in the past.

Predication masks are orthogonal to that. The only value they add is that the blend happens as part of the same instruction, and it can get clock gated per lane. In my opinion those features aren't very important to the use cases you've described. You need code with a significant number of branches with a fair bit of divergence, before this complexity starts to pay off. Parallelizable code with 8-bit or 16-bit values generally does not fall into that category. And even if it does, you still get the choice between using blend instructions or storing them in 32-bit lanes. Depending on the situation, one of these is prefectly acceptable. Again, we've gone without predication masks for ages, so it is not worth losing 32/64-bit scalability by demanding 8/16-bit predication granularity.

Provided that Intel adds vector-within-vector operations, that would make predication useless for a considerable range of applications. Don't you think this is a somewhat wasted investment? Predication is a general new feature, and limiting it to 32/64-bit only seems unreasonable to me.

Yes, we currently use blend and logical operations to solve branching cases, but what makes you think media and string algorithms wouldn't benefit from replacing them with predication? You described the benefits yourself. Depending on hardware implementation, I imagine there could even be some throughput gains, if the CPU is able to execute more instructions in parallel if more of the elements of the operands are masked out. IMHO, predication should replace blend operations almost entirely.

Quote:

c0d1f1ed wrote:

Quote:

If mask registers were designed with byte granularity in mind, all that complexity would be unnecessary. Just let k registers be 64-bit from the start, every bit would mask the corresponding byte in the vector. Larger-granularity masking would just use more than one bit per lane or just one (e.g. highest) bit to mask out the operation on the element. Comparison instructions would also set bits corresponding to the operation granularity (like pcmpgt/pcmpeq do now with xmm registers). No need for wiring different bits to different lanes at all.

That would be an excellent suggestion if not for the fact that AVX-512 would already use all 64-bit of the k registers. Extending to 1024-bit would get very messy. I don't think that's a smart move for a predication granularity that's not going to be of great value anyway.

Quote:

Vector width extension then comes naturally with k registers width extension. I suppose, at that point the integer PRF wouldn't be enough to store the extended k registers, unless one k register is multiplexed from two physical registers in the file.

Then you need two register file accesses per predication mask (up to a total of six per cycle, without counting any pointer and index registers). Also, you'd easily run out of physical registers. This sounds like a costly hack to me, and again I don't think it's worth the effort.

Well, it's hard for me to judge how difficult such an extension would be, but it looks worthy to me. There are solutions for this problem, besides multiplexing k registers. x86-128 IA32 extention, for example :-D. Seriously though, k registers could be extracted to a separate file, which could be dormant most of the time, so the power consumption is not increased much. At some point, I think, 128-bit registers may appear anyway, since the need for larger precision numbers is already present in scientific/math applications.

All in all, I'm not arguing that full support for 8/16-bit operations and predicaion comes without a cost. My opinion is that the demand for it is significant enough to justify the possible complication. Vectors won't continue to grow much after 512 bits (1024 - yes, 2048? don't know), so the amount of complication is limited and predictable.

Аватар пользователя bronxzv

Quote:

andysem wrote:The masks returned by movmsk* instructions are actually very similar to those described in AVX-512,

yes, so we have basically 2 types of mask, thus my comment "otherwise we will have 3 types of masks"

Quote:

andysem wrote:The programmer's interface is also simplified, since there would be no need for __mmask8, __mmask16, etc. types but just __mmask64.

well I suppose it's a matter of personal taste, I strongly (pun intended) prefer strong typing as offered by __m512 , __m512i  __m512d  etc,

Quote:

andysem wrote:drawback is that movmsk* instructions won't be useful for mask construction, but given their poor performance and availability of the new cmp* instructions I don't think this would be an issue.

movmsk* instructions aren't poor performance according to my experience, at least not on Intel's CPUs

Аватар пользователя andysem

Quote:

bronxzv wrote:

Quote:

andysem wrote:The masks returned by movmsk* instructions are actually very similar to those described in AVX-512,

yes, so we have basically 2 types of mask, thus my comment "otherwise we will have 3 types of masks"

Well, since they have the same representation and semantics, I don't separate them. But ok.

Quote:

bronxzv wrote:

Quote:

andysem wrote:The programmer's interface is also simplified, since there would be no need for __mmask8, __mmask16, etc. types but just __mmask64.

well I suppose it's a matter of personal taste, I strongly (pun intended) prefer strong typing as offered by __m512 , __m512i  __m512d  etc,

The parallel is not correct, the mask is always a mask. It's the number of bits that differ, not their semantics (i.e. integer vs FP). You do use __m512i to store different sized integers, after all.

Quote:

bronxzv wrote:

Quote:

andysem wrote:drawback is that movmsk* instructions won't be useful for mask construction, but given their poor performance and availability of the new cmp* instructions I don't think this would be an issue.

movmsk* instructions aren't poor performance according to my experience, at least not on Intel's CPUs

movmsk* are slower than just cmp* instructions, especially on older CPUs. Surely, we don't have timings for AVX-512 yet but I hope cmp* in AVX-512 to have performance comparable to the previous extensions.

Аватар пользователя bronxzv

Quote:

andysem wrote:movmsk* are slower than just cmp* instructions, especially on older CPUs.

they are 1 clock throughput / 2 clock latency since Conroe or even before IIRC

another advantage for the AVX-512 tight masks (vs your sparse masks proposal) that comes to mind is that they are directlly suited to access look up tables, it's doable to have a LUT with 16-bit indices, but obviously not with 64-bit indices

Аватар пользователя iliyapolak

>>>What I meant is that all 64 bits could be used to simplify signal routing and add support for 8/16-bit vector elements. I.e. in case of 8-bit elements every bit of the mask is in effect, in case of 16-bit elements - bits 1, 3, 5 and so on, 32-bit - 3, 7, 11 and so on, 64-bit - 7, 15, 23 and so on.>>>

Signal routing will be probably implemented at control micro-instruction(uops?) level and when you consider 8/16/32/64 bit mask granularity it will be needed to use 2-bit bitfield to represent particular bitmask(here I suppose that operands are not encoded in micro-instruction) and are simply decoupled.At hardware level masking could be probably  performed by the  ALU(probably wil need specific control signal input) and it is interesting if the same execution Port ALU will be responsible for performing masking operation on AVX-512 vectors thus possibly staying busy when vector integer code is beign dispatched for execution.

Аватар пользователя c0d1f1ed

Quote:

Agner wrote:Whether it is too expensive to mask with 8-bit or 16-bit granularity is difficult to argue when the people who know the hardware details seem to be gagged.

GPU designers aren't doing it, and AVX-512 doesn't support it either. I think that tells us something about the significant cost of supporting fine-grained predication. And I have yet to come accross a valuable use case for which 32-bit lanes or legacy blend instructions and vector-within-vector operations would not be acceptable.

Quote:

Even if it is expensive to do low-granularity masking today, it may be cheap tomorrow, or they may have to do it anyway because of demand from SW people. My point is that the design should be extensible because we can't predict the future. The history of x86 instruction sets is full of examples of very shortsighted solutions with no concern for extensibility. They are doing the same mistake again with AVX512 by making mask registers with no plan for extension beyond 64 bits.

Can't you see the irony in that? You say we can't predict the future, but you also say we should keep the mask registers extensible for low-granularity masking, which potentially hampers widening the SIMD width itself. That seems to me like the shortsightedness you intended to avoid in the first place. You have no cause for this, other than an expectation of that it might be demanded in the distant future. So the reality is that trying to not make a prediction, results in making a prediction as well. Engineers face this dilemma all the time, so in the end you just have to recognize that all the past 'mistakes' stem from the same limited foresight that we are facing now. So you can't call them mistakes unless you're willing to admit that your proposal might very well be the next one.

So in my opinion we have to try to predict the future to the best of our abilities and design for that instead of designing for something we may or may not need. Worst case you're creating something that's really powerful but for a slightly limited range of applications. Looking at GPUs and how lots of people are trying to use them for generic computing, it's not going to be bad at all to use them as the guideline. Bringing that technology into CPU cores in a homogeneous manner in and of itself widens the range of applications. And GPUs have proven that having a minimal lane width of 32-bit is not a significant limitation. For the few cases where you need maximum 8-bit performance, vector-within-vector instructions offer a practical solution. So why take any unnecessary risks?

Аватар пользователя andysem

Quote:

bronxzv wrote:

Quote:

andysem wrote:movmsk* are slower than just cmp* instructions, especially on older CPUs.

they are 1 clock throughput / 2 clock latency since Conroe or even before IIRC

As per Intinsics guide, on Sandy/Ivy bridge, pmovmskb has latency 2 clocks, throughput 1 clock; pcmpeqb and pcmpgtb - 1/0.5. On Conroe/Wolfdale the timings are [unknown] (probably, 1?)/1 and 1/0.33 respectively. On Netburst CPUs the difference is more pronounced: 7/2 and 2/2. I'm not sure about AMD CPUs but I think pmovmskb is very slow on some models.

Зарегистрируйтесь, чтобы оставить комментарий.