Adders and Multipliers in a processor

Adders and Multipliers in a processor

Hello,
I have 2 questions:
1) How many adders and arithmetic multipliers (not clock multiplier) are used in present processors on an average?
2) Is it true that more complex design blocks are less than comparatively less complex design blocks? For example, are number of adders more than number of arithmetic multipliers generally in a processor?

Actually, these questions arose from a doubt regarding constrained resources in hardware.
A programming language code requiring multiplication of two integer/floating numbers actually uses which resource in the processor to do multiplication? Multiplier to do direct multiplication or adder to do successive addition?

In which case the execution is faster? This also includes taking into consideration the continuous switching between programs to achieve multitasking and the waiting time for which a blocked program has to wait for the required resources to be free.

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Generalizations such as you request seem impossible.

Before the widespread adoption of fused multiply add, equal numbers of add and multiply units were usual.  However, the latency (number of cycles for a single operation) generally is significantly less for a plain add.  For that reason, compilers frequently replace *2 by add.  As you hint, there might then be missed opportunities for keeping the multiplier busy.

Now that transistors are cheap, explicitly accessible functional units are usually provided in equal numbers.  

For floating-point operations, there have been a few designs over the years that have had more adders than multipliers.  Floating-point-intensive codes typically have more adds than multiplies (though not by a huge ratio), and adders are smaller/cheaper than multipliers.  Without in increase in instruction fetch/issue bandwidth and an increased number of register ports, this approach should be considered as an increase in generality, rather than an increase in peak performance.  I.e., with 2 adders and 1 multiplier, it may not be possible to issue all three in one cycle, but the processor could issue 2 adds in one cycle (rather than being limited to 1 add + 1 multiply per cycle).

There are almost always more adders than multipliers if you count the address generation units as adders, but don't count their shift capability as a multiplier.  In transistor-constrained designs, address generation and arithmetic may share the adder unit(s), but recent processors have separate add and/or shift+add units for address generation.   Looking at the addressing modes supported by Intel processors, you can see the capabilities required for these units.  (Section 3.7.5 of Volume 1 of the Intel Architectures Software Developer's Manual, document 253665-068, November 2018).

"Dr. Bandwidth"

Thanks for replying McCalpin, John and Tim P.

I understand your point of having almost equal number of design blocks of all types due to improvements in fabrication technology.

But still is their any possibility of having more latency in executing a code which is waiting for a multiplier to be free? Obviously individually a multiplier block will be faster than successively using an adder block for multiplication. But it will happen only when multiplier itself is free to do the task. If it is engaged in some other task, then the program depending on it will have to wait for it to be free. Whereas a program depending on an adder can do successive addition, assuming that the adder is free.

In other words, the coders who code in programming languages like C, C++, Java, Python, etc. consider the order of complexity of their solution like O(n), O(n^2), O(n*log n), etc. thus keeping a check on latency of the computation. But they do not keep in mind the actual hardware. Whereas HDL (Verilog, VHDL) coders have to consider both the latency of their solution as well as the simplification of the hardware used for it. What if the hardware which provides low latency is not available or is busy in some other task? Thus simplifying hardware increases the possibility of its availability and the possibility of it being free regularly in actual processor.

The other thing I wanted to ask: what are different categories of adders and arithmetic multipliers generally used in processor? If you can't say in a general way, can you answer the same regarding some particular Intel processor or suggest some source from where I can get it?

Last thing: I did not get "Volume 1 of the Intel Architectures Software Developer's Manual, document 253665-068, November 2018" which you mentioned in your (McCalpin, John) reply. I looked for it on Intel website as well as googled it. But it always showed October 2016 version.

Replugging (as I didn't get any response):

Thanks for replying McCalpin, John and Tim P.

I understand your point of having almost equal number of design blocks of all types due to improvements in fabrication technology.

But still is their any possibility of having more latency in executing a code which is waiting for a multiplier to be free? Obviously individually a multiplier block will be faster than successively using an adder block for multiplication. But it will happen only when multiplier itself is free to do the task. If it is engaged in some other task, then the program depending on it will have to wait for it to be free. Whereas a program depending on an adder can do successive addition, assuming that the adder is free.

In other words, the coders who code in programming languages like C, C++, Java, Python, etc. consider the order of complexity of their solution like O(n), O(n^2), O(n*log n), etc. thus keeping a check on latency of the computation. But they do not keep in mind the actual hardware. Whereas HDL (Verilog, VHDL) coders have to consider both the latency of their solution as well as the simplification of the hardware used for it. What if the hardware which provides low latency is not available or is busy in some other task? Thus simplifying hardware increases the possibility of its availability and the possibility of it being free regularly in actual processor.

The other thing I wanted to ask: what are different categories of adders and arithmetic multipliers generally used in processor? If you can't say in a general way, can you answer the same regarding some particular Intel processor or suggest some source from where I can get it?

Last thing: I did not get "Volume 1 of the Intel Architectures Software Developer's Manual, document 253665-068, November 2018" which you mentioned in your (McCalpin, John) reply. I looked for it on Intel website as well as googled it. But it always showed October 2016 version

I don't know of any hardware that uses repeated addition to substitute for a multiplier, but there are a number of analogous tricks that are used in Intel processors. The Intel instruction set provides access to the ALUs in the address generation units using the LEA (Load Effective Address) instruction. This instruction is capable of performing integer arithmetic of the form:
result = base + scale*index + offset
where
"base" is an integer value in a GPR
"scale" is 2, 4, or 8
"index" is an integer value in another GPR
"offset" is an 8-bit, 16-bit, or 32-bit constant
Sometimes the compiler will make use of this functionality for computations that have nothing to do with addresses. If the "scale" value is 2, 4, or 8, then this is much faster than the corresponding sequence of ALU operations. On many Intel processor models, different execution ports have different performance for the LEA instruction depending on the number of arguments provided.

The current versions of the Intel Software Developer Manuals are available at https://software.intel.com/en-us/articles/intel-sdm

"Dr. Bandwidth"

Thanks for replying. I got it.

Leave a Comment

Please sign in to add a comment. Not a member? Join today