Hardware acceleration of Special Functions.

Hardware acceleration of Special Functions.

Hi!
I would like to ask Intel's employees on this forum.Why IntelCPU architects have never implemented in hardware some of themore "popular" Special Functions like'GAMMA','BETA' and various 'BESSEL' functions of an integer order.All these functions could have been accessed byx87 ISAinstructions.

71 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Is anyone interested in this question?

I am not affiliated with Intel but I strongly suspect that the silicon for that is much better spent elsewhere.
To be honest: How often did you need such functions?

How often did you need such functions

I liked your answer :) I know that trigonometric function(fsin fcos)are more useful than special functions mentioned by me,but there are various application that can benefit from hardware implementation of such a functions.
For example Bessel functions are used in signal processing and in wave propagation.
Gamma functions are used in statistics as gamma distribution.
These functions can be approximated by polynomial fit with pre-calculated coefficients and it is straightforward to implement in SSE technology when the high-precision(less than 80-bit) is needed.I suppose that CPU designers beign aware of such a functions andposibility to accurate approximate them in software simply decided to not implement it in hardware.

one key limitation of the legacy x87 transcendental functions is that they are scalar, for high performance code one will use SSEn or AVX software implementations because he benefits from a vectorized implementation (i.e. higher throughput)

one key limitation of the legacy x87 transcendental functions is that they are scalar, for high performance code one will use SSEn or AVX software implementations because he benefits from a vectorized implementation (i.e. higher throughput)

As our tests have shown highly optimized SSE - based sine() function is almost as fast as x87 fsin.But the comparision was made to fsin which prabably implements in hardware range reduction.
I think that Intel could have implemented in microcode transcendental functions with the help of SSE technology.I mean creating SSE instruction which takes as an input single precision or double precision values and returns sine of these values , such a instruction is implemented in microcode.

But the comparision was made to fsin which prabably implements in hardware range reduction

obviously any complete implementation will include a proper range reduction, that's the case for the x87 instructions and high performance vectorized implementations such as the MKL Vector Mathematical Functions Library [1] vsCos, vsSin, vsSincos, vsAcos, etc.

[1] http://software.intel.com/sites/products/documentation/hpc/mkl/vml/vmldata.htm

> As our tests have shown highly optimized SSE - based sine() function is almost as fast as x87 fsin.
Exactly this is the reason for not wasting silicon, especially when such a function is only very seldom needed.

> I think that Intel
could have implemented in microcode...
Of course they could have - but why?

obviously any complete implementation will include a proper range reduction, that's the case for the x87 instructions and high performance vectorized implementations such as MKL vsCos, vsSin, vsSincos, vsAcos,

Yes thats true.Did you test MKL transcendentals?
For example such a function like Gamma which is not periodic albeit its rate of grow is very fast.I think Intel could have implemented it in microcode as a SSE or AVX instructionit could have been even faster when coded as minimax approximated polynomial(elimination of dependency on exp and pow).

Exactly this is the reason for not wasting silicon, especially when such a function is only very seldom needed

I have to agree with you :)You are talking from practical point of view.If Intel were creating some custom DSP processor tailored for Bessel function's application it was probably mandatory for the enginers to implement it in hardware.But in case of Intel CPU when such a exotic functions can be efficiently approximated by SSE/AVX simplier instruction they did not waste silicon for this.
@sirrida
>>Of course they could have - but why
For this question I have no answer.As you have said that Intel engineers probably used silicon for more important things.For example branch-prediction logic.

Yes thats true.Did you test MKL transcendentals?

no, I don't, but you can find very detailed performance data here:
http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html

as stated by sirrida Idon't think it will be a good idea to waste chip area and/or validation budget (read: potential delays) for such specialized things in hardware, forthcoming FMA in Haswell will probably provide a strong boost to all polynomial/rational based approximations, and gather instructions willhelp table-based methods, it's the right way forward IMHO: powerful new general purpose instructions that let us speed up a lot of special cases

also the more you add functions the more you open the door to someone asking for yet another one, with an hardware based solution you talk about a 3+ years turnaround for just a new function and with software it's more like3 weeks

Talking of validation, we don't need another fdiv bug...

no, I don't, but you can find very detailed performance data here:
http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html

Thank you very much for posting this link.You made my day:)
As I already have seen vml gamma functions also is slow 123 clocks per value.I suppose that theydid not eliminate dependency on library calls.What static libraryimplements vlm?If I had known this I would have been able to disassemble this library and try to understand their implementation.

>>as stated by sirrida Idon't think it will be a good idea to waste chip area and/or validation budget (read: potential delays) for such specialized things in hardware, forthcoming FMA in Haswell will probably provide a strong boost to all polynomial/rational based approximations, and gather instructions willhelp table-based methods, that's the right way forward IMHO: powerful new general purpose instructions that let us speed up a lot of special cases

Yes very true.The ability to approximate such a functions with the help of SSE/AVX instructions is the argument against hardware implementation of special functions.

Talking of validation, we don't need another fdiv bug...

indeed, I was thinking to it actually when mentioning validation, it will be real bad to delay or recall new CPUs due to a hard to catch microcode bug for an instruction used by 0.0001 % of the code base

@bronxzv
I would like to ask you In what static library VLM is implemented in?I'am searching this directory on my computer: C:\Program Files\Intel\ComposerXE-2011\mkl\lib\ia32
There are many .lib files

I have actually bought this product and accepted its license (excerpt below)

"
3. LICENSE RESTRICTIONS:
[...]
B. You may NOT: [...] (v) reverse engineer, decompile, or disassemble the Materials;
"

reverse engineer, decompile, or disassemble the Materials

Sorry did not know this.

@bronxzv

MKL tgamma results for 1000 random choosen doublevalues are 123 cycles very close to the my results.It is interesting what an approximation did they use?
They also were able to achieve 0.5 ulp of an accuracyeven on the problematic range[ 0.0001,1.0] maybe they have used Lanczos approximation?

as I confessed the other day I have zero experience with Gamma functions,these look pretty much like strange animals to me

the poster with the best knowledge of MKL here isTimP AFAIK

I will postthis question on the MKL forum, but I'am afraid that nobody will answer the question regarding implementation of an algorithm.

Quoting iliyapolak...I would like to ask Intel's employees on this forum. Why IntelCPU architects have never implemented
inhardware some of themore "popular" Special Functions like'GAMMA','BETA' and various 'BESSEL'
functions of an integer order. All these functions could have been accessed byx87 ISAinstructions.

I don't think that Intel will add a such set of instructions. These functions are "Special" and they are not "Fundamental".
Intel clearly made a statement: "Use SSE or AVX to achieve as better as possible performance..."

Usually, big or small companies need to balance:

What markets demand?
What compatitors do?
What some customers want or expect?

And what is going on now? There is a growing demand on more powerful and energy efficient CPUs to run
bigger (!) versions of different "mobile"and "desktop" OSs.

Iliya, you mentioned a couple of times that some function calculates a result in ~120 clock cycles.
Let's put it on the LEFT side of some "Magic Scale". Let's assume that some bigcompany added a hardware
support for that special function in its CPUand it allows to get the same result in ~60 clock cycles. We put it
on the RIGHT side of our "Magic Scale". But, that is not everything and a cost, something like $500,000,000 USD,
will need to be added on the RIGHT side as well. This is because company needs to complete R&D, testing,
verifications, different production related tasks, and during these times salaries must be paid.

Personally, I would be glad to see a hardware accelerated matrix multiplication for matrices with
sizes up to 1,024x1,024.

Best regards,
Sergey

I don't think that Intel will add a such set of instructions. These functions are "Special" and they are not "Fundamental".
Intel clearly made a statement: "Use SSE or AVX to achieve as better as possible performance

Sergey
After reading your answer and answers from the other posters I have came to the same conclusion like you and other people answering this thread.Knowing that various special functions can be very accurately represented by Taylor series which in turn can be implemented at machine code level as a set of additions and multiplication with coefficients pre-calculation can eliminate the burden of micro-code level hardware implementation of special functions.
SSE/AVX instruction are perfect fundamental building blocks for such a implementations.

I think that it does make sense to implement in hardware for example Bessel functions,but it must besome kind of chip or controllerused to control modulation/demodulation FMcircuits.

Hi Iliya,

Quoting iliyapolak...I think that it does make sense to implement in hardware for example Bessel functions,but it must besome
kind of chip or controllerused to control modulation/demodulation FMcircuits.

I won't be surprised if specializedCPUs for DSP have it already. I would personallyput Intel CPUs for personal
computers into a category "General Purpose CPUs".

Best regards,
Sergey

Did quick search on the web but was not able any DSP which implements in hardware some of the special functions.Probably easier is to use custom ISA instructions like Intel x87 or SSE/AVX.

Quoting iliyapolakDid quick search on the web but was not able any DSP which implements in hardware some of the special functions...
Did you try to look at websites of these companies:

- AMD (I rememberAMD had a great set of RISC microcontrollers, like Am29200, Am29205, etc )
- Texas Instruments
- Motorolla
- Marvell

You could also try to email tothese companies.

Best regards,
Sergey

AMD (I rememberAMD had a great set of RISC microcontrollers, like Am29200, Am29205

I have read Amd manual and found that info regarding microcontroller's ISA andit was clearly written that this controller does not have even FPU.
Checked also TI DSP's, but these also havebasic floating-pont and integer ISA.
Probably some special gear like:FM modulators and bessel filters can use bessel functionthe question is how was it implemented in such a complex gear.I suppose that General Purpose TIDSP does not have dedicated hardware accelerated Bessel functions , but it implements a software approximation.

Quoting iliyapolak

AMD (I rememberAMD had a great set of RISC microcontrollers, like Am29200, Am29205

I have read Amd manual and found that info regarding microcontroller's ISA andit was clearly written that this controller does not have even FPU...
You could call to AMD to verify that Am29200 & Am29205 RISC microcontrollershave Floating Point single-precision ( 32-bit ) and
double-precision ( 64-bit ) instructions ( 18 instructionsin total ).

There is a book "29K Family" ( AMD / User Manual/ 1994 year ) and it has lots of technical details about these RISC microcontrollers.

You could call to AMD to verify that Am29200 & Am29205 RISC microcontrollershave Floating Point single-precision ( 32-bit ) and
double-precision ( 64-bit ) instructions ( 18 instructionsin total ).

Probably was not reading the right manual.

Even if those microcontrollers have floating-point instructions it is hard to believe that Amd engineers implemented in hardware some of the special functions.
It is easier to provide simple fp arithmetic instruction and use polynomial or rational approximation to calculate special functions values.

Quoting Sergey Kostrov...Am29200 & Am29205 RISC microcontrollershave Floating Point single-precision ( 32-bit ) and
double-precision ( 64-bit ) instructions (
18 instructionsin total )...

Single-precision instructions:

FADD
FSUB
FMUL
FDIV
FEQ
FGE
FGT

Double-precision instructions:

DADD
DSUB
FDMUL
DMUL
DDIV
DEG
DGE
DGT

Other:

SQRT
CONVERT
CLASS

Quoting iliyapolak...it is hard to believe that Amd engineers implemented in hardware some of the special functions...
That is correct and they provided a CRT-library of standard functions.

Single-precision instructions:

FADD

FSUB

FMUL

FDIV

FEQ

FGE

FGT

Double-precision instructions:

DADD

DSUB

FDMUL

DMUL

DDIV

DEG

DGE

DGT

Other:

SQRT

CONVERT

CLASS

yes but allimplemented as slowsoftware emulation (software traps much like x87 codeon386 and 486SX), isn'it?

yes but allimplemented as slowsoftware emulation (software traps much like x87 codeon386 and 486SX), isn'it?

IIRC the Amd microcontroller'sspecification which I have read clearly stated that floating-point instruction are executed by trap handlers.
Here is the excerpt from the official AMD document:

"Am29200 and Am29205 RISC Microcontrollers"

(copied from the pdf document):

The floating point

instructions are not executed directly, but are emulated

by trap handlers

yes, in other words no FPU

yes, in other words no FPU

When you think about this , there is no FPU and basic arithmetic floating-point instructions have to be emulated by the software and on top of this various more complicated approximations (sine,cosine ,atan...) are implemented by the software library which in turn calls into floating-pointtrap handlers to calculate primitive fp instructions like :fadd and fmul.
Not so efficient for the real-time applications based on heavy usage of fp instructions.

When you think about this , there is no FPU and basic arithmetic floating-point instructions have to be emulated by the software

in these ancient times even hardware support for integer multiplication wasn't always agiven so the multiplication of the mantissas wasa critical part of your floating pointemulation routines

in these ancient times

This reminds me a book on computer graphics written by Foley(did not remember the title) where the author describes some 2Dvideo hardwareengine.

This reminds me a book on computer graphics written by Foley(did not remember the title) where the author describes some 2D video hardware engine .

I still have Foley et al. Computer Graphics: Principles and Practice 2nd Edition, Addison Wesley 1990, there is an overview of the Silicon Graphics Power IRIS 4D/240GTX architecture which is 3D already

IIRC the most used chip for general purpose 3D in the early 90s was the Intel i860 "Cray on a chip", used on a lot of 3D accelerators, notably #9 and Dupont Pixel offering

the usage of the Am292xx series discussed here (like the Intel i960) for graphics were mostly for 2D raster applications like laser printers, scanners, etc.

Quoting bronxzv...yes but allimplemented as slowsoftware emulation (software traps much like x87 codeon386 and 486SX), isn'it?

I'll need toverify it.

I have the second edition of the Foley book.This book is outdated but has a very good intro on computer graphics theory.
The third edition is expected 1/2013.The video adapter described in second edition is used to mostly to create 2D raster images on the screen.

The video adapter described in second edition is used to mostly to create 2D raster images on the screen.

I was refering to chapter 18 "Advanced Raster Graphics Architecture" which is mostly about hardware architectures for 3D rendering based on a standard graphics pipeline, it looks like you haveanother chapter in mind, probablychapter 4 "Graphics Hardware"

it looks like you haveanother chapter in mind, probablychapter 4 "Graphics Hardware"

Yes I was reffering to the chapter 4"Graphics Hardware".

Intel i860 was very advanced at those days.
Hereare one of the few benchamrks published in Foley's book.
13MFLOP of double precision instructions.Today one processing core can exceed this speed by 1000x.
50,000 Gourad-shaded 100-pixel triangles per second.What could be an average speed measured in Gourad-shaded 100-pixels triangles when executed on Intel Sandy-Bridge CPU?

50,000 Gourad-shaded 100-pixel triangles per second.What could be an average speed measured in Gourad-shaded 100-pixels triangles when executed on Intel Sandy-Bridge CPU?

on a quad coreSandy Bridgeit will be something like 30M-60M 100 samples polygons per second(i.e. 1000x more) with a dumb Z-buffer algorithm (CPU only using all cores and fully vectorized AVX code)

here is an example with 49M polygons with per sample normal interpolation and reflection mapping(morecostly than Gouraud shading)that runs at 20+ fps (~ 1G polygon/second apparent) on a quad core Sandy Bridge:
http://www.inartis.com/Company/Lab/KribiBenchmark/KB_Robots.aspx

it's possible thanks to scene graph traversal optimizations such as occlusion culling where most polygons are not actually drawn

on a quad coreSandy Bridgeit will be something like 30M-60M 100 samples polygons per second(i.e. 1000x more) with a dumb Z-buffer algorithm (CPU only using all cores and fully vectorized AVX code

It is simply amazing how the CPU processing power increased over the period of 20 years.

>>here is an example with 49M polygons with per sample normal interpolation and reflection mapping

By the looking at the Robots surface what interpolation has been used in order to calculate the samples.
Was it bi-cubic interpolation , albeit costly but it can add significally smoother surface colourtransition.

By the looking at the Robots surface what interpolation has been used in order to calculate the samples.
Was it bi-cubic interpolation

normals are bilinearly interpolated in world spaceand thereflection map isbilinearly interpolated in texture space, bicubic interpolation is useful mostly for texture magnification but will be overkill here IMHO

normals are bilinearly interpolated in world spaceand thereflection map isbilinearly interpolated in texture space, bicubic interpolation is useful mostly for texture magnification but will be overkill here IMHO

Still the Demo programmers were able to achieve smooth transitions.Maybe usage of bilinear interpolation is compensated by the high level of details and high frequency sampling?

Quoting iliyapolak

yes, in other words no FPU

...
Not so efficient for the real-time applications based on heavy usage of fp instructions.

29K family microcontrollers designed for embedded systems, like laser printers, scaners, X terminals and
these microcontrollersdon't have FPU in order to reduce a cost of system integration.Even if FP-instructions
on these microcontrollers cause "lightweight"Trapsonly ~3 clock cycles are needed tocomplete a vector fetch.

Still the Demo programmers were able to achieve smooth transitions.Maybe usage of bilinear interpolation is compensated by the high level of details and high frequency sampling?

yes bilinear interpolation is fine for reflection maps since there is generally no magnification but a very high frequency samplingin texture space instead (due to the wild variation of normal directions), the sampling scheme is thus paramount for good quality, adaptive stochastic antialiasing in this example when you don't move the mouse

integration. Even if FP-instructions on these microcontrollers cause "lightweight" Traps only ~3 clock cycles are needed to complete a vector fetch.

what was a "vector fetch" on such anancient purely scalar chip?

btw, do you knowhow many cycles were required foremulating basic fp instructions like FADD and FMUL? FMUL was particularly slow due to the lack of integer multiplier AFAIK

adaptive stochastic antialiasing in this example when you don't move the mouse

Adaptive stochastoc antialising is very good at minimizing computational cost and memory bandwidth,but at the cost of some irregular sampling pattern introduced by the random(stochastic) sampling.
What is the sampling filter used in the Robots demo?
Is this simple box filter or sinc filter?

29K family microcontrollers designed for embedded systems, like laser printers, scaners, X terminals and

For intensive floating-point application the better option is to use Texas Instruments SHARC microprocessors.
But even these DSP microprocessors do not have some special functions directly implemented in hardware.
I think that we can come to conclusion that none of the general purposeDSP implememtssuch functions in the hardware and microprocessors useinstead software libraries.

What is the sampling filter used in the Robots demo?

Is this simple box filter or sinc filter?

the reconstruction filter is a box in this example, it's generally the best filter for low resolution raster images since other filterssuch as Gaussian andraised cosine lead to too much bluring and 2-3 lobes Lanczos too much ringing(note thatwe have these alternate reconstruction filters availablewith auserselectablefilter radius)

NB: sinc is a theoretical filter, not something you can use in practice with a realworld FIR filter kernel

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today