Does latest x86 architecture offer native support for quad precision (QP)floating-point (FP)arithmetic?

If no,canQP be emulatedon XMM and YMM registers with small overhead (< 2X slowdown)compared to the double precision FP arithmetic?

Thanks,

Nick

# Quad precision floating point arithmetic with SSE/AVX?

## Quad precision floating point arithmetic with SSE/AVX?

For more complete information about compiler optimizations, see our Optimization Notice.

Did you mean a "boost over the double precision floating point" instead if a "boost over thequadprecision floating point"?

Nick

Tim,

are there any plans to incorporate native hardware-based support for quad precision, at least into Xeon processors? The performance ofthe pure software implementation is generally too slow for our purposes - mostly large regular financial summation tasks. If not, is there a forum of sorts where one can register interest in hardware support for quad precision in Xeon processors?

Thanks,

Anders

As this would be a long term project (years), I hope you are working with the current implementations of parallelism.

Hi Anders

If by quad-precision you have 128-bit Binary Integer Decimal in mind or as candidate for consideration (BID encoding can deal with rounding and precision propagation issues better than binary FP encoding, Intel's DFP library is a great place to start.

You might want to contact the leader of Intel DFP library (he may be able to brief you of future release plan for that library.

http://software.intel.com/en-us/articles/intel-decimal-floating-point-ma...

You can also contact me offline to explore potential performance headroom on second andthird generation intel core processorsor Intel Xeon E3 and E5 processors.

Shihjong

Quoting akirkeby*...The performance ofthe pure software implementation is generally too slow for our purposes - mostly large*

**regular financial summation tasks**

*...*

Could you explain why do you need a 128-bit precision in that case?

Rounding problemscreatereal troubles in case of exchange operations and it would beinteresting to understand

what your problem is.

Best regards,

Sergey

Could you explain why do you need a 128-bit precision in that case

Sometimes it could be useful.When youdeal with the speed of execution vs precision and you do not want the arbitrary precision implementation which is slower than hardware registers.

For example the value of Pi which is transcendental number with infinite precision and it could benefit from the wider fp registers so range-reduction algorithms could provide more accurate mapping of the large arguments to the suitable range of sine calcualtion.

Quoting iliyapolak

Could you explain why do you need a 128-bit precision in that case

*Sometimes it could be useful.When youdeal with the speed of execution vs precision and you do not want the arbitrary precision implementation which is slower than hardware registers.For example the value of Pi which is transcendental number with infinite precision and it could benefit from the wider fp registers so range-reduction algorithms could provide more accurate mapping of the large arguments to the suitable range of sine calcualtion.*

This range reduction has been a subject of extensive research, and practical solutions have been implemented which don't rely on extra hardware precision. Anyway, to justify the investment in a higher precision, among other things a corresponding math function library is required, requiring yet again a higher precision algorithm for range reduction.

You could find plenty of references on the limitations of simply relying on extra precision for range reduction, as the x87 firmware does.

My point was that no matter how much hardware precision you have, you still need a higher precision range reduction algorithm to support trig functions on your new high precision.

If the market demand were seen, no doubt someone would study the feasibility of vector quad precision on future 256- and 512-bit register platforms.

I didn't say there is no need for quad precision. All widely used Fortran compilers have it, for example, with software implementation. Performance deficiency of current quad precision is due as much to lack of vectorizability as lack of single hardware instruction

I agree with you on this.We must also ask for what purpose should the hardware and ISA be modified to implement quadprecision or even more.I suppose that thereare not many mainstream math or engineeringapplicationthat need to calculatequad precisiontranscend. functions values.And for those esoteric application or highly sofisticated math packages(Mathematica ,Matlab)which calculates trig function with arbitrary precisionthe memory array model will be the best implementation albeit at theprice of speed of execution.

**>>you still need a higher precision range reduction algorithm to support trig functions on your new high precision**

It is catch-22 situation.

Quoting TimP (Intel)*I didn't say there is no need for quad precision...*

Borland C++ compiler v5.xincludes a **BCD Number Library** and it allows to work with numbers up to 5,000 digits. A question is:

ShouldI wait for a hardware support of 256-bit or 512-bit precisionsif some workaround could be used?

Also, having workedin financial industry for many years I could say thataccuracy of calculations ismore important than speed.

Borland C++ compiler v5.xincludes a

BCD Number Libraryand it allows to work with numbers up to 5,000 digits.

Java also has two arbitrary precision classes: Big Integer and Big Decimal.But it is unintuitive to work with these classes because numerical primitives like float or int are represented by objects and so simple arithmketic operations are done on objects so you have a large overhead of memory space needed to store them and time when you are doing calculation is very slow even hundreds times slower than in the case of arithmetics done on primitive types.

The question is what kind of applications beside some esoteric pure mathematical soft which calculates Pi untill thousands of digits and sophisticated math packages like Mathematica needs such a precision.

In e.g. C++ you have structs/classes without necessarily needing heap space, and you have operator overloading. Also, some C(++) compilers (e.g. gcc: __int128) allow for 128 bit integers. Intel's C compiler also knows about some kind of 128 bit floats which are emulated quite efficiently.

Thanks all for your comments so far. Been away for a bit so let me try to answer all comments and questions to add contextin one bigpost:

The business problem is that 14-15 significant digits is not enough to retain sufficient precision for amounts of moeny in bookkeeping applications where large transaction volumes are totaled up on a regular basis. The problem is particularly apparent when adding low unit value currencies such as VND or IDR.

The numbers in playare not typically integers, although for simple summation they could be shifted a few digits but this would only solve some of the applications and thus adds to the overall complexity.

Our current solution under investigation is a software implementation of a 128-bit decimal type based on IEEE 754-2008. The performance so far is 1-2 orders of maginitude slower than the corresponding 64-bit data types currently used.

Since our software is deployed in Windows environments the only alternative to a software implementation I currently see is the FPGA route. But that's not particularly attractive as FPGA hardware it would have to be installed in bulk on servers is outsourced data centres at substantial cost.

I'm aware that asking for 128-bit precision support at CPU level is a request for the long-term. However, with the current performance penalty we see from the software implementation it is clear that while it may work in limited areas for a whileit will never be something we or our clients will be happy with.

Thanks

Quoting akirkeby*...The problem is particularly apparent when adding low unit value currencies such as VND or IDR.*

*The numbers in playare not typically integers, although for simple summation they could be shifted a few digits but this wouldonly solve some of the applications and thus adds to the overall complexity...*

[**SergeyK**] **VND** is a Vietnamise Dong and acurrent exchange rate is about $4.8*10^-5 USD ( 0.0000479846 ).

Believe me, that thecurrency exchange problem is solved for years ( since first PCs appeared at banks)

by introducing a normalization factor ( or a CurrencyUnit )and in case of **VND** it has to be equal to10^5.

Another way is to do calculations in aBase Currency and usuallythis is USD ( $100 USD = 1 / 0.0000479846 VND * 100).

I would like to repeat that something is really wrong conceptually with a way summations

are done in your software. It could be alsorelated to anot efficientdatabase design.

*...I currently see is the FPGA route. But that's not particularly attractive as FPGA hardware it would have to be installedin bulk on servers is outsourced data centres at substantial cost...*

[**SergeyK**] That "FPGA solution"is clearlynot the best one and I make that statement because I worked

as a C++ Software Developer for the Financial Industry for more than 8 years and

was involvedina design and implementation of several financial systems (two of them were

Certifiedata National Bank of some country).

Best regards,

Sergey

Our current solution under investigation is a software implementation of a 128-bit decimal type based on IEEE 754-2008. The performance so far is 1-2 orders of maginitude slower than the corresponding 64-bit data types currently used

Without native hardware acceleration nothing can be done to improve the performance i.e eliminate store/lode overhead needed to represent 128-bit number implemented in software.

Although I cant speak to the state of art with respect to 128-bit binary FP software libraries, I can share some insights about performance improvements of IEE754-2008 BID floating-point math that's readily feasible without 128-bit FP hardware support.

If you look at how these FP numbers are encoded, whether its hw or software solutions, they must deal with

1. Bit-fields extractions

2. Special case pruning of INF/NANs

3. Binary arithmetic operations on large bit strings to achieve the precision provided in the width of the mantissa

4. Normalization of mantissa/exponents and required rounding operations

5. Fix-up and packing bit fields back into IEEE encoding.

Whether it's hardware or software, these result-dependency chains are hard to get around. I would venture to suggest any expectation of not more than an order of magnitude slow down, when committing to use quad-precision, is unrealistic.

In my experiences with Intel's Binary Integer Decimal (BID) FP library, basic arithmetic operation cycles does take more 10x longer, if you want to compare to the hardware accelerated 64-bit Binary FP encoding. Bear in mind that the operational cycle of arithmetic are value-dependent, the most cycle-consuming BID128_add can take 180ish cycles, while typically it's more like 80ish. On the other hand, the most cycle-consuming case of BID128_mul can take more than 400 cycles, while mid 200 is more typical.

Without 128-bit FP hardware, there are several places that algorithmic/existing ISA and application architecture can make large speed up.

Take BID128_mul for example, multi-precision arithmetic using existing ISA's 64-bit only MUL instruction to produce 128-bit result will have the strongest impact on accelerating BID128_MUL. Additional gains will follow if using the MULX instruction that will be in the market in 2013.

The chore of initial bit field extraction and special case pruning can be handled using existing SSE instructions, so that the lengthy multi-precision math code can start sooner.

The net result is on Sandy Bridge based processor, the most cycle-consuming BID128_MUL case would take less than 200 cycles and typical cases are well below 100 cycles, without 128-bit FP hardware and no new ISA.

It may be feasible in an application that the special case input ranges can be handled at a different stage than relying on an arithmetic library function to implement the defensive pre-processing stage, as typical software library need to do.

Quoting Shih Kuo (Intel)

*Although I cant speak to the state of art with respect to 128-bit binary FP software libraries, I can share some insights aboutperformance improvements of IEE754-2008 BID floating-point math that's readily feasible without 128-bit FP hardware support.*

*If you look at how these FP numbers are encoded, whether its hw or software solutions, they must deal with*

**1. Bit-fields extractions2. Special case pruning of INF/NANs3. Binary arithmetic operations on large bit strings to achieve the precision provided in the width of the mantissa4. Normalization of mantissa/exponents and required rounding operations5. Fix-up and packing bit fields back into IEEE encoding...**

Well, could you try to explain this to a person who does accounting in some company that is in a Currency Exchange

business? Or, have a chat with somebodywho does accounting in Intel. I'll bevery glad to hear a response

from that person.

Intel Software Engineers, please try to look **Out-Of-The-Box**.

Nick has a real life problem related to a "Rounding of a Financial Transaction". Even if Nick's company spends

many millions of dollars on FPGAs, or 128-bit/256-bit/etc precision library, it won't fix a really simple problem and

take a look at:

http://www.irishwebmasterforum.com/coding-help/5997-accounting-for-rounding-errors.html

Please respond me how a "magical" 128-bit precision hardware will solve that problem?

Some big companies, like SAP, havevery flexible rules on how to do roundings and take a look:

http://help.sap.com/saphelp_rc10/helpdata/en/18/8b8a3a068ada7fe10000000a114084/content.htm

Or, take a look at:

http://blog.acrossecurity.com/2012/01/is-your-online-bank-vulnerable-to.html

http://docs.oracle.com/cd/A60725_05/html/comnls/us/mrc/currco01.htm

Thanks in advance for your time!

Best regards,

Sergey

Sergey's suggestion of a normalization factor is good.

Another route to consider is to choose a monetary quanta and perform all calculations in units of quanta.

A quanta would be defined as an indivisible unit of money. An example of choice would be $1.0e-8.

This is approximately 1.0e-3VND. 64 bits then could handle amounts up to ~+/-$35,000 Trillion US$.

Round off error of ~ 1/1000th of 1VND might be acceptable. Even 1 million carefully crafted transactions couldn't skew the result by more than 1 cent.

Periodically the quanta could be deflated (the normalization factor Sergey was talking about). At some date a value for 1 quanta is chosen and defined as having a normalization factor of 0. Then at some future date (assuming inflation) you could then declare we are now using the :new" quanta with normalization factor of 1 wit respect to first generation quanta.

Jim Dempsey

Quoting Shih Kuo (Intel)*...In my experiences with Intel's Binary Integer Decimal (BID) FP library...*

Could you provide a test-case or link(s) to binaries / sources of the **Intel BID FP** library? Thanks.

Best regards,

Sergey

Quoting TimP (Intel)*Intel decimal library is included in the gcc source distribution.*

Thank you, Tim! Are there plans to include Intel BID libraryin Intel C++ compiler for Windows?

Quoting TimP (Intel)*The claims for the netlib version of the library imply you are entitled to try it yourself.*

This is exactly what I was looking for. Thank you, Tim!

Now I need to schedule some timefor R&D and I will compare **Intel BID** library with **Borland BCD Number** library.

Best regards,

Sergey

Quoting yuriisig**Dekker's method for doubled-single extended** gives accuracy of 128 bits...

Yurii, I can't find any references for that method on the Internet (**Google** search was used ). Could youprovide me

with internetlinks or docs, please? Thanks in advance.

PS: I've found this http://en.wikipedia.org/wiki/Dekker's_algorithmbut it is a different one andfor a concurrent programming.

Hi

I can sketch the test approach of my study, which was geared towards uncovering opportunities of vectorization and native ISA performance headroom that were not exploited. This is different from typical usage of library users. But some parts may be of interest to you.

As a background, GCC supports its own data type, _Decimal128, which maps to BID128 when built for x64 architecture. GCC's native language support for _Decimal128 extension on x64 architecture is essentially wrapped on top of Intel BID library with some of the flexibility trimmed out. The Intel BID library is released in source form that can be built for Linux/Windows using common compilers to run on x86, x64 and IPF. Some of the flexibility provided by API of the Intel BID library include: passing by value or reference, explicit rounding behavior control, exception reporting, endianness etc.

Internal to Intel BID library, 128-bit and higher precision data are represented as arrays of qwords. For testing throughput of basic arithmetic operations, one of the task is to generate test bit patterns to characterize cycle characteristics. Different considerations come into play when considering Bid128_mul vs. Bid128_add.

Studying the source code of Intel BID library to understand its algorithmic and implementation aspects were quite a task, even when my scope is limited to one arithmetic operation at a time.

I was not interested in the flexibility of exception reporting, nor parameter passing choices, and I chose only to focus on round-nearest behavior as a proxy to capture the necessary algorithmic requirements.

So, I made some simplification choices: (a) extricate a proxy implementation of the target BID arithmetic library implementation that retain the functional, algorithmic and performance characteristic of the original library function implementation, (b) a calibration harness to correlate the actual performance of the extricated proxy implementation with off-the-shelf BID128 performance, (c) My test evaluation and vectorized POC need to run on both Windows and Linux.

The simplest calibration test code is simply using GCC's extension of _Decimal128 data type and standard operator '*" and '+' provided by that extension. But passing by value not only creates a portability problem but implicit data type conversion to/from _Decimal128 will invoke addition BID conversion routines, adding overhead.

So, my proxy of Bid128 arithmetic source implementation adopt passing by reference API of the Intel BID library and use the same data layout of arrays of qwords in little-endian on x64.

For Bid128_mul performance evaluations, the primary knob affecting cycles is the dynamic range of the mantissa of the two input value. The Bid128 encoding provides a maximum range of encodable mantissa of 34 decimal digits using 113 bits within the 128-bit container.

Hence the basic flow of testing Bid128_mul looks like

Extern void _my_Bid128_pack (__BID_UINT128 *pV128, int sign, __UINT64 qw_ho, __UINT64 qw_lo, int exp);

void Test_BID128_MUL( /*knob parameters for random pattern generation */)

{int sign1, exp1, sign2, exp2;

__UINT64 man_hi1, man_lo1, man_hi2, man_lo2;

__BID_UINT128 a128, b128;

__BID_UINT128 * pA = (__BID_UINT128 * ) a128, * pB = (__BID_UINT128 * ) b128;

/* generate desired mantisa bit patterns, exponent values, signs */

_my_Bid128_pack (&a128, sign1, man_hi1, man_lo1, exp1);

_my_Bid128_pack (&b128, sign2, man_hi2, man_lo2, exp2);

#ifdef _TARGET_GCC_LINUX_EVAL_

_Decimal128 ref, result, *a = (_Decimal128 *) pA, *b = (_Decimal128 *) pB;

ref = (*a) * (*b); // linked to gcc provided library code

result = _proxy_BID128_MUL( pA, pB); // link to locally compiled proxy code

// compare ref against result, exit if different

// measure thrupt of either ref = (*a) * (*b);

//result = _my_poc_BID128_MUL( pA, pB); // link to local vectorized poc

// measure trhupt of _proxy_BID128_MUL, my poc

#else

__BID_UINT128 result;

ref = _proxy_BID128_MUL( pA, pB); // link to locally compiled proxy code

result = _my_poc_BID128_MUL( pA, pB); // link to locally compiled vectorized poc code

// compare ref against result, exit if different

// measure trhupt of _proxy_BID128_MUL, my poc

#endif

}

Quoting yuriisig*See the book: ***Handbook of Floating-Point Arithmetic***...*

Thank you! Just downloaded...

Thank you, Shih!

Quoting Shih Kuo (Intel)*...Internal to ***Intel BID library, 128-bit and higher precision data***...*

Could you clarify the statement, please? Does it mean that**256-bit** or **512-bit** precisions are supported as well?

Best regards,

Sergey

Hi,

Thanks for your comments. Various scaling approaches are indeed sensible in many case. However, to clarify our challenge: The problem is exacerbated when dealing with low unit value currencies, but we encounter the same challenges with clients working in USD and EUR. The fundamental issue is not something which can easily be solved through a normalisation factor. Not least because the right factor is dependent on the specific context and would thus require application developer involvement rather than be transparent to the developer.

The problem is fundamentally that we run out of significant digits. The alternative, to lose some precision, is not deemed acceptable. In investment accounting specific rounding rules must be applied at specific points and early loss of precision canimpactfinal results visibly.

Thanks,

Anders

Since the the mantissa of the input values can reach 113 bits in dynamic range, the immediate result of the multiprecision multiply of the two input mantissa needs an even larger container than 128-bits, as large as 256-bits.

Furthermore, the inmmediate result of the two input mantissa needs to be normalized in conjunction with desired rounding behavior to fit the IEEE-754 DFP spec defined encoding precision of 34 decimal digits. That is usually done by Montgomery reduction.

So the 256-bit container of the immediate product are multiplied again with a large-enough constant to perform 256-bit integer division to produce a quotient with at least 113 bits precision, in the extreme case.