x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ)

Introduction

This document details the difference between how assists are handled with x87 and Single Instruction Multiple Data (SIMD) instructions, and gives information on how to change their behavior when using (Streaming SIMD Extensions) SSE and SSE2.

Since denormals and underflow numbers are an issue with the performance of most processors, instructions have been added to the SSE and SSE2 instruction set to internally fix the numbers in hardware without calling a software exception. SSE/SSE2 and x87 operations are handled differently, so this paper describes them in two separate sections.


x87 FP Assists

Controls when denormals occur are handled in one of two ways according to how the FPU control word (Figure 1) is set. The FPU control word can set individual exceptions according to the figure below as either masked or unmasked. The default setting for all of the exception masks is to be masked.



Figure 1. x87 FPU Control Word

When masked (bit = 1), the hardware produces the IEEE 754 standard default result itself. For a denormal, the value is automatically normalized when the 32-bit single-precision or 64-bit double-precision operand is converted to 80-bit double extended-precision (Intel Architecture Guide 8.5.2). Denormals are not detected in Intel hardware until the result is stored in memory unless the denormal is an 80-bit double extended-precision denormal. To mask the denormal operand bit manually, use the following code snippet:

     short int x;


     __asm FSTCW     x


     x |=

    0x02;          //

    to mask the Denormal control bit


     __asm FLDCW      x

 

When unmasked (bit = 0), the denormal will be fixed in a software exception handler. If an application does not have a specific software exception handler for the denormal exception, the application crashes with an unhandled exception. To unmask the denormal operand bit manually, you can use the following code snippet:

     short int x;


     __asm FSTCW     x


     x &=

    0xFFFD;       // to

    unmask the Denormal control bit


     __asm FLDCW    x

 

Unmasking the denormal operand exception is a useful debugging tool for the programmer. If an application does not have its own exception handler for denormals, the program crashes when the bit is unmasked, allowing you to see exactly where the x87 assist occurs within the program. Otherwise, with the denormal operand exception masked, there is simply a performance problem, but it is transparent to the user as to where it occurred unless a tool such as the VTune analyzer is used to detect the event.

The FPU status word (Figure 2) doesn't allow the user to manipulate how exceptions or operations are handled. Instead, it gives a status of what has happened in the system.



Figure 2. x87 FPU Status Word

When a process is started, the status bits for that process are all cleared to zero. Figure 3 shows the Registers window of a Visual Studio* 6.0 project after an application has started, showing the FPU status word has been initialized to zero.



Figure 3. FPU Status Word after application begins

When a denormal operand is detected, it sets the exception flag in the status word. Each of the exception flags is a sticky bit, which means that they all remain on until specifically set to zero. This enables programmers to check an application for exceptions by setting the bit to zero before the section of concern, and then verifying the status of the exception bit in the status word after the section has completed to see if a denormal occurred. Figure 4 is an example of what the FPU status word looks like after a denormal exception has been detected in an application.



Figure 4. FPU Status Word after denormal exception occurs


SSE and SSE2 Assists

Within the XMM registers, 32-bit single-precision and 64-bit double-precision floating point values are not internally converted to 80-bit extended-precision floating point values as they are within the x87 FPU. Because of this, special handling is needed when SSE and SSE2 instructions operate on invalid floating-point numbers. This occurs internally according to the IEEE Standard 754 for floating-point calculations. When operating on denormal numbers, the speed at which these are performed is limited.

However, with SSE and SSE2, a hardware Flush-to-Zero (FTZ) mod e is possible. Some of the later processors that support SSE2 also support Denormals-Are-Zero (DAZ) modes. To determine whether a processor supports the DAZ mode, a software check must be performed (see Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 2A in References for further information). Both of these modes are less precise than using the IEEE 754 standard method, and so they are not compatible with the standard, but do provide substantially faster execution times when absolute accuracy according to the standard is not needed.

FTZ and DAZ modes both handle the cases when invalid floating-point data occurs or is processed with underflow or denormal conditions. See Table 1 for a layout of what floating point numbers look like in these cases. The difference between a number that is handled by FTZ and DAZ is very subtle. FTZ handles underflow conditions while DAZ handles denormals. An underflow condition occurs when a computation results in a denormal. In this case, FTZ mode sets the output to zero. DAZ fixes the cases when denormals are used as input, either as constants or by reading invalid memory into registers. DAZ mode sets the inputs of the calculation to zero before computation. FTZ can then be said to handle SSE Output Assists while DAZ handles SSE Input Assists.

Table 1. Floating-Point Number and NaN Encodings

Flush-To-Zero Mode

FTZ is a method of bypassing IEEE 754 methods of dealing with invalid floating-point numbers due to underflows. As previously mentioned, this mode is less precise, but much faster. Two conditions must be met for FTZ processing to occur:

  • The FTZ bit (bit 15) in the MXCSR register must be masked (value = 1).
  • The underflow exception (bit 11) needs to be masked (value = 1).

 

If both of these conditions are met and an underflow occurs, the following steps are taken in the hardware:

  • Returns a zero with the sign of the true result.
  • Sets the precision (PE) and underflow (UE) flags.

 

If the underflow mask bit is masked, handling of underflow output is determined by the FTZ mode bit. While the underflow bit is masked and the FTZ mode bit is unmasked, the operation continues according to IEEE standard 754 for the computation with the denormalized input.

When the underflow exception bit is unmasked, the system attempts to call a software exception handler. If no software exception handler is present, an unhandled exception occurs.

The results of using the combination of the FTZ bit and underflow mask bit are shown in Table 2.

Table 2. FTZ and Underflow Mask Bit Results

FTZ mode (bit 15) Underflow mask (bit 11) Effect
1 1 Hand led, output zero, PE and UE set
0 1 Handled, output precise, PE and UE set
1 0 Software Exception, UE set
0 0 Software Exception, UE set

 

The FTZ bit is cleared upon reset of the processor, but usually the OS sets this bit by default (Windows* 2000 and Windows XP behave this way).

Denormals-Are-Zero Mode

DAZ is very similar to FTZ in many ways. DAZ mode is a method of bypassing IEEE 754 methods of dealing with denormal floating-point numbers. As mentioned, this mode is less precise, but much faster and is typically used in applications like streaming media when minute differences in quality are essentially undetectable. Two conditions must be met in order for DAZ processing to occur:

  • The DAZ bit (bit 6) in the MXCSR register must be masked (value = 1).
  • The processor must support DAZ mode. Initial steppings of Pentium® 4 processors did not support DAZ. Information about detection of DAZ can be found in the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 2A in the References section.

 

If both of these conditions are met and a denormal operation occurs, it returns a zero with the sign of the original operand. With DAZ mode, the denormal-operand exception flag is never set, regardless of the value of the denormal operation mask.

When the DAZ bit is masked, a software exception will not be called. When the denormal mask is unmasked, either denormals will be set to zero or a software exception will be returned, according to the value of the DAZ bit. For IEEE standard 754 processing, the denormal mask bit must be masked with the DAZ bit unmasked, as shown in Table 3.


Table 3. DAZ and Denormal Mask Bit Results

DAZ mode (bit 6) Denormal mask (bit 8) Effect
1 1 Handled, output zero, no flags set
0 1 Handled, output precise, DE set
1 0 Handled, output zero, no flags set
0 0 Software exception, DE set

 

By default the DAZ bit is not set.  


How to Modify Modes

The FTZ and DAZ modes can be set in the MXCSR control/status register (Figure 5). This allows you to change the behavior of the hardware in order to change values of denormal data, which allows processing without affecting the performance traditionally found with software exception implementations. The same type of masking of exceptions occurring with SSE and SSE2 instructions is also possible, as with the x87 FP unit.






Figure 5. MXCSR Control/Status Register

The FP control word, MMX, XMM, and MXCSR registers can be loaded all at the same time using the FXSAVE instruction. The variable in which this state is stored must be 512 bytes, 16-byte aligned according to the structure shown in Figure 6.



Figure 6. Layout of FXSAVE and FXRSTOR Memory Region

The FP control word, MMX, XMM, and MXCSR registers can be loaded all at the same time using the FXSAVE instruction. The variable in which this state is stored must be 512 bytes, 16-byte aligned. The XMM state can then be changed and stored using the LDMXCSR register as shown in the code example in the next section.


Code Example

// With Visual Studio Proc Pack
#include "xmmintrin.h"
#include "memory.h"

#define X87FLAGBITS         6
#define DAZ_BIT            6
#define FTZ_BIT            15
#define DENORMAL_EXCEPTION_MASK   8
#define UNDERFLOW_EXCEPTION_MASK   11

void set_mxcsr_on(int bit_num)
{
__m128    state[32];
__int32   x;
__asm fxsave   state      
 memcpy( (void*)&x, (char*)state+24, 4);
x |= (1 << bit_num);       
 __asm ldmxcsr   x   

 }

void set_mxcsr_off(int bit_num)
{
__m128    state[32];
__int32   x;
__asm fxsave   state      
 memcpy( (void*)&x, (char*)state+24, 4);
x &= ~(1 << bit_num);
__asm ldmxcsr   x   
 }

void clear_flags()
{
__m128    state[32];
__int32   x;
__asm fxsave   state      
 memcpy( (void*)&x, (char*)state+24, 4);
x = x >> X87FLAGBITS;
x = x << X87FLAGBITS;
__asm ldmxcsr   x   
 }

void make_denormal()
{
__m128   denormal;
int      den_vec[4] = {1,1,1,1};
memcpy( &denormal, den_vec, sizeof(int)*4 );
denormal = _mm_add_ps( denormal , denormal );
}


void main()
{
// UNDERFLOWS
set_mxcsr_on(FTZ_BIT);
set_mxcsr_off(UNDERFLOW_EXCEPTION_MASK);
make_denormal();
clear_flags();


// DENORMALS
set_mxcsr_off(DAZ_BIT);
set_mxcsr_on(DENORMAL_EXCEPTION_MASK);
make_denormal();
clear_flags();

}   


 


Conclusion

To avoid serialization and performance issues due to denormals and underflow numbers, use the SSE and SSE2 instructions to set Flush-to-Zero and Denormals-Are-Zero modes within the hardware to enable highest performance for floating-point applications.

References


Intel® 64 and IA-32 Architectures Software Developer's Manuals, Volume 1: Basic Architecture

Intel® 64 and IA-32 Architectures Software Developer's Manuals, Volume 2A: Instruction Set Reference A-M

Related Links

Read Intel engineer Rich Winterton's article on Software Exception Handling

Additional articles on Intel Pentium 4 processors

Information on Intel® Software Development Products

About the Author

Shawn Casey is an Application Engineer with Intel's Software & Solutions Group. He joined Intel in 2000 and has worked on application tuning, optimization and related training for the Intel Pentium 4 and next generation processors. Currently, Shawn is working on enabling threading and mobile technologies for future generation processors. He earned a BSE in Electrical Engineering in 1996 from Arkansas State University.


Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.
Теги: