Benefits of Intel® Advanced Vector Extensions For Quaternion Spherical Linear Interpolation (Slerp)

Introduction

Intel® Advanced Vector Extensions (Intel® AVX) introduced with the new Intel® micro architecture, code named Sandy Bridge, extends the capabilities of Intel® Streaming SIMD Extensions (Intel® SSE), especially for floating point data and operations. Intel® AVX essentially doubles the width of the current XMM registers and adds new extensions that can operate on the wider data width. Intel® AVX significantly increases the floating-point performance density with improved power efficiency over previous 128-bit SIMD instruction set extensions. This document specifically examines how some of the Intel® AVX and Sandy Bridge micro architecture features such as wider 256-bit SIMD registers, distinct destination, new data manipulation and arithmetic primitives, dual 128-bit load ports and doubling of computational execution width can benefit Spherical Linear Interpolation (SLERP) operation on quaternion data sets.

Intel also provides a set of Intel® AVX software development tools like Intel® AVX enabled compilers, Intel® Software Development Emulator (Intel® SDE) and the Intel® Architecture Code Analyzer. All of these tools were effectively used during the development of this kernel and can be downloaded from the Intel® AVX website https://software.intel.com/en-us/intel-isa-extensions

Testing Environment

The performance speedups stated in this paper are based on runs on actual Sandy Bridge silicon. It can also be assumed that the test data is already in the first level processor cache prior to the computation of the Slerp algorithm. Performance comparisons are made based on the relative performance of Intel® AVX versus corresponding Intel® SSE implementations using C intrinsics, both run on the Sandy Bridge silicon. The code was compiled using the Intel® C++ Compiler version 11.0.074.

3D Orientation Representations

In 3D graphics, orientations can be represented as Euler Angles, Rotation Matrices or Quaternions. Transformation between coordinate spaces and interpolation between one or more points are common graphics operations. Matrices are necessary when transforming and rotating orientation vectors between coordinate spaces. On the other hand, the spherical linear interpolation operation is best performed on quaternions to provide the smoothest interpolation between orientations. Quaternions are small and efficient and are a good placement for rotation matrices. Quaternions take less storage space, only 4 scalar values as compared to 9 for a 3x3 rotation matrix. Quaternion multiplication is also more efficient than matrix multiplication and a quaternion can be easily converted to a matrix when needed. Unlike Euler angles, quaternions do not present issues like "gimbal lock". These properties make quaternions ideal for many algorithms for example the ones used in 3D animations. To convert between representations, refer to Reference [2] for background on quaternions and converting between rotation matrices and quaternions.

Quaternion Slerp

Animation systems often use interpolation to generate rotations in between key frames. Different animations can also be blended together to achieve smooth transitions from one animation to another. There are three general approaches to quaternion interpolation, Slerp, Lerp with re-normalization and exponential map interpolation. Of these, Slerp is considered the most optimal interpolation curve between two general rotations. The Slerp algorithm involves several trigonometric functions and is a computationally intensive operation.

Given two quaternions q0 and q1 and a parameter t in the range [0, 1], the spherical linear interpolation is defined as follows:



Where alpha is the angle between q0 and q1 which can be calculated from the dot product of the two quaternions.



Reference [1] provides detailed information on the optimized Slerp algorithm that was followed for both the scalar and SIMD implementations.
 

Previous Work

The Slerp algorithm used in the Intel® AVX implementation discussed in this whitepaper follows the optimized algorithm discussed in Reference [J.M.P.van Waveren, Slerping Clock Cycles, Id Software, Inc. 2005, [PDF]. First the Intel® SSE code was ported to the C intrinsics form and then it was migrated to Intel® AVX.

AVX Implementation of Quaternion Slerp

Similar to the Intel® SSE implementation in the referenced whitepaper, the Intel® AVX implementation does not use an isolated spherical linear interpolation between two quaternions, but instead interpolates between two quaternion lists. A skeletal animation system for example usually requires interpolation between two key frames where each key frame is a list with joints that define a pose of the skeleton. A joint from a key frame is stored as a quaternion for the orientation and a 4D vector for the position.

It is fairly straightforward to port the Slerp algorithm from Intel® SSE to Intel® AVX. The initial port provides a 1.58x speedup without any additional optimizations. However, several optimizations can be applied to achieve even higher performance levels.

A careful inspection of the original algorithm revealed that swizzling after the initial multiplications will reduce instruction count. There is no need to swizzle both input quaternions first and then multiply. Instead, they can be multiplied together first and then that result can be swizzled. This optimization eliminated four _mm256_unpackxx_ps and four _mm256_unpackxx_pd instructions.

There is another optimization for the swizzle operations which reduce the instruction count even further. This swizzle optimization is possible because it is not necessary to do a complete swizzle to arrange all of the elements in sequential order. It is sufficient to arrange them in the same respective order within their destination registers. Processing begins by reading the first eight quaternions from memory into 4 ymm registers using the _mm256_load_ps instruction. The following diagrams show how the X components from this set of quaternions are gathered into one register. Similar operations are applied to gather the Y, Z, and W components. This approach reduces the number of instructions executing on the shuffle port from 36 to 24.




Once the components are gathered into proper positions, the SLERP operation is performed. A similar operation is applied to gather all components of each quaternion back together. The results are then written to memory.

It is also beneficial to unroll the for-loop once thus performing the SLERP operation on sixteen quaternion sets per loop. This optimization produces another 2.4% performance gain compared to the unrolled Intel® AVX version.

There were some instances where it was beneficial to re-arrange instructions. The sum of products at the beginning of the algorithm is one such area. One partial sum can be calculated before the operands used in the second partial sum are swizzled.

The importance of memory alignment and the use of _mm256_load_ps instead of _mm256_loadu_ps also provide performance gains. Thus, alignment of 256-bit operands on 32 Byte boundaries is strongly recommended. This means that anyone using this code needs to pass in aligned data. If alignment of data cannot be guaranteed then unaligned versions of load and store instructions must be used. It is helpful to review the code generated by the compiler and select the load and store instructions that produce the most efficient approach to memory access.

The Intel® Architecture Code Analyzer identified critical path instructions and provided execution port usage information. This analysis tool can help developers choose the best instructions for the implementation of an algorithm.

Results

The 128-bit code was compiled for Intel® microarchitecture code named Nehalem to generate Intel® SSE code and executed on the Sandy Bridge silicon. The corresponding 256-bit Intel® AVX enabled code was compiled for the Sandy Bridge architecture and executed on the Sandy Bridge silicon. Data was aligned on 16-Byte boundaries for the Intel® SSE code and 32-Byte boundaries for the Intel® AVX code. The speedups listed here are for 256-bit code relative to the128-bit code.

It should also be noted that normally you would not expect performance improvement greater than 2x moving from 128-bit Intel® SSE instructions to 256-bit Intel® AVX instructions. In this case, the results are slightly greater than 2x due to a few additional optimizations applied to the Intel® AVX version that were not used in the Intel® SSE version. The Intel® SSE version was based on the original white paper [J.M.P.van Waveren, Slerping Clock Cycles, Id Software, Inc. 2005, [PDF] with no additional optimizations applied. The only difference being, the Intel® SSE code in the original paper was written in assembly and was converted to C intrinsic for Intel® AVX speedup comparisons.

For data sets of multiples of 16, a speedup of 2.16x was achieved with Intel® AVX. This is because Intel® AVX code takes advantage of processing of double the floating point data compared to Intel® SSE and one-way loop unrolling that allows to process 16 elements per iteration.

For data sets that are non-multiples of 16, the non-modulo elements have to be handled in a separate loop. This second loop is currently being processed as simple C scalar code. However, this can be further optimized using Intel® AVX masked move instructions that can conditionally load and store only certain array elements without crossing array bounds.
 

Algorithm Speedup Parameters
Quaternion Slerp (only multiples of 16) 2.16x Source is two sets of 64 quaternions each
Quaternion Slerp (includes non-multiples of 16) 1.86x Source is two sets of 67 quaternions each

Table 1 - 256-bit to 128-bit Speedup Results

Intel® Architecture Code Analyzer Correlation

The Intel® Architecture Code Analyzer is a useful tool to analyze a block of instructions for execution port utilization and identifying throughput bottleneck. This tool was effectively used while developing the Slerp kernel to analyze execution port 5 utilization at various steps of the algorithm. The code analyzer can also be used to find expected speedup. For the Slerp kernel, the code analyzer expected speedup is within 4% of the silicon speedup. However, it should be warned that the result from the code analyzer matching the result from the silicon so closely is not typical. The Intel® Architecture Code Analyzer has been modeled such that it assumes that all data resides in the first level cache and that there are no dynamic micro-architectural penalties. The tool should only be used as a way to see expected speedup.

Intel© Architecture Code Analyzer Intel® SSE throughput analysis report:
----------------------------------------------------------------------------

Block Throughput: 47.75 Cycles       Throughput Bottleneck: Port0

Port Binding In Cycles Per Iteration:
-----------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-----------------------------------------------------------------------------
| Cycles | 46.0   0.0  | 34.0 | 19.5   16.5 | 19.5   16.5 | 6.0  | 33.0 |
-----------------------------------------------------------------------------



Intel© Architecture Code Analyzer Intel® AVX throughput analysis report:
-----------------------------------------------------------------------------

Block Throughput: 97.20 Cycles       Throughput Bottleneck: Port0

Port Binding In Cycles Per Iteration:
-----------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-----------------------------------------------------------------------------
| Cycles | 96.0   0.0  | 68.0 | 33.5   56.0 | 33.5   56.0 | 22.0 | 61.0 |
-----------------------------------------------------------------------------


Note: To collect the above code analyzer statistics, the first loop that handles quaternions that are multiples of 16 was instrumented by inserting IACA_START and IACA_END markers at the beginning and end of the block of instructions inside the loop.

For 64 quaternions,
Total Intel® SSE Intel® Architecture Code Analyzer count = 47.75 * 64/4 = 764 cycles
Total Intel® AVX Intel® Architecture Code Analyzer count = 97.20 * 64/16 = 368.5 cycles
Predicted Intel® Architecture Code Analyzer speedup = Intel® SSE count/ Intel® AVX count = 2.07x

The predicted speedup is within 4% of the speedup observed with actual Sandy Bridge silicon.

Conclusion

The results for the Spherical Linear Interpolation of two sets of aligned quaternion data shows that the Intel® AVX versions outperformed the original Intel® SSE implementation by 2.16x. This algorithm aptly demonstrates how Intel® AVX achieves higher throughput than Intel® SSE by operating on twice the number of operands per instruction when utilizing the entire register width. The algorithm also achieves higher performance throughput by performing Slerp calculations on eight quaternions simultaneously, thereby exploiting the Intel® AVX SIMD architecture. Having a higher ratio of computational instructions in comparison to the memory operations was also beneficial. Additional optimizations that helped are simple loop unrolling and hand scheduling the loads from the unrolled iteration prior to the higher latency compute instructions. Performing initializations of all constants outside the main compute loop also helps increases the Intel® AVX performance.

Source Code for Quaternion Slerp

The complete source code for the quaternion SLERP can be downloaded here.

Partial source code listing is provided in the following section.

//Some constants needed when calculating sin(x)
#define SINC0 -2.39e-08f
#define SINC1 2.7526e-06f
#define SINC2 -1.98409e-04f
#define SINC3 8.3333315e-03f
#define SINC4 -1.666666664e-01f

//Some constants needed for Newton-Rapson iteration
#define RSQRT_C0 3.0f
#define RSQRT_C1 -0.5f

//Some constants needed when calculating arc tangent
#define ATANC0 0.0028662257f
#define ATANC1 -0.0161657367f
#define ATANC2 0.0429096138f
#define ATANC3 -0.0752896400f
#define ATANC4 0.1065626393f
#define ATANC5 -0.1420889944f
#define ATANC6 0.1999355085f
#define ATANC7 -0.3333314528f

//General purpose constants
#define ALL_ONES 1.0f
#define CLEAR_SIGN_BIT 0x7fffffff
#define SET_SIGN_BIT 0x80000000
#define HALF_PI (float) (22.0f/7.0f/2.0f)
#define TINY 0.1e-09f

#define DECLARE_INIT_M256(name, value) __m256 name = {value, value, value, value, value, value, value, value};
#define DECLARE_INIT_M256I(name, value) __m256i name = {value, value, value, value, value, value, value, value};

//Initialize constants
DECLARE_INIT_M256(sinC0, SINC0);
DECLARE_INIT_M256(sinC1, SINC1);
DECLARE_INIT_M256(sinC2, SINC2);
DECLARE_INIT_M256(sinC3, SINC3);
DECLARE_INIT_M256(sinC4, SINC4);
DECLARE_INIT_M256(sqrtC0, RSQRT_C0);
DECLARE_INIT_M256(sqrtC1, RSQRT_C0);
DECLARE_INIT_M256(atanC0, ATANC0);
DECLARE_INIT_M256(atanC1, ATANC1);
DECLARE_INIT_M256(atanC2, ATANC2);
DECLARE_INIT_M256(atanC3, ATANC3);
DECLARE_INIT_M256(atanC4, ATANC4);
DECLARE_INIT_M256(atanC5, ATANC5);
DECLARE_INIT_M256(atanC6, ATANC6);
DECLARE_INIT_M256(atanC7, ATANC7);
DECLARE_INIT_M256(allOnes, ALL_ONES);
DECLARE_INIT_M256(halfPi, HALF_PI);
DECLARE_INIT_M256(tiny, TINY);
DECLARE_INIT_M256I(dClearSignBit, CLEAR_SIGN_BIT);
DECLARE_INIT_M256I(dSetSignBit, SET_SIGN_BIT);
DECLARE_INIT_M256(clearSignBit, 0);
DECLARE_INIT_M256(setSignBit, 0);

//Application initialization - call once before invoking slerp 
void initializationFunction(void)	{
	
   //Initialize some constants
   clearSignBit = _mm256_castsi256_ps(dClearSignBit);
   setSignBit = _mm256_castsi256_ps(dSetSignBit);
}

//Perform the SLERP operation
void ProcessOptimizedKernel(void)	{
__m256 fromX, fromY, fromZ, fromW, toX, toY, toZ, toW, factor;
__m256 cosom, absCosom, sinsqr, sinom;
__m256 ymmR0, ymmR1, ymmR2, ymmR3, ymmR4, ymmR5, ymmR6, ymmR7;
__m256 scale0, scale1;
__m256 signBit;
__m256 omega;

float * pFromQuat;
float * pToQuat;
float * pResults;

//Initialize data in and out pointers
pFromQuat = (float *) fromQuaternion;
pToQuat = (float *) toQuaternion;
pResults = (float *) resultQuaternion;

//Other initializations
factor = _mm256_broadcast_ss(&g_factor);
int mod16QuaternionCount = g_quaternionCount & 0xfffffff0;

//Operate on all the quaternions
for(int i = 0; i < mod16QuaternionCount; i += 16)	{

   //Load From-Quat 0 and Quat 1 into one register
   fromX = _mm256_load_ps(pFromQuat + i/8*32);
   //Load From-Quat 2 and Quat 3 into one register
   fromY = _mm256_load_ps(pFromQuat + i/8*32 + 2*4);
   //Load From-Quat 4 and Quat 5 into one register
   fromZ = _mm256_load_ps(pFromQuat + i/8*32 + 4*4);
   //Load From-Quat 6 and Quat 7 into one register	
   fromW = _mm256_load_ps(pFromQuat + i/8*32 + 6*4);
   //Load To-Quat 0 and Quat 1 into one register
   toX = _mm256_load_ps(pToQuat + i/8*32);
   //Load To-Quat 2 and Quat 3 into one register
   toY = _mm256_load_ps(pToQuat + i/8*32 + 2*4);
   //Load To-Quat 4 and Quat 5 into one register
   toZ = _mm256_load_ps(pToQuat + i/8*32 + 4*4);
   //Load To-Quat 6 and Quat 7 into one register
   toW = _mm256_load_ps(pToQuat + i/8*32 + 6*4);

   //*** Compute product terms: fromx*tox, fromy*toy, fromz*toz, fromw*tow
   //ymmR0 = fromw1*tow1, fromz1*toz1, fromy1*toy1, fromx1*tox
   //        fromw0*tow0, fromz0*toz0, fromy0*toy0, fromx0*tox0
   ymmR0 = _mm256_mul_ps(fromX, toX);	//Quaternion 0 and 1
   ymmR1 = _mm256_mul_ps(fromY, toY);	//Quaternion 2 and 3
   ymmR2 = _mm256_mul_ps(fromZ, toZ);	//Quaternion 4 and 5
   ymmR3 = _mm256_mul_ps(fromW, toW);	//Quaternion 6 and 7

   //*** Swizzle products to prepare for additions
   //Start collecting product terms (step 1 of 2)
   //Get y3, y1, x3, x1, y2, y0, x2, x0
   ymmR4 = _mm256_unpacklo_ps(ymmR0, ymmR1);
   //Get w3, w1, z3, z1, w2, w0, z2, z0
   ymmR5 = _mm256_unpackhi_ps(ymmR0, ymmR1);
   //Get y7, y5, x7, x5, y6, y4, x6, x4
   ymmR6 = _mm256_unpacklo_ps(ymmR2, ymmR3);
   //Get w7, w5, z7, z5, w6, w4, z6, z4
   ymmR7 = _mm256_unpackhi_ps(ymmR2, ymmR3);

   //Finish collecting product terms (step 2 of 2)
   //Get x7, x5, x3, x1, x6, x4, x2, x0
   ymmR0 = _mm256_castpd_ps(_mm256_unpacklo_pd(_mm256_castps_pd(ymmR4), _mm256_castps_pd(ymmR6)));
   //Get y7, y5, y3, y1, y6, y4, y2, y0
   ymmR1 = _mm256_castpd_ps(_mm256_unpackhi_pd(_mm256_castps_pd(ymmR4), _mm256_castps_pd(ymmR6)));
   //calculate first partial sum
   ymmR4 = _mm256_add_ps(ymmR0, ymmR1);
   //Get z7, z5, z3, z1, z6, z4, z2, z0
   ymmR2 = _mm256_castpd_ps(_mm256_unpacklo_pd(_mm256_castps_pd(ymmR5), _mm256_castps_pd(ymmR7)));
   //Get w7, w5, w3, w1, w6, w4, w2, w0
   ymmR3 = _mm256_castpd_ps(_mm256_unpackhi_pd(_mm256_castps_pd(ymmR5), _mm256_castps_pd(ymmR7)));

   //Finish adding the terms
   //Second partial sum
   ymmR5 = _mm256_add_ps(ymmR2, ymmR3);
   //Final sum
   cosom = _mm256_add_ps(ymmR4, ymmR5);	
		
   //*** Compute the absolute value of above
   absCosom = _mm256_and_ps(cosom, clearSignBit);
   //Save the sign bit, used to invert scale1 later
   signBit = _mm256_and_ps(cosom, setSignBit);	
			
   //*** Compute 1.0 - (above squared)   
   ymmR0 = _mm256_mul_ps(absCosom, absCosom);
   sinsqr = _mm256_sub_ps(allOnes, ymmR0);
   //Watch for very small values
   ymmR1 = _mm256_xor_ps(ymmR1, ymmR1);		//Zeros
   ymmR1 = _mm256_cmpeq_ps(ymmR1, sinsqr);
   ymmR1 = _mm256_and_ps(ymmR1, tiny);		//Make them tiny, not zero
   sinsqr = _mm256_and_ps(sinsqr, clearSignBit);//Make them positive
   sinsqr = _mm256_or_ps(sinsqr, ymmR1);

   //Compute 1/(square root of above)
   sinom = _mm256_rsqrt_ps(sinsqr);
   ymmR0 = _mm256_mul_ps(sinsqr, sinom);	
   ymmR0 = _mm256_mul_ps(ymmR0, sinom);
   ymmR0 = _mm256_sub_ps(ymmR0, sqrtC0);
   ymmR1 = _mm256_mul_ps(sinom, sqrtC1);
   ymmR0 = _mm256_mul_ps(ymmR0, ymmR1);
   ymmR0 = _mm256_mul_ps(sinsqr, sinom);
		
   //*** Compute the ArcTangent (omega = arctan(sinom*sinsqr, absCosom))
   ymmR1 = absCosom;
   absCosom = _mm256_min_ps(absCosom, ymmR0);
   ymmR0 = _mm256_max_ps(ymmR0, ymmR1);
   ymmR1 = _mm256_cmpeq_ps(ymmR1, absCosom);
   ymmR2 = _mm256_rcp_ps(ymmR0);
   ymmR0 = _mm256_mul_ps(ymmR0, ymmR2);
   ymmR0 = _mm256_mul_ps(ymmR0, ymmR2);
   ymmR2 = _mm256_add_ps(ymmR2, ymmR2);
   ymmR2 = _mm256_sub_ps(ymmR2, ymmR0);
   absCosom = _mm256_mul_ps(absCosom, ymmR2);
   ymmR0 = ymmR1;		//copy the mask
   ymmR0 = _mm256_and_ps(ymmR0, setSignBit/*ymmR2*/);
   absCosom = _mm256_xor_ps(absCosom, ymmR0);

   //cmp mask determines contents: Pi/2 or 0.0
   ymmR1 = _mm256_and_ps(ymmR1, halfPi);

   //finish ArcTangent calculation
   ymmR0 = _mm256_mul_ps(absCosom, absCosom);
   ymmR2 = _mm256_mul_ps(atanC0, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC1);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC2);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC3);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC4);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC5);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC6);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC7);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, allOnes);
   ymmR2 = _mm256_mul_ps(ymmR2, absCosom);
   //Adjust by adding Pi/2 or 0.0
   omega = _mm256_add_ps(ymmR2, ymmR1);
		
   //*** Compute the first scale factor
   //Compute [1.0 - (move factor)]*above?
   ymmR1 = _mm256_mul_ps(factor, omega);
   ymmR0 = _mm256_sub_ps(omega, ymmR1);
		
   //Compute both scale factors
   //Compute sin(above)
   ymmR1 = ymmR0;						//used to calculate scale0
   //Compute factor*omega
   ymmR6 = _mm256_mul_ps(factor, omega);
   ymmR4 = ymmR6;						//used to calculate scale1
   ymmR1 = _mm256_mul_ps(ymmR1, ymmR1);
   ymmR4 = _mm256_mul_ps(ymmR4, ymmR4); //scale1
   ymmR2 = _mm256_mul_ps(sinC0, ymmR1);
   ymmR5 = _mm256_mul_ps(sinC0, ymmR4); //scale1
   ymmR2 = _mm256_add_ps(ymmR2, sinC1);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR1);
   ymmR5 = _mm256_add_ps(ymmR5, sinC1); //scale1
   ymmR5 = _mm256_mul_ps(ymmR5, ymmR4); //scale1
   ymmR2 = _mm256_add_ps(ymmR2, sinC2);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR1);
   ymmR5 = _mm256_add_ps(ymmR5, sinC2); //scale1
   ymmR5 = _mm256_mul_ps(ymmR5, ymmR4); //scale1
   ymmR2 = _mm256_add_ps(ymmR2, sinC3);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR1);
   ymmR5 = _mm256_add_ps(ymmR5, sinC3); //scale1
   ymmR5 = _mm256_mul_ps(ymmR5, ymmR4); //scale1
   ymmR2 = _mm256_add_ps(ymmR2, sinC4);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR1);
   ymmR5 = _mm256_add_ps(ymmR5, sinC4); //scale1
   ymmR5 = _mm256_mul_ps(ymmR5, ymmR4); //scale1
   ymmR2 = _mm256_add_ps(ymmR2, allOnes);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);	//ymmR2 has the sin(x) for eight
   ymmR5 = _mm256_add_ps(ymmR5, allOnes);  //scale1
   ymmR5 = _mm256_mul_ps(ymmR5, ymmR6);	//ymmR5 has the sin(x) for eight
		
   //Compute above*sinom
   scale0 = _mm256_mul_ps(ymmR2, sinom);	//scale0 has the first scale factor
   scale1 = _mm256_mul_ps(ymmR5, sinom);	//scale1 has the second scale factor

   //*** invert the second scale factor if that first sum of products is negative
   scale1 = _mm256_xor_ps(scale1, signBit);
		
   //*** Compute the final results by scaling
   //Copy scale values for quat 1 and 0 to all spots in their 128-bit lane
   //ymmR6 = scale01, scale01, scale01, scale01, scale00, scale00, scale00, scale00
   ymmR6 = _mm256_shuffle_ps(scale0, scale0, 0x00);
   //ymmR7 = scale11, scale11, scale11, scale11, scale10, scale10, scale10, scale10
   ymmR7 = _mm256_shuffle_ps(scale1, scale1, 0x00);
   //Scale the original From quaternions 1 and 0
   ymmR0 = _mm256_mul_ps(ymmR6, fromX);
   //Scale the original To quaternions 1 and 0
   ymmR1 = _mm256_mul_ps(ymmR7, toX);
   //Add to complete the SLERP, then store results to memory. 
   ymmR2 = _mm256_add_ps(ymmR0, ymmR1);	//ymmR2 has all the quat 0 and 1 results
   _mm256_store_ps(pResults + i/8*32, ymmR2);

   //Copy scale values for quat 3 and 2 to all spots in their 128-bit lane
   ymmR6 = _mm256_shuffle_ps(scale0, scale0, 0x55);
   ymmR7 = _mm256_shuffle_ps(scale1, scale1, 0x55);
   //Scale, add, and store results for quats 2 and 3
   ymmR0 = _mm256_mul_ps(ymmR6, fromY);
   ymmR1 = _mm256_mul_ps(ymmR7, toY);
   ymmR2 = _mm256_add_ps(ymmR0, ymmR1);//ymmR2 has all the quat 2 and 3 results
   _mm256_store_ps(pResults + i/8*32 + 2*4, ymmR2);

   //Copy scale values for quat 5 and 4 to all spots in their 128-bit lane
   ymmR6 = _mm256_shuffle_ps(scale0, scale0, 0xaa);
   ymmR7 = _mm256_shuffle_ps(scale1, scale1, 0xaa);
   //Scale, add, and store results for quats 4 and 5
   ymmR0 = _mm256_mul_ps(ymmR6, fromZ);
   ymmR1 = _mm256_mul_ps(ymmR7, toZ);
   ymmR2 = _mm256_add_ps(ymmR0, ymmR1);	//ymmR2 has all the quat 4 and 5 results
   _mm256_store_ps(pResults + i/8*32 + 4*4, ymmR2);

   //Copy scale values for quat 7 and 6 to all spots in their 128-bit lane
   ymmR6 = _mm256_shuffle_ps(scale0, scale0, 0xff);
   ymmR7 = _mm256_shuffle_ps(scale1, scale1, 0xff);
   //Scale, add, and store results for quats 6 and 7
   ymmR0 = _mm256_mul_ps(ymmR6, fromW);
   ymmR1 = _mm256_mul_ps(ymmR7, toW);
   ymmR2 = _mm256_add_ps(ymmR0, ymmR1);	//ymmR2 has all the quat 6 and 7 results
   _mm256_store_ps(pResults + i/8*32 + 6*4, ymmR2);

//*****************************************************************
//Unrolled once - repeat the above for another eight quaternions
//Load From-Quat 0 and Quat 1 into one register
fromX = _mm256_load_ps(pFromQuat + i/8*32 + 32);
   //Load From-Quat 2 and Quat 3 into one register
   fromY = _mm256_load_ps(pFromQuat + i/8*32 + 2*4 + 32);
   //Load From-Quat 4 and Quat 5 into one register
   fromZ = _mm256_load_ps(pFromQuat + i/8*32 + 4*4 + 32);
   //Load From-Quat 6 and Quat 7 into one register
   fromW = _mm256_load_ps(pFromQuat + i/8*32 + 6*4 + 32);
   //Load To-Quat 0 and Quat 1 into one register
   toX = _mm256_load_ps(pToQuat + i/8*32 + 32);
   //Load To-Quat 2 and Quat 3 into one register
   toY = _mm256_load_ps(pToQuat + i/8*32 + 2*4 + 32);
   //Load To-Quat 4 and Quat 5 into one register
   toZ = _mm256_load_ps(pToQuat + i/8*32 + 4*4 + 32);
   //Load To-Quat 6 and Quat 7 into one register
   toW = _mm256_load_ps(pToQuat + i/8*32 + 6*4 + 32);

   //*** Compute fromx*tox, fromy*toy, fromz*toz, fromw*tow
   ymmR0 = _mm256_mul_ps(fromX, toX);	//Quaternion 0 and 1
   ymmR1 = _mm256_mul_ps(fromY, toY);	//Quaternion 2 and 3
   ymmR2 = _mm256_mul_ps(fromZ, toZ);	//Quaternion 4 and 5
   ymmR3 = _mm256_mul_ps(fromW, toW);	//Quaternion 6 and 7

   //*** Swizzle to perform adds
   //Start collecting terms (step 1 of 2)
   //Get y3, y1, x3, x1, y2, y0, x2, x0
   ymmR4 = _mm256_unpacklo_ps(ymmR0, ymmR1);
   //Get w3, w1, z3, z1, w2, w0, z2, z0
   ymmR5 = _mm256_unpackhi_ps(ymmR0, ymmR1);
   //Get y7, y5, x7, x5, y6, y4, x6, x4
   ymmR6 = _mm256_unpacklo_ps(ymmR2, ymmR3);
   //Get w7, w5, z7, z5, w6, w4, z6, z4
   ymmR7 = _mm256_unpackhi_ps(ymmR2, ymmR3);

   //Finish collecting terms (step 2 of 2)
   //Get x7, x5, x3, x1, x6, x4, x2, x0
   ymmR0 = _mm256_castpd_ps(_mm256_unpacklo_pd(_mm256_castps_pd(ymmR4), _mm256_castps_pd(ymmR6)));
   //Get y7, y5, y3, y1, y6, y4, y2, y0
   ymmR1 = _mm256_castpd_ps(_mm256_unpackhi_pd(_mm256_castps_pd(ymmR4), _mm256_castps_pd(ymmR6)));
   ymmR4 = _mm256_add_ps(ymmR0, ymmR1);
   //Get z7, z5, z3, z1, z6, z4, z2, z0
   ymmR2 = _mm256_castpd_ps(_mm256_unpacklo_pd(_mm256_castps_pd(ymmR5), _mm256_castps_pd(ymmR7)));
   //Get w7, w5, w3, w1, w6, w4, w2, w0
   ymmR3 = _mm256_castpd_ps(_mm256_unpackhi_pd(_mm256_castps_pd(ymmR5), _mm256_castps_pd(ymmR7)));

   //Finish adding the terms
   ymmR5 = _mm256_add_ps(ymmR2, ymmR3);
   cosom = _mm256_add_ps(ymmR4, ymmR5);
		
   //*** Compute the absolute value of above
   absCosom = _mm256_and_ps(cosom, clearSignBit);
   signBit = _mm256_and_ps(cosom, setSignBit);	//Used to invert scale1 later
	
   //*** Compute 1.0 - (above squared)
   ymmR0 = _mm256_mul_ps(absCosom, absCosom);
   sinsqr = _mm256_sub_ps(allOnes, ymmR0);
   //Watch for very small values
   ymmR1 = _mm256_xor_ps(ymmR1, ymmR1);		//Zeros
   ymmR1 = _mm256_cmpeq_ps(ymmR1, sinsqr);
   ymmR1 = _mm256_and_ps(ymmR1, tiny);		//Make them tiny, not zero
   sinsqr = _mm256_and_ps(sinsqr, clearSignBit);//Make them positive
   sinsqr = _mm256_or_ps(sinsqr, ymmR1);

   //Compute 1/(square root of above)
   sinom = _mm256_rsqrt_ps(sinsqr);
   ymmR0 = _mm256_mul_ps(sinsqr, sinom);	
   ymmR0 = _mm256_mul_ps(ymmR0, sinom);
   ymmR0 = _mm256_sub_ps(ymmR0, sqrtC0);
   ymmR1 = _mm256_mul_ps(sinom, sqrtC1);
   ymmR0 = _mm256_mul_ps(ymmR0, ymmR1);
   ymmR0 = _mm256_mul_ps(sinsqr, sinom);	
		
   //*** Compute the ArcTangent (omega = arctan(sinom*sinsqr, absCosom)
   ymmR1 = absCosom;
   absCosom = _mm256_min_ps(absCosom, ymmR0);
   ymmR0 = _mm256_max_ps(ymmR0, ymmR1);
   ymmR1 = _mm256_cmpeq_ps(ymmR1, absCosom);
   ymmR2 = _mm256_rcp_ps(ymmR0);
   ymmR0 = _mm256_mul_ps(ymmR0, ymmR2);	
   ymmR0 = _mm256_mul_ps(ymmR0, ymmR2);
   ymmR2 = _mm256_add_ps(ymmR2, ymmR2);
   ymmR2 = _mm256_sub_ps(ymmR2, ymmR0);
   absCosom = _mm256_mul_ps(absCosom, ymmR2);
   ymmR0 = ymmR1;								//copy the mask
   ymmR0 = _mm256_and_ps(ymmR0, setSignBit);
   absCosom = _mm256_xor_ps(absCosom, ymmR0);

   //cmp mask determines contents: Pi/2 or 0.0
   ymmR1 = _mm256_and_ps(ymmR1, halfPi);
   ymmR0 = _mm256_mul_ps(absCosom, absCosom);
   ymmR2 = _mm256_mul_ps(atanC0, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC1);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC2);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC3);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC4);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC5);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC6);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, atanC7);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);
   ymmR2 = _mm256_add_ps(ymmR2, allOnes);
   ymmR2 = _mm256_mul_ps(ymmR2, absCosom);
   omega = _mm256_add_ps(ymmR2, ymmR1);
		
   //*** Compute the first scale factor
   //Compute [1.0 - (move factor)]*above?
   ymmR1 = _mm256_mul_ps(factor, omega);
   ymmR0 = _mm256_sub_ps(omega, ymmR1);
		
   //Compute both scale factors
   //Compute sin(above)
   ymmR1 = ymmR0;						//used to calculate scale0
   //Compute factor*omega
   ymmR6 = _mm256_mul_ps(factor, omega);
   ymmR4 = ymmR6;						//used to calculate scale1
   ymmR1 = _mm256_mul_ps(ymmR1, ymmR1);
   ymmR4 = _mm256_mul_ps(ymmR4, ymmR4);	//scale1
   ymmR2 = _mm256_mul_ps(sinC0, ymmR1);
   ymmR5 = _mm256_mul_ps(sinC0, ymmR4);	//scale1
   ymmR2 = _mm256_add_ps(ymmR2, sinC1);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR1);
   ymmR5 = _mm256_add_ps(ymmR5, sinC1);	//scale1
   ymmR5 = _mm256_mul_ps(ymmR5, ymmR4);	//scale1
   ymmR2 = _mm256_add_ps(ymmR2, sinC2);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR1);
   ymmR5 = _mm256_add_ps(ymmR5, sinC2);	//scale1
   ymmR5 = _mm256_mul_ps(ymmR5, ymmR4);	//scale1
   ymmR2 = _mm256_add_ps(ymmR2, sinC3);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR1);
   ymmR5 = _mm256_add_ps(ymmR5, sinC3);	//scale1
   ymmR5 = _mm256_mul_ps(ymmR5, ymmR4);	//scale1
   ymmR2 = _mm256_add_ps(ymmR2, sinC4);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR1);
   ymmR5 = _mm256_add_ps(ymmR5, sinC4);	//scale1
   ymmR5 = _mm256_mul_ps(ymmR5, ymmR4);	//scale1
   ymmR2 = _mm256_add_ps(ymmR2, allOnes);
   ymmR2 = _mm256_mul_ps(ymmR2, ymmR0);	//ymmR2 has the sin(x) for eight
   ymmR5 = _mm256_add_ps(ymmR5, allOnes);  //scale1
   ymmR5 = _mm256_mul_ps(ymmR5, ymmR6);	//ymmR5 has the sin(x) for eight
		
   //Compute above*sinom
   scale0 = _mm256_mul_ps(ymmR2, sinom);	//scale0 has the first scale factor
   scale1 = _mm256_mul_ps(ymmR5, sinom);	//scale1 has the second scale factor

   //*** invert the second scale factor if
   //*** that first sum of products is negative
   scale1 = _mm256_xor_ps(scale1, signBit);	
		
   //*** Compute the final results by scaling, adding, then storing
   ymmR6 = _mm256_shuffle_ps(scale0, scale0, 0x00);
   ymmR7 = _mm256_shuffle_ps(scale1, scale1, 0x00);
   ymmR0 = _mm256_mul_ps(ymmR6, fromX);
   ymmR1 = _mm256_mul_ps(ymmR7, toX);
   ymmR2 = _mm256_add_ps(ymmR0, ymmR1);	//ymmR2 has all the quat 0 and 1 results
   _mm256_store_ps(pResults + i/8*32 + 32, ymmR2);

   ymmR6 = _mm256_shuffle_ps(scale0, scale0, 0x55);
   ymmR7 = _mm256_shuffle_ps(scale1, scale1, 0x55);
   ymmR0 = _mm256_mul_ps(ymmR6, fromY);
   ymmR1 = _mm256_mul_ps(ymmR7, toY);
   ymmR2 = _mm256_add_ps(ymmR0, ymmR1);	//ymmR2 has all the quat 2 and 3 results
   _mm256_store_ps(pResults + i/8*32 + 2*4 + 32, ymmR2);

   ymmR6 = _mm256_shuffle_ps(scale0, scale0, 0xaa);
   ymmR7 = _mm256_shuffle_ps(scale1, scale1, 0xaa);
   ymmR0 = _mm256_mul_ps(ymmR6, fromZ);
   ymmR1 = _mm256_mul_ps(ymmR7, toZ);
   ymmR2 = _mm256_add_ps(ymmR0, ymmR1);	//ymmR2 has all the quat 4 and 5 results
   _mm256_store_ps(pResults + i/8*32 + 4*4 + 32, ymmR2);

   ymmR6 = _mm256_shuffle_ps(scale0, scale0, 0xff);
   ymmR7 = _mm256_shuffle_ps(scale1, scale1, 0xff);
   ymmR0 = _mm256_mul_ps(ymmR6, fromW);
   ymmR1 = _mm256_mul_ps(ymmR7, toW);
   ymmR2 = _mm256_add_ps(ymmR0, ymmR1);	//ymmR2 has all the quat 6 and 7 results
   _mm256_store_ps(pResults + i/8*32 + 6*4 + 32, ymmR2);
}	

//Handle any remaining quaternions
//Operate on all the quaternions
{
   float cosom, absCosom, sinom, sinSqr, omega, scale0, scale1;
   for(int i = mod16QuaternionCount; i < g_quaternionCount; i++)	{
      cosom = fromQuaternion[i].x*toQuaternion[i].x + fromQuaternion[i].y*toQuaternion[i].y
 + fromQuaternion[i].z*toQuaternion[i].z + fromQuaternion[i].w*toQuaternion[i].w;
      absCosom = fabs(cosom);
      if((1.0f - absCosom) > 1e-6f)	{
         sinSqr = 1.0f - absCosom*absCosom;
         sinom = reciprocalSqrt(sinSqr);
         omega = atanPositive(sinSqr*sinom, absCosom, i);
         scale0 = sinZeroHalfPi((1.0f - g_factor) * omega) * sinom;
         scale1 = sinZeroHalfPi(g_factor * omega) * sinom;
      }
      else	{
         scale0 = 1.0f - g_factor;
         scale1 = g_factor;
      }
      scale1 = (cosom >= 0.0f) ? scale1 : -scale1;
      resultQuaternion[i].x = scale0 * fromQuaternion[i].x + scale1 * toQuaternion[i].x;
      resultQuaternion[i].y = scale0 * fromQuaternion[i].y + scale1 * toQuaternion[i].y;
      resultQuaternion[i].z = scale0 * fromQuaternion[i].z + scale1 * toQuaternion[i].z;
      resultQuaternion[i].w = scale0 * fromQuaternion[i].w + scale1 * toQuaternion[i].w;	
      }
   }
}
}


 

 

References

The following documents are referenced in this application note, and provide background or supporting information for understanding the topics presented in this document.

1. J.M.P. van Waveren, Slerping Clock Cycles, Id Software, Inc. 2005, /sites/default/files/m/d/4/1/d/8/293747_293747.pdf

2. J.M.P. van Waveren, From Quaternions to Matrix and Back, Id Software, Inc. 2005, /sites/default/files/m/d/4/1/d/8/293748.pdf

3. Ken Shoemake, Animating Rotation with Quaternion Curves, Computer Graphics 19(3): 24-254, 1985, http://portal.acm.org/citation.cfm?doid=325334.325242
 

About the Authors

Pallavi Mehrotra - Pallavi joined Intel as a Senior Software Engineer in 2006 after working as a CDMA Network software developer at Motorola. She is currently member of the SSG Apple enabling team, working on optimizing Mac OS* X applications for power and performance. Pallavi holds Masters degree in Computer Science from Arizona State University & Bachelors in Electrical Engineering from India.

Richard Hubbard - Richard is a Senior Software Engineer and member of the SSG Apple enabling team, working on optimizing Mac OS* X applications for power and performance. Richard holds a Masters degree in Electrical Engineering from Stevens Institute of Technology and Bachelors in Computer Engineering from New Jersey Institute of Technology.

 
Pour de plus amples informations sur les optimisations de compilation, consultez notre Avertissement concernant les optimisations.