Using SIMD Technologies on Intel® Architecture to Speed Up Game Code

Abstract

As we know, numerous and complex mathematical operations are the cornerstones of 2D and 3D game development and execution, and these operations frequently require lots of vector and matrix transformations with resulting high computation demands. In this article, we will look at matrix multiplication and use “The Last Defender,” the Android*-based, 3D shooting game popular in the PRC, as a practical case to describe how to use one of SIMD technologies for Intel® Architecture (IA): Intel® Streaming SIMD Extensions (Intel® SSE[XX]) and Supplemental Streaming SIMD Extensions (SSSE[XX]) to speed up mathematical operations and improve game performance. At the end of the article, some common optimization solutions IA-based code will be mentioned.

Background

Simply put, Single Instruction Multiple Data, or SIMD, does multiple data computing in one instruction and is a kind of parallel computing technology on the instruction level. With its ability to utilize parallel computing fully, Intel® SSE is the perfect match for vector and matrix operations in game code because each element in the vector or matrix can usually be operated independently (developers must check to be sure elements do not interfere with each other, which occasionally happens). In contrast to traditional serial computing technologies where each element takes one instruction cycle, a 4x4 matrix, for example, would take 16 cycles to perform all operations. But if we utilize SIMD technology and assume the implementation of SIMD can handle 4 data operations in 1 instruction cycle in parallel, it means 4 elements operations can be finished in 1 instruction cycle simultaneously, and all 16 element operations spend only 4 instruction cycles!

Intel SSE instructions are widely used in SIMD implementations on IA; they are especially seen in game engines. Intel SSE includes many extensions: SSE, SSE2, SSE3, and SSE4.x, and they support many kinds of integer and floating-point types of operations. They also support highly efficient memory access. In this article, we refer to them all as “Intel SSE” for simplification, and here we will show how to use them in the matrix multiplication operation.

There are two kinds of matrix storage models in memory:

1. In Column-Major Order (Figure 1), all elements of the matrix are put in column order one by one in memory. If memory is accessed by contiguous addressing, the matrix elements will be accessed column by column. As we know, OpenGL* uses this method to store matrices in memory.


Figure 1. Column-Major Order

2. In Row-Major Order (Figure 2), all elements of a matrix are put in row order, one by one in memory. If memory is accessed by contiguous addressing, the matrix elements will be accessed row by row. “The Last Defender” game uses this method as its high-level matrix storage model.


Figure 2. Row-Major Order

Because this article is focusing on “The Last Defender” (row major order) as the use case for game development on Android and OpenGL ES* (column major order) is the only low-level graphics hardware acceleration API on Android, we will cover both kinds of storage models and respective optimizations.

In the following discussion, in OpenGL ES and “The Last Defender”, the matrix multiplication operation (vector transformation) orders are the same. They are both using “matrix premultiplication” as this equation (V is vector, M is matrix) shows:

    Vo = Mn x ... x M2 x M1 x Vi

Please refer to Figure 3:


Figure 3. MxV

Figure 3 shows “MxV” results in a new vector, multiplying a 4x4 matrix by a 4-element vector needs 16 multiplication operations and 12 add operations. The operation load is very heavy. If these operations are performed serially, many instruction cycles must occur, and it will take a long time. Game engines especially have lots of these kinds of operations, so this is a key area for optimization.

In Figure 3, the columns in the blue box are the 4 element operations that can be performed simultaneously, which means a whole MxV operation only needs 4 parallel multiplication operations and 3 parallel add operations. Performance can be improved 4 times. In fact, Intel SSE is very good at executing all multiplication and add operations concurrently in each column, as Figure 4 shows.


Figure 4. Matrix Parallel Operations


Figure 5. Matrix X Matrix

Figure 5 shows an example of multiplying two matrices, which is an extension of multiplying a matrix by a vector, shown previously. But there is one important point we should know: this operation will be impacted by the matrix storage models. The different colors in Figure 5 indicate parallel operations in two different matrix storage modes. Pink shows the parallel operation in Row-Major mode, and orange shows another parallel operation in Column-Major mode. Because Intel SSE also provides highly efficient memory access instructions for different matrix storage models, we should use different algorithms to utilize these acceleration instructions. But memory alignment restrictions for memory access instructions must be noted.

The next two sections show two different solutions based on the two matrix storage models. Both use Intel SSE to parallelize matrix operations to speed up code execution.

Based on “The Last Defender,” optimizing matrix multiplication operation in Row-Major order

Before optimizing the “The Last Defender” game engine code with Intel SSE, we analyzed it using Intel® VTune™ Amplifier 2011 for Android to profile its computing consumption, specifically noting the “Matrix4f::mul” function as shown in Figure 6:


Figure 6. Original (Baseline) Matrix4f::mul computing consumption reference

We profiled our code on the Motorola MT788 smartphone with the Intel® Atom™ processor Z2480. After specific operations, we found Matrix4f::mul computing consumption reference was 83,340,000—a very time-consuming operation. In real code, it looks like this:

void Matrix4f::mul(Matrix4f *m1, Matrix4f *m2)
	if (this != m1 && this != m2)
	{

		this->f[m00] = m1->f[m00] * m2->f[m00] + m1->f[m01] * m2->f[m10] + m1->f[m02] * m2->f[m20] + m1->f[m03] * m2->f[m30];
		this->f[m01] = m1->f[m00] * m2->f[m01] + m1->f[m01] * m2->f[m11] + m1->f[m02] * m2->f[m21] + m1->f[m03] * m2->f[m31];
		this->f[m02] = m1->f[m00] * m2->f[m02] + m1->f[m01] * m2->f[m12] + m1->f[m02] * m2->f[m22] + m1->f[m03] * m2->f[m32];
		this->f[m03] = m1->f[m00] * m2->f[m03] + m1->f[m01] * m2->f[m13] + m1->f[m02] * m2->f[m23] + m1->f[m03] * m2->f[m33];

		this->f[m10] = m1->f[m10] * m2->f[m00] + m1->f[m11] * m2->f[m10] + m1->f[m12] * m2->f[m20] + m1->f[m13] * m2->f[m30];
		this->f[m11] = m1->f[m10] * m2->f[m01] + m1->f[m11] * m2->f[m11] + m1->f[m12] * m2->f[m21] + m1->f[m13] * m2->f[m31];
		this->f[m12] = m1->f[m10] * m2->f[m02] + m1->f[m11] * m2->f[m12] + m1->f[m12] * m2->f[m22] + m1->f[m13] * m2->f[m32];
		this->f[m13] = m1->f[m10] * m2->f[m03] + m1->f[m11] * m2->f[m13] + m1->f[m12] * m2->f[m23] + m1->f[m13] * m2->f[m33];

		this->f[m20] = m1->f[m20] * m2->f[m00] + m1->f[m21] * m2->f[m10] + m1->f[m22] * m2->f[m20] + m1->f[m23] * m2->f[m30];
		this->f[m21] = m1->f[m20] * m2->f[m01] + m1->f[m21] * m2->f[m11] + m1->f[m22] * m2->f[m21] + m1->f[m23] * m2->f[m31];
		this->f[m22] = m1->f[m20] * m2->f[m02] + m1->f[m21] * m2->f[m12] + m1->f[m22] * m2->f[m22] + m1->f[m23] * m2->f[m32];
		this->f[m23] = m1->f[m20] * m2->f[m03] + m1->f[m21] * m2->f[m13] + m1->f[m22] * m2->f[m23] + m1->f[m23] * m2->f[m33];

		this->f[m30] = m1->f[m30] * m2->f[m00] + m1->f[m31] * m2->f[m10] + m1->f[m32] * m2->f[m20] + m1->f[m33] * m2->f[m30];
		this->f[m31] = m1->f[m30] * m2->f[m01] + m1->f[m31] * m2->f[m11] + m1->f[m32] * m2->f[m21] + m1->f[m33] * m2->f[m31];
		this->f[m32] = m1->f[m30] * m2->f[m02] + m1->f[m31] * m2->f[m12] + m1->f[m32] * m2->f[m22] + m1->f[m33] * m2->f[m32];
		this->f[m33] = m1->f[m30] * m2->f[m03] + m1->f[m31] * m2->f[m13] + m1->f[m32] * m2->f[m23] + m1->f[m33] * m2->f[m33];
	}
	else
	{
		float _m00, _m01, _m02, _m03, _m10, _m11, _m12, _m13, _m20, _m21, _m22, _m23, _m30, _m31, _m32, _m33; // vars
		// for
		// te_mp
		// result
		// _matrix
		_m00 = m1->f[m00] * m2->f[m00] + m1->f[m01] * m2->f[m10] + m1->f[m02] * m2->f[m20] + m1->f[m03] * m2->f[m30];
		_m01 = m1->f[m00] * m2->f[m01] + m1->f[m01] * m2->f[m11] + m1->f[m02] * m2->f[m21] + m1->f[m03] * m2->f[m31];
		_m02 = m1->f[m00] * m2->f[m02] + m1->f[m01] * m2->f[m12] + m1->f[m02] * m2->f[m22] + m1->f[m03] * m2->f[m32];
		_m03 = m1->f[m00] * m2->f[m03] + m1->f[m01] * m2->f[m13] + m1->f[m02] * m2->f[m23] + m1->f[m03] * m2->f[m33];

		_m10 = m1->f[m10] * m2->f[m00] + m1->f[m11] * m2->f[m10] + m1->f[m12] * m2->f[m20] + m1->f[m13] * m2->f[m30];
		_m11 = m1->f[m10] * m2->f[m01] + m1->f[m11] * m2->f[m11] + m1->f[m12] * m2->f[m21] + m1->f[m13] * m2->f[m31];
		_m12 = m1->f[m10] * m2->f[m02] + m1->f[m11] * m2->f[m12] + m1->f[m12] * m2->f[m22] + m1->f[m13] * m2->f[m32];
		_m13 = m1->f[m10] * m2->f[m03] + m1->f[m11] * m2->f[m13] + m1->f[m12] * m2->f[m23] + m1->f[m13] * m2->f[m33];

		_m20 = m1->f[m20] * m2->f[m00] + m1->f[m21] * m2->f[m10] + m1->f[m22] * m2->f[m20] + m1->f[m23] * m2->f[m30];
		_m21 = m1->f[m20] * m2->f[m01] + m1->f[m21] * m2->f[m11] + m1->f[m22] * m2->f[m21] + m1->f[m23] * m2->f[m31];
		_m22 = m1->f[m20] * m2->f[m02] + m1->f[m21] * m2->f[m12] + m1->f[m22] * m2->f[m22] + m1->f[m23] * m2->f[m32];
		_m23 = m1->f[m20] * m2->f[m03] + m1->f[m21] * m2->f[m13] + m1->f[m22] * m2->f[m23] + m1->f[m23] * m2->f[m33];

		_m30 = m1->f[m30] * m2->f[m00] + m1->f[m31] * m2->f[m10] + m1->f[m32] * m2->f[m20] + m1->f[m33] * m2->f[m30];
		_m31 = m1->f[m30] * m2->f[m01] + m1->f[m31] * m2->f[m11] + m1->f[m32] * m2->f[m21] + m1->f[m33] * m2->f[m31];
		_m32 = m1->f[m30] * m2->f[m02] + m1->f[m31] * m2->f[m12] + m1->f[m32] * m2->f[m22] + m1->f[m33] * m2->f[m32];
		_m33 = m1->f[m30] * m2->f[m03] + m1->f[m31] * m2->f[m13] + m1->f[m32] * m2->f[m23] + m1->f[m33] * m2->f[m33];

		this->f[m00] = _m00;
		this->f[m01] = _m01;
		this->f[m02] = _m02;
		this->f[m03] = _m03;
		this->f[m10] = _m10;
		this->f[m11] = _m11;
		this->f[m12] = _m12;
		this->f[m13] = _m13;
		this->f[m20] = _m20;
		this->f[m21] = _m21;
		this->f[m22] = _m22;
		this->f[m23] = _m23;
		this->f[m30] = _m30;
		this->f[m31] = _m31;
		this->f[m32] = _m32;
		this->f[m33] = _m33;
	}

Yes, this code is clear and simple. But obviously, it is long and time-consuming, and the function is called frequently in the game engine, affecting performance and making it a prime candidate for optimization.

As mentioned above, the simple SSE optimization can be applied as follows (in Row-Major order):

	__m128 m1_row_0 = _mm_setr_ps(m1->f[m00], m1->f[m01], m1->f[m02], m1->f[m03]);
	__m128 m1_row_1 = _mm_setr_ps(m1->f[m10], m1->f[m11], m1->f[m12], m1->f[m13]);
	__m128 m1_row_2 = _mm_setr_ps(m1->f[m20], m1->f[m21], m1->f[m22], m1->f[m23]);
	__m128 m1_row_3 = _mm_setr_ps(m1->f[m30], m1->f[m31], m1->f[m32], m1->f[m33]);

	__m128 m2_row_0 = _mm_setr_ps(m2->f[m00], m2->f[m01], m2->f[m02], m2->f[m03]);
	__m128 m2_row_1 = _mm_setr_ps(m2->f[m10], m2->f[m11], m2->f[m12], m2->f[m13]);
	__m128 m2_row_2 = _mm_setr_ps(m2->f[m20], m2->f[m21], m2->f[m22], m2->f[m23]);
	__m128 m2_row_3 = _mm_setr_ps(m2->f[m30], m2->f[m31], m2->f[m32], m2->f[m33]);

	__m128 out0;
	__m128 out1;
	__m128 out2;
	__m128 out3;

	out0 = _mm_mul_ps(m2_row_0, _mm_replicate_x_ps(m1_row_0));
	out1 = _mm_mul_ps(m2_row_0, _mm_replicate_x_ps(m1_row_1));
	out2 = _mm_mul_ps(m2_row_0, _mm_replicate_x_ps(m1_row_2));
	out3 = _mm_mul_ps(m2_row_0, _mm_replicate_x_ps(m1_row_3));

	out0 = _mm_madd_ps(m2_row_1, _mm_replicate_y_ps(m1_row_0), out0);
	out1 = _mm_madd_ps(m2_row_1, _mm_replicate_y_ps(m1_row_1), out1);
	out2 = _mm_madd_ps(m2_row_1, _mm_replicate_y_ps(m1_row_2), out2);
	out3 = _mm_madd_ps(m2_row_1, _mm_replicate_y_ps(m1_row_3), out3);

	out0 = _mm_madd_ps(m2_row_2, _mm_replicate_z_ps(m1_row_0), out0);
	out1 = _mm_madd_ps(m2_row_2, _mm_replicate_z_ps(m1_row_1), out1);
	out2 = _mm_madd_ps(m2_row_2, _mm_replicate_z_ps(m1_row_2), out2);
	out3 = _mm_madd_ps(m2_row_2, _mm_replicate_z_ps(m1_row_3), out3);

	out0 = _mm_madd_ps(m2_row_3, _mm_replicate_w_ps(m1_row_0), out0);
	out1 = _mm_madd_ps(m2_row_3, _mm_replicate_w_ps(m1_row_1), out1);
	out2 = _mm_madd_ps(m2_row_3, _mm_replicate_w_ps(m1_row_2), out2);
	out3 = _mm_madd_ps(m2_row_3, _mm_replicate_w_ps(m1_row_3), out3);

	_mm_store_ps(&this->f[0], out0);
	_mm_store_ps(&this->f[4], out1);
	_mm_store_ps(&this->f[8], out2);
	_mm_store_ps(&this->f[12], out3);

This implementation is based on Intel SSE Intrinsics. We recommend that developers use SSE Intrinsics when compilers support them instead of writing pure assembly language. They are better, easier to use, and more intuitive than assembly language without any performance loss.

    __m128 is a data type for SSE Intrinsics. The length is 128 bytes, it can be used to store four 32-bit single floats.
    __m128 _mm_setr_ps(float z , float y , float x , float w );

    This intrinsic can set four single floats to 1 __m128 data as r0 := z, r1 := y, r2 := x, r3 := w.
    __m128 _mm_mul_ps(__m128 a , __m128 b );

    This intrinsic can execute the multiplication of four single floats of “a” by four single floats of “b” in parallel as:
    r0 := a0 * b0
    r1 := a1 * b1
    r2 := a2 * b2
    r3 := a3 * b3

    _mm_replicate_(x~w)_ps is a macro that includes:
    #define _mm_replicate_x_ps(v) \
    _mm_shuffle_ps((v), (v), SHUFFLE_PARAM(0, 0, 0, 0))

    #define _mm_replicate_y_ps(v) \
    _mm_shuffle_ps((v), (v), SHUFFLE_PARAM(1, 1, 1, 1))

    #define _mm_replicate_z_ps(v) \
    _mm_shuffle_ps((v), (v), SHUFFLE_PARAM(2, 2, 2, 2))

    #define _mm_replicate_w_ps(v) \
    _mm_shuffle_ps((v), (v), SHUFFLE_PARAM(3, 3, 3, 3))

    __m128 _mm_shuffle_ps(__m128 a , __m128 b , int i );

    This intrinsic is based on mask “i”. It selects four specific single floats from “a” and “ b” to combine a new __m128 data based on mask “i”. The mask must be an immediate number. Figure 7 shows the detailed rules:

    _MM_SHUFFLE(z, y, x, w)
    /* expands to the following value */
    (z<<6) | (y<<4) | (x<<2) | w


Figure 7. SHUFFLE description

    The following SHUFFLE macros have been used to optimize “The Last Defender”:
    #define SHUFFLE_PARAM(x, y, z, w) \
    ((x) | ((y) << 2) | ((z) << 4) | ((w) << 6))

    This sets four same single floats to 1 __m128 data for operating the same element value simultaneously.
    #define _mm_madd_ps(a, b, c) \
    _mm_add_ps(_mm_mul_ps((a), (b)), (c))

    This macro implemented the combination of multiplication first and add last and makes the coding easy.
    void _mm_store_ps(float *p, __m128 a );

    This intrinsic belongs to memory access. It stores a __m128 data to “p” address (the address must be 16-byte aligned).
    p[0] := a0
    p[1] := a1
    p[2] := a2
    p[3] := a3

After this simple optimization, we used Intel VTune Amplifier 2011 for Android to profile the same operation again and got the result shown in Figure 8.


Figure 8. Optimized Matrix4f::mul Computing Consumption Reference

The computing consumption reference was reduced from 83,340,000 to 18,780,000, which is a performance improvement of over 4 times1 (We executed the same test steps and operations at same scenario: same scene, almost same enemies, same vehicles, same weapons, same test duration and so on, but because of changes of AI, quantity of enemy, there were some tiny impacts for test result). This example shows the powerful ability of Intel SSE parallel computing.

Column-Major Matrix Multiplication Optimization for OpenGL ES

For matrix operations in OpenGL ES-based applications, the Column-Major storage model is strongly recommended. Not only can this model meet the OpenGL ES specification, it can also be parallelized more easily. As we mentioned previously, developers can apply many highly efficient memory access technologies to optimize their code. The following code is a classic conversion sample from ARM NEON* to Intel SSE:

void NEON_Matrix4Mul(const float* a, const float* b, float* output )
{
    __asm__ volatile
    (
     // Store A & B leaving room for q4-q7, which should be preserved
     "vldmia %1, { q0-q3 } nt"
     "vldmia %2, { q8-q11 }nt"

     // result = first column of B x first row of A
     "vmul.f32 q12, q8, d0[0]nt"
     "vmul.f32 q13, q8, d2[0]nt"
     "vmul.f32 q14, q8, d4[0]nt"
     "vmul.f32 q15, q8, d6[0]nt"

     // result += second column of B x second row of A
     "vmla.f32 q12, q9, d0[1]nt"
     "vmla.f32 q13, q9, d2[1]nt"
     "vmla.f32 q14, q9, d4[1]nt"
     "vmla.f32 q15, q9, d6[1]nt"

     // result += third column of B x third row of A
     "vmla.f32 q12, q10, d1[0]nt"
     "vmla.f32 q13, q10, d3[0]nt"
     "vmla.f32 q14, q10, d5[0]nt"
     "vmla.f32 q15, q10, d7[0]nt"

     // result += last column of B x last row of A
     "vmla.f32 q12, q11, d1[1]nt"
     "vmla.f32 q13, q11, d3[1]nt"
     "vmla.f32 q14, q11, d5[1]nt"
     "vmla.f32 q15, q11, d7[1]nt"

     // output = result registers
     "vstmia %0, { q12-q15 }"
     : // no output
     : "r" (output), "r" (a), "r" (b)     // input - note *value* of pointer doesn't change
     : "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15" //clobber
     );
}
//####################################################################################################################
static inline void SSE_Matrix4Mul(const float* a, const float* b, float* output)
{
	// load matrices a and b as column-major order of OpenGL ES
	__m128 ma_col_0 = _mm_load_ps(a);
	__m128 ma_col_1 = _mm_load_ps(a + 4);
	__m128 ma_col_2 = _mm_load_ps(a + 8);
	__m128 ma_col_3 = _mm_load_ps(a + 12);

	__m128 mb_col_0 = _mm_load_ps(b);
	__m128 mb_col_1 = _mm_load_ps(b + 4);
	__m128 mb_col_2 = _mm_load_ps(b + 8);
	__m128 mb_col_3 = _mm_load_ps(b + 12);

	// get ready to store the result
	__m128 result0;
	__m128 result1;
	__m128 result2;
	__m128 result3;

	// result = first column of B x first row of A
	result0 = _mm_mul_ps(mb_col_0, _mm_replicate_x_ps(ma_col_0));
	result1 = _mm_mul_ps(mb_col_0, _mm_replicate_x_ps(ma_col_1));
	result2 = _mm_mul_ps(mb_col_0, _mm_replicate_x_ps(ma_col_2));
	result3 = _mm_mul_ps(mb_col_0, _mm_replicate_x_ps(ma_col_3));

	// result += second column of B x second row of A
	result0 = _mm_madd_ps(mb_col_1, _mm_replicate_y_ps(ma_col_0), result0);
	result1 = _mm_madd_ps(mb_col_1, _mm_replicate_y_ps(ma_col_1), result1);
	result2 = _mm_madd_ps(mb_col_1, _mm_replicate_y_ps(ma_col_2), result2);
	result3 = _mm_madd_ps(mb_col_1, _mm_replicate_y_ps(ma_col_3), result3);

	// result += third column of B x third row of A
	result0 = _mm_madd_ps(mb_col_2, _mm_replicate_z_ps(ma_col_0), result0);
	result1 = _mm_madd_ps(mb_col_2, _mm_replicate_z_ps(ma_col_1), result1);
	result2 = _mm_madd_ps(mb_col_2, _mm_replicate_z_ps(ma_col_2), result2);
	result3 = _mm_madd_ps(mb_col_2, _mm_replicate_z_ps(ma_col_3), result3);

	// result += last column of B x last row of A
	result0 = _mm_madd_ps(mb_col_3, _mm_replicate_w_ps(ma_col_0), result0);
	result1 = _mm_madd_ps(mb_col_3, _mm_replicate_w_ps(ma_col_1), result1);
	result2 = _mm_madd_ps(mb_col_3, _mm_replicate_w_ps(ma_col_2), result2);
	result3 = _mm_madd_ps(mb_col_3, _mm_replicate_w_ps(ma_col_3), result3);

	// store the result to memory
	_mm_store_ps(output, result0);
	_mm_store_ps(output+4, result1);
	_mm_store_ps(output+8, result2);
	_mm_store_ps(output+12, result3);
}

This code is just a reference sample that I hope you find helpful.

    __m128 _mm_load_ps(float * p );
    This intrinsic loads four contiguous single floating point data from “p” address (“p” must be 16-byte aligned) to a __m128 data.
    r0 := p[0]
    r1 := p[1]
    r2 := p[2]
    r3 := p[3]

Additional Optimization Technologies

There are more optimization technologies and techniques that can be applied to game coding. One of them is the Intel® C++ Compiler for Android*, a good, easy-to-use candidate that can compile NDK portions of the game code. Intel C++ Compiler for Android OS provides many special optimizations for Intel® CPU architecture, such as pipeline, cache, and memory utilization. We also use GCC to compile our code, but we need to set compilation options to tune performance and improve cache and memory utilization as follows:
Optimization Compilation Options for GCC
LOCAL_CFLAGS := -O3 -ffast-math -mtune=atom -msse3 -mfpmath=sse
Optimization Compilation Options for Intel C++ Compiler for Android OS
LOCAL_CFLAGS := -O3 -xSSSE3_atom -ipo -no-prec-div

“Sharp tools make good work”, as we know, and Intel VTune Amplifier 2011 for Android can help developers locate a program’s hotspots (highly time-consuming) quickly, and check cache and memory usage to improve performance and quality. Intel® Graphics Performance Analyzers (Intel® GPA) is another powerful set of tools. It can help developers monitor real-time status of executing software from a whole system perspective including CPU, GPU, memory, IO, graphics API, and so on to find the bottlenecks. Intel GPA is excellent for game development!


Figure 9. Intel® Graphics Performance Analyzers

Summary

By combining Intel® SSE and Intel C++ Compiler for Android OS compilation with Intel GPA instructions, we achieved obvious improvement in The Last Defender’s performance. Using the same test scenario as before, the FPS improved from 30 to 39, or about 30%2!


Figure 10. Snapshot for Non-optimized Version of The Last Defender (FPS can be turned on via game settings)


Figure 11. Snapshot for Optimized Version of The Last Defender (FPS can be turned on via game settings)

Using Intel® SSE technology to speed up game code is amazing and fun, but also challenging. Although we could only give a short description of our process in this article, we hope we inspired Android game developers to use the features that are available on IA, and helped them optimize their game code to get faster gameplay and better user experience!

Thanks!

 

Author Bio

YANG Yi is a software application engineer and working in Intel, currently, he is focusing on game engines and graphics related enablement for Android* on IA in PRC. Now, based on lots of advanced Intel® technologies, he engages and helps PRC game ISVs to enable more high performance and high quality game engines, more popular game titles with excellent gameplay on Intel® x86 Android* platforms.

Other Related Articles and Resources 

Speeding Up the Android* Emulator on Intel® Architecture
Getting Started with Intel® Composer XE 2013, Compiler Pragmas and Directives
Overview: Intel® Streaming SIMD Extensions 2 (Intel® SSE2)
Requirements for Vectorizing Loops with #pragma SIMD
Creating a Particle System with Streaming SIMD Extensions

 

To learn more about Intel tools for the Android developer, visit Intel® Developer Zone for Android.

Intel, the Intel logo, and Atom are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.

 

1 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance

2 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance

For more complete information about compiler optimizations, see our Optimization Notice.