*Jacco Bikker, NHTV University of Applied Sciences, Breda, The Netherlands*

### 1. Introduction

Recently, it has been shown that ray tracing can be done in real-time on consumer PCs. This is an interesting development; ray tracing solves many of the problems that rasterization has by taking into account global effects (shadows, reflections, refractions) in an intuitive way, it is able to create more realistic graphics, and does so in a more elegant way. At the same time, ray tracing is a very resource-intensive algorithm; to make it real-time requires optimal use of modern hardware.

This article is the first in a series of two. Together the articles describe the architecture of a real-time ray tracer, and how thread level and instruction level parallelism is employed to maximize performance. In this first article, thread-level parallelism is discussed, an introduction to SIMD is provided (in the context of real-time ray tracing), and SIMD code is applied to the first stage of the ray tracing process, ray generation. The second article focuses on the shading pipeline, and describes how instruction-level parallelism is used to improve the performance of this stage considerably.

These articles assume some familiarity with the ray tracing algorithm. For an introduction see e.g. http://www.devmaster.net/articles/raytracing_series/part1.php.

### 2. High level overview

A basic ray tracer looks like this (in pseudo-code):

A few things should be noted about this process: In the first place, individual rays do share data (e.g., the primitives in the scene, materials), but this is all read-only and does not need any synchronization. Rays are thus fully self-contained, and can easily be traced in parallel. Secondly, naïve (brute-force) ray tracing is an extremely time consuming process, since it requires each ray to be intersected with all primitives in the scene. For this reason, a spatial subdivision should be used. This can be a grid, an octree, a bounding volume hierarchy (BVH) or a binary space partition (BSP); in practice, the most efficient scheme appears to be the kd-tree, which is an axis-aligned BSP tree. And finally, ray tracing is a recursive process. One implication of this is that the number of rays used to determine the color of a single pixel varies, and thus the time this takes is not constant.for eachscreen pixelgeneratea rayfromthe cameratothe pixelintersectthe raywithall objects in the sceneforthe closest intersectiondeterminethe materialfor eachlight in the scenegeneratea rayfromthe intersectiontothe light

if not obstructed: Apply illuminationspawnsecondary rays (e.g., reflection, refraction)combineresults

The key parts of the ray tracer are:

- Ray generation
- Ray traversal
- Ray/object intersection
- Shading

Shading may involve generation of new rays, in which case the process is repeated in a recursive manner.

### 3. Thread-level parallelism

To maximize ray tracing performance, both thread level and instruction level parallelism should be used. As mentioned in section 2, the ray tracing algorithm is parallel by nature; rays can be traced independently and in any order. Creating a multithreaded ray tracer that uses all available cores is therefore straightforward. In principle, each ray can be rendered in its own thread. In practice, it is more efficient to assign tiles of pixels to each rendering thread, to reduce threading overhead.

Multithreaded ray tracing is implemented using a ‘master’ thread and several ‘worker’ threads. The master prepares tasks for the workers, while the workers perform the actual rendering. The process looks as follows:

- Master thread receives render call. Workers are asleep;
- Master thread prepares tasks for workers;
- Master thread wakes workers, and goes to sleep;
- Workers process tasks;
- When no tasks are left, each worker signals master, and goes to sleep;
- When all workers are done, the master finishes the image.

The worker thread code that waits for the signal from the master thread, gets the next task, renders assigned tiles, and signals the master when there are no more tiles to be rendered is shown below. This code does not rely on the OS for synchronization during the actual processing of tasks. Minimizing calls to system routines considerably decreases threading overhead.

**while** (1)

{

WaitForSingleObject( go_signal[threadindex], INFINITE );

**while** (1)

{

**if** (jobs_waiting > 0)

{

**volatile LONG** w = InterlockedDecrement( &jobs_waiting );

**if** (w >= 0) RenderTile( w );

}

**else**

{

SetEvent( done_signal[threadindex] );

**break**;

}

}

}

Some efficiency notes: The number of worker threads should be equal to the number of available cores, and each worker should be linked to a single core (using SetT hreadIdealProcessor). Since worker threads operate on tiles of pixels, the rays traced for a tile tend to access the same data (kd-tree nodes, primitives, textures). Locking a worker thread to a core thus improves cache usage.

### 4. Instruction-level parallelism

A ray tracer that uses thread-level parallelism (as described in section 3) scales almost linearly in performance as the number of available cores increases. A well-written implementation, using a high-quality kd-tree and efficient traversal and shading code, should be able to achieve millions of rays per second, per core. This number can be increased significantly by using instruction-level parallelism. For this, SIMD instructions are used to operate on multiple data streams in parallel. SIMD instructions are implemented by the SSE instruction sets on modern processors.

The process is illustrated in figure 1: To use SIMD instructions optimally, four streams (128 bits of data) pass the same instructions, to produce four results. It is important to note that during the entire process, the streams do not interact. They are however processed simultaneously by SIMD instructions. For this reason, conditional code and branching is non-trivial: It is not possible to follow a different code path for one of the streams.

To use SIMD efficiently for ray tracing, we must thus operate on four rays (or multiples thereof) in parallel. This requires a specific layout of our data, since the streams must be fed to the SIMD instructions as 128-bit data. The code below shows an example implementation of a ray packet:

**class** RayPacket

{

**public**:

**union**

{

**struct**

{

**union** { **float** dx[4]; __m128 dx4; };

**union** { **float** dy[4]; __m128 dy4; };

**union** { **float** dz[4]; __m128 dz4; };

**union** { **float** ox[4]; __m128 ox4; };

**union** { **float** oy[4]; __m128 oy4; };

**union** { **float** oz[4]; __m128 oz4; };

};

**struct**

{

**union** { **float** dcell[12]; __m128 dc4[3]; };

**union** { **float** ocell[12]; __m128 oc4[3]; };

};

; };

}

The RayPacket object stores data for four rays (direction and origin), and allows easy access to individual values in the 128-bit variables, using unions. Note that each 128-bit vector can be safely unioned with an array of four floats: The __m128 data type stores just that. The ‘dcell’ and ‘ocell’ arrays can be used to access data using the axis as an index.

Setting up a RayPacket is now similar to setting up a single ray. The direction of the first ray in the packet can be done as follows:

RayPacket rp;

// direction and origin in: vector3 origin, direction;

rp.dx[0] = direction.x;

rp.dy[0] = direction.y;

rp.dz[0] = direction.z;

rp.ox[0] = origin.x;

rp.oy[0] = origin.y;

rp.oz[0] = origin.z;

The process of converting the data stored in a vector3 structure to the arrays in the RayPacket is called ‘array of structures’ (AOS) to ‘structure of arrays’ (SOA) conversion. This is a common operation when working with SIMD code. Obviously, it would have been more efficient if directions and origins were already passed in SOA form. If the preceding calculations allow this, this should be preferred.

For most calculations, ray directions need to be normalized. This is where the SOA pays off. Vector normalization is highly efficient using SIMD code. Using scalar code, one would normalize a single vector by dividing its components by the length of the vector:

float length = sqrt( dx * dx + dy * dy + dz * dz );

float length_reciprocal = 1.0f / length;

dx *= length_reciprocal;

dy *= length_reciprocal;

dz *= length_reciprocal;

Using SIMD code, all four directions in a RayPacket can be normalized using an equal amount of SIMD instructions.

__m128 dx_squared = _mm_mul_ps( dx4, dx4 );

__m128 dy_squared = _mm_mul_ps( dy4, dy4 );

__m128 dz_squared = _mm_mul_ps( dz4, dz4 );

__m128 length_squared = _mm_add_ps ( _mm_add_ps( dx_squared, dy_squared ), dz_squared );

__m128 length4 = _mm_sqrt_ps( length_squared );

__m128 one4 = _mm_set_ps1( 1.0f );

__m128 length_reci4 = _mm_div_ps( one4, length4 );

dx4 = _mm_mul_ps( dx4, length_reci4 );

dy4 = _mm_mul_ps( dy4, length_reci4 );

dz4 = _mm_mul_ps( dz4, length_reci4 );

Note that, even though the SIMD code looks twice as long, it actually translates to the same number of operations: six multiplications, two additions, one division, and one square root. The only difference is that the SIMD code is executed on four values simultaneously. The way SSE intrinsics are written tends to expand code quite a bit.

The SIMD code is already quite a bit faster than the scalar code (which needs to be executed four times, to process the same amount of data), but it can still be improved considerably. The SSE instruction set has a specialized instruction for calculating the reciprocal of a square root: _mm_rsqrt_ps. This instruction uses a hardware look-up table, which limits its accuracy somewhat. This accuracy can be improved by performing a single Newton-Raphson iteration on the table supplied approximation. Using scalar code, this could be written as:

float approx = sqrtLUT[v]

float muls = v * approx * approx;

return (approx * 0.5f) * (3 – muls);

This would be translated to the following SIMD code:

static __forceinline __m128 fastrsqrt( const __m128 v )

{

const __m128 approx = _mm_rsqrt_ps( v );

const __m128 muls = _mm_mul_ps(_mm_mul_ps(v, approx), approx);

return _mm_mul_ps(_mm_mul_ps(_half4, approx), _mm_sub_ps(_three, muls) );

}

Where ‘_half4’ is the vector (0.5, 0.5, 0.5, 0.5) and ‘_three’ is the vector (3, 3, 3, 3). The performance of the fast reverse square root function is timed with the following code, where ‘TCOUNT’ is set to 20M and ‘QCOUNT’ to 5M. With these settings, these loops normalize 20 million vectors stored in arrays.

// initialize test data

vector3* v = new vector3[TCOUNT];

__m128* vx4 = new __m128[QCOUNT]; float* vx = (float*)vx4;

__m128* vy4 = new __m128[QCOUNT]; float* vy = (float*)vy4;

__m128* vz4 = new __m128[QCOUNT]; float* vz = (float*)vz4;

for ( int i = 0; i < TCOUNT; i++ )

v[i] = vector3( vx[i] = 1, vy[i] = 2, vz[i] = 3 );

// sse test

int start = GetTickCount();

for ( int i = 0; i < QCOUNT; i++ )

{

const __m128 sqx4 = _mm_mul_ps( vx4[i], vx4[i] );

const __m128 sqy4 = _mm_mul_ps( vy4[i], vy4[i] );

const __m128 sqz4 = _mm_mul_ps( vz4[i], vz4[i] );

const __m128 rlen4 = fastrsqrt( _mm_add_ps( _mm_add_ps( sqx4, sqy4 ), sqz4 ) );

vx4[i] = _mm_mul_ps( vx4[i], rlen4 );

vy4[i] = _mm_mul_ps( vy4[i], rlen4 );

vz4[i] = _mm_mul_ps( vz4[i], rlen4 );

}

int elapsed2 = GetTickCount() - start;

// scalar test

start = GetTickCount();

for ( unsigned int i = 0; i < TCOUNT; i++ )

{

const float rlen = 1.0f / sqrtf( v[i].x * v[i].x + v[i].y * v[i].y + v[i].z * v[i].z );

v[i].x *= rlen, v[i].y *= rlen; v[i].z *= rlen;

}

int elapsed1 = GetTickCount() - start;

The code was tested on a Intel® Core™2 Duo laptop running at 1.6Ghz. On this system, the scalar loop executes in 3156ms. The SSE loop executes in 172ms, which is 18.35 times faster.

Instruction-level parallelism can be used throughout the pipeline of the ray tracer. In an earlier article, efficient ray traversal using ray packets was described. Besides initialization and traversal, the shading stage of the ray tracer also lends itself well to vectorization using SIMD code.

### 5. Data layout

The data that is used by the ray tracer can be roughly divided in the following categories:

*Ray packets*

These contain the directions and origins of rays. When secondary rays are spawned, the primary ray is preserved and a new one is created. The data from the primary ray may be needed for shading, once tracing of the secondary ray completes.

*Intersection data*

The result of tracing a ray is information about the primitive that was found (if any), the location of the intersection point of the ray and the scene, and the position of the intersection point on the primitive (u/v), which is needed for texturing.

*Scene data*

The scene data consists of the primitives of the scene, the light sources and the camera.

*Acceleration structure*

Spatial subdivision data: the kd-tree, or another structure.

*Texture data*

Textures and normal maps.

*Housekeeping data*

Various data used for ray tracing, such as the current recursion depth and a tree traversal stack.

For optimal performance, data should be organized so that ‘hot’ data is isolated from ‘cold’ data, and the data should be localized. ‘Hot data’ refers to data that changes frequently; ‘cold data’ is data that is mostly static. Localizing data means that data that is specific for a thread should be stored with that thread. For example, when a rendering thread needs a new ray packet, it could retrieve it from a shared memory manager, but performance is better when it maintains its own array of spare ray packets.

In the above list, a lot of data can be considered ‘cold’. The scene data and acceleration structure are static, even when lights or primitives animate, since during rendering, their position will not change. The ray packets are not as cold as that, but still not as hot as the intersection data. During traversal, the intersection data may be updated several times. The housekeeping data is a typical example of data that should be stored for each thread.

Note that data access patterns for a ray tracer are highly incoherent. A single ray or a ray packet first accesses arbitrary parts of the acceleration structure, then one or more primitives, then a single pixel from a texture. Before a pixel is plotted to the screen, usually a second ray is spawned to determine if a point is lit by a light source. Cache misses are therefore a major source of performance loss.

Besides isolating and localizing data, it also pays off to prepare data for the type of operations that will be used on it. The texture data is a good example of this. Since a ray tracer usually operates on floating-point color, it makes sense to store te xtures in this format, even though it takes more space. Texture data can then be retrieved without repeatedly having to convert. Likewise, the position of a light source should be stored in expanded __m128 variables for x, y and z. When shadow rays are generated, the directions for four rays are calculated by subtracting the position of the light from the intersection points of the four primary rays in a ray packet. Recall the code in section 4 to setup a single ray. If we have both the start and end points for four shadow rays, we can setup the shadow rays far more efficiently:

RayPacket rp;

// light position: __m128 lx4, ly4, lz4;

// intersection point: __m128 ix4, iy4, iz4;

rp.dx4 = _mm_sub_ps( ix4, lx4 );

rp.dy4 = _mm_sub_ps( iy4, ly4 );

rp.dz4 = _mm_sub_ps( iz4, lz4 );

rp.ox4 = ix4;

rp.oy4 = iy4;

rp.oz4 = iz4;

Again, we are using the same amount of operations, but we operate on four times the amount of data.

### 6. Conclusion

Applying parallelism to the ray tracing algorithm can greatly improve the performance of a single-threaded, non-vectorized implementation. Dividing work over multiple threads is relatively simple for this rendering algorithm, since rays are independent. Splitting work by assigning tiles of pixels to rendering threads allows us to control the granularity of the multithreading, while at the same time increasing coherency of memory access. Using an efficient master/worker model, ray tracing has the potential to scale almost linearly with the number of available cores, which makes it the perfect test case for today’s multi-core processors.

Vectorization using SIMD instructions is used to improve the performance of individual rendering threads. By tracing four rays simultaneously, all stages of a ray tracer can be sped up considerably: In this first article, it was shown that normalization of ray directions is about eighteen times faster. By working with packets of rays throughout the stages of ray tracing, data conversion is minimized.

In the second article, an efficient implementation of the shading stage is described, where the four rays and their intersection data form the starting point.

### 7. Pointers and literature

The ray traversal stage, which is omitted from this article, is described in an earlier article that appeared on the Intel Developer Zone: “Interactive Ray Tracing”, http://software.intel.com/en-us/articles/interactive-ray-tracing

Further reading material on the topic of high performance ray tracing can be found at various locations on the internet.**Recommended literature:**

Ingo Wald’s thesis on real-time ray tracing and interactive global illumination, which covers many details, including SIMD kd-tree traversal and ray/triangle intersection:

/sites/default/files/m/8/6/c/phd.pdf

Ray Tracing Theory and Implem entation: Series of articles on ray tracing from the ground up. http://www.devmaster.net/articles/raytracing_series/part1.php

Discussions on the topic of ray tracing (both interactive and offline) can be found on the OMPF forum: http://igad2.nhtv.nl/ompf2/

The source code for a real-time ray tracer, Arauna, is available from the same forum: http://ompf.org/forum/viewtopic.php?p=1006#1006

A stand-alone kd-tree compiler with text output of the tree is available at

http://ompf.org/forum/viewtopic.php?t=438

### About the Author

Jacco Bikker has been in the games industry for 10 years, previously working at Lost Boys and Davilex (Netherlands), primarily on 3D technology. He has authored a number of articles on game development websites, including some on ray tracing, visibility determination and mobile game development. Currently he works as a lecturer for the International Game Architecture & Design program of the NHTV University of Applied Sciences in Breda, the Netherlands.