Interactive Ray Tracing

Submit New Article

Last Modified On :   October 24, 2008 10:16 AM PDT
Rate
 


by Jacco Bikker


Introduction

See how advances such as Hyper-Threading Technology and dual-core processors affect ray tracing—allowing for three- to four-fold speed increases.

Ever since the introduction of ray tracing it has been regarded mostly as an off-line process; a really expensive way to produce really pretty images. Recently, ray tracing has been made much more interactive, even using consumer hardware. As ray tracing begins to make optimal use of recent advances in processor technology, such as Hyper-Threading Technology and dual-core processors, we expect that it will become practical for both interactive and real time applications in the near future. This article shows how low-level optimizations targeted at this technology allow for a three- to four-fold speed increase compared to existing implementations.

A ray tracer typically consists of the following parts:

  • Ray generation
  • Ray traversal
  • Primitive intersection
  • Shading

 

After summarizing earlier work, the rest of this paper describes the implementation of these parts in a SIMD optimized packet ray tracer.

The first section describes the generation of ray packets. This section also details when not to use ray packets, and how to handle this case. The second section describes the kd-tree structure and its use for fast location of primitives to be intersected with a ray. The third section discusses efficient ray packet / triangle intersection. The fourth section describes the shading model used in our ray tracer.

Finally, we present performance figures, and areas of future research.


Ray Tracing

Ray tracing is a very intuitive way of visualizing data, as it mimics the flow of light from light sources, via surfaces, to a camera. Or, stated the other way around, light travels in straight lines, so it doesn't matter at which end we start. Starting at the camera is more practical, as only a small portion of light emitted by light sources actually reaches the camera.

The fact that we simulate the light flow as it appears in nature lets us simulate all kinds of related effects by applying the laws of physics for refraction, reflection, the behavior of light in semi-translucent materials, shadows, color bleeding and so on. Adding a feature to a ray tracer often starts by reading a physics book.

The basic algorithm for ray tracing looks like this:

Generate a ray through each screen pixel, starting at the camera. For each ray:

  • Find the nearest intersection of the ray and a surface
  • Determine the color of the surface at the intersection point
  • Adjust this color based on the lights in the scene.

 

This can be a recursive process; determining the color at the intersection point might involve casting extra rays for reflective or refracting surfaces, while shading requires extra rays to see which light sources are visible from the intersection point.

Previous work

Several algorithms have been proposed to improve the speed of classic Whitted-style ray tracing.

As the execution time is linearly dependent to the number of ra ys cast, it is advantageous to minimize the number of rays required to render an image. Several approaches to accomplish this have been proposed. Some examples are: Adaptive sampling1, reusing samples (e.g., the RenderCache2,3), reducing the number of shadow rays (e.g. shadow caching4).

Besides reducing the number of ray queries, the speed of a ray tracer can also be improved by reducing the cost of each individual ray query. This can be done by using spatial subdivisions. One of the most efficient structures for accelerating ray queries is the kd-tree5.

One approach that is seldom mentioned is low level optimizing. Wald showed that vectorization of the ray tracing algorithm approximately doubles the performance of a classic implementation6. In his thesis, he describes a ray tracer that handles four rays in parallel, utilizing SIMD instructions to do the necessary calculations in parallel. Wald further shows that utilizing multiple CPUs in a networked environment enables real-time rendering of very complex scenes; the ray tracing algorithm essentially scales linearly with the number of available processors.

Our implementation focuses on single-machine rendering, and extends on Wald's ideas by showing that a highly optimized ray tracer can achieve a performance of over 10 million rays per second for complex scenes on state-of-the-art consumer hardware.

Our ray tracer focuses on improving the speed of individual ray queries. It does not aim to reduce the number of rays required to render an image, as these algorithms often reduce the image quality and make the renderer less generally applicable.

1. Andrew S. Glassner, An introduction to ray tracing, Academic Press Ltd., London, UK, 1989

2. Bruce Walter, George Drettakis, Steven Parker, Interactive Rendering Using the Render Cache, in Rendering Techniques '99, G. Larson, D. Lichinski (eds), Springer-Verlag, (Proc. 10th Eurographics workshop on Rendering, Granada, Spain.)

3. Bruce Walter, George Drettakis, Donald P. Greenberg, Enhancing and Optimizing the Render Cache, in Rendering Techniques 2002, (13th Eurographics Workshop on Rendering. Pisa, Italy), Springer-Verlag, pp. 37-42, 2002.

4. E. Haines and D. Greenberg, "The Light Buffer: A Shadow-Testing Accelerator," IEEE CG&A, Vol. 6, No. 9, Sept. 1986, pp. 6-16.

5. L.Szirmay-Kalos, V.Havran, B.Balazs, L.Szecsi: "On the Efficiency of Ray-shooting Acceleration Schemes", Proceedings of SCCG'02 conference, Budmerice, Slovakia, Published by ACM SIGGRAPH, PDF, pp. 89--98, April 2002.

6. Ingo Wald, Realtime Ray Tracing and Interactive Global Illumination - PhD thesis, Computer Graphics Group, Saarland University, 2004.

 


SIMD Packet Ray Tracing

We have already noted that ray tracing is by nature exceptionally well suited to parallel processing. Individual rays do not share any data and can be rendered in any order. This means that the ray tracing algorithm could, in theory, make optimal use of recent processor technology. Where most applications are able to execute only partially in parallel, ray tracing can relatively easil y be adapted to parallel processing technologies such as SIMD, Hyper-Threading Technology and multi-core technology.

For Hyper-Threading Technology enabled or dual-core platforms, the ray queries can simply be evenly distributed over the available processors. To facilitate this, we run two ray tracing threads, each of which takes care of half of the work. This can easily be extended to more complex configurations; ray tracing performance scales almost linearly with the number of available processors.

It is harder to exploit SIMD technology. The ray tracing algorithm becomes memory bandwidth bound, as each ray traverses a spatial structure and is tested against several primitives before the nearest intersection is found.

Let us examine SIMD briefly. SIMD lets us operate on four single precision floating point values at once. These values are stored in 128-bit wide registers. If we for example use the SSE instruction ADDPS( r0, r1 ), the result is a 128-bit value containing four sums: The first value in r0 is added to the first value in r1, and so on.

Figure 1.

It's important to note that there is no horizontal movement; there are four operations going on in parallel, without any influence on each other.

Using SIMD instructions, we can add two vectors using one instruction, or calculate the square root or reciprocal of four values. As long as there is a way to rewrite an algorithm to operate on four values in parallel, vectorization is a powerful way to improve the efficiency of the algorithm.

To exploit vectorization in the ray tracer, we trace four rays in parallel: Any data that is used in the algorithm is replaced by an array of four values, one for each ray. This way, we can use SSE instructions for almost every operation.

For a single ray we need the following data:

  • The origin of the ray
  • The direction of the ray

Or, written out:

struct ray
{
float ox, oy, oz;
float dx, dy, dz;
};

 

Preparing this structure for vectorization yields:

struct RayPacket
{
union
{
struct
{
union { real ox[4]; __m128 ox4; };
union { real oy[4]; __m128 oy4; };
union { real oz[4]; __m128 oz4; };
union { real dx[4]; __m128 dx4; };
union { real dy[4]; __m128 dy4; };
union { real dz[4]; __m128 dz4; };
};
};
};

 

Since each 128-bit value literally contains four regular floating point numbers, we can access individual values by unioning the register with a float array.

The Ray Packet data structure now represents four rays. In practice, these rays represent 2x2 pixels, so that they stay as close together as possible. This way, the four rays most of the time access the same data. It's quite common that the rays hit the same primitive, and most of the time, they traverse the same nodes of the kd-tree. This also means that vectorization helps to reduce bandwidth usage. Instead of fetching kd-tree nodes and primitives for each ray, we now request this data for four rays at once, theoretically cutting down data transfer by 75%. This makes the ray tracer less dependent on bandwidth.

Once the ray origins and directions are set for a packet, we can start exploiting the new data layout to quickly normalize the rays in the packet:

__m128 v1 = _mm_mul_ps( RP->dx4, RP->dx4 );
__m128 v2 = _mm_mul_ps( RP->dy4, RP->dy4 );
__m128 v3 = _mm_mul_ps( RP->dz4, RP->dz4 );
__m128 sum = _mm_add_ps( v1, _mm_add_ps( v2, v3 ) );
__m128 sqrroot = _mm_sqrt_ps( sum );
__m128 reciprocal = _mm_div_ps( one, sqrroot );
RP->dx4 = _mm_mul_ps( RP->dx4, reciprocal );
RP->dy4 = _mm_mul_ps( RP->dy4, reciprocal );
RP->dz4 = _mm_mul_ps( RP->dz4, reciprocal );

 

Note that the above code assumes the existence of a __m128 variable named 'one' that contains four '1.0f's.

The above code is a straight SIMD adaptation of the original C version:

float sum = RP->dx * Rp->dx + RP->dy * RP->dy + RP->dz * RP->dz;
float reciprocal = 1.0f / sqrt( sum );
RP->dx *= reciprocal;
RP->dy *= reciprocal;
RP->dz *= reciprocal;

 

The SIMD version is already quite a bit faster, but we can do better. Intel has added a fast 1/sqrt(x) function to the SSE2 instruction set. The only drawback is that its precision is limited. We need the precision, so we refine it using Newton-Rhapson:

__m128 nr = _mm_rsqrt_ps( x );
__m128 muls = _mm_mul_ps( _mm_mul_ps( x, nr ), nr );
result = _mm_mul_ps( _mm_mul_ps( half, nr ), _mm_sub_ps( three, muls ) );

 

This code assumes the existence of a __m128 variable named 'half' (four times 0.5f) and a variable 'three' (four times 3.0f).


Kd-tree's and Ray Packets

The kd-tree is a spatial subdivision scheme similar to the BSP-tree: Space is cut in half by a splitting plane, and each half space is processed recursively in the same manner. Rather than using arbitrary split plane orientations, a kd-tree uses axis-aligned split planes. This obviously reduces the freedom in creating the tree, but the resulting tree is a highly efficient structure for ray queries. In this paper we will concentrate on ray packet traversal of kd-trees, rather than the building of them.

Basic kd-tree traversal is done in a recursive manner.

Figure 2.

Figure 2 shows a kd-tree with a vertical split in the root node. Each half-space is split once more, using a horizontal split. Tracing a ray through this tree is a matter of handling three cases:

Figure 3.

  • Case 1 shows a ray that traverses only the 'far' child of the node ('far' with respect to the ray direction).
  • Case 2 shows a ray that traverses both child nodes. In case 3, the ray only traverses the 'near' child node.

 

Each point on the ray is defined as p = Oray + t * Dray, where Oray is the origin of the ray, Dray is the direction of the ray and t is a distance along the ray. Using this ray representation, we can store the entry and exit point for the ray in the current node as two values of t: tnear and tfar. The ray enters the current node at tnear and leaves at tfar. The three cases can now be identified by comparing the distance of the intersection point of the ray and the splitting plane to tnear and tfar. For case 1, the intersection point is closer than the entry point of the ray. For case 3, the intersection point is further than the exit point of the ray.

  • In the root node, the ray intersects the splitting plane inside the node: the intersection point is closer than tfar and further than tnear. We thus process both child nodes, first the 'near' node (in this case, the right child node) and then the 'far' node.
  • In the right child node, the ray again intersects the splitting plane inside the node. We again process both child nodes.
  • Finally, in the left node, only one child node needs to be traversed. The other node is ignored.

 

In pseudo code, traversing a single ray through a kd-tree looks like this:

while (!node->IsLeaf())
{
d = (node->splitpos – ray.origin[node->axis]) / ray.dir[node->axis]
if (d <= tnear) node = node->far
else
if (d >= tfar) node = node->near
else
{
Push( node->far )
node = node->near
}
}

 

Once a leaf has been found, the ray is checked against the primitives in that leaf. This will be discussed in more detail in the next section.

If we try to extend this traversal code to ray packets, we run into a problem: The rays in the packet usually have slightly varying directions, and so they do not necessarily traverse the exact same kd-tree nodes.

This is solved by traversing a node with all the rays in the packet if at least one ray needs to traverse that node. Results for the redundant rays are masked out. These rays are detected by comparing tnear to tfar. Rays that should not have traversed a particular node do not meet the condition tfar >= tnear.

Pseudo code to traverse four rays through the kd-tree using vector instructions is shown below.

mask = all_valid;
while (!node->IsLeaf())
{
splitpos[0…3] = node->splitpos )
dist[0…3] = (splitpos[0...3] – rp.origin[0...3][axis]) / (rp.dir[0...3][axis])
comparisonresult = (dist[0…3] <= tnear[0...3])
comparisonresult &= mask
if (all of comparisonresult[0...3] = 'true')
{
node = node->far
continue
}
comparisonresult = (dist[0...3] >= tfar[0…3])
comparisonresult &= mask
if (all of comparisonresult[0...3] = 'true')
{
node = node->near
continue
}
if (any of comparisonresult[0...3] > 0)
{
Push( node->far, dist[0...3], tfar[0...3] )
node = node->near
mask = (tnear[0...3] < tfar[0...3])
tfar[0...3] = dist[0...3] where mask = 'true'
}
}

 

Note that this code is identical to the original ray traversal code; the only difference is the fact that arrays of rays are processed, and that a node is visited if any of the rays require this.

An optimized C implementation of the above pseudo code is shown below.

const int offs[4] = { (RP->dcell[0] >= 0)?1:0, (RP->dcell[4] >= 0)?1:0, (RP> ;dcell[8] >= 0)?1:0, 0 };
while (!node->IsLeaf())
{
const __m128 spos = _mm_load_ps( node->m_Split );
const int aidx = node->GetAxis();
const __m128 d4 = _mm_mul_ps( _mm_sub_ps( spos, RP->oc4[aidx] ), RP->rdc4[aidx] );
const KdTreeNode* restrict ln = node->GetLeft() + offs[aidx];
if (!_mm_movemask_ps( _mm_and_ps( _mm_cmpgt_ps( d4, tn4 ), mask ) )) { node = ln;
continue; }
node = node->GetLeft() + (offs[aidx]^1);
if (_mm_movemask_ps( _mm_and_ps( _mm_cmplt_ps( d4, tf4 ), mask ) ))
{
const __m128 mask2 = _mm_cmpgt_ps( d4, tn4 );
const __m128 mask3 = _mm_cmplt_ps( d4, tf4 );
m_Stack[stackptr].tf4 = tf4;
tf4 = _mm_or_ps( _mm_and_ps( mask3, d4 ), _mm_andnot_ps( mask3, tf4 ) );
m_Stack[stackptr].node = (KdTreeNode*)ln;
m_Stack[stackptr].mask = mask;
m_Stack[stackptr++].tn4 = _mm_or_ps( _mm_and_ps( mask2, d4 ), _mm_andnot_ps( mask2, tn ) );
mask = _mm_cmplt_ps( tn4, tf4 );
}
}

 

In the above code, the static 'offs' array is used to determine 'near' and 'far' nodes based on the ray directions. Our ray tracer stores left and right child nodes in consecutive memory locations; the 'right' child thus has an offset of 1 to the 'left' child. Also note that the intersection point is calculated using pre-calculated reciprocals of the ray direction.

Since the classification of the 'near' and 'far' child depends on the sign of the ray direction, the rays in a ray packet must all have the same signs. This is usually the case; when this is not the case we traverse the rays independently using 'mono ray' fallback code.

Recall that the ray packet is defined as follows:

struct
{
union { real ox[4]; __m128 ox4; };
union { real oy[4]; __m128 oy4; };
union { real oz[4]; __m128 oz4; };
union { real dx[4]; __m128 dx4; };
union { real dy[4]; __m128 dy4; };
union { real dz[4]; __m128 dz4; };
}

 

A ray packet needs to use the mono ray fall-back code when the signs in dx4, dy4 or dz4 differ. This can be determined efficiently using the _mm_movemask intrinsic, which is normally used to convert a vector comparison result to a 4-bit integer. Since this conversion evaluates only the sign bit of each of the four floats, it can also be used to check the signs efficiently:

int m1 = _mm_movemask_ps( rp->dx4 );
int m2 = _mm_movemask_ps( rp->dy4 );
int m3 = _mm_movemask_ps( rp->dz4 );
if (((m1 == 0) || (m1 == 15)) && ((m2 == 0) || (m2 == 15)) && ((m3 == 0) || (m3 == 15))) valid = true;

 


Packet-triangle Intersections

The last ingredient for the ray query is the primitive intersection. Geometry is stored in the leaves of the kd-tree. Whenever the intersection code encounters a leaf, the ray packet is intersected with the primitives st ored in that leaf.

A fast approach to determine ray-triangle intersections is to use barycentric coordinates. The code below shows how to do this:

for ( each primitive )
{
prim = node->prim[…]
k = prim->k
d = 1 / ray.D[k] + prim->nu * ray.D[ku] + prim->nv * ray.D[kv]
t = (prim->nd – ray.O[k] - prim->nu * ray.O[ku] - prim->nv * ray.O[kv]) * d
if (!(dist > t && t > 0)) continue
hu = ray.O[ku] + t * ray.D[ku] - prim->au
hv = ray.O[kv] + t * ray.D[kv] - prim->av
u = hv * prim->bnu + hu * prim->bnv
if (u < 0) continue
v = hu * prim->cnu + hv * prim->cnv
if ((v < 0) || ((u + v) > 1)) continue
// intersection found
}

 

This code assumes the following pre-calculated data for each triangle (vector A, B and C define the triangle):

vector3 c = B - A;
vector3 b = C - A;
vector3 N = b.Cross( c );
// determine dominant axis
int k = 2, u, v;
if (_fabs( N.x ) > _fabs( N.y)) { if (_fabs( N.x ) > _fabs( N.z )) k = 0; }
else { if (_fabs( N.y ) > _fabs( N.z )) k = 1; }
u = (k + 1) % 3, v = (k + 2) % 3;
nu = N.cell[u] / N.cell[rayidk];
nv = N.cell[v] / N.cell[rayidk];
nd = N.Dot( A ) / N.cell[rayidk];
bnu = b[u] / (b[u] * c[v] – b[v] * c[u]);
bnv = -b[v] / (b[u] * c[v] – b[v] * c[u]);
cnu = c[v] / (b[u] * c[v] – b[v] * c[u]);
cnv = -c[u] / (b[u] * c[v] – b[v] * c[u]);
au = A[u], av = A[v];

 

Modifying the intersection code so that it can handle four rays simultaneously is not too difficult, although it is complicated by the conditionals in the code and the fact that not all rays visiting the leaf are guaranteed to be valid.

The code below shows how to implement a vectorized version of the above pseudo code:

const static unsigned int modulo[] = { 0, 1, 2, 0, 1 };
const int k = prim->k;
const unsigned int _ku = modulo[k + 1] << 2;
const unsigned int _kv = modulo[k + 2] << 2;
const unsigned int _ka = k << 2;
const __m128 &da = *(__m128*)&RP->dcell[_ka];
const __m128 &du = *(__m128*)&RP->dcell[_ku];
const __m128 &dv = *(__m128*)&RP->dcell[_kv];
const __m128 &oa = *(__m128*)&RP->ocell[_ka];
const __m128 &ou = *(__m128*)&RP->ocell[_ku];
const __m128 &ov = *(__m128*)&RP->ocell[_kv];
const __m128 inu = _mm_load_ps ( &ta->nu );
const __m128 inv = _mm_load_ps ( &ta->nv );
const __m128 ind = _mm_load_ps ( &ta->nd );
const __m128 v2 = _mm_add_ps( _mm _add_ps( _mm_mul_ps( inu, du ), _mm_mul_ps( inv,dv ) ), da );
const __m128 v1 = _mm_sub_ps( _mm_sub_ps( _mm_sub_ps( ind, oa ), _mm_mul_ps( ou, inu ) ), _mm_mul_ps( ov, inv ) );
const __m128 t4 = fastxovery( v1, v2 );
const __m128 comp_a = _mm_cmpgt_ps( ID->dist4, t4 );
const __m128 comp_b = _mm_cmpgt_ps( t4, zero );
const __m128 mask = _mm_and_ps(comp_a, comp_b);
if (_mm_movemask_ps( mask ) > 0)
{
const __m128 au4 = _mm_load_ps( ta->au );
const __m128 av4 = _mm_load_ps( ta->av );
const __m128 v1 = _mm_sub_ps( _mm_add_ps( ou, _mm_mul_ps( t4, du ) ), au4 );
const __m128 v2 = _mm_sub_ps( _mm_add_ps( ov, _mm_mul_ps( t4, dv ) ), av4 );
const __m128 ibnu = _mm_load_ps( ta->bnu );
const __m128 ibnv = _mm_load_ps( ta->bnv );
const __m128 icnu = _mm_load_ps( ta->cnu );
const __m128 icnv = _mm_load_ps( ta->cnv );
const __m128 vu4 = _mm_add_ps( _mm_mul_ps( v2, ibnu ), _mm_mul_ps( v1, ibnv ) );
const __m128 mask2 = _mm_and_ps( mask, _mm_cmpge_ps( vu4, zero ) );
const __m128 vv4 = _mm_add_ps( _mm_mul_ps( v1, icnu ), _mm_mul_ps( v2, icnv ) );
const __m128 mask3 = _mm_and_ps( mask2, _mm_cmpge_ps( vv4, zero ) );
union { __m128 mask4; __m128i imask; };
mask4 = _mm_and_ps( mask3, _mm_cmple_ps( _mm_add_ps( vu4, vv4 ), one ) );
const __m128i tacc4 = _mm_set1_epi32( (unsigned int)ta );
ID->tacc4 = _mm_or_si128( _mm_andnot_si128( imask, ID->tacc4 ), _mm_and_si128( imask, tacc4 ) );
ID->dist4 = _mm_or_ps( _mm_andnot_ps( mask4, ID->dist4 ), _mm_and_ps( mask4, t4 ) );
ID->u4 = _mm_or_ps( _mm_andnot_ps( mask4, ID->u4 ), _mm_and_ps( mask4, vu4 ) );
ID->v4 = _mm_or_ps( _mm_andnot_ps( mask4, ID->v4 ), _mm_and_ps( mask4, vv4 ) );
}

 

We now have all the ingredients to trace four rays in parallel through a kd-tree, as long as these rays obey certain rules. The rays should be as close together as possible and the signs of their directions must match. When this is the case, it is possible to exploit vector instructions to speed up generic ray queries, at the cost of a moderate increase in code complexity. Fall-back code to trace single rays is still needed for packets that do not meet the criteria.


Performance

We tested the packet ray tracer on several scenes to determine the performance. The packet tracer is compared to the build-in fall-back code for individual rays, to guarantee equal circumstances.

The first scene is 'legocar', a simple scene consisting of 10k triangles. This scene contains some empty screen space, which is an ideal case for the packet tracer, as empty space tends to contain relatively large kd-tree nodes, where the rays in the packet are less likely to diverge.



The second scene is 'cloister', a scene that consists of 81k triangles. This scene is relatively hard for the packet tracer, as it has high shading overhead, as well as evenly distributed geometry.



The third scene is 'kitchen', a 'real-life' scene consisting of 181k triangles. This scene has some complex areas, and some low-detail areas.



The performance was measured on a Mobile Intel® Pentium® 4 processor-based laptop @ 1700Mhz. The ray tracer rendered images of 800 by 600 pixels.

Scene Packet tracer Mono tracer
Legocar 3277k rays/s, 4.6 fps 1064k rays/s, 1.9 fps
Cloister 1554k rays/s, 1.4 fps 283k rays/s, 0.2 fps
Kitchen 2135k rays/s, 2.5 fps 452k rays/s, 0.6 fps

 

As can be seen from above figures, the difference between the packet tracer and the mono tracer is considerable, especially for the cloister scene. The very large difference for this scene can be explained by the fact that a coherent packet also uses SSE code for shading. The cloister scene is heavily textured, using a bilinear filter. For coherent packets, this filtering is performed on four pixels simultaneously.

When the performance improvement of the ray queries is considered without shading, the typical speed increase is 300%.

We have shown that the performance of a ray tracer can be improved considerably by using vectorization to trace four rays in parallel. The vectorization adds a modest amount of complexity to the ray tracer, yet it improves performance considerably.