It’s a Fast, Smooth, Well-lit World Out There: Confetti, MLAA, and Intel® Microarchitecture Codename Sandy Bridge Remodel Unreality

By Nancy Nicolaisen

Download Article

Download It's a Fast, Smooth, Well-lit World Out There: Confetti, MLAA, and Intel® Microarchitecture Codename Sandy Bridge Remodel Unreality [PDF 1.7MB]

Abstract

This case study details the work of Confetti Special Effects Inc., a graphics think-tank specializing in Morphological Anti-aliasing (MLAA) technology.  Confetti's methodology as well as its work involving Intel Graphics Performance Analyzers (GPA) on the Sandy Bridge family of processors is documented below.
 

Introduction/Goal

As documented by extensive neurobiologic research, about 80 percent of the entire bandwidth of the human brain is devoted to vision. Another 12 percent or so is devoted to fun. This second bit is conjecture but represents a reasonable guess based on casual observation of the electronic entertainment industry. All this goes a long way toward explaining why, of all the dynamic, rapidly evolving fields of technology, computer-generated visuals are among the most vibrant in terms of innovation. One thing that makes graphics software development such fertile ground for innovators is that as rendering algorithms have become increasingly sophisticated, cutting-edge graphics hardware has advanced dramatically in performance and fallen in price. This trend is making systems formerly accessible to the fortunate few commonplace in home and personal gaming systems. High-end graphics developers Wolfgang Engel and Peter Santoki were in the best imaginable position to see the opportunity this trajectory posed when they decided to strike out on their own and leave Rockstar Games, publisher of some of the most successful games of all time.

Engel and Santoki enjoyed the sort of jobs that most in the game industry would kill for: Engel was lead graphics programmer and a key game engine developer, while Santoki was an acknowledged wizard in the creation and application of visual effects. However, the two friends and colleagues hungered for that elusive "something more." Driven, as Engel puts it, "to follow their dream," the two founded Confetti Special Effects Inc. to pursue research in leading-edge graphics programming technology and to become a game middleware powerhouse (Figure 1). Dreams, it seems, can come true. Confetti has become one of the earliest providers of advanced graphics middleware tools based on Morphological Anti-aliasing (MLAA) technology.


Figure 1. Launching a rocket using depth of field, lit soft particles, and point light shadows

 

Smooth Those Edges and Make It Snappy

MLAA is a relatively new technique but has the potential to dramatically improve game experiences. Initially showcased on high-end graphical processing units (GPU), the technology enables developers to boost the image quality of their games efficiently and in real time. To understand how much improvement MLAA represents, you need to look back. Historically, dynamically rendering computer graphics dictated a simple trade-off: You could have it look good, happen fast, or run well on inexpensive platforms, but it was a "pick any two" situation. In part, this limitation stems from the physiology of human eyesight. It turns out that, given a random sample of the population, the number of color-sensitive cones on one person's retina may differ by up to 40 times relative to that of another person, but there will be no measurable difference in their ability to see. This is because we see mostly with our brains, not our eyes. More important still, we don't "see" everything in our field of view equally.

Human vision is fine-tuned to detect alignment, which is why edges that appear to have "stair steps" are much more noticeable and visually displeasing in video graphics than minor variations in color, to which we are significantly less sensitive. (The phenomenon of blocky, stair-stepped edges is known by computer scientists as spatial aliasing.) For these reasons, it has long been standard practice in graphics programming to "soften away" spatial aliases by applying gradients of color change at edge boundaries, commonly know as anti-aliasing (AA). Detecting edge boundaries in complex shapes-particularly shapes with holes in them, like grates or fences-is a demanding problem that can be computationally costly to solve, particularly in the case of repetitively used graphic objects such as texture maps. Although many programming techniques exist to smooth edges and give textures a pleasing or realistic look and feel, they all suffer from one or both of these problems: They are resource intensive, requiring extraordinary amounts of computing power for reasonable performance, or they omit remediation of some amount of aliasing, causing images to appear to shimmer or have obvious remaining "jaggies." One of Confetti's noteworthy early successes was to create cutting-edge AA middleware tools for use by game developers. Those AA tools finally achieve the graphic programmer's trifecta of happy outcomes: They create smooth, sinuous edges; they're fast; and they run well on the most common configurations used by PC gamers (Figure 2).


Figure 2. Launching a rocket using depth particles, fire, point lights and shadows, and sparks

 

Confetti's Road to MLAA

Why did Confetti settle on a plan to implement MLAA in its product offering? Confetti CTO Engel, a widely published authority on computer-generated visuals, reflects here on his previous experience with AA techniques:

"I wrote a significant part of the game engine used in Rockstar's products, so optimizing image quality is something I've thought about a lot. If you use older software anti aliasing strategies like Multi Sampling Anti-aliasing (MSAA), you get excellent image quality, but it can quickly become expensive in terms of performance impacts and sacrifices gamers with low-end platforms. Similarly, if you code to hardware anti-aliasing, then you limit your audience size to a specific installed hardware base. MLAA seemed to me to have the potential to overcome a lot of barriers for us at once, because it works in a compartment image space, making it efficient and portable."

Engel explains the conceptual way in which Confetti implements MLAA in its deferred lighting engine like this:

  1. Confetti's engine detects edges, discovering verticals and horizontals.
  2. Edges are assigned values.
  3. Based on the edge values, the image space is blurred in horizontal and vertical directions.

"It sounds simple," says Engel, "but there are challenges. How do you detect edges? And how do you apply a chroma filter in vertical and horizontal directions?" One thing with which the Confetti team was most impressed with in their early MLAA implementation experience was the technology's performance. Their MLAA-reliant lighting engine ran as fast as if there were no AA being performed at all. Image quality was another standout benefit.

"The quality is really close to MSAA, but doing anti-aliasing in image space is so much simpler and easier. It gives you all kinds of flexibility, like the ability to switch anti-aliasing on and off without rebooting the whole game. For example, you can do things like toggling anti-aliasing on or off depending on the workload or the camera style."

 

The Intel® Microarchitecture Codename Sandy Bridge Piece of the Solution

It is not surprising that MLAA image enhancement works beautifully with the 2nd generation Intel® Core™ Processor family-featuring Intel microarchitecture codename Sandy Bridge-,because in a real sense, the two were made for each other (see Figure 3 and Figure 4). The efficiency of this pairing wasn't lost on Engel and his team. When asked to describe their typical workflow and design process, he made these observations:

"We look at the hardware first and see what it can do. From there, we try to come up with assumptions about how we can get a certain effect from a given hardware configuration. Of course, at a very early point in our evaluations, we also look at potential market federation for a given platform. We closely monitor the monthly Steam hardware survey to identify trends.

"In December 2010, I believe the last survey showed something like 74 percent Intel® architecture usage. Given these statistics, one thing we realized when we first looked at Sandy Bridge was that it would certainly have a large market presence. In our early evaluation group, we thought 'Oh, that's so cool, because suddenly we have a very powerful platform that will have lots of users.' The next thought was, 'How can we use this?'"

(Steam conducts a voluntary survey to collect data about what kinds of computer hardware and software people are using. Find it at http://store.steampowered.com/hwsurvey.)


Figure 3. MLAA before and after. The image on the left is rendered without MLAA; the image on the right has MLAA applied.


Figure 4. Detail: Notice the spatial aliasing in the shoulder line before applying MLAA and the smoother profile of the right image after the MLAA processing.

Intel microarchitecture codename Sandy Bridge provided many advantages to Confetti. It offered increased memory, much better rendering performance, and flexibility in designing new solutions.

"I've worked a great deal with previous generations of graphics hardware, where it was very difficult to do deferred lighting because the memory bandwidth was just not there. We did a couple of test runs on [Intel microarchitecture codename] Sandy Bridge, and it really surprised us with the fact that there was suddenly so much more memory bandwidth. That was actually the first time we could run on processor graphics."

(Processor graphics are a defining feature of the 2nd Generation Intel® Core™ i7/i5/i3 processors, wherein all the graphics capabilities are built right into the CPU chip.) Engel and his team were most pleased about being able to run with their new shadow system on a piece of hardware that will be common in the market. As Engel notes:

"What really surprised us was that when we started to do performance measurements, we found that the cost per light and the cost per shadow map were really good. We all still have our gaming console hardware and lots of specialized entertainment hardware, so coming from that background, we were very surprised that we can use more lights and we can render more shadows on the [Intel] Sandy Bridge platforms. We are quite excited about this, because it means that whatever we do now will run on the majority of PCs."

Doing the Metrics

Confetti did a lot of analysis before implementing its deferred lighting engine, because it was a "no-going-back" kind of step. In earlier versions of the company's middleware and tools, the team left this feature out. Seductive as it was, it was simply too big a market risk. Says Engel:

"Two years ago, we were thinking about doing deferred lighting and decided against including it, because the low-end consumer platforms just couldn't handle it. They really couldn't do it at all. And because deferred lighting influences the look and feel of your game, we couldn't just say, 'Okay let's have a fall-back method.' If you have deferred lighting in your engine, you can't dynamically drop back to a low-end approach."

The Confetti team had to be sure they were creating a product that could be used by as wide a variety of gamers as possible, and they needed empirical performance metrics to back up their decision, so they used the Intel® Graphics Performance Analyzers (Intel® GPA) to give solid foundations to their estimations of rendering performance and characteristics.

Engel and his peers liked Intel GPA-a lot:

"Let me just first say GPA is great. It's awesome. For someone like me, coming from a game console background, the standard of comparison is high. Video game consoles have really great profilers, so we were kind of spoiled. Targeting consoles, we had been able to go down to the nitty gritty details. Intel GPA is the first PC-based tool where we can say 'Okay, this is comparable to game console tools.' That pretty much says it all. We get a very detailed view. We can also reuse our custom tagging system. This is key, because we were already productive with that tool, and we were comfortable with it. We tag parts of our code and can see, for example, how we render lights and how we render shadows. We get millisecond orders of feedback on performance, down to whatever level of granularity we want. It just worked. And that whole system was very reliable. One of the optimizations we figured out with [Intel] GPA was an improvement in rendering cube shadow maps. With [Microsoft*] DirectX* 10, you can render in all six faces of a cube shadow map with one draw call. The geometry shader will then replicate-if necessary-triangles into the six faces. It also does frustum and triangle culling in the geometry shader, so the geometry shader is pretty busy."

The code in this inner loop might look like Listing 1.

Listing 1. Example Inner-loop Code

// Loop over cube faces
[unroll]
for (int i = 0; i < 6; i++)
{
	// Translate the view projection matrix to the position of the light
	float4x4 pViewProjArray = viewProjArray[i];

	//
	// translate
	//
	// access the row HLSL[row][column]
	pViewProjArray[0].w += dot(pViewProjArray[0].xyz, -In[0].lightpos.xyz);
	pViewProjArray[1].w += dot(pViewProjArray[1].xyz, -In[0].lightpos.xyz);
	pViewProjArray[2].w += dot(pViewProjArray[2].xyz, -In[0].lightpos.xyz);
	pViewProjArray[3].w += dot(pViewProjArray[3].xyz, -In[0].lightpos.xyz);

	float4 pos[3];
	pos[0] = mul(pViewProjArray, float4(In[0].position.xyz, 1.0));
	pos[1] = mul(pViewProjArray, float4(In[1].position.xyz, 1.0));
	pos[2] = mul(pViewProjArray, float4(In[2].position.xyz, 1.0));

	// Use frustum culling to improve performance
	float4 t0 = saturate(pos[0].xyxy * float4(-1, -1, 1, 1) - pos[0].w);
	float4 t1 = saturate(pos[1].xyxy * float4(-1, -1, 1, 1) - pos[1].w);
	float4 t2 = saturate(pos[2].xyxy * float4(-1, -1, 1, 1) - pos[2].w);
	float4 t = t0 * t1 * t2;

	[branch]
	if (!any(t))
	{
	// Use backface culling to improve performance
	float2 d0 = pos[1].xy * pos[0].w - pos[0].xy * pos[1].w;
	float2 d1 = pos[2].xy * pos[0].w - pos[0].xy * pos[2].w;

	[branch]
	if (d1.x * d0.y > d0.x * d1.y || 
min(min(pos[0].w, pos[1].w), pos[2].w) < 0.0)
	{
		Out.face = i;

		[unroll]
		for (int k = 0; k < 3; k++)
		{
			Out.position = pos[k];
			Stream.Append(Out);
		}
		Stream.RestartStrip();
	}
	}
}

To relieve the workload of the geometry shader, Engel's team moved the offset and transformation code into the vertex shader. This was a performance gain of more than 25 percent. Listing 2 shows the source code.

Listing 2. Vertex Shader

float4x4 viewProjArray[6];
float3 LightPos;

GsIn main(VsIn In)
{
	GsIn Out;

	float3 position = In.position - LightPos;

	[unroll]
	for (int i=0; i<3; ++i)
	{
	Out.position[i] = mul(viewProjArray[i*2], float4(position.xyz, 1.0));
	Out.extraZ[i] = mul(viewProjArray[i*2+1], float4(position.xyz, 1.0)).z;
	}
	


	return Out;
}


//------------------------------------------------------------------------------
[Geometry shader]

#define POSITIVE_X 0
#define NEGATIVE_X 1
#define POSITIVE_Y 2
#define NEGATIVE_Y 3
#define POSITIVE_Z 4
#define NEGATIVE_Z 5

float4 UnpackPositionForFace(GsIn data, int face)
{
	float4 res = data.position[face/2];

	[flatten]
	if (face%2)
	{
		res.w = -res.w;
		res.z = data.extraZ[face/2];
		[flatten]
		if (face==NEGATIVE_Y)
			res.y = -res.y;
		else
			res.x = -res.x;
	}

	return res;
}

[maxvertexcount(18)]
void main(triangle GsIn In[3], inout TriangleStream Stream)
{
	PsIn Out;

	// Loop over cube faces
	[unroll]
	for (int i = 0; i < 6; i++)
	{
	float4 pos[3];
	pos[0] = UnpackPositionForFace(In[0], i);
	pos[1] = UnpackPositionForFace(In[1], i);
	pos[2] = UnpackPositionForFace(In[2], i);

	// Use frustum culling to improve performance
	float4 t0 = saturate(pos[0].xyxy * float4(-1, -1, 1, 1) - pos[0].w);
	float4 t1 = saturate(pos[1].xyxy * float4(-1, -1, 1, 1) - pos[1].w);
	float4 t2 = saturate(pos[2].xyxy * float4(-1, -1, 1, 1) - pos[2].w);
	float4 t = t0 * t1 * t2;

	[branch]
	if (!any(t))
	{
		// Use backface culling to improve performance
		float2 d0 = pos[1].xy * pos[0].w - pos[0].xy * pos[1].w;
		float2 d1 = pos[2].xy * pos[0].w - pos[0].xy * pos[2].w;

		[branch]
		if (d1.x * d0.y > d0.x * d1.y || 
min(min(pos[0].w, pos[1].w), pos[2].w) < 0.0)
		{
			Out.face = i;

			[unroll]
			for (int k = 0; k < 3; k++)
			{
				Out.position = pos[k];
				Stream.Append(Out);
			}
			Stream.RestartStrip();
		}
	}
	}
}

Optimizing the Geometry Shader for Intel® Microarchitecture Codename Sandy Bridge Using Intel® GPA

Confetti used Intel GPA to tune code and establish hard metrics about optimization results (Figure 5).


Figure 5. The baseline in Intel® GPA as the Confetti team identified areas of potential for optimizing shader code. Note the GPU time in GS:1447 on the Shaders tab: On January 8, this was 1444.0 ms.

"We integrated [Intel] GPA very quickly and used it a lot in optimizing for [Intel microarchitecture codename] Sandy Bridge," said Engel (Figure 6).


Figure 6. GPA metrics quantifying improvements in shader performance. Note the GPU time in GS:1357 on the Shaders tab: For the baseline, this was 1098.5 ms.

"I felt like it was the best system compared to other systems. I don't know what else we can say. It's just cool." Engel is right: There's nothing more to add to that story. Except maybe graphics. The next two figures show Confetti's Dynamic Skydome System during a 24-hour day-night cycle. Figure 7 shows Confetti's depth-of-field and point light shadows technology.


Figure 7. Confetti's depth-of-field and point light shadows technology are important components of its Dynamic Skydome technology.

Figure 8 provides a detail of light in-scattering.


Figure 8. Detail of the use of in-scattering of light in a rendered scene

Conclusion

The Confetti team have a long history of aggressively implementing advanced technologies and also have a broad cross-platform background. Based on that depth of experience, they approached Intel microarchitecture codename Sandy Bridge-based platforms with expectations of finding good graphics performance and real opportunities to expand their audience. They got more than that, however: dramatically improved rendering performance; increased memory bandwidth and storage; an architecture that allowed them to implement MLAA in a fashion entirely compatible with moderately priced systems; and best-of-breed optimization tools, so they could know to a certainty they were delivering beautiful, immersive game experiences for the typical user. For more information on Confetti, go to http://www.conffx.com or become a friend of Confetti Special Effects on Facebook.

About the Author

Nancy Nicolaisen is the author of numerous books on software engineering techniques. She specializes in the design and development of solutions for small mobile and embedded systems. Her involvement with the game industry dates back to 1981, when she worked at gaming pioneer Imagic, developer of Demon Attack and other classics.

For more complete information about compiler optimizations, see our Optimization Notice.