We describe updates to the original conservative morphological anti-aliasing (CMAA) algorithm including a new DirectX* Compute Shader implementation that we call CMAA2. This version provides additional performance, improved quality, optional integration with multi-sample anti-aliasing (MSAA), and the ability to be leveraged asynchronously on graphics processing units (GPUs) that support this capability. We also describe some of the weak points of the post-process anti-aliasing (AA) techniques, reflect on scenarios in which they work well and, in some cases, complement MSAA solutions. In comparison to fast approximate anti-aliasing (FXAA), our testing using the publicly available Amazon Lumberyard* Bistro scene shows that CMAA2 provides as good or better anti-aliasing while preserving significantly more of the input image sharpness at roughly similar performance. At higher resolutions, CMAA2 is as fast or faster depending on the hardware. We provide quality and performance for a number of hardware and resolution configurations as well as the source code to a DirectX* 11 DirectCompute 5.0 sample application.
Figure 1. On the left, the image with no anti-aliasing enabled. Second image shows CMAA2 applied alone. Third image is 8xMSAA alone, showing better subpixel AA on the napkin holder and other geometry than the second, but no AA on shadows. On the right is 4xMSAA+CMAA2 providing better AA compared to 8xMSAA alone, at lower overall computation cost.
We present an update to the image based conservative morphological anti-aliasing (CMAA) algorithm and discuss the performance and quality tradeoffs as well as the integration with multi-sample anti-aliasing (MSAA). This new implementation improves on the anti-aliasing quality and performance of previous implementations.1 Also, we have transitioned from a pixel shader to a series of compute shaders that can improve performance on hardware that supports the simultaneous execution of work scheduled in a graphics and compute command queue. The compute shaders are DirectX* Compute Shader 5.0 and we provide a sample implementation in DirectX 11. Finally, we discuss how to use CMAA together with MSAA and how the two approaches complement each other.
While we focused on optimizations to improve the performance on Intel® Integrated Native Developer Experience (Intel® INDE) graphics, we worked to ensure the algorithm works well on the latest discrete graphics processing units (GPUs) from other vendors. We welcome contributions from the community to refine the implementation.
The move from a pixel shader to compute shader implementation provided us with several benefits. First, we can schedule the work concurrently on hardware that supports execution of graphics and compute simultaneously via asynchronous compute. Second, we leverage shared local memory which is not available in pixel shaders to avoid more expensive memory access. Third, by leveraging indirect dispatch, we are able to better utilize GPU compute resources when compared to a pure pixel shader implementation, especially for shape detection and resolution which relies on branching in the pixel shader. Many other benefits are described in Section 2 when we describe the improvements over previous releases of CMAA.
For the purposes of this paper, we will use CMAA2 to refer to CMAA 2.0.
We use the term conservative to refer to the unique approach of classification and resolving of aliasing artifacts designed to preserve overall sharpness of the input image while still providing a visually pleasing aliased result. In other words, our anti-aliasing heuristic is guided by two objectives: getting as close to ground truth as possible while at the same time minimizing the change to the input image. We use the term morphological to identify it as belonging to the same class of algorithms as the original morphological antialiasing.2
Inspired by the work to integrate SMAA with 2xMSAA (SMAA S2x)3, there are a few specific reasons we wanted to investigate the integration of CMAA with MSAA. First, MSAA alone is intended to resolve rasterization artifacts but does not help with aliasing issues that arise during pixel shading, so increasing the sample count above 4x or 8x provides diminishing returns at the exponential increase in cost. Examples of aliasing sources that do not benefit from MSAA are high power specular components, sharp shadow maps or even ray-traced shadows as shown in John Story's "Hybrid Ray-Traced Shadows".4 Second, because MSAA is not as efficient on some integrated GPUs as we would like, higher sample counts can be impractically expensive in certain scenarios. We are able to show that the integration of CMAA2 and MSAA provides a net benefit over doing one or the other individually, providing an effect that is better than the sum of its parts. Specifically, we show that 4x MSAA with CMAA2 can provides higher quality than 8x MSAA at a lower or similar overall performance cost depending on the graphics hardware and share these results in Section 6.
Anti-aliasing techniques have been studied by computer graphicists for many years. For the purposes of this article we limit references to popular post-processing techniques, all of which attempt to approximate a super sampled anti-aliased (SSAA) image taking into account only the pixels generated in the target frame buffer at frame buffer resolution. In some cases which include temporal sampling, SMAA T2x for example, they additionally leverage previously generated frame buffers. Integrated temporal sampling with an earlier CMAA2 implementation is presented in Sungye Kim's "Temporally Stable Conservative Morphological Anti-Aliasing" article.5
A retrospective of the post-processing era is available, so we refer the reader to this survey for more background as well as a survey of games that have shipped that adopted implementations of post process anti-aliasing techniques.6 This has been an active area of work and we will review four of the interesting implementations. First, we will describe morphological anti-aliasing (MLAA) which forms the foundation of our technique as well as the others. Next, we review fast-approximate anti-aliasing (FXAA) which has been widely adopted as a very fast post-process anti-aliasing technique. We also review subpixel morphological anti-aliasing (SMAA) which was used in CRYENGINE* as well as many other titles.7 Finally, we compare our updates to the previous implementations of CMAA.
A seminal work in the field of post process anti-aliasing is Alexander Reshetov's MLAA paper.2 MLAA was originally implemented as a CPU-side post-processing technique well suited but not limited to ray tracing. Shortly after this invention the technique was implemented on the GPU.8 In Reshetov's implementation, edges in color space were classified as Z, U, or L shapes where Z and U shapes can be decomposed into 2 L-shapes, as seen in the figure below. While we provide a short summary here, it is recommended for those interested in additional details to consult the original paper.
Figure 2. Shapes classified in the original MLAA paper. From left to right: Z, U, and L shapes. The shapes are also identified that match these patterns rotated 90, 180 and 270 degrees. Since Z and U shapes decompose into L-shapes, the MLAA implementation need only be concerned with processing the decomposed version of Z and U into two L-shaped subcomponents.
All of the L-shapes are blended in the final stage by blending the colors of the pixels that touch the edges they contain. To do this blending a primary and secondary edge is defined, the primary edge being the one that was found at the first step, and the secondary edge the one found in the second merge phase. Interestingly, the secondary edge is only one pixel in length, even in cases when the number of pixels detected in a previous phase belonging to the secondary edge could be longer. This is safe because the rest of the pixels not processed in this stage will be independently processed as a separate shape. The pixels are blended based on a weighted average of the area of each pixel occupied, found by drawing a line from the midpoint of the secondary edge to the farthest vertex of the primary edge.
Like other post-processing techniques, FXAA infers edges in color space and operates only on those edges to minimize the amount of computation. The value proposition of existing FXAA implementations is as follows:
FXAA has remained a popular and practical post-processing technique, particularly in deferred rendering. The implementation described in FXAA by Timothy Lottes9 is a single pass on the color image and integrates the ability to estimate luminance strictly from the red and green channels observing that blue aliasing rarely occurs in game content. The computations described in "Filtering Approaches for Real-Time Anti-Aliasing",10 are kept localized to the nearest neighbor pixels and optionally increasing the filter width up to 8 pixels. We use a publicly available implementation.11 We use FXAA version 3.11 at default settings for testing and comparison purposes.
SMAA is an evolution of the original MLAA12 and describes how to handle many issues not addressed in previously published post-processing techniques including local contrast adaptation (LCA), improves handling of sharp corners, diagonal line segments, subpixel features, motion, and specular aliasing. SMAA highlights its modularity allowing the implementation to adapt to discrete design points in quality and performance. For example, a low-end hardware target may choose to include the features to resolve diagonal line segments and local contrast but may omit the temporal stages. Similarly, a high-end target may include all of the capabilities and show even better results in both performance and quality than a high bandwidth MSAA implementation. We use a publicly available SMAA implementation at default settings for testing and comparison purposes3.
Intel has published earlier implementations of CMAA.1 This was a pixel shader-based implementation that had commercial use in game titles, for example CodeMaster* GRID 2.13 We improve upon this previous implementation that we describe in more detail.
DirectX DispatchIndirect()method to reduce the amount of work relative to a pixel shader approach that may have required less efficient branching or stencil masking.14
Intel published a temporally stable adaptation of a recent implementation which enhances the frame to frame coherence of the original CMAA implementation, temporally stable conservative morphological anti-aliasing (TSCMAA).5 In fact, the request for updating the original CMAA for use in the TSCMAA implementation is what inspired this stand-alone CMAA2 update. Given the significant changes, we felt details of the revised CMAA technique independently justified its own treatment, in addition to updates that have occurred since the original TSCMAA implementation.
Locate edges in input images, store as candidates for later evaluation
For each edge, calculate new weighted color values, append results to per pixel linked list
Buffer of per
For each pixel of the output image, blend linked lists elements in a weighted average into the output buffer
An overview of the three primary stages of the CMAA 2.0 algorithm. Each stage is an independent compute shader and takes input from the previous stages to compute the final render target. Details of each stage are described below.
In this section we provide an overview of our CMAA2 algorithm. We compare and contrast with two of the most popular post-process anti-aliasing solutions on the market today, FXAA and SMAA.
One way to distinguish different post process anti-aliasing techniques is by looking at the aliasing shapes they identify and how they utilize those shapes to anti-alias by blending with neighboring pixels. Summarizing from Reshetov and Jimenez's presentation6, we can simplify the key points of MLAA or SMAA as follows:
Figure 4. The MLAA edge detection step.
Figure 5. The MLAA shape classification and revectorization step.
Figure 6. The final MLAA blending step.
CMAA2 uses a first step that is analogous to the first step above (Figure 4), where axis-aligned lines separating pixel with different colors (edges) are found.
However, the fundamental difference from MLAA (and SMAA) is in the rules governing shape classification that follow the reconstruction of the silhouette lines. Here CMAA2 classifies only one main shape type – a symmetrical Z shape, different from the Z, U and L shapes from MLAA and SMAA.
Figure 7. An example of a CMAA2's symmetrical Z-shape.
CMAA2's Z-shape transition from one row or column to the next is always one pixel in height and width, while the arms are extending as long as it can walk an unbroken color edge on the sides, with the additional constraints that the length of one arm is never larger than the length of the other multiplied by a constant value, for example 1.2, 1.3, or 2, and they always have a combined length of at least 4 pixels.
Finally, the blending is performed assuming the underlying geometry's silhouette is a straight line from one to the other Z-shape arm's centers.
Figure 8. The revectorization (red line) and blending based on CMAA2's symmetrical Z-shapes. The blend takes place between the midpoint of the two long arms of the z-shape.
This symmetrical Z-shape classification and blending approach is the core of the CMAA2 approach and is more restrictive when compared to the one employed by MLAA (and consequently SMAA). Symmetrical Z handling inherently avoids the over blurring scenarios that are handled as a special case in SMAA using the longer crossing edges condition, as described in Chapter 4.2 Pattern Handling – Sharp geometric features12.
Figure 9. From left to right: the source image, FXAA 3.11, SMAA, and CMAA2.
Figure 9 shows different behaviors of FXAA, SMAA and CMAA2 on a simple synthetic example of a simple rotating square, with visible differences in sharpness and handling of corners. Next, we will cover the practical implications of these differences in the real-world scenarios.
A careful observer will notice that there is one-pixel blurring in the corners of the CMAA2 case and also that we have previously stated that one of the conditions for the detection of a symmetrical Z-shape is that "the combined length of the arms is 4 or above”. This would have excluded these corners as well as a subset of other aliasing artifacts such as a diagonal line.
CMAA2 uses a separate simple shape detection and filter to handle these cases. This simple shape filter works on pixels individually, taking as input info on the 12 surrounding edges as shown in Figure 10 and performs re-vectorization and blending between the pixel and its four neighbors (left, top, right, bottom). The blending weights computation is based on a relatively simple heuristic that is explained in the implementation details. This filter is not executed on pixels that were previously processed by the main symmetrical Z-shape.
Figure 10. CMAA2's handling of secondary 'simple shapes'. The blue lines indicate 12 edges used to resolve blending for the dark green pixel in the center and red lines indicate resulting revectorization.
CMAA2 does not yet include advanced handling of diagonal aliasing patterns such as the SMAA approach based on using precomputed texture to obtain more accurate coverage areas (Chapter 4.2 Pattern Handling – Diagonal patterns, 11). This is because our implementation of this approach had a non-trivial impact on performance and also reduced sharpness in certain scenarios, which goes against the conservativeness ethos. However, we are still exploring potential solutions, as this is one area where SMAA produces superior quality results compared to both FXAA and CMAA2.
Figure 11. Handling of diagonals, from left: source image, FXAA 3.11, SMAA, CMAA2
There are a number of places throughout the implementation where a tradeoff is made between preserving the sharpness or applying more anti-aliasing. For example, the strength of local contrast adaptation for edge detection, processing symmetrical Z-shape constraints, simple shape re-vectorization aggressiveness, etc. We have captured these settings under a global "Extra sharpness" switch, providing the user an easy way to toggle between two modes. In some scenarios, such as when text and the UI are embedded into the 3D scene, or the art contains sharp high frequency textures, extra sharpness is desirable. In others, more anti-aliasing provides more pleasing results.
Figure 12. Sharpness vs anti-aliasing; from left: source image, FXAA 3.11, SMAA, CMAA2, CMAA2-ExtraSharpness
The CMAA2 algorithm is broken into three phases: Edge Detection, Shape Processing, and Resolve. To simplify the process of understanding the overall algorithm we will provide an overview then describe the enhancements to the basic algorithm separately:
Now that we've presented an overview of CMAA2, we present the enhancements and next level of detail on our implementation. First, we describe several improvements made to the base algorithm, including the use of dispatch indirect and optimizations to increase utilization of data we had pulled from memory into the cache and registers. In addition, we describe our encoding scheme for internal data structures and the use of shared local memory, as well as interoperability with MSAA. The next section goes into detail on each of the three phases of the implementation. To motivate these optimizations, it is useful to be familiar with the Compute Architecture of the Intel® Integrated Development Environment graphics GPUs. An excellent reference is the whitepaper, The Compute Architecture of Intel® Processor Graphics Gen9. Also, we assume the reader is familiar with terms related to the DirectX API such as groupshared, dispatch indirect, and threadgroup.
Section 3 provided an overview of the CMAA2 algorithm and we continue to instill improvements. However, we continue to evolve the implementation with improvements. To this end, we will describe several of the enhancements included in this version. Also, the Compute Shader Model 5.0 implementation has been tested and is designed to work in both DirectX11 and DirectX12.
For the second (Shape Processing) and third (Resolve) stages we utilize dispatch indirect to efficiently coalesce and schedule useful work that results from the previous stage and avoid processing pixels that do not require anti-aliasing. This way we can also split the work in 14 optimal thread group sizes that are tuned with regard to shader shared local memory, known as [groupshared] in DirectX High-Level Shading Language (HLSL) and minimize register pressure. For setting up the dispatch indirect calls we invoke two trivial Compute Dispatch Argument passes, omitted in the previous description for clarity, before the second and third main steps. This is in contrast to our original pixel shader CMAA implementation (and SMAA as well) which rely on stencil masking to avoid unnecessary work in subsequent passes on areas that do not require anti-aliasing. Using the stencil mask for this is less efficient than the dispatch indirect approach because of the need and cost of outputting and later using the stencil mask as well as pixel shader thread dispatch granularity that results in some maskedout pixels with no required work still causing the dispatch of redundant threads
During performance analysis we observed that the first compute shader version that processed a single pixel per thread was underutilizing the hardware threads on the Intel® Integrated Development Environment graphics GPU as well as performing sampling and math operations where the results could be shared with neighboring pixels. After some experimentation we determined that processing pixels in 2x2 micro tiles per thread provided the right balance between the front end launching thread groups and the back end, as well as the reuse of texture sampling, edge and local contrast adaptation computation. The downside of coalescing the work of four pixels into a single thread is that it increases the complexity of the first stage but in the end the speedup was worth it. These benefits translate directly to discrete GPUs which had similar underutilization issues.
In edge detection, we encode the edge candidates into a 4-bit per-pixel buffer using the bits as on or off flags to indicate presence of an edge in each direction. This means that each 2x2 pixel block writes only 16 bits of edge information, which significantly reduces overall memory bandwidth usage of the algorithm.
For intermediate storage of 8-bit non-HDR sRGB packed colors we adopt the R11G11B10_E4_FLOAT encoding from the Microsoft DirectX* 12 MiniEngine, and for processing of HDR color we support standard R11G11B10_FLOAT encoding. It is trivial to extend this to any other encoding if needed although using an encoding that requires more than 32-bits per color will increase memory bandwidth usage and storage requirements. An analysis of small float format precision for storing and using color information can be found in Bart Wronski's "Small float formats – R11G11B10F precision".16
Tight packing of data is employed in all other cases such as when storing the blending values created in Process Shapes and consumed by Resolve. All of these routines are documented with source provided in the sample code and will work on MSAA or non-MSAA surfaces with no changes.
We pack edge information into shared local memory (in HLSL) during edge detection. This provides speedup compared to global memory because we can take advantage of the heavily banked shared local memory portion of the L3 cache on Intel® Integrated Development Environment GPUs. Discrete GPUs also benefit from this optimization.
One issue we face in the shape processing pass is that only a small subset of candidates will produce viable Z-shapes, and the processing of each is comparatively an expensive operation. This is problematic as most of the threads in the threadgroup used for evaluation will not result in viable Z-shapes and will continue idling while they wait for the few threads that did find viable Z-shapes to finish their lengthy and expensive operation that involves texture reads, blending and storage operations. In order to reduce the resulting underutilization of the hardware threads we measure the viable Z-shapes as before but then split further blending and processing operations into individual per-pixel micro work items, with arguments encoded into SLM memory. Then, after a group barrier and group sync, we redistribute these work items to all threads in the group and execute them in parallel. Please refer to CMAA2_COLLECT_EXPAND_BLEND_ITEMS macro in the source code for more details.
One can attempt to apply post-process AA after the MSAA resolve but this breaks down for areas already anti-aliased with MSAA since edge detection and heuristics no longer have the original values and cannot correctly deduce rasterization artifacts.
More advanced integration of post-process AA with MSAA is explored in detail in SMAA11 for the SMAA+2xMSAA mode called SMAA S2x, where applying the effect on a per-sample-plane before MSAA resolve is shown to induce unwanted blurring. This happens because rasterization for each MSAA sample plane is effectively offset in order to extract subpixel information, so the assumptions on underlying geometry that the post-process AA techniques will make based on the produced image will be shifted and result in divergence from actual geometry, with noticeable blurring. To remedy this, SMAA S2x uses specialized offsetted revectorization heuristics based on the MSAA sample plane offset, as described in11 chapter 4.3 Subpixel Rendering, which counters the rasterization blurring and produces sharp geometry anti-aliasing. However, it does introduce additional blurring on non-rasterization aliasing artifacts that come out of pixel shading because during MSAA pixel shading is done per-pixel and at pixel centers without offsets so in this case the heuristic actually causes a reduction in sharpness.
Initially we tried to separate rasterization aliasing artifacts from pixel shading aliasing artifacts in the MSAA surface by assuming the former on pixels with different sample values within a pixel and later when the MSAA samples within the pixel are identical. This proved to be a relatively good heuristic with regards to quality, but the complexity did not allow for performance that was practical.
Consequently, we decided to revert to the simple CMAA2 + MSAA per-sample-plane approach, with an optional reduction in the CMAA2 effect strength to reduce blurring. Interestingly, although the slight increase in blurriness is noticeable, we still get good quality results that get closer to the ground truth reference (see the section in Chapter 5: CMAA2 Integration with MSAA) in 2x, 4x and 8x MSAA scenarios, and in the case of CMAA2+4xMSAA better quality than 8xMSAA alone. This is potentially due to higher overall sharpness and more conservative revectorization employed by CMAA2 in general when compared to SMAA. From performance perspective, avoiding the offsetted revectorization lets us handle identical samples as one, resulting in acceptable overall cost.
However, we would like to stress that the integration of the CMAA2 and MSAA is provided as a reference, and we have a number of potential future optimizations not yet explored.
Edge detection can optionally be supported in an 8-bit per-pixel (bpp) pre-computed log-luma texture to reduce memory bandwidth in the first pass by a factor of four at the cost of slightly lower quality edge detection. This is most useful when previous stages may have already generated such an image and we are able to re-use the texture.
Edge detection operates on blocks of pixel data with user configurable size, each outputting edge data and anti-aliasing shape candidates. It is contained entirely in the EdgesColor2x2CS function in the provided shader source code. We empirically found that the optimal processing block size on all tested hardware is 32x32 pixels. This results in effective processing of 28x28 pixels by each block since the 2-wide borders of the kernel are used to provide neighbor data to the inner kernel but cannot produce valid output themselves.
Edge detection is done in sRGB (not linear) color space to better detect edges that the human visual system is more likely to observe in a scene. For each pixel, we use the pixel to the right and bottom to determine if the color difference is above a threshold and if so we consider it a candidate edge to be evaluated later. A lower threshold will cover more aliasing cases but increase the overall execution time as this results in more edges being considered to require filtering. This difference is measured as the maximum difference in color channels between a pixel to the right and bottom of the current pixel under consideration. In a higher performance implementation, we only measure the difference between precomputed luma values. We store these differences in shared local memory (in DirectX) for later processing.
Edge detection is done using the RGB values to create a luma approximation [Wiki 2018]. We experimented with using RGB directly; however, luminance proved to be faster with minimal quality loss. As an alternative to the supplied RGB to luma based approximation scheme we also include an optional edge detection using a luminance map that may have been created for other stages in the rendering pipeline, for example a tone mapping post process. We leave it up to the user to decide what works best in their specific use case.
The sample code also includes an optional weighted euclidean distance metric for higher quality edge detection. Our experimentation did not show significant improvement in quality that justifies the added performance cost.
To avoid smaller color discontinuities being considered with the same importance as the more noticeable ones, in the second part of edge candidate detection we apply a local contrast adaptation (LCA) step that culls edges when a lot of higher contrast ones exist in the neighborhood. In a technique similar to SMAA as described in Chapter 4.1 Edge detection – Local contrast adaptation11 we reduce the already computed color or luma difference by the maximum of the four values representing connecting edges of opposite orientation, multiplied by a user-selectable contrast adaptation multiplier as in Figure 13. For more detail refer to the ComputeLocalContrastV and ComputeLocalContrastH functions in the shader code. We use shared local memory to keep this computation entirely in GPU memory.
Figure 13. Local contrast adaptation suppresses the blue edge value by the four connecting red edge values of opposite orientation.
Finally, if the remaining value is above the threshold, the edge is considered to be present and we store this as a binary flag for the edge.
The last part of the edge detection phase has two parts. One is to coalesce the edge information so that each pixel's processing location holds a full set of flag values indicating the presence of an edge towards each neighbor. This 4-bit value is stored in a 2D texture that we call the edges working buffer. In the second part, this information is used to identify if the location can be a starting point for a simple or complex aliasing shape and if that is the case it is stored in a global list that we call shape candidate list for later processing.
In the second phase we process the locations identified in Phase 1 and stored the results in the shape candidate list. We spawn one instance of ProcessCandidatesCS per shape candidate using an indirect dispatch on the GPU18 and process each using the following steps:
Figure 14. Simple shape kernel size; 12 edges used for shape classification shown in red
If the resulting classification from Step 2 requires color blending, it is performed between the pixel and its four neighbors and the final color value is stored in a separate list using the StoreColorSample function in the shader code, for processing in the last phase.
Figure 15. Tracing the edges to determine Z-shape arm lengths
Figure 16. Applying anti-aliasing based on the Z-shape
We then observe the area that the assumed geometry edge covers for each pixel so we can determine the amount of under-representation: Zl, Zc and Ls define the triangle on the left side of the Z-shape that is under-represented and equally Zr, Zc and Rs define the triangle on the right. We then attempt to correct the under-representation by assuming that the missing color information is best represented by sampling the values below (left) or above (right) of the traced line and blending the color value of each pixel on a line with it, with the ratio corresponding to the area of the pixel covered by the under-representation triangle. This process (shown as red arrows in Figure 16) accounts for most of the anti-aliasing effect in CMAA2. We store the color values from the previous step in the same way we stored color values from processing the simple shapes, by using the StoreColorSample function with an additional flag to indicate that the values are coming from the complex shapes (Z-shapes), so the last stage can distinguish between the two.
The previous stages used the StoreColorSample function to write the anti-aliased color values they expected to store into the final input/output color buffer. Considering that output pixel locations of these color values can sometimes overlap between different complex shapes and between simple shapes and complex shapes, and the processing happens in parallel on the GPU in a non-deterministic order, we store them in separate per-pixel lists and process them together to ensure determinism. Final color value for each pixel is computed as a weighted average with the values originating from complex shapes processing having 10x the weight compared to those from the simple shapes which is a simple way to ensure that the former, higher quality heuristic has the priority. At the end, the final color values are written to the input/output color buffer.
One major upgrade for CMAA2 is its interoperability with MSAA. An obvious approach would be to simply apply one pass after the other. While this would correctly solve most non-rasterization sources of aliasing since these are not affected by MSAA, it would not do a good job in areas already anti-aliased by MSAA since edge detection and the other heuristics are no longer dealing with the original color values. See Figure 17 for an example.
Figure 17. A zoomed in image of a table leg showing geometric and shader-based aliasing. The geometric aliasing is along the leg itself, and the shadow demonstrates the shader-based aliasing. Top-left: no AA; top-right: 4xMSAA; bottom-left: 4xMSAA+CMAA2 applied sequentially (the obvious approach); bottom-right: 4xMSAA+CMAA2 applied per-plane (a more visually correct approach)
Instead of applying the techniques in two separate passes, we instead apply CMAA2 to the individual MSAA planes before the resolve phase, without taking account of individual MSAA sample offsets as explained in Chapter 4: Interoperability with MSAA. This way anti-aliasing of rasterization gradients is enhanced with what we consider an acceptable amount of unwanted blurring and the shading aliasing artifacts are resolved in the same way as they would be with no MSAA, with no additional blurring.
Another issue to handle with MSAA surfaces is tone mapping which generally applies a non-linear color transform. The result of this is a significant loss of the MSAA effect in high contrast areas when hardware resolve is used. Instead, most modern graphics engines that use MSAA will perform a custom step which resolves MSAA samples after tone mapping [Persson 2008] or perform an inverse luma-weighted blend. In the case of CMAA2 with MSAA, we intercept the custom post-tone mapped resolve to additionally output tone mapped per-sample (not after the multi-sampled resolve) values into a 2D texture array or another multi-sample texture, in addition to regular output for the resolved color buffer. This step can be skipped if tone mapping is not performed. We also create a complexity map texture during tone mapping or separately which is the equivalent of the GL_EXT_shader_samples_identical OpenGL extension20.
Instead of trivially applying CMAA2 on each multi-sample (MS) plane separately and then resolving, each of the three CMAA2 compute shader passes are extended to support the processing of MS surfaces, operating on unresolved MS color buffers as the input and the resolved color buffer as the output, using the information from the complexity map to avoid repeating work on identical samples.
The first part of edge detection can skip subsequent loading of the colors, computing and storing the strength values to shared local memory in cases when the texel and its right and bottom neighbors are classified by the complexity map as all having identical samples. This was implemented in the prototype and in our testing has an execution savings of 30-40% for the CMAA2+4xMSAA scenario.
However, there are a number of potential opportunities for future optimization of the CMAA2+MSAA code, such as having API access to the GL_EXT_shader_samples_identical-like information to avoid computing it manually and being able to efficiently apply tone mapping to the unresolved MSAA color texture, both in order to avoid copying values to intermediate texture array, as well as to enable the usage of the more efficient hardware MSAA resolve.
For improved usability and validation, we implemented a debug draw mode. This allows us to see the edges detected in the Edge Detection phase. This mode was also used in practice to adjust the edge detection threshold and can be enabled by calling DebugDrawEdgesCS after edge detection as shown in the sample code.
While ease of integration and suitability to specific scenarios is important, choice of a post-processing anti-aliasing (AA) technique will depend mainly on quality and performance. We provide full DirectX 11 source code for the sample that, in addition to the test 3D scene, has the ability to load any user image and apply CMAA2, FXAA and SMAA in order to test performance and quality, and to find the best settings. We also provide our analysis based on the Amazon Lumberyard Bistro scene, available from the Open Research Content Archive. The interior part of the scene geometry has not been significantly modified, however in order to keep the assets within practical size the textures were downsized and compressed using block-based compression (BC) and some parts of the scene exterior geometry was removed. The reduction of texture resolution reduces the sharpness and detail of the scene to a degree, but the scene remains representative of an average real-time graphics workload, is easily accessible at under 250 MB, and can run on low-end hardware.
Figure 18. From left to right: Source, FXAA, SMAA and CMAA2.
The FXAA implementation that we use is available at12 and was only modified to consume input luma from a single channel R8 texture using an optimal gather texture sampling codepath. We use default edge threshold and quality settings, with the exception of fxaaQualitySubpix which was adjusted from the default of 0.75 to the value of 0.55 which provides a sharper effect and best peak signal-to-noise-ratio (PSNR) versus ground truth reference numbers, with no change to performance. In the future we plan to also integrate the compute shader FXAA implementation from Microsoft21 as compute shader implementations are more relevant for modern engines.
For SMAA we use the implementation available at GitHub*, upgraded to DirectX 11, at default settings with no changes. We are not aware of a public compute shader implementation but intend to update the code if it becomes available.
Visual inspection shows that, in general, CMAA2 and SMAA produce a sharp image with some degradation of textures and text, with the exception of a couple of fail cases mostly in the synthetic tests, the difference is minimal. FXAA is blurrier in all scenarios, even with fxaaQualitySubpixreduced at the expense of AA quality. The CMAA2 Extra Sharp mode is clearly best with regards to preserving the input image at the expense of handling fewer aliasing artifacts. With regards to anti-aliasing quality, both CMAA2 and SMAA provide clean, visually pleasing results and SMAA has unrivaled handling of diagonals. FXAA provides smooth anti-aliasing effect with loss of sharpness that can be noticeable in some scenarios and acceptable in others.
For numerical analysis we use the Amazon Lumberyard Bistro scene. We compare image differences to the 8xMSAA reference, to the super-sampled reference and to the raw source images. The super-sampled reference has 36 taps per pixel each at 8x MSAA. The numbers provided below are the average of PSNR differences computed at 8 different test scene locations, at 1920x1080 on Nvidia GeForce* GTX 1070. This may not be fully representative of your title, studio or workload. This is one of the reasons we provide a working implementation that can be used to apply the effects on any input image.
|AA Technique||PSNR (db) Compared to|
|SS Reference||Source (no-AA)|
|CMAA2 ES||36.63||CMAA2 ES|
Table 1. Various AA techniques, PSNR (db) from SS reference and from source (no-AA).
We can make a couple of observations based on the numbers in Table 1:
Our conclusion is that all three techniques provide similar PSNR results when measured as the difference to the super sampled reference, however CMAA2 and CMAA2 Extra Sharp modes achieve this with the least amount of change to the source image, thus preserving most of the detail of the original, while FXAA causes the most significant amount of change (mostly noticeable as unwanted over-blurring).
It should be noted that a smaller PSNR difference from the super sampled reference does not necessarily imply a visually less aliased image. For example, a post-process AA technique can (and will) make incorrect assumptions on the underlying geometry and perform reconstruction that deviates from the ground truth but is still visually pleasing. This can still be a net gain and the balance depends on the input content and visual requirements. This is why we added the Extra Sharp mode – we prefer the way default CMAA2 looks as it has less aliasing than the Extra Sharp mode, but in some cases the latter will be preferred. The same point can be made for SMAA and FXAA; there are cases where SMAA's handling of diagonals and other higher-end features produce better looking results and cases where FXAA's loss of sharpness can be acceptable or even desirable, such as for example with anti-aliasing of ray-traced shadows, as demonstrated in Hybrid Ray-Traced Shadows.4
We measure performance by timing a deterministic 4800 frame flythrough of the Amazon Lumberyard* Bistro scene, with the performance cost of each AA technique defined as the delta of the full run with it enabled compared to the baseline with no anti-aliasing. This gives an average added performance cost for each technique per frame. We repeat the process and compute an average out of four runs. The test can be performed with the "Run performance benchmarks" option in the code sample 1.
|Added Performance Cost of AA Techniques, as Average Over 4800 Frames, Milliseconds|
|Nvidia GeForce* GTX 1080||AMD Vega 64||Intel® NUC Kit NUC8i7HVK||Intel® NUC Kit NUC6i7KYK|
|1920x1080||3840x 2160||1920x1080||3840x 2160||1920x1080||3840x 2160||1920x1080|
1 Platform Configuration: Nvidia GeForce* GTX 1080: Windows® 10 64-bit Build 16299, Intel® Core™ i7-6950 processor @ 3.0Ghz, 16384MB RAM, Nvidia Geforce GTX 1080, 16221 MB Display Memory, Driver Version 18.104.22.16835, System Manufacturer and Model: MSI MS-7885. AMD Vega 64: Windows 10 64-bit Build 16299, Intel Core i7-6950 processor @ 3.0Ghz, 16384MB RAM, AMD Radeon RX Vega, 16264MB MB Display Memory, Driver Version 23.20.15033.5003. System Manufacturer and Model: MSI MS-7885. Intel® NUC Kit NUC8i7HVK: Windows 10 Pro-64-bit Intel Core i7-6950 processor @ 3.0Ghz, 16384MB RAM, AMD Radeon Vega M GH Graphics, 8277 MB Display Memory, Driver Version 22.214.171.12473. Intel® NUC Kit NUC6i7HVK: Windows 10 64-bit Build 16299, Intel Core i7-6950 processor @ 3.0Ghz, 16384 MB RAM, Iris® Pro graphics 580, 8264 MB Display Memory, Driver Version: 126.96.36.19999. Source code available at link in whitepaper. Performance is based on current engineering estimates and is subject to change.
Table 2. FXAA, CMAA2 and SMAA performance as measured on the Amazon Lumberyard Bistro scene, various GPUs at 1080p and 4k resolutions
The numbers reveal that CMAA2 is slightly slower than FXAA at lower resolutions but becomes as fast or faster with the resolution going up, depending on the hardware. SMAA is on average roughly 3 times as expensive when compared to FXAA and CMAA2 although there is high variability (2x to 4x) based on the content, which is similar to the findings from the "Decima Engine: Advances in Lighting and AA" presentation20 and the reason behind their decision to use FXAA.
It should be noted that all three approaches have a number of settings that can be used to trade off performance versus quality to a degree.
Last, we look at CMAA2 integration with MSAA. With regards to quality, it provides an additional option to a developer who is relying solely on MSAA by effectively reducing non-geometry aliasing such as inadequately filtered shadow maps or any other shader aliasing as well as enhancing the anti-aliasing of the geometry as shown in Figure 17.
Figure 19. Top left: no AA; Top right: CMAA2 alone; Bottom left: 4xMSAA alone; Bottom right: 4xMSAA+CMAA2. This scene uses exaggerated shadows to demonstrate shader aliasing and is not used for numerical analysis.
The slight downside is the loss of sharpness, as explained in Jimenez's SMAA: Enhanced Subpixel Morphological Antialiasing.12 However, image analysis (Table , PSNR from SS column) shows that in our scenario the overall benefit of applying CMAA per-MSAA plane extends all the way to 8xMSAA, with the 4xMSAA+CMAA2 being closer to ground truth than 8xMSAA alone.
|AA Technique||PSNR From SS||Added Cost, Milliseconds|
|Nvidia GeForce* GTX 1080, 4k||AMD Vega 64, 4k||Intel® NUC Kit NUC6i7KYK, 1080p|
Table 3. MSAA and CMAA2 combined – quality and performance overview
The current implementation of CMAA2 integration with MSAA is an early implementation, however it is still reasonably useful on some hardware such as the AMD Vega 64 and the Intel® NUC Kit NUC6i7KYK where 4xMSAA+CMAA2 is both closer to the ground truth reference and faster when compared to 8xMSAA. On Nvidia GTX 1080 the case is less clear as the MSAA overhead itself is proportionally lower.
It should be noted that the current MSAA+CMAA2 implementation does not yet exploit all potential benefits of avoiding processing individual MSAA samples on pixels where they are all identical. Also, for cases where it does, this information is currently extracted manually and stored in a separate texture for later use in the CMAA2 pass. However, on most graphics hardware this information already exists and in some cases there already are attempts at exposing it, for example: EXT_shader_samples_identical. The other significant source of inefficiency is the need to export MSAA surface samples after tone mapping via the Texture2DArray approach. Instead, if we were able to apply tone mapping to the MSAA surface in-place and then read from it directly we could significantly reduce the integration overhead, all of which is currently measured and shown as Added Performance Cost in Table 2.
In summary, we have described the significantly improved implementation of CMAA using Compute Shader Model 5.0 as well as provided performance and quality testing results. We have shown that it can be used to provide good anti-aliasing while at the same time preserving original image sharpness at a comparatively low performance cost, making it applicable to integrated and mobile GPU solutions in addition to discrete GPUs. In the future we are planning to implement or explore a number of enhancements:
Instructions to build and run a working DirectX11 Sample are made available at GitHub*. We also have an implementation integrated into the Microsoft DirectX 12 Mini-Engine for comparisons with FXAA and to ensure the Compute Shader 5.0 implementations would work unmodified in both versions of the DirectX API. The Compute Shader 5.0 implementation is structured in such a way that should be relatively easy to integrate but we welcome suggestions or contributions to improve the sample code. On the host side there will be a requirement to create and pass in the additional buffers mentioned in the previous sections as well as shown in the CPU-side sample code.
We want to acknowledge several people who supported this work. First, Leigh Davies and Axel Mamode for inspiring and helping bring the original CMAA effort to the light of day. Second, Anupreet Kalra, Sungye Kim, Javier Martinez, and others on the VPG Solutions Team for encouraging us to revisit the original CMAA implementation. Third, team members that remain supportive of the technical work and gave us the time to write-up the new results including Mike Burrows, Josh Doss, Stephen Junkins, and others. Also, we want to acknowledge the previous work by Alexander Reshetov for inspiring an entire new field of AA techniques, as well as the previous work by Jorge Jimenez and Timothy Lottes who made these techniques common in real-time graphics.
John Papadopoulos: DirectX 12 – Execute Indirect Command Further Improves Performance and Greatly Reduces CPU Usage. March 6, 2015.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804