OpenGL* Performance Tips: Power of Two Textures Have Better Performance

By Praveen K Kundurthy,

Published:07/04/2016   Last Updated:07/04/2016

Download the Code Sample from Github*

Introduction

This article discusses how to improve OpenGL* performance by using textures that have dimensions that are a power-of-two. It is accompanied by a C++ example application that shows the effects on game performance by using a texture with power-of-two dimensions and one that is not. This article describes the technique and why it works, and it also describes how a game developer would use the technique in his or her code.

Note: While this article refers to graphical game developers, the concepts apply to all applications that use OpenGL 4.3 and higher. The sample code is written in C++ and is designed for Windows* 8.1 and Windows® 10 devices.

Requirements

The following are required to build and run the example application:

  • A computer with a 6th Generation Intel® Core™ processor
  • OpenGL 4.3 or higher
  • Microsoft Visual Studio* 2013 or newer

6th Generation Intel® Core™ Processor Graphics

The 6th Generation Intel Core processors provide superior two- and three-dimensional graphics performance, reaching up to 1152 GFLOPS. Its multicore architecture improves performance and increases the number of instructions per clock cycle.

The 6th Generation Intel Core processors offer a number of new benefits over previous generations and provide significant boosts to overall computing horsepower and visual performance. Sample enhancements include a GPU that, coupled with the CPU's added computing muscle, provides up to 40 percent better graphics performance over prior Intel® Processor Graphics. The 6th Generation Intel Core processors have been redesigned to offer higher-fidelity visual output, higher-resolution video playback, and more seamless responsiveness for systems with lower power usage. With support for 4K video playback and extended overclocking, 6th Generation Intel Core processors are ideal for game developers.

GPU memory access includes atomic min, max, and compare-and-exchange for 32-bit floating point values in either shared local memory or global memory. The new architecture also offers a performance improvement for back-to-back atomics to the same address. Tiled resources include support for large, partially resident (sparse) textures and buffers. Reading unmapped tiles returns zero, and writes to them are discarded. There are also new shader instructions for clamping LOD and obtaining operation status. There is now support for larger texture and buffer sizes. (For example, you can use up to 128k x 128k x 8B mipmapped 2D textures.)

Bindless resources increase the number of dynamic resources a shader may use, from about 256 to 2,000,000 when supported by the graphics API. This change reduces the overhead associated with updating binding tables and provides more flexibility to programmers.

Execution units have improved native 16-bit floating point support as well. This enhanced floating-point support leads to both power and performance benefits when using half precision.

Display features further offer multiplane overlay options with hardware support to scale, convert, color correct, and composite multiple surfaces at display time. Surfaces can additionally come from separate swap chains using different update frequencies and resolutions (for example, full-resolution GUI elements composited on top of up-scaled, lower-resolution frame renders) to provide significant enhancements.

Its architecture supports GPUs with up to three slices (providing 72 EUs). This architecture also offers increased power gating and clock domain flexibility, which are well worth taking advantage of.

Lesson 1: Power of Two Textures Have Better Performance

Since the release of OpenGL 2.0, developers have the option of using textures without specific dimensions. In the past, these texture sizes were required to be powers of two. However, there are advantages to using textures with power-of-two dimensions.

  • Because interpolation of float numbers can be done very quickly with power-of-two textures, these textures will render faster than ones that are not a power of two. The amount of this difference varies depending upon the GPU, and with modern GPUs this difference may be small, but you can see for yourself using the accompanying application.
  • Non-power-of-two textures waste RAM because they are padded up to the next power-of-two dimension, even though they do not use the entire space.
  • This padding may leave edging artifacts in the resulting image.

The example application displays an image rendered using both a power-of-two and a non-power-of-two texture. The current performance for each (displayed in milliseconds-per-frame) will be displayed in the console window, along with the number of frames-per-second. Pressing the spacebar toggles between the two textures so you can compare the two approaches.

Building and Running the Application

Follow these steps to compile and run the example application.

  1. Download the ZIP file containing the source code for the example application, and then unpack it into a working directory.
  2. Open the lesson1_pow2textures/lesson1.sln file in Microsoft Visual Studio 2013.
  3. Select <Build>/<lesson1_pow2textures> as the location to build the application.
  4. Upon successful build you can run the example from within Visual Studio

Once the application is running, a main window opens and you will see the power-of-two or non-power-of-two image, with the performance measurements in the Microsoft Visual Studio 2013 console window. Press the spacebar to toggle between the two images and compare the performance difference. Press ESC to exit the application.

Code Highlights

The code for this example is straightforward, but there are a few items to highlight.

In the initialization routine we will turn off vsync to get a more accurate report of performance, compile a simple program to display the image, and load both a non-power-of-two and a power-of-two texture into vram.

void init()
{
    versionCheck();

    // turn off vsync
    if (!wglSwapIntervalEXT(0))    		                       

    // compile and link the shaders into a program, make it active
    vShader = compileShader(vertexShader,   GL_VERTEX_SHADER);
    fShader = compileShader(fragmentShader, GL_FRAGMENT_SHADER);
    program = createProgram({ vShader, fShader });
    offset = glGetUniformLocation(program, "offset");                             GLCHK;
    texUnit = glGetUniformLocation(program, "texUnit");    	                    GLCHK;
    glUseProgram(program);                                  	                    GLCHK;

    // configure texture unit
    glActiveTexture(GL_TEXTURE0);                                                 GLCHK;
    glUniform1i(texUnit, 0);                                                      GLCHK;

    // create and configure the textures
    glGenTextures(_countof(texture), texture);                                    GLCHK;
    for (int i = 0; i < _countof(texture); ++i) {
        glBindTexture(GL_TEXTURE_2D, texture[i]);                                 GLCHK;
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT);             GLCHK;
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_REPEAT);             GLCHK;
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);        GLCHK;
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);        GLCHK;
    }

    // load texture image
    std::vector<GLubyte> img; if (lodepng::decode(img, w, h, "sample.png")) 

    // upload the non-power of 2 image to vram
    glBindTexture(GL_TEXTURE_2D, texture[0]);                    		      GLCHK;
    glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, w, h, 0, GL_RGBA, GL_UNSIGNED_BYTE,
                 &img[0]);     							      GLCHK;

    // create a pow2 scaled version
    auto pow2 = [](unsigned v) { int p = 2; while (v >>= 1) p <<= 1; return p; };
    w2 = h2 = max(pow2(w), pow2(h)); std::vector<GLubyte> img2(w2 * h2 * 4);
    if (gluScaleImage(GL_RGBA, w, h, GL_UNSIGNED_BYTE, &img[0], w2, h2, 
        GL_UNSIGNED_BYTE, &img2[0])) 

    // upload the pow2 image to vram
    glBindTexture(GL_TEXTURE_2D, texture[1]);
    glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, w2, h2, 0, GL_RGBA, 
                 GL_UNSIGNED_BYTE, &img2[0]);   GLCHK;
}

Each time a video frame is drawn, the performance output is updated in the console and the application checks whether the spacebar or ESC is pressed. Pressing the spacebar causes the textures to be swapped; pressing the escape key exits the application. When a new texture is loaded the performance measurements are reset and the image animates as a visual indicator that something has changed. If no key was pressed the next frame is rendered.

// GLUT idle function.  Called once per video frame.  Calculate and print timing
// reports and handle console input.
void idle()
{
    // Calculate performance
    static unsigned __int64 skip;  if (++skip < 512) return;
    static unsigned __int64 start; if (!start && 
            !QueryPerformanceCounter((PLARGE_INTEGER)&start))
                               ¬¬__debugbreak();
    unsigned __int64 now;  if (!QueryPerformanceCounter((PLARGE_INTEGER)&now))
                               __debugbreak();
    unsigned __int64 us = elapsedUS(now, start), sec = us / 1000000;
    static unsigned __int64 animationStart;
    static unsigned __int64 cnt; ++cnt;

    // We're either animating
    if (animating)
    {
        float sec = elapsedUS(now, animationStart) / 1000000.f; if (sec < 1.f) {
            animation = (sec < 0.5f ? sec : 1.f - sec) / 0.5f;
        } else {
            animating = false;
            selector ^= 1; skip = 0;
            cnt = start = 0;
            print();
        }
    }

    // Or measuring
    else if (sec >= 2)
    {
        printf("frames rendered = %I64u, uS = %I64u, fps = %f,  
               milliseconds-per-frame = %f\n", cnt, us, 
               cnt * 1000000. / us, us / (cnt * 1000.));
        if (swap) {
            animating = true; animationStart = now; swap = false;
        } else {
            cnt = start = 0;
        }
    }

    // Get input from the console too.
    HANDLE h = GetStdHandle(STD_INPUT_HANDLE); INPUT_RECORD r[128]; DWORD n;
    if (PeekConsoleInput(h, r, 128, &n) && n)
        if (ReadConsoleInput(h, r, n, &n))
            for (DWORD i = 0; i < n; ++i)
                if (r[i].EventType == KEY_EVENT && r[i].Event.KeyEvent.bKeyDown)
                    keyboard(r[i].Event.KeyEvent.uChar.AsciiChar, 0, 0);

    // Ask for another frame
    glutPostRedisplay();
}

Download Code Sample

Below is the link to the code samples on Github

https://github.com/IntelSoftware/OpenGLBestPracticesfor6thGenIntelProcessor

Conclusion

Modern GPUs, like 6th Generation Intel Core processors, have decreased the need for graphic game developers to be concerned whether their textures have power-of-two dimensions. It is still a good OpenGL practice and can help both rendering performance and RAM utilization, something any game developer should care about.

By combining this technique with the advantages of the 6th Generation Intel Core processors, graphic game developers can ensure their games perform the way they were designed.

Notes

[1] March, Meghana R., “An Overview of the 6th generation Intel® Core™ processor (code name Skylake).” March 23, 2016. https://software.intel.com/content/www/us/en/develop/articles/an-overview-of-the-6th-generation-intel-core-processor-code-named-skylake.html

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804