Visual Studio 2010 Built-in CPU Acceleration

Writing the sample code for this post I was amazed myself to see how simple it was to reach over 20 times performance improvement with so little effort.   

The motivation is a very heavy video processing algorithm created for HD TV. This means hi-resolution which means many pixels to compute and it also means that every object in a frame contains several times more pixels than regular video image. After having a C# working code (you can try it here: http://www.asyncop.com/clearimage/) there is a need for a C++ DirectShow codec implementation. The C++ implementation worked very well with a 640 x 480 video resolution and 30 frames per second. This meant 640 x 480 x 30  =  9,216,000 pixels to process every second. On top of that the purpose of the algorithm is to produce a 'photoshop'-like effect by detecting important facial features and smoothing out non-important imperfections. This requires massive amounts of processing power. Moving towards HD means at least 1080 x 720 x 30  =  23,328,000 pixels to process, but it also means that every object in the image is made of approximately 4 times more pixels and every object is surrounded by more pixels. Processing power requirements increase dramatically.   

Here is an example of the output as generated by the less sophisticated C# online implementation:   

   Source:   

      CandleLight Source Image    

   Product:   

      CandleLight Product Image   

I will now survey the potential alternatives we considered with pros and cons for each possible alternative.   

When I became an Intel Black-Belt this year I received an impressive glass certification plate, licenses to Intel software products and a brand new laptop. Dell Latitude 6410 with Core i-7 running 64 bit version of Windows 7. My old laptop had a dual-core CPU with a slow hard-drive so I didn't even try to install Visual Studio 2010 on it. The new laptop is a completely different story. The new machine is running Visual Studio 2010 with no problems. In fact an image of my old laptop is running as Virtual PC inside my new laptop and it is executing much faster than the original device (easily done, email me and I'll tell you how).   

The C++ DirectShow implementation of the algorithm was first tested on this new Core i-7 machine. This finally allowed live encoding of a 640 x 480 video input but was far from enough for live encoding of HD ('beta' input) video. Here are the possible alternatives we have considered:   

Logical Optimizations of the code: This means selecting the best effective internal resolution, color sensitivity, brightness compensation, linear adjustments etc. Reasoning: This is a way to eliminate pointless calculations and data processing which is not really relevant to the viewer. Pros: This will allow maximum performance for a given hardware / infrastructure. Cons: Requires great efforts to produce, high risk for bugs require long duration for QA, and overlooking a scenario is very costly which means a long design process. Product Quality: Very Good. Estimated Cost: 6 months.   

Parallel Software Design: The algorithm is written sequentially. A good parallel design will allow best utilization of CPU Cores with minimal dependencies and inter-locking. Reasoning: Every CPU has multiple Cores. Making the most of the CPU cores can produce performance increase of several multiples. Pros: CPUs are expected to grow in the number of cores, which will allow performance improvements by simply buying new hardware. A good parallel design will also prepare the software for use on multiple machines for example as part of an HPC array. Cons: This process is sensitive for, and takes in account the usage of internal data structures. This means that every change in internal usage of data members and cache will result in redesign so this stage is the last stage of development, after the mathematics and all other features are fully closed. It is also possible that this will not be enough, and this solution has a resolution barrier per number of CPU Cores. Product Quality: Good, no as good as with logical optimizations. Estimated Cost: 2 weeks for every fix to the algorithm.   

Distribution: It is possible to take the implementation of the algorithm as is and distribute it with minimal parallel optimizations. Reasoning: The idea is that we use enough computers to get enough processing power even if the algorithm is not optimized at all. Higher resolution required? Just add more machines. Pros: Very easy to increase processing power. Cons: Requires the software to support distribution. Also it is possible that there is a barrier due to the massive amounts of data exchanged between machines. Latency is not guaranteed. Product Quality: Poor. Great increase in storage, power consumption, heat generated, system costs, and LAN infrastructure. Estimated Cost: 2 months of R&D and dramatic increase in product costs.   

At this point we went back to the drawing board. We tried to think of a way to enter the market with what we have or with minimal R&D efforts until we have a more concrete opportunity. We moved our efforts to the business arena, waiting for a good reason to invest in R&D. As you can see above this is not a small effort. But then...   

Then I came across the Visual Studio 2010 CPU Acceleration support. The following steps are amazingly simple and fast to implement. Let's start by adding this option to the list:   

Visual Studio 2010 Built-in CPU Acceleration: This option means that we try to get the best performance possible without having to go into the implementation details. Reasoning: Basically you can take my implementation of the algorithm and with minimal code modifications and project settings try to produce the most possible. Pros: Fast, simple to implement, out-of-the-box, very little chance for bugs and bugs are relatively simple to detect. Cons:This is not a real optimization to the code. The code will still perform redundant operations and the parallel optimization is far from optimal. Product Quality: As good as the latest implementation that we currently have. Estimated Cost: Faster than writing this blog post. (less than a day).   

To simplify the process I will now create a new project and write down the steps I am taking one by one. I will not use the real algorithm code. Instead I will take a simple implementation of an operation. A sample video frame of 1080 x 720 pixels is used as input. For every pixel we will calculate the square root, multiply the result by itself, and calculate the square root again, so pixel P = sqr( P ); P = P * P;  P = sqr( P );.   

Here is the simple code:

void LoadImage(float* pixels);
const int Width = 1280;
const int Height = 720;
const int SEC = 120;

float* pixels = new float[Width * Height];
LoadImage( pixels );

for(int frames = 0; frames<30*SEC; frames++)
{
   for (int y = 0; y < Height; y++)
   {
      int Y = y * Width;
      for (int x = 0; x < Width; x++)
      {
         pixels[Y + x] = sqrt( pixels[Y + x] );
         pixels[Y + x] = pixels[Y + x] * pixels[Y + x];
         pixels[Y + x] = sqrt( pixels[Y + x] );
      }
   }
}

The code generates a call to a CRT function that does the sqrt calculation. Actually there are several internal calls until the code returns. Total processing time for this is 57.361 sec (57 seconds and 361 milliseconds). Here is the assembly generated. The assembly code is disorganised because this is compiled as Release and not Debug so that compiler optimizations are in effect.   

01231055 fld dword ptr [esi]
01231057 call _CIsqrt (1231910h)
0123105C fstp dword ptr [ebp-0Ch]
0123105F fld dword ptr [ebp-0Ch]
01231062 fmul st(0),st
01231064 fstp dword ptr [ebp-0Ch]
01231067 fld dword ptr [ebp-0Ch]
0123106A call _CIsqrt (1231910h)
0123106F fstp dword ptr [ebp-0Ch]
01231072 fld dword ptr [ebp-0Ch]
01231075 add esi,4
01231078 dec edi
01231079 fstp dword ptr [esi-4]
0123107C jne main+55h (1231055h)


 

First optimization: Open MP  

Go to Project Properties / Configuration Properties / Language and modify "Open MP Support" to "Yes"   

Here is the modified code: 

for(int frames = 0; frames<30*SEC; frames++)
{
   #pragma omp parallel
   #pragma omp
for
   for (int y = 0; y < Height; y++)
   {
      int Y = y * Width;
      for (int x = 0; x < Width; x++)
      {
        pixels[Y + x] = sqrt( pixels[Y + x] );
        pixels[Y + x] = pixels[Y + x] * pixels[Y + x];
        pixels[Y + x] = sqrt( pixels[Y + x] );
      }
    }
}


All four CPU Cores are continuously busy and we get a time of 32.276 sec. If we optimized the code (as mentioned in the section above) we should be closer to 1/3 of the time or even 1/4 of the original processing time.  

OpenMP is simple and most of you have already heard of it before. Now here is something even simpler that would probably surprise you.

 

Second Optimization SSE2

Instead of adding SSE support to the OpenMP version I will revert to the original code (before OpenMP) and add SSE support. SSE stands for Streaming SIMD Extensions, where SIMD is Single Instruction Multiple Data. There are several versions of these extensions, the latest today called AVX. The idea is that most CPU (assembly) operations operate on a single data variable either float, byte, word, double word, etc. SIMD operations started with MMX and allow for example incrementing the value by one for eight variables with a single CPU Instruction. This works by using a special CPU Register which can store more than a single variable. Here is the procedure:

Go to Project Properties / Configuration Properties / Code Generation and modify "Enable Enhanced Instruction Set" to "Streaming SIMD Extensions 2".

This will activate the SIMD feature in the compiler. Rebuilding the application will produce a new code. Now CRT functions are replaced with single CPU Instructions. We are back to the original code. Now the assembly code looks like this:  

00D71055 movss xmm0,dword ptr [eax]
00D71059 cvtps2pd xmm0,xmm0
00D7105C sqrtsd xmm0,xmm0
00D71060 cvtpd2ps xmm0,xmm0
00D71064 cvtss2sd xmm0,xmm0
00D71068 mulsd xmm0,xmm0
00D7106C cvtpd2ps xmm0,xmm0
00D71070 cvtss2sd xmm0,xmm0
00D71074 sqrtsd xmm0,xmm0
00D71078 cvtpd2ps xmm0,xmm0
00D7107C movss dword ptr [eax],xmm0
00D71080 add eax,4
00D71083 dec ecx
00D71084 jne main+55h (0D71055h)
  

 

 

 

 

 

 

 

 

 


 



The part of code that remained the same belongs to the for loop. We did zero modifications to the code. Now let's see how the new code behaves. The original version was 57.361 sec. Running the same application with SSE2 turned on resulted in the amazing time of 15.320!

Microsoft Visual Studio 2010 can also automatically generate AVX code. For this you will need to cancel SSE2 support by setting to "Not Set" and go to Project Properties / Configuration Properties / Command Line and manually add in "Additional Options" to "/arch:AVX".

You can find out more about these commands here: /en-us/avx/

 

But wait... there is even more! The compiler does not support all commands automatically. What we just did above was to ask the compiler to use the CPU acceleration for square root computation. The CPU can also perform this operation on several floating point variable at the same time, using the same CPU Core.

SSE Packed Operations 

We add to our code the support for CPU Intrinsic functions by adding the header #include <emmintrin.h>.

Now we modify the code so that the buffer of pixels which is an array of floating point values is used as an array of Packed Floating Point values. Packed means that there are several different and unrelated values packed into a single variable. Here is the modified code:

for(int frames = 0; frames<30*SEC; frames++)
{
    for (int y = 0; y < Height; y++)
    {
        int Y = y * Width;
        for (int x = 0; x < Width; x += 4)
        {
            (*(__m128*)(pixels+Y+x)) = _mm_sqrt_ps(*(__m128*)(pixels+Y+x));
            (*(__m128*)(pixels+Y+x)) = _mm_mul_ps(*(__m128*)(pixels+Y+x), *(__m128*)(pixels+Y+x));
            (*(__m128*)(pixels+Y+x)) = _mm_sqrt_ps(*(__m128*)(pixels+Y+x));
        }
    }
}

The type __m128 is recognized by the compiler as packed type of four float values. Now the CPU will operate on four pixels per instruction instead of one so we advance the for loop variable xby 4 instead of 1. You can learn more about these Intrinsics functions by pressing F1 in Visual Studio 2010 or by searching MSDN Library. There is a full listing of supported operations.

When we execute the code now we get the amazing result of 3.729 sec!

Adding OpenMP support reduces this even more and down to 2.246 sec. Such a reduction means that only two CPU Cores are enough so we can tell OpenMP to use only two cores and we still get the same results, clearing up two cores for other applications. (We do this by using #include <omp.h> and calling omp_set_num_threads(2); before our code).

 

Conclusion

Instead of doing massive code redesign and optimization, which requires good understanding of the logic behind the code and is very bug prone, we can change a few compiler switches and modify C Runtime (CRT) Library functions without even understanding what the code does and get a performance increase which reduced in the example above the execution time from 57.361 seconds to 2.246 seconds. This can be achieved very fast by a programmer who has very little idea of the code implementation.

Fast and simple code changes resulted in 25 time increase in performance and two CPU Cores free for work. This means the equivalent of a cluster of 50 machines!

You were looking for parallel programming to help your application perform? Don't underestimate AVX!

 

有关编译器优化的更完整信息,请参阅优化通知