Practical investigation of critical sections

Recently we have read the post "Managing Lock Contention: Large and Small Critical Sections" where the author touches upon the question of optimizing critical sections. I am not going to retell this post here but I would like to note that the author gives there some examples of how useful it is to unite sometimes several code fragments with critical sections into one large critical section. For example, instead of

Begin Thread Function ()
  Initialize ()
  BEGIN CRITICAL SECTION 1
    UpdateSharedData1 ()
  END CRITICAL SECTION 1
  DoFunc1 ()
  BEGIN CRITICAL SECTION 2
    UpdateSharedData2 ()
  END CRITICAL SECTION 2
  DoFunc2 ()
End Thread Function ()

you may write in this way

Begin Thread Function ()
  Initialize ()

  BEGIN CRITICAL SECTION 1
    UpdateSharedData1 ()
    DoFunc1 ()
    UpdateSharedData2 ()
  END CRITICAL SECTION 1

  DoFunc2 ()
End Thread Function ()

We decided to carry out some experiments and see if and how the theory meets the practice and what performance gains we may expect. For this purpose we made a test program that contains some functions with various synchronization implementations.

Here is what the program does: we calculate two sums (msum1, msum2). As this is performed in parallel, we need to protect these variables during the write. Like it was shown in the example from the article "Managing Lock Contention: Large and Small Critical Sections", we use two critical sections at first. Then we unite these sections into one critical section. Note that after this step the function of calculating the n-th root gets included into the section - it is a rather resource-intensive operation. Then we carry out some experiments.

Description of functions with various implementations of synchronization:

Foo1 - We use two critical sections:

#pragma omp critical
{
  msum1 += m1;
}
double m2 = pow(m1, 1.0 / h);
#pragma omp critical
{
  msum2 += m2;
}

Foo2 - We use one critical section. The n-th root is calculated inside the critical section:

#pragma omp critical
{
  msum1 += m1;
  double m2 = pow(m1, 1.0 / h);
  msum2 += m2;
}

Foo3 - We use one critical section. The n-th root is calculated outside the critical section:

double m2 = pow(m1, 1.0 / h);
#pragma omp critical
{
  msum1 += m1;
  msum2 += m2;
}

Foo4 - We use atomic operations (atomic directive):

#pragma omp atomic
msum1 += m1;

double m2 = pow(m1, 1.0 / h);

#pragma omp atomic
msum2 += m2;

Foo5 - We use reduction directive:

#pragma omp parallel for reduction(+:msum1, msum2)
...
msum1 += m1;
double m2 = pow(m1, 1.0 / h);
msum2 += m2;

The full text of the program is given below. You may also download the project for Visual Studio 2005.

#include <stdio.h>
#include <tchar.h>
#include <windows.h>
#include <math.h>
#include <omp.h>
#include <iostream>
#include <float.h>

class Timing {
public:
  void StartTiming();
  void StopTiming();
  double GetUserSeconds() const {
    return value;
  }
private:
  DWORD_PTR oldmask;
  double value;
  LARGE_INTEGER time1;
};

void Timing::StartTiming()
{         
  SetThreadAffinityMask(::GetCurrentThread(), 0);
  QueryPerformanceCounter(&time1);
}  

void Timing::StopTiming()
{  
  LARGE_INTEGER performance_frequency, time2;
  QueryPerformanceFrequency(&performance_frequency);
  QueryPerformanceCounter(&time2);  
  SetThreadAffinityMask(::GetCurrentThread(), oldmask);
  value = (double)(time2.QuadPart - time1.QuadPart);
  value /= performance_frequency.QuadPart;
}

double **CreateArray(ptrdiff_t w,
                     ptrdiff_t h)
{
  double **array = (double **)
    malloc(w * sizeof(double *));
  if (array == NULL)
    throw std::exception("Error allocate memory");
  for (ptrdiff_t x = 0; x != w; ++x)
  {
    array[x] =
      (double *)malloc(h * sizeof(double));
    if (array[x] == NULL)
      throw std::exception("Error allocate memory");

    for (ptrdiff_t y = 0; y != h; ++y)
      array[x][y] = (rand() % 100) / 100.0;
  }
  return array;
}

double Foo1(double **array,
            ptrdiff_t w, ptrdiff_t h)
{
  double msum1 = 0;
  double msum2 = 0;

  #pragma omp parallel for
  for (ptrdiff_t x = 0; x < w; x++)
  {
    double m1 = 1;
    for (ptrdiff_t y = 0; y != h; ++y)
      m1 *= array[x][y];

    #pragma omp critical
    {
      msum1 += m1;
    }
    double m2 = pow(m1, 1.0 / h);
    #pragma omp critical
    {
      msum2 += m2;
    }
  }

  return msum1 + msum2;
}

double Foo2(double **array,
            ptrdiff_t w, ptrdiff_t h)
{
  double msum1 = 0;
  double msum2 = 0;

  #pragma omp parallel for
  for (ptrdiff_t x = 0; x < w; x++)
  {
    double m1 = 1;
    for (ptrdiff_t y = 0; y != h; ++y)
      m1 *= array[x][y];

    #pragma omp critical
    {
      msum1 += m1;
      double m2 = pow(m1, 1.0 / h);
      msum2 += m2;
    }
  }

  return msum1 + msum2;
}

double Foo3(double **array,
            ptrdiff_t w, ptrdiff_t h)
{
  double msum1 = 0;
  double msum2 = 0;

  #pragma omp parallel for
  for (ptrdiff_t x = 0; x < w; x++)
  {
    double m1 = 1;
    for (ptrdiff_t y = 0; y != h; ++y)
      m1 *= array[x][y];

    double m2 = pow(m1, 1.0 / h);
    #pragma omp critical
    {
      msum1 += m1;
      msum2 += m2;
    }
  }

  return msum1 + msum2;
}

double Foo4(double **array,
            ptrdiff_t w, ptrdiff_t h)
{
  double msum1 = 0;
  double msum2 = 0;

  #pragma omp parallel for
  for (ptrdiff_t x = 0; x < w; x++)
  {
    double m1 = 1;
    for (ptrdiff_t y = 0; y != h; ++y)
      m1 *= array[x][y];

    #pragma omp atomic
    msum1 += m1;

    double m2 = pow(m1, 1.0 / h);

    #pragma omp atomic
    msum2 += m2;
  }

  return msum1 + msum2;
}

double Foo5(double **array,
            ptrdiff_t w, ptrdiff_t h)
{
  double msum1 = 0;
  double msum2 = 0;

  #pragma omp parallel for reduction(+:msum1, msum2)
  for (ptrdiff_t x = 0; x < w; x++)
  {
    double m1 = 1;
    for (ptrdiff_t y = 0; y != h; ++y)
      m1 *= array[x][y];

    msum1 += m1;
    double m2 = pow(m1, 1.0 / h);
    msum2 += m2;
  }

  return msum1 + msum2;
}

int _tmain(int, _TCHAR*)
{
  const ptrdiff_t W = 100;
  const ptrdiff_t H = 200;
  double **array = CreateArray(W, H);

  const ptrdiff_t TestsCount = 50000;
  Timing t;
  t.StartTiming();
  double result = 0;
  for (ptrdiff_t i = 0; i != TestsCount; ++i)
    result = Foo1(array, W, H);
  t.StopTiming();
  printf("Foo1 - two critical sections\n");
  printf("Foo1 return: %G\n", result);
  printf("Foo1 time = %.3G seconds.\n\n",
    t.GetUserSeconds());

  t.StartTiming();
  for (ptrdiff_t i = 0; i != TestsCount; ++i)
    result = Foo2(array, W, H);
  t.StopTiming();
  printf("Foo2 - one critical sections\n");
  printf("Foo2 return: %G\n", result);
  printf("Foo2 time = %.3G seconds.\n\n",
    t.GetUserSeconds());

  t.StartTiming();
  for (ptrdiff_t i = 0; i != TestsCount; ++i)
    result = Foo3(array, W, H);
  t.StopTiming();
  printf("Foo3 - one critical sections + optimize\n");
  printf("Foo3 return: %G\n", result);
  printf("Foo3 time = %.3G seconds.\n\n",
    t.GetUserSeconds());

  t.StartTiming();
  for (ptrdiff_t i = 0; i != TestsCount; ++i)
    result = Foo4(array, W, H);
  t.StopTiming();
  printf("Foo4 - atomic\n");
  printf("Foo4 return: %G\n", result);
  printf("Foo4 time = %.3G seconds.\n\n",
    t.GetUserSeconds());

  t.StartTiming();
  for (ptrdiff_t i = 0; i != TestsCount; ++i)
    result = Foo5(array, W, H);
  t.StopTiming();
  printf("Foo5 - reduction\n");
  printf("Foo5 return: %G\n", result);
  printf("Foo5 time = %.3G seconds.\n\n",
    t.GetUserSeconds());
  return 0;
}
ПРИМЕЧАНИЕ

 

The program measures the speed of each function. Here are the results we got when compiling the application with Intel® C++ Compiler 11.1.071 and launching it on a two-core processor:

Foo1 - two critical sections
Foo1 time = 2.21 seconds.

Foo2 - one critical sections
Foo2 time = 1.66 seconds.

Foo3 - one critical sections + optimize
Foo3 time = 1.48 seconds.

Foo4 - atomic
Foo4 time = 1.69 seconds.

Foo5 - reduction
Foo5 time = 0.863 seconds.

Summary

The results show that the code with two critical sections is the slowest (2.21 seconds). If you unite two critical sections and include the operation of the n-th root calculation into it, the time is reduced to 1.66 seconds. I repeat once again that this operation is rather expensive. If you take it out from the critical section, the speed increases by 1.66 - 1.48 = 0.18 seconds. Thus, we got in practice the evidence of a great speed gain from reducing the number of critical sections.

We also carried out an experiment using more efficient synchronization means. At first we supposed that atomic operations might provide an additional benefit. But in practice the speeds of atomic directive and one critical section almost coincide (the former - 1.69 seconds, the latter - 1.66 seconds). Moreover, if you take the n-th root calculation operation out of the critical section, it turns out that such a function is much more efficient than two atomic directives. The time of executing the function with the optimized critical section is 1.48 seconds while in the case of two atomic operations it is 1.69 seconds.

Reduction directive showed the best result (0.86 seconds).

Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.