Have your cake and eat it too - part 2

In the first part of this article I described how you can “have your cake and eat it too” with respect to programmable use of Hyper Threading or no Hyper Threading through use of thread team selection attributes on the parallel_for construct in the QuickThread® programming tool kit.

In this part I will describe the test bed application and results data as run on an Intel Core i7 2600K Sandy Bridge (no over clocking).

It took a while for me to think of how to write a single test bed application that could explore the performance related issues with respect to permutations of:

Hyper Threading On/Off
Turbo On/Off
Floating Point non-memory load intensive task
Integer L1 Instruction Cache Read intensive task
Integer L1 Data Cache Read intensive task
Integer L1 Data Cache Read/Write intensive task
Integer L1 Data Cache Write intensive task

Running integer and floating point serially to establish a base line for scaling.

Running integer and floating point in parallel, one loop after another as you might with conventional programming

Running the floating point task in parallel, isolated to one HT sibling within each core.
Concurrently with running the integer task in parallel, isolated to the other HT sibling within each core.


The iteration counts of the for loops were normalized such that each task class (floating point or integer) would consume approximately the same run time. The serial execution time was adjusted to run approximately 30 seconds. These loop parameters are used in the parallel loops to attain the relative performance information. Note, the parallel tasks are not a single parallel for loop run once, rather the parallel for loop is run many times.

The intent of this test is to illustrate the data access relationship as it affects integer code run concurrent with floating point code in various combinations of HT and Turbo on Sandy Bridge.

The integer tasks were derivative of the serial code to compute the Fibonacci series to the n’th term.

__int64 SerialFib( __int64 n )
{
  if( n<2 )
    return n;
  else
    return SerialFib(n-1)+SerialFib(n-2);
}


The code, as written above is not suitable for this test bench program because this is a stack call intensive function. Virtually all of the compute time is spent calling the function (writing and reading return address on stack). The Fibonacci can be written in serial and which may run ~9 orders of magnitude faster than the stack written function.

__int64 SerialFib2( __int64 n )
{
  if( n<2 )
    return n;
  __int64 fib0 = 1; // most recent
  __int64 fib1 = 1; // 1 prior to most recent
  __int64 fib2 = 0; // 2 prior to most recent
  __int64 i;
  for(i=2; i<n; ++i)
  {
    fib2 = fib1;
    fib1 = fib0;
    fib0 = fib1 + fib2;
  }
  return fib0;
}


This code is not recursive and all the variables will get register-ized.

The processing of the loop will be L1 Instruction Cache intensive:

;;; for(i=2; i<n; ++i)
mov edx, 2
mov r8, rax
jle .B43.6
.B43.4::
;;; {
;;; fib2 = fib1;
mov r9, r8
inc rdx
;;; fib1 = fib0;
mov r8, rax
;;; fib0 = fib1 + fib2;
add rax, r9
cmp rdx, rcx
jl .B43.4
.B43.6::
;;; }
;;; return fib0;
ret


A second derivative function was written to insert L1 Data Cache Read loads:

__int64 SerialFib3( __int64 n , volatile __int64* zero)
{
  if( n<2 )
    return n;
  __int64 fib0 = 1; // most recent
  __int64 fib1 = 1; // 1 prior to most recent
  __int64 fib2 = 0; // 2 prior to most recent
  __int64 i;
  for(i=2; i<n; ++i)
  {
    fib2 = fib1 + *zero;
    fib1 = fib0 + *zero;
    fib0 = fib1 + fib2 + *zero;
  }
  return fib0;
}


This function will be called with a pointer to a value containing 0. However, note the variable is volatile, thus requiring the compiler to generate code to read the memory location containing the 0. And the assembler code:

;;; for(i=2; i<n; ++i)
mov r8d, 2
mov r9, rax
jle .B44.6
.B44.4::
;;; {
;;; fib2 = fib1 + *zero;
mov r10, QWORD PTR [rdx]
inc r8
add r10, r9
;;; fib1 = fib0 + *zero;
mov r9, QWORD PTR [rdx]
add r9, rax
;;; fib0 = fib1 + fib2 + *zero;
mov rax, QWORD PTR [rdx]
add rax, r9
add rax, r10
cmp r8, rcx
jl .B44.4
.B44.6::
;;; }
;;; return fib0;


Note the addition of the

mov r9, QWORD PTR [rdx]


This loop now has 3 memory reads per 10 instructions.

The third derivative function is L1 Data Cache Write intensive

__int64 SerialFib4( __int64 n , volatile __int64* temp)
{
  if( n<2 )
    return n;
  __int64 fib0 = 1; // most recent
  __int64 fib1 = 1; // 1 prior to most recent
  __int64 fib2 = 0; // 2 prior to most recent
  __int64 i;
  for(i=2; i<n; ++i)
  {
    *temp = fib2 = fib1;
    *temp = fib1 = fib0;
    *temp = fib0 = fib1 + fib2;
  }
  return fib0;
}


And the assembler code:

;;; for(i=2; i<n; ++i)
mov r8d, 2
mov r9, rax
jle .B45.6
.B45.4::
;;; {
;;; *temp = fib2 = fib1;
mov r10, r9
inc r8
mov QWORD PTR [rdx], r9
;;; *temp = fib1 = fib0;
mov r9, rax
mov QWORD PTR [rdx], rax
;;; *temp = fib0 = fib1 + fib2;
add rax, r10
mov QWORD PTR [rdx], rax
cmp r8, rcx
jl .B45.4
.B45.6::
;;; }
;;; return fib0;


Note the three

mov QWORD PTR [rdx], rax


instructions. Producing 3 writes to memory per 9 instructions.

Finally, the last integer function performing L1 Data Cache Read/Modify/Write instructions

__int64 SerialFib5( __int64 n , volatile __int64* temp)
{
  if( n<2 )
    return n;
  __int64 fib0 = 1; // most recent
  __int64 fib1 = 1; // 1 prior to most recent
  __int64 fib2 = 0; // 2 prior to most recent
  __int64 i;
  for(i=2; i<n; ++i)
  {
    *temp += fib2 = fib1;
    *temp += fib1 = fib0;
    *temp += fib0 = fib1 + fib2;
  }
  return fib0;
}


And the assembly code:

;;; for(i=2; i<n; ++i)
mov r8d, 2
mov r9, rax
jle .B46.6
.B46.4::
;;; {
;;; *temp += fib2 = fib1;
add QWORD PTR [rdx], r9
mov r10, r9
;;; *temp += fib1 = fib0;
add QWORD PTR [rdx], rax
mov r9, rax
;;; *temp += fib0 = fib1 + fib2;
add rax, r10
inc r8
add QWORD PTR [rdx], rax
cmp r8, rcx
jl .B46.4
.B46.6::
;;; }
;;; return fib0;


Note the three add to memory instructions (these perform a Read/Modify/Write)

add QWORD PTR [rdx], r9


While I could have done the same with the floating point function I chose not to and only looked at the L1 Instruction Cache intensive code.

This code is a derivative of the Riemann zeta(2) function, which can be used to approximate the value of Pi**2 / 6.

double Riemann_zeta_2(int n)
{
  double rz2 = 0.0;
  double d = 1.0;
  int i;
  for(int i = 0; i < n; ++i)
  {
    rz2 += 1.0 / (d * d);
    d = d + 1.0;
  }
  return rz2;
}


Or in alternate form to produce one more digit of precision:

double Riemann_zeta_2_b(int n)
{
  double rz2 = 0.0;
  double d = n;
  int i;
  for(int i = 0; i < n; ++i)
  {
    rz2 += 1.0 / (d * d);
    d = d - 1.0;
  }
  return rz2;
}


Because Sandy Bridge has AVX, I took the opportunity to write a derivative of the second form above using the AVX intrinsic instructions:

double Riemann_zeta_2_AVX(int n)
{
  if(n < 4)
    return Riemann_zeta_2_b(n);
  __m256d rz2x4 = {0.0, 0.0, 0.0, 0.0};
__m256d dx4 = {(double)(n-3),
                               (double)(n-2),
                               (double)(n-1),
                               (double)(n)};
  __m256d incx4 = {-4.0, -4.0, -4.0, -4.0};
  __m256d onex4 = {1.0, 1.0, 1.0, 1.0};
  int nOver4 = n / 4;
  for(int i = 0; i < nOver4; ++i)
  {
    rz2x4 = _mm256_add_pd(rz2x4,
    _mm256_div_pd(onex4,
      _mm256_mul_pd(dx4,dx4)));
    dx4 = _mm256_add_pd(dx4, incx4);
  }
  // horizontal add (optimize later)
  double rz2 = rz2x4.m256d_f64[0]
                      + rz2x4.m256d_f64[1]
                      + rz2x4.m256d_f64[2]
                      + rz2x4.m256d_f64[3];
  double d = dx4.m256d_f64[0];
  int i = n & 3;
  if(i)
  {
    d = (double)i;
    do
    {
      rz2 += 1.0 / (d * d);
      d = d - 1.0;
    } while(--i);
  }
  return rz2;
}


And the body of the for loop in assembly code is all L1 Instruction Cache code.

;;; {
;;; rz2x4 = _mm256_add_pd(rz2x4,
vmulpd ymm4, ymm3, ymm3
inc eax
;;; _mm256_div_pd(onex4,
;;; _mm256_mul_pd(dx4,dx4)));
;;; dx4 = _mm256_add_pd(dx4, incx4);
vaddpd ymm3, ymm3, ymm2
vdivpd ymm5, ymm0, ymm4
vaddpd ymm1, ymm1, ymm5
cmp eax, edx
jl .B53.4


The complete floating point task

double someFloatingPointFunction()
{
  double pi;
  for(int j=0;j<mFPIteratons;++j)
  {
    double rz2 = Riemann_zeta_2_AVX(20000);
    pi = sqrt(rz2 * 6.0);
  }
  return pi;
}

void aFloatingPointTask(int iBegin, int iEnd)
{
  double total = 0;
  for(int iIteration = iBegin; iIteration < iEnd; ++iIteration)
  {
    for(int i=0;i<FloatingPointLoad;++i)
    total += someFloatingPointFunction();
  }
  // assure compiler optimization does not remove above loop
  if(total == 0.0)
    std::cout << "bug " << total << std::endl;
}


Where mFPIteratons and FloatingPointLoad are set to produce the desired serial test runtime of ~30 seconds. The iBegin and iEnd are the iteration space. For serial this will be the entire range, for parallel this will be a slice of the full iteration range.

The integer functions are similar. I will only list one:

__int64 someIntegerFunction_L1_I_Cache_Read()
{
  __int64 x = 0;
  for(__int64 j=0;j<mIntegerIteratons;++j)
    x += SerialFib2(45);
  return x;
}

void anIntegerTask_L1_I_Cache_Read(int iBegin, int iEnd)
{
  __int64 total = 0;
  for(int iIteration = iBegin; iIteration < iEnd; ++iIteration)
  {
    for(int i=0;i<IntegerLoad_L1_I_Cache_Read;++i)
      total += someIntegerFunction_L1_I_Cache_Read();
  }
  // assure compiler optimization does not remove above loop
  if(total == 0)
    std::cout << "bug" << std::endl;
}


And these functions are called by serial or parallel code:

t.begin();
aFloatingPointTask( 0, nIterations);
t.end();


or

t.begin();
anIntegerTask_L1_I_Cache_Read( 0, nIterations);
t.end();

t.begin();
qt::parallel_for(
  aFloatingPointTask, 0, nIterations);
t.end();


or

t.begin();
qt::parallel_for(
  anIntegerTask_L1_I_Cache_Read, 0, nIterations);
t.end();


or

t.begin();
qt::parallel_invoke(
  [&](){ qt::parallel_for(
    qt::OneEach_L1$,
    anIntegerTask_L1_I_Cache_Read, 0, nIterations); },
  [&](){ qt::parallel_for(
    qt::OneEach_L1$,
    aFloatingPointTask, 0, nIterations); });
t.end();


And other similar functions.

Where parallel_invoke is an n-way fork and join, in this case a 2-way, The two tasks each perform a parallel_for restricting the thread team to one HT sibling of each L1 cache.

Now on to some charts:



This first chart compares the parallel for using the floating point task alone with permutations of HT and Turbo. Please note that the scale in seconds is from 7 to 8 seconds as opposed to from 0 to 8 seconds. The reason for this is so you can see the difference in run times.

This chart shows that when Turbo is Off, the runtime performance is approximately the same with HT On (8 threads) as compared to HT Off (4 threads). This is to be expected because there are only 4 cores on this system with 4 execution resources through the AVX (and SSE and FPU) systems.

When Turbo is On, interestingly, this chart does illustrate that HT Off has an advantage. This difference amounts to 1% of the HT On time.

Although 1% is hardly worth disabling HT, note that for this example, this test code is not performing a high degree of cache or memory access during the floating point test. It is expected that there will be a variance in this 1% (+ or -) depending on your application. Some applications may show somewhat better performance improvement with HT Off, while other applications will not.

Assume now, you are willing to keep HT On and are willing to take a small performance hit for floating point. Is there anything you can do about this? (the “eat your cake too” part)

Let’s Enable HT and Turbo then look at scheduling the floating point task(s) to one HT siblings (4-way slice of parallel_for loop), while scheduling the integer task(s) to the other HT siblings. The floating point will use 4 cores while the integer task will use the same 4 cores. The code to do this is:

t.begin();
qt::parallel_invoke( // 2-way fork/join
  // first leg of fork
  [&](){ qt::parallel_for(
    qt::OneEach_L1$, // one thread per each L1 cache
    anIntegerTask_L1_I_Cache_Read, 0, nIterations); },
  // second fork
  [&](){ qt::parallel_for(
    qt::OneEach_L1$, // one thread per each L1 cache
                                     // (this is the other HT sibling)
    aFloatingPointTask, 0, nIterations); }); // join
t.end();


And the code that performs the typical parallel_for looks like:

t.begin();
qt::parallel_for(
          aFloatingPointTask, 0, nIterations);
qt::parallel_for(
          anIntegerTask_L1_I_Cache_Read, 0, nIterations);
t.end();


Let’s see what happens:



The L1 Instruction Cache Read experiences a 40.7% improvement in performance.
The L1 Data Cache Read experiences a 21.6% improvement in performance.
The L1 Data Cache Read/Write experiences a 33.9% improvement in performance.
The L1 Data Cache Write experiences a 20.3% improvement in performance.

These are all significantly better than the ~1% for floating point only when turning HT Off.

Yes, Virginia, you can have your cake and eat it too.

Your results will vary from this test bed application.

FYI, I am in the process of updating the software on the website. I have made some corrections to the code since last posting. If you have any issues, please report via the email address listed on the web site or address below. QuickThread works on Windows and Ubuntu Linux systems both 32-bit and 64-bit.

These tests were run on Windows 7 Professional x64 using Intel Parallel Studio 2011 XE Core i7 2600K.

Jim Dempsey
jim@quickthreadprogramming.com

Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.