Countable loops in openMP

Countable loops in openMP

Some OpenMP related documents state that in order for loop to be treated by OpenMP is must be “countable” providing different definitions for loop being “countable”:

  • the number of iterations in the loop must be countable with an integer and loop use a fixed increment.
  • the loop count can be “determined” ( what does it mean “determined”? )

Is it indeed the requirement of OpenMP? Or is it requirement of a specific compiler implementation of OpenMP?

Can the following code ( doesn't seems to be countable ) be parallelized by OpenMP ( note that the question is if the code can be pararallelized and not if there is a way to create a parallel equivalent of the code )
 

for ( i = 0; i < cnt; )
{
 x1 = 2.0 * x - 1.;
 if ( x1 < 1.0 )
 {
  i = i + 3;
  x = x*2.;
 }
 else // if ( x1 >= 1. )
 {
  i = i + 2;
  x = x/2.;
 }
}

Thank you,


David

www.dalsoft.com

 

 

publicaciones de 42 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

You would need to make the parallel for run for the maximum required count. Then you could make the body of the loop conditional on on i.  With static scheduling, this would imply work imbalance, so you could work with schedule(runtime) and try various choices by environment variable such as guided, auto, or dynamic.  With dynamic, at least, you should try various chunk sizes.   Best choices will vary with number of cores, total number of iterations, and even which openmp library is in use.

"Countable" generally means that the compiler can generate code that will compute the number of loop iterations without executing the loop.

Modifying the index variable outside of the increment expression in the "for" statement is often prohibited, though special cases can be countable (e.g., a simple unconditional increment of the index variable somewhere in the loop). 

In your case, the update(s) of the index variable are conditional, which is usually enough to prevent the loop from being countable.  To make it worse, the condition depends on a floating-point value, and that floating-point value is updated within the loop.   The number of iterations in such a case may depend on the floating-point rounding mode in effect.  Determining the number of iterations in general code is equivalent to solving the Halting Problem, which is not possible. https://en.wikipedia.org/wiki/Halting_problem

"Dr. Bandwidth"

"compiler can generate code that will compute the number of loop iterations without executing the loop"

what kind of code? would it be acceptable to slice the code of the original loop to extract index generation and then loop that code ( and not the original loop )?

 

That loop is not inherently parallelizable (except in the cases of where the initial value of x is <=0.0, and in that case replace the loop with x=x*(2**cnt)). Otherwise, x has loop order dependencies.

Jim Dempsey

 

Edit: <= 1.0 / (2**cnt)

Jim Dempsey

Although that is not a point of my post, but may be in the case you mentioned the result shall be: x*(2**(cnt/3)).

Also, when you say that loop is not parallelizabe, I guess you mean "not parallelizable by OpenMP". I wrote an autoparallelizer ( see www.dalsoft.com ) that can parallelize this loop ( in fact, much more complicated loop - the code in my post is a simplified version of that loop ).

 

It would help if you give a link to the direct page that illustrates how the loop in #1 is auto-parallelized.

Jim Dempsey

The loop that I refered to ( of which the code in my post is a simplified version ) is:

for ( i = 0; i < cnt; )
   {
    x1 = 2.0 * x - 1.;
    if ( x1 <  1.0 )
     {
     b[i] = exp( x1 ) * cos( x1 );
      i = i + 3;
      x = x*2.;
     }
    else  // if ( x1 >= 1. )
     {
     a[i] = sqrt( x1 ) * log( 1 / x1 );
      i = i + 2;
      x = x/2.;
     }
   }

I will present the results on the "Compiler, Architecture, And Tools Conference", see

https://software.intel.com/en-us/event/compiler-conference/2018/schedule

After the presentation ( December 17 ) would be glad to show you the code.

 

Please add this to your calendar such that those here not attending the conference can see and comment.

While having a compiler auto-parallelize the #9 loop can be parallelized.

In the serial loop, i always increments, and thus will not produce duplicate indices for [i]. While is has not been disclosed, it may be a requirement that the initial x be > 0.0. Therefor the values inserted into b or a would not be 0.0

This untested code may be effective:

atomic<double> fix_a[cnt], fix_b[cnt];
atomic<int> fill_a,fill_b;
...

#pragma omp parallel
{
  #pragma omp sections
  {
    for (int i = 0; i < cnt; ++i)
      a[i] = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      b[i] = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      fix_a[i] = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      fix_b[i] = 0.0;
  } //#pragma omp end sections
} // #pragma omp parallel

fill_a = -1;
fill_b = -1;
... x = some initial value

#pragma parallel
{
  #pragma omp master
  {
    for (int i = 0; i < cnt; )
    {
      x1 = 2.0 * x - 1.;
      if ( x1 <  1.0 )
      {
        fix_b[i] = x1; // b[i] = exp( x1 ) * cos( x1 );
        fill_b = i;
        i = i + 3;
        x = x*2.;
      }
      else  // if ( x1 >= 1. )
      {
        fix_a[i] = x1; // a[i] = sqrt( x1 ) * log( 1 / x1 );
        fill_a = i;
        i = i + 2;
        x = x/2.;
      }
    } // for (int i = 0; i < cnt; )
    fill_a = cnt;
    fill_b = cnt;
  } // #pragma omp master
  // all threads here
  int empty_a = 0;
  int empty_b = 0;
  // until done
  for(;empty_a < cnt || empty_b < cnt;)
  {
    while(empty_a <= fill_a && empty_a < cnt)
    {
      if(fix_a[empty_a] != 0.0)
      {
        double x1 = fix_a[empty_a].exchange(0.0);
        if(x1)
        {
          a[empty_a] = sqrt( x1 ) * log( 1 / x1 );
        } // if(x1)
      } // if(fix_a[empty_a])
      ++empty_a;
    }
    while(empty_b <= fill_b && empty_b < cnt)
    {
      if(fix_b[empty_b] != 0.0)
      {
        double x1 = fix_b[empty_b].exchange(0.0);
        if(x1)
        {
          b[empty_b] = exp( x1 ) * cos( x1 );
        } // if(x1)
      } // if(fix_b[empty_b])
      ++empty_b;
    }
  } // for(;empty_a < cnt || empty_b < cnt;)
} // #pragma parallel

Depending on your needs, you may want to insert _mm_pause() when waiting for work.

Keep in mind you may need to modify the code.

Also, the amount of work needs to be sufficient to amortize the overhead of starting/resuming the thread team. (IOW number of iterations is relatively large).

Jim Dempsey

It should be noted that the wipe of fix_a and fix_b need only be done once due to the pickers resetting to 0.0 with exchange. As to what to do with a and b there is insufficient information in your postings.

You would want to assure that the threads performing the picking were on separate cores (IOW not with multiple threads within a core).

Would it be safe to assume the sample code was taken from some actual code, and if so, what is typical of the iteration counts, path a and path b?

Jim Dempsey

As to what to do with a and b there is insufficient information in your postings.

a and b assumed to be parameters to the routine that contains the loop.

Would it be safe to assume the sample code was taken from some actual code, and if so, what is typical of the iteration counts, path a and path b?

No, I came up with this in an attempt to show the functionality of the auto parallelizer, specifically the ability to

  • calculate loop count ( of not a countable loop, thus not supported by OpenMP )
  • resolve memory dependency for memory writes a and b
  • create code to calculate values needed at the entry to a thread: x and i

If you would like, I will send you ( after the conference ) my presentation that explains the above.

As to the iteration counts: in the test suite I use, the iteration count cnt is set to be 100000000 ( 8 zeros ) - which makes me wonder how practical is your solution where you introduce new arrays of the size cnt.

Also note the use of transcendentals - this is done in order to give some weight to the loop; otherwise the overhead of doing the above will make auto-parallelization to be not worth it. I ran some tests adding more calls to "expensive" routines and saw how it improves the performance of the parallelized code. You may find the following article helpful to clarify that:

http://www.dalsoft.com/Calculating_number_of_cores_to_benefit_from_paral...

 

 

 
 
 

*/

>>which makes me wonder how practical is your solution where you introduce new arrays of the size cnt.

To reduce the additional arrays to 1 array, it is known that the X1's generated are all > 0.0. Therefore the sign could be used to indicate the path.

As to which is faster (your auto-gen code or my specific code), well that can be tested (by one that has both codes).

Assuming x is unknown at compile time, it is not clear to me as to how you could parallelize this. This said, one (you) could have the compiler identify this type of loop (something similar to a convergence loop), and produce a preamble for single path, then enter the flip/flop for the remainder of the convergence.

Simplified code of my prior post:

atomic<double> fix_x[cnt];
atomic<int> fill_x;
...
// once only
#pragma omp parallel
{
  #pragma omp sections
  {
    for (int i = 0; i < cnt; ++i)
      a[i] = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      b[i] = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      fix_x[i] = 0.0;
  } //#pragma omp end sections
} // #pragma omp parallel

fill_x = -1;
... x = some initial value

#pragma parallel
{
  #pragma omp master
  {
    for (int i = 0; i < cnt; )
    {
      x1 = 2.0 * x - 1.;
      if ( x1 <  1.0 )
      {
        fix_x[i] = -x1; // b[i] = exp( x1 ) * cos( x1 );
        fill_x = i;
        i = i + 3;
        x = x*2.;
      }
      else  // if ( x1 >= 1. )
      {
        fix_x[i] = x1; // a[i] = sqrt( x1 ) * log( 1 / x1 );
        fill_x = i;
        i = i + 2;
        x = x/2.;
      }
    } // for (int i = 0; i < cnt; )
    fill_x = cnt;
  } // #pragma omp master
  // all threads here
  int empty_x = 0;
  // until done
  for(;empty_x < cnt;)
  {
    while(empty_x <= fill_x)
    {
      if(fix_x[empty_x] != 0.0)
      {
        double x1 = fix_x[empty_x].exchange(0.0);
        if(x1)
        {
          if(x1 > 0.0)
            a[empty_x] = sqrt( x1 ) * log( 1 / x1 );
          else
            b[empty_b] = exp( -x1 ) * cos( -x1 );
        } // if(x1)
      } // if(fix_a[empty_x])
      ++empty_x;
    }
  } // for(;empty_x < cnt;)
} // #pragma parallel

Perhaps you could compare the above with your auto-generated code.

Note, if the distribution of the modified cells is somewhat random, then this code may be better:

atomic<double> fix_x[cnt];
atomic<int> fill_x;
...
// once only
#pragma omp parallel
{
  #pragma omp sections
  {
    for (int i = 0; i < cnt; ++i)
      a[i] = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      b[i] = 0.0;
    #pragma omp section
    for (int i = 0; i < cnt; ++i)
      fix_x[i] = 0.0;
  } //#pragma omp end sections
} // #pragma omp parallel

fill_x = -1;
... x = some initial value

#pragma parallel
{
  int iThread = omp_get_thread_num();
  int nThreads = omp_get_num_threads();
  if(iThread == 0)
  {
    for (int i = 0; i < cnt; )
    {
      x1 = 2.0 * x - 1.;
      if ( x1 <  1.0 )
      {
        fix_x[i] = -x1; // b[i] = exp( x1 ) * cos( x1 );
        fill_x = i;
        i = i + 3;
        x = x*2.;
      }
      else  // if ( x1 >= 1. )
      {
        fix_x[i] = x1; // a[i] = sqrt( x1 ) * log( 1 / x1 );
        fill_x = i;
        i = i + 2;
        x = x/2.;
      }
    } // for (int i = 0; i < cnt; )
    fill_x = cnt;
  } // if(iThread == 0)
  if(nThreads > 1)
  {
    --iThread;
    --nThreads;
  }
  if(iThread >=0)
  {
    int empty_x = 0;
    // until done
    for(;empty_x < cnt;)
    {
      while(empty_x <= fill_x)
      {
        if(empty_x % nThreads == iThread)
        {
          double x1 = fix_x[empty_x];
          if(x1 != 0.0)
          {
            fix_x[empty_x] = 0.0;
            if(x1 > 0.0)
              a[empty_x] = sqrt( x1 ) * log( 1 / x1 );
            else
              b[empty_b] = exp( -x1 ) * cos( -x1 );
          } // if(x1 != 0.0)
        } // if(empty_x % nThreads == iThread)
        ++empty_x;
      }
    } // for(;empty_x < cnt;)
  }
} // #pragma parallel

Jim Dempsey

As to which is faster (your auto-gen code or my specific code), well that can be tested (by one that has both codes).

Your code wouldn't run on my machine as a and b, with the cnt as I specified, take all the memory. The rule for writing parallel code is that everything shall be done in-place, no huge memory allocations as it may be no memory available.

Assuming x is unknown at compile time, it is not clear to me as to how you could parallelize this.

I fail to understand your preoccupation with the value of x. Autoparallelizer is not bothered by that at all. Of course user should be careful to use the values of x that don't cause the exception(s), but the same exception will occur in the sequential and parallel codes. Also, should it be clear how to parallelize sequential code, I would be out of business.

 

 

 
 
 

*/

Ok, then here is simplified parallel loop in OpenMP

#pragma omp parallel
{
  int iThread = omp_get_thread_num();
  int nThreads = omp_get_num_threads();
  int interval = 0;
  // all threads perform
   for (int i = 0; i < cnt; ++interval)
   {
    x1 = 2.0 * x - 1.; // * is relatively fast
    if ( x1 <  1.0 )
    {
      if(iterval%nThreads == iThread)
      {
        // distribute computation part
        b[i] = exp( x1 ) * cos( x1 ); //
      }
      i = i + 3;
      x = x*2.;
    }
    else  // if ( x1 >= 1. )
    {
      if(interval%nThreads == iThread)
      {
        // distribute computation part
        a[i] = sqrt( x1 ) * log( 1 / x1 );
      }
      i = i + 2;
      x = x/2.;
     }
   }
}

Jim Dempsey

To the best of my knowledge, the loop in your last post is not canonical and therefore wouldn't be accepted by OpenMP; gcc should give compilation error ( when using -fopenmp ) requesting explicit loop increment.

Please disregard the last post - I missed the fact that "#pragma omp" applied to a block ( and not to a loop ).

Sorry.

 

Here is the code that works, however, the compute time of the "DoWork" emulation is rather short.

// GoofyLoop.cpp
//

#include "stdafx.h"
#include <iostream>

#include <immintrin.h>
#include <math.h>
#include <omp.h>

const __int64 N = 100000000;
double* a;
double* b;
const double typicalX = 3.141592653589793;

void Serial(void)
{
	double x = typicalX;
	double x1;
	__int64 cnt = N;
	for (__int64 i = 0; i < cnt;)
	{
		x1 = 2.0 * x - 1.;
		if (x1 <  1.0)
		{
			b[i] = exp(x1) * cos(x1);
			i = i + 3;
			x = x*2.;
		}
		else  // if ( x1 >= 1. )
		{
			a[i] = sqrt(x1) * log(1 / x1);
			i = i + 2;
			x = x / 2.;
		}
	}
}

void Parallel(void)
{
#pragma omp parallel
	{
		int iThread = omp_get_thread_num();
		int nThreads = omp_get_num_threads();
		double x = typicalX;
		double x1;
		__int64 cnt = N;
		__int64 interval = 0;
		for (__int64 i = 0; i < cnt; ++interval)
		{
			x1 = 2.0 * x - 1.;
			if (x1 < 1.0)
			{
				if (interval%nThreads == iThread)
				{
					b[i] = exp(x1) * cos(x1);
				}
				i = i + 3;
				x = x*2.;
			}
			else  // if ( x1 >= 1. )
			{
				if (interval%nThreads == iThread)
				{
					a[i] = sqrt(x1) * log(1 / x1);
				}
				i = i + 2;
				x = x / 2.;
			}
		}
	}
}
int _tmain(int argc, _TCHAR* argv[])
{
	a = (double*)malloc(N * sizeof(double)); // new double[N];
	b = (double*)malloc(N * sizeof(double)); // new double[N];
#pragma omp parallel
	{
#pragma omp master
		{
			std::cout << "nThreads = " << omp_get_num_threads() << std::endl;
		}
	}
#pragma omp parallel for
	for (int i = 0; i < N; ++i)
	{
		a[i] = 0.0;
		b[i] = 0.0;
	}

	for (int rep = 0; rep < 3; ++rep)
	{
		unsigned __int64 t0 = _rdtsc();
		Serial();
		unsigned __int64 t1 = _rdtsc();
		std::cout << "Serial ticks = " << t1 - t0 << std::endl;
	}
	std::cout << std::endl;
	for (int rep = 0; rep < 3; ++rep)
	{
		unsigned __int64 t0 = _rdtsc();
		Parallel();
		unsigned __int64 t1 = _rdtsc();
		std::cout << "Parallel ticks = " << t1 - t0 << std::endl;
	}
	return 0;
}

On a 4 core w/ HT, Core i7 2700K, running 1 thread per core:

Threads = 4
Serial ticks = 3076054566
Serial ticks = 2543030436
Serial ticks = 2547671985

Parallel ticks = 2116263348
Parallel ticks = 2116889788
Parallel ticks = 2128250491

Marginal.

3 Threads:

hreads = 3
Serial ticks = 2585388714
Serial ticks = 2603521131
Serial ticks = 2602569400

Parallel ticks = 2098115233
Parallel ticks = 2108614224
Parallel ticks = 2098695600

Slightly better.

The Do Work section is relatively small computation between memory writes, therefore it seems that for this example, the degree of (productive) parallelization is dependent upon the memory subsystem.

Jim Dempsey

I created the following test program:

#include <stdio.h>
#include <math.h>

#define N 100000000

double a[N], b[N];

double foo( double *a, double *b, double x, unsigned int cnt )
 {
  double x1;
  unsigned int i;

asm( "#.dco_start" );

  for ( i = 0; i < cnt; i++ )
   {
    x1 = 2.0 * x - 1;
    if ( x1 <  1.0 )
     {
      x1 = exp( x1 ) * cos( x1 );
      b[i] = x1;
      i = i + 3;
      x = x*2.;
     }
    else  // if ( x1 >= 1. )
     {
      x1 = sqrt( x1 ) * log( 1. / x1 );
      a[i] = x1;
      i = i + 2;
      x = x/2.;
     }
   }

asm( "#.dco_end" );

  return x;

 }

// by Jim Dempsey
double foo_Jim( double *a, double *b, double x, unsigned int cnt )
 {
  double x1;
  unsigned int i;

#if 0

#pragma omp parallel
{
  int iThread = omp_get_thread_num();
  int nThreads = omp_get_num_threads();
  int interval = 0;
  // all threads perform
   for ( i = 0; i < cnt; ++interval)
   {
    x1 = 2.0 * x - 1.; // * is relatively fast
    if ( x1 <  1.0 )
    {
      if(interval%nThreads == iThread)
      {
        // distribute computation part
        b[i] = exp( x1 ) * cos( x1 ); //
      }
      i = i + 3;
      x = x*2.;
    }
    else  // if ( x1 >= 1. )
    {
      if(interval%nThreads == iThread)
      {
        // distribute computation part
        a[i] = sqrt( x1 ) * log( 1 / x1 );
      }
      i = i + 2;
      x = x/2.;
     }
   }
}

#endif

return x;

}

int main()
 {
  double rslt, rslt1;
  unsigned int i;

  for ( i = 0; i < N; i++ )
   {
    a[i] = 0.;
    b[i] = 0.;
   }

  for( i = 0; i < 10; i++ )
   {
    rslt = foo( a, b, 3.1415, N - 100 );
//    rslt = foo_Jim( a, b, 3.1415, N - 100 );
   }

  rslt1 = 0.;
  for ( i = 0; i < N; i++ )
   {
    rslt1 += a[i];
    rslt1 += b[i];
   }

  printf( "rslt  %f   %f\n", rslt, rslt1 );

 }

and generated 3 executables ( altering code as necessary ):

serial code: loop

parallel code generated by my auto-parallelizer: loop_dco

your code: loop_Jim

 

Each program was executed  3 times on the E8600 @ 3.33GHz, 2 cores Linux machine. The execution was under "time" command ( e.g. "time ./loop" ) and reported time ( see bellow ) is 'real' time produced by the "time" command that is neither the fastest nor the slowest out of 3 executions attepmted.

The execution times ( in seconds ) are:

loop: 14.99

loop_dco: 11.75

loop_Jim: 10.48

 

As you can see the program generates and prints checksums - for loop and loop_dco these chechsums always ( for every invocation ) fully agreed, loop_Jim was generating different checksums for every separate run (?).

Few words about the code:

loop_Jim assumes that a and b are not-overlaping memory regions and therefore may be use in parallel code; loop_dco doesnt make such an assumption and generates code to verify that dynamicaly at run time - overhead of up to 20%.

loop_jim doesnt preserve the value of x that shall be returned.

 

If you like, I would be glad to send you my ( Linux ) executables.

 

 

Knowing this is now memory access bound, performing aligned allocation and organizing stores on cache lines, yields a little more improvement:

const int CacheLineSize = 64;
const int doublesInCacheLine = CacheLineSize / sizeof(double);
...
	a = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double[N];
	b = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double[N];
...
				if ((i / doublesInCacheLine)%nThreads == iThread)
...
				if ((i / doublesInCacheLine) % nThreads == iThread)
nThreads = 4
Serial ticks = 2665960090
Serial ticks = 2645705503
Serial ticks = 2551374815

Parallel ticks = 1949592124
Parallel ticks = 1957116967
Parallel ticks = 2024346334

Jim Dempsey

I get consistent results. time command line includes program load time, page file initialization time, etc... and is not too useful for loop analysis.

// GoofyLoop.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <iostream>

#include <immintrin.h>
#include <math.h>
#include <omp.h>

const __int64 N = 100000000;
double* a;
double* b;
const double typicalX = 3.141592653589793;
const int CacheLineSize = 64;
const int doublesInCacheLine = CacheLineSize / sizeof(double);

double Serial(void)
{
	double x = typicalX;
	double x1;
	__int64 cnt = N;
	for (__int64 i = 0; i < cnt;)
	{
		x1 = 2.0 * x - 1.;
		if (x1 <  1.0)
		{
			b[i] = exp(x1) * cos(x1);
			i = i + 3;
			x = x*2.;
		}
		else  // if ( x1 >= 1. )
		{
			a[i] = sqrt(x1) * log(1 / x1);
			i = i + 2;
			x = x / 2.;
		}
	}
	return x;
}

double Parallel(void)
{
	double rslt;
#pragma omp parallel
	{
		int iThread = omp_get_thread_num();
		int nThreads = omp_get_num_threads();
		double x = typicalX;
		double x1;
		__int64 cnt = N;
		
		for (__int64 i = 0; i < cnt;)
		{
			x1 = 2.0 * x - 1.;
			if (x1 < 1.0)
			{
				if ((i / doublesInCacheLine)%nThreads == iThread)
				{
					b[i] = exp(x1) * cos(x1);
				}
				i = i + 3;
				x = x*2.;
			}
			else  // if ( x1 >= 1. )
			{
				if ((i / doublesInCacheLine) % nThreads == iThread)
				{
					a[i] = sqrt(x1) * log(1 / x1);
				}
				i = i + 2;
				x = x / 2.;
			}
		}
		rslt = x;
	}
	return rslt;
}

void Wipe()
{
#pragma omp parallel for
	for (__int64 i = 0; i < N; ++i)
	{
		a[i] = 0.0;
		b[i] = 0.0;
	}
}

double Tally()
{
	double tally = 0.0;
	for (__int64 i = 0; i < N; ++i)
	{
		tally += (a[i]*i) + (b[i]*i);
	}
	return tally;
}
int _tmain(int argc, _TCHAR* argv[])
{
	a = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double[N];
	b = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double[N];
#pragma omp parallel
	{
#pragma omp master
		{
			std::cout << "nThreads = " << omp_get_num_threads() << std::endl;
		}
	}

	for (int rep = 0; rep < 3; ++rep)
	{
		Wipe();
		unsigned __int64 t0 = _rdtsc();
		double rslt = Serial();
		unsigned __int64 t1 = _rdtsc();
		double tally = Tally();
		std::cout << "Serial rslt = " << rslt << " tally = " << tally << " ticks = " << t1 - t0 << std::endl;
	}
	std::cout << std::endl;
	for (int rep = 0; rep < 3; ++rep)
	{
		Wipe();
		unsigned __int64 t0 = _rdtsc();
		double rslt = Parallel();
		unsigned __int64 t1 = _rdtsc();
		double tally = Tally();
		std::cout << "Parallel rslt = " << rslt << " tally = " << tally << " ticks = " << t1 - t0 << std::endl;
	}
	return 0;
}

nThreads = 4
Serial rslt = 1.5708 tally = 3.74665e+014 ticks = 2737953921
Serial rslt = 1.5708 tally = 3.74665e+014 ticks = 2629847324
Serial rslt = 1.5708 tally = 3.74665e+014 ticks = 2616123552

Parallel rslt = 1.5708 tally = 3.74665e+014 ticks = 2057548183
Parallel rslt = 1.5708 tally = 3.74665e+014 ticks = 2159630511
Parallel rslt = 1.5708 tally = 3.74665e+014 ticks = 2201760969

Jim Dempsey

 

For more accurate correctness test, change tally to:

tally += (a[i] * ((i & 15) + 1)) + (b[i] * ((i & 15) + 1));

The idea is to produce a hash that includes the location as well as the value... without having too many of the results bits get truncated in the process. The earlier use of i directly would have caused this. For complete correctness two results (4 arrays) would be required for direct results comparison (placement as well as value).

Jim Dempsey

fixed the sources and the check sums now fully agree with the timing being:

loop - sequential code: 19.94

loop_dco - code generated by autoparallelizer: 17.78

loop_Jim: 13.62

 

Last post shows that your manually created code is 24% faster than auto-parallelized version of the loop. The explanation is quite simple as auto-parallelizer creates code to ensure memory disambiguation ( which introduces about 20% overhead ), something that is missing in the manually created version.

I created code that adds some weight to the real calculation of the loop replacing assignments of a and b by:

b[i] = exp(x1) * cos(x1) + cosh( x1 );
a[i] = sqrt(x1) * log(1 / x1) – sinh( x1 );

( not that it makes sense mathematically, just uses transcendentals that seems to be expensive ).

I also run the code for 5 iterations instead of 10 in the previous try. The execution times now:

 

loop - sequential code: 20.05

loop_dco - code generated by auto-parallelizer: 12.96

loop_Jim: 11.48

 

this time your manually created code is 11% faster than auto-parallelized version of the loop.

I don't know how much time took you to create the code, to run auo-parallelizer took few seconds. The 11% difference is the price to pay for convenience ( no need to modify the source, no time to create the code ) and protection/precaution ( all necessary checks performed ).

 

 

I agree that 11% might not be worth the effort for using manual work distribution, but please note that your test platform has only 2 cores. It is yet unclear how the underlying thread teaming is performed, and if your auto-parallelizer scales with more cores. What is important, is the technique as to how you make the determination as to how to partition this particular example.

Did your verification assure the placement of the results were the same as with the serial version. IOW simply testing the end result of x is insufficient.

Jim

 

 

Not sure I understood what do you mean by “teaming”, but currently every thread gets equal number of iterations ( that is why it so important to determine loop count ). As to “scaling with more cores”, auto-parallelizer doesn’t care about number of cores, however not every algorithm scales nicely, see

 

www.dalsoft.com/Increasing_number_of_cores_is_not_always_beneficial.pdf

 

As to the verification, all the test I run – the one in our conversation and many many more - were tested to assure that results generated by parallel and serial codes are reasonably identical ( in the loop we discussed, the check sums were verified ).

Being able to run verification only on one 2 core machine I consider as a major flaw and a drawback. I would like to evaluate the tool on ( many ) multicore x86 systems. There are samples of the results of parallelization and instructions on how to execute them and how to report the results, see

 

www.dalsoft.com/documentation_auto_parallelization.html#howToRunProvided...

 

These executables run on x86 Linux system and some ( one ) need Fortran shared libraries to be installed. Desperately looking for people to run these samples and to report the generated data ( as described in the above link ).

 

Thanks for the links. When I have time, I will see if I can run your executables on two of my systems:

Intel Xeon E5-2620 6-core/12-Thread
Intel Xeon Phi 5200 64-core/256-Thread

I will compile the source code for the non-dco samples as-written
Then I will edit the source code to how I would think a High Performance Computing programmer would address the same problem using OpenMP.
It would only be fair to return the modified sources back to you to remove the OpenMP directives and to dco-ify the same code.

Jim Dempsey

 

Thank you.

Impressive hardware!

Note that there is no need for any development effort, the executables and the scripts to run them are provided at the link I specified - it is necessary just to run these scripts and to report the results so I will be able to publish them on my site.

Having sad that I like to add - I am not against another round of the "manual-automatic parallel code generation" dueling between us.

 

 

Comparison of your manually created code to the code generated by the auto-parallelizer felt as not complete: auto-parallelizer creates code to ensure memory disambiguation which manually created code doesn’t. I modified the assembly file created by dco ( auto-parallelizer ) and bypassed this code. The execution time now:

 

loop - sequential code: 20.05

loop_dco - code generated by auto-parallelizer: 12.96

loop_dco_fake - code generated by auto-parallelizer with memory disambiguation being disabled: 11.92

loop_Jim: 11.48

this time your manually created code is less than 4% faster.

Actually I estimated the overhead of memory disambiguation to be about 10% ( 20% for the original code – I reported that ). The difference that still exists ( less than 4% ) is due to the overhead introduced by the calculation of the loop count ( done by dco ) as well as other checks that I didn’t touch.

 

 

Did your verification verify that the store index i's were juxtaposed correctly (IOW not verify that the final x was correct)?

Also, a test with 2 cores is not necessarily indicative of what will happened with 4, 8, ... cores.

Jim Dempsey

Did your verification verify that the store index i's were juxtaposed correctly (IOW not verify that the final x was correct)?

Yes of course, we discussed this already.

Also, a test with 2 cores is not necessarily indicative of what will happened with 4, 8, ... cores.

more than that – with more cores the overhead introduced by dco will become more profound thus widening the gap between our implementations ( most of the code that introduces the overhead is executed by the master process and doesn’t benefit from the cores available ).

However what seems to be important is

  1. the price/benefit ratio between time spent to manually develop code ( price ) and the improvement that provides over code generated by auto-parallelize ( benefit ).
  2. how secure the manually developed code really is – for example no memory disambiguation was attempted in the code you developed, which I don't know how to justify: you may have no control on how and were your code is executed, which makes relation between memory access addresses not known.

David Livshin

www.dalsoft.com

 

Your #1 is incomplete for evaluation purposes. One must look at the incremental cost amortized across the reduction in runtime for the expected life time * number of instances of the application. If the application were to only be run once, for a short duration (days), then your worth it calculation is correct. However, for research and/or business applications where the number of runs are many and run times are long, then 5% - 10% faster are a significant benefit.

For #2, for a properly constructed piece of code it will be known in advance that either:

a) disambiguation isn't necessary (as conflicts cannot occur)
b) disambiguation is required (as conflicts can occur)

In the case of b), generally a simple test can be performed (e.g. overlapping array sections), and then an alternate form of a) can be selected (or resort to safe reduction methods).

In my sample code provided, the construction was such that conflict could not occur except for a meaningless mashing of pointers:

	a = (double*)_mm_malloc(N * sizeof(double), 64); // (double*)malloc(N * sizeof(double)); // new double[N];
	b = &a[1];

Jim Dempsey

 then 5% - 10% faster are a significant benefit.

I am writing code optimizers for many many years and didn't see a customer that was interested in a tool that gives 5% improvement. I thinks such a person wasn't born yet.

In my evaluations I always consider 5% difference in code executions ( either may - 5% improvement or otherwise ) to be nothing more than a "noise".

For #2, for a properly constructed piece of code it will be known in advance that either:

a) disambiguation isn't necessary (as conflicts cannot occur)
b) disambiguation is required (as conflicts can occur)

In the case of b), generally a simple test can be performed (e.g. overlapping array sections), and then an alternate form of a) can be selected (or resort to safe reduction methods).

Disregarding the fact that such a crude way to perform disambiguation, in my my opinion, is unacceptable, even that you wouldn't be able to do - remember that the loop we are discussing is not countable, thus the loop count can not be easily determined and therefore array regions utilized by the loop may not be calculated.

In your case you allocated memory thus eliminating the need for disambiguation, but why do you think that this represents the real life situation. In the real life, as I know it, the routines are often called without any means to recreate the values of their parameters, thus making dynamic memory disambiguation necessary.

 

If you bothered to look at my code you will notice the partitioning is done after the fact of generating the output index. IOW the input index is irrelevant, the output passes through a write permitted filter based on the store index. Each thread has a specific output index range and thus cannot conflict amongst threads.

The author of the code is responsible for defining any requirements on the input arguments (e.g. not permitted to overlap, can overlap where "B" follows "A" by at least vector width, can overlap where "A" follows "B" by at least vector width, or permitted to overlap, not allocated, allocated, not initialized, initialized, ...). Your position seems to be "don't know/care" and the code will test the situation and take appropriate action. Your position is valid for undisciplined programmers, which is alright if that is who you are dealing with.

My preference is if there is a possibility of conflicting usage, that I (the programmer) .not. rely on auto dependency testing with code path generation. Especially with multi threading. But this is my preference as I want to know, and control, what the code is doing.

Jim Dempsey

 

I did look at your code, I even had to fix it for it to run ( see posts #19, #23 ), but fail to understand how the way you arranged the calculation eliminates memory aliasing. Being on the subject, what would you say if the line 'i = i + 2' will be replaced by 'i = i - 2'?

The following is my presentation at the "Compiler, Architecture, And Tools Conference" ( see post # 9 ):

the full text of the presentation:

http://www.dalsoft.com/Autoparallelizer_for_a_multi_core_x86_architectur...

the slide show:

http://www.dalsoft.com/presentation_Autoparallelizer_for_a_multi_core_x8...

 

 

 

David,

What you have done for auto-parallelism is quite interesting. And once mature (bugs shaken out), very impressive. As to if this technology can be incorporated into existing compilers (IP acquired) or sold as a stand-alone post-processor, I have some prior experience on this. I suspect that either nothing will happen or you might get some interest in IP acquisition. Selling or consulting for post-processing may get but a few interested parties.

Back in the stone age (1990-1992) my minicomputer business faded away due to PCs disrupting the minicomputer business. Back in those days, a large portion of the market still ran MS-DOS in Real Mode or using a DOS Extender in a transitional Real/Protected mode. Windows (Protected Mode) was still becoming of age. Looking to form a new PC based software company, I'd asked myself what could I offer.

The first product was to port the Digital Equipment Corporation text editor TECO to the PC. This text editor is more capable than AWK, Brief and others, at pattern matching and MUNGing up text. Mung is a self-referencing acronym Mung Until No Good. Due to the steep learning curve (too many features) there was had little interest in this product.

Then I though, well, how can I get interest in this... ah... incorporate it in a tool for programmers. At that time I had gained considerable programming experience using Borland Turbo-C, and then Borland C++. During debugging, I'd often looked at the disassembly code and noticed that although the Borland compilers produced optimized coded, that in places the code was non-optimal. I guess it was good enough for Borland as their compiler was better than Microsoft's.

I thought to myself, hmm..., If I used the Borland compiler to produce assembler output, I could use my TECO editor to search out the non-optimal code sections and tidy them up. While the code patterns tended to be the same, the register assignments and branch labels would differ. TECO is capable of fuzzy match so I could search out a fuzzy large pattern (multiple lines) containing smaller fuzzy match patterns (registers and labels), then massage the fuzzy stuff and produce optimized replacement code. I called this the PeepHole Optimizer.

The above procedure was performed again for different non-optimal code sequences. After I thought I found all, or enough of, the low hanging fruit. I re-examined my output files only to discover that may tweaked-up code produce some sections that could benefit from further optimizations using the same fuzzy match and rewrite macros. IOW, I could pass my output back through the same editing macros to yield further improvements (and add new macros to handle exception cases). A typical example of MUNG (without the NG).

The end process produce code that was typically 15% faster. I thought this would sell as a post-processor. I found little interest (maybe 100 sales). Some of my users asked if this would work on MS C/C++. So I gave it a try and to my surprise although the assembler code looked quite different between Borland and Microsoft, the fuzzy pattern matching found and corrected much of the code produced by the MS compiler and yielded similar performance improvements.

At this time the Intel 80386 CPU was quite popular. You had 16-bit MS-DOS, MS-DOS with DOS Extender (protected mode) and Windows (protected mode). IOW we had the MS-DOS/DOS-Extender market and the Windows market. MS-DOS could live in the low 1MB of RAM but was capable of using the additional RAM as a RAM-DISK via a driver. The DOS extenders could use the extra memory but only in 16-bit selector model. The CPU though was capable of 32-bit Flat Model mode but DOS could not run in that mode. Some of us programmers figured out that the CPU DS and ES selectors could be pre-conditioned to have Huge Granularity bit set permitting addressing indexed/based by 32-bit registers. The only way to perform the indexing was by modifying assembly code to instruct the assembler to emit the 32-bit address prefix byte. Borland TASM assembler would do this. At that time users could write code in assembler that would run 16-bit MS-DOS code with 32-bit addressing capability for data. The only problem was this required assembly coding.

Now then, I thought, I can optimize the assembly output of Borland and Microsoft C/C++ code by way of the assembler output, so with enhancement to the PeepHole optimizer I could add Flat Model programming. Side bar: at that time a 16-bit program (C/C++) could have a pointer, a 16-bit address off of the DS (Data Segment), a "far pointer" two 16-bit words one to be loaded into the ES (Extra Segment) register and the other to be used as a 16-bit index (from this pointer you could byte offset index +/- 32KB), and finally a "huge pointer", which has the two 16-bit words like the far pointer, but the compiler could generate code that the index would manipulate both the segment and offset thus permitting arrays larger than 32KB (but still confined within the 1MB box).

What I did was to pattern match for huge pointer manipulation, and convert this to 32-bit Flat Model, inclusive of using the CPU's SIB (Scale Index Base) capability. This made for a vast reduction in code that manipulated huge pointers and permitted 32-bit indexing. The next problem though was the C Runtime Library heap manager was locked in the 1MB box (640KB or 720KB) as presented by MS-DOS. Because I could manipulate huge pointers, the only remaining issue was with the allocation. While it was an easy process to locate the malloc/free, identifying if the pointer receiving the malloc or supplied to free was a huge pointer or far pointer was problematic. I knew the debugger had this information, so I required the C/C++ compiler to add the debug symbol table to the assembler output, and text macro code to search the debug symbol table to find out if the pointer was attributed as huge. If so, the allocation/free was redirected to my own memory manager who's heap could expand to all of physical RAM.

Now I had a product that could run in MS-DOS who's code, though restricted to the low 640KB, had access to all of physical RAM as data (4MB, 8MB, 16MB, ...) and executed 2x to 4x faster. Great, or so I thought. There were two marketing problems with this:

a) I was a little company and few businesses and/or research institute would deal with a small outfit (non-major vendor)
b) The product couldn't work inside Windows

The problem in Windows was a result of a poor decision on Intel's part when they implemented the Virtual 8086 Machine mode (V86) to run MS-DOS applications (this is a quazi Real Mode). While the memory management hardware supports a process mapped (via selectors) and via the Page Tables be place on arbitrary pages, someone at Intel though it wise that any index used that resulted in an index beyond 1MB from the base of the Selector (in V86 mode) was to generate a GP fault... regardless as to if that selector contained a size larger than 1MB. IMHO there was absolutely no reason for this seeing that for a typical MS-DOS session the Selector could be allocated with a size of 1MB and that attempt to address outside of the box would generate a Page Fault. As such my atypical MS-DOS session while it could manipulate the granularity and potentially acquire the additional RAM for the Selector, it was prohibited from addressing that memory.

What this long story is about is a cautionary tail that sometimes superior products fail.

Jim Dempsey

Jim,

 

Thank you very much for taking the time and sharing your story with me and all the readers of this forum – and what the story it is!

For me it might serve as a rude awakening, but I never really was asleep. I understand that a failure is a very possible outcome of my efforts, but instead of fretting about a likely flop I would like to focus on the possible success and how to make it happen. Don't know how much can I rely on your help and to expect your assistance, but anything providing the guidance for marketing of my tool will be greatly appreciated.

In the case that that is going to be the last post, I would like to thank you for the most exciting and interesting dialog that we had and wish you Happy Holidays!

And, cant resist one more request, – wish me luck.

 

David Livshin

www.dalsoft.com

 

 

David,

I wish you all the luck in attaining some measure of reward for all your hard work. Personally I think your best possibility will be to see if one of the compiler vendors (read those that sell compilers) could be interested in acquiring your intellectual property. The Open Source community will want you to donate all your effort, and then provide support. I am not generally against Open Source, but someone has to pay your bills.

RE your text (.pdf)

I think it would help if you included some better examples and some charts. IOW a little more in-depth presentation (study) comparing OpenMP and dco. I think the major strengths you have over OpenMP (which at times can be a weakness) is that with your technique the programmer need only specify that the loop is a candidate for parallelization. Whereas with OpenMP, the programmer may be required to supply clauses private, shared, reduction, simd, loop count etc.... IOW dco is somewhat the Parallel Programming for Dummies approach to yield good results. Dco may benefit from additional directives that may eliminate some of your preamble code (e.g. determine if loop count is sufficient for parallelization).

In your pdf you showed:

do 140 i = 1, nk
  x1 = 2.d0 * x(2*i-1) - 1.d0
  x2 = 2.d0 * x(2*i) - 1.d0
  t1 = x1 ** 2 + x2 ** 2
  if (t1 .le. 1.d0) then
    t2 = sqrt(-2.d0 * log(t1) / t1)
    t3 = (x1 * t2)
    t4 = (x2 * t2)
    l = max(abs(t3), abs(t4))
    q(l) = q(l) + 1.d0
    sx = sx + t3
    sy = sy + t4
  endif
140 continue

As an example and stated that the q(l) had dependencies that made is hard or inefficient to parallelize using OpenMP. This is not true, and should not be used as an exemplar. To wit:

!$omp parallel do default(private) reduction(+:q,sx,sy)
do i = 1, nk
  x1 = 2.d0 * x(2*i-1) - 1.d0
  x2 = 2.d0 * x(2*i) - 1.d0
  t1 = x1 ** 2 + x2 ** 2
  if (t1 .le. 1.d0) then
    t2 = sqrt(-2.d0 * log(t1) / t1)
    t3 = (x1 * t2)
    t4 = (x2 * t2)
    l = max(abs(t3), abs(t4))
    q(l) = q(l) + 1.d0
    sx = sx + t3
    sy = sy + t4
  endif
end do

The only change to the loop code was to modernize it by replacing the "do 140.../140 continue" with "do.../end do". Then adding the single OpenMP directive to the modernized code. The reduction clause eliminates the race conditions for the shared objects without requiring these to be atomic operations (or critical sections). It does add the expense at the beginning of the loop to zero-out the to be reduced data and at the end of the loop the combination code to perform the reduction. As to how this compares with dco this would be up to a test.

The example in the Utilized technology section was covered above in post #15. While I did have to modify the code, the modifications were slight and quite easy to introduce without bunging up the results. dco did not win on performance but it did win on ease of coding.

The cmppth example would be harder to manually code for use with OpenMP. Something like:

int cmppth (BIT *p, BIT *q, int ninputs )
{
  atomic<int> ret = INT_MAX;
  #pragma omp parallel reduction(min:ret)
  {
    int my_code = omp_get_thread_num() << 1; // 0, 2, 4, 6, ...
    #pragma omp for
    for (int i = 0; i < ninputs; i++)
    {
      if (p[i] != q[i])
      {
        if (p[i] < q[i])
        {
	  ret = my_code + 1; // +1 indicates return -1 for thread my_thread >> 1
          break;
        }
        else
        {
          ret = my_code;
          break;
        }
      } // !=
      if(ret < my_code)
        break;
    } // for
  } // parallel
  if(ret == INT_MAX)
     return (0);
  if(ret & 1)
     return(-1);
  return (1);
}

Jim Dempsey

Jim,

Thank you for your reply.

For the Fortran code, introducing reduction of the array ‘q’ ( reduction(+:q) ) is unacceptable as a general solution – reduction creates a private copy of the array inside every thread which, when ‘q’ is a “large” array, may be problematic not to say fatal. dco came up with another solution that doesn't require memory allocation and is very efficient.

For ‘cmppth’, the code that you created has a ‘break’ inside ‘omp for’ area which is, as I understand it, is not allowed. However the problem is not that. I can easily generate parallel version of this loop – actually I did but couldn't verify it because test cases for  “eqntott” benchmark I have contain few tens of items which is not enough to benefit from parallelization. The problem is not to allow iterations after the exit from the loop to be executed. It is not clear if the code you proposed guarantees that. Also ‘atomic ret’ may be expensive to implement - you set it once inside the loop ( which is fine ), but test it at every iteration ( which is something that I would try to avoid ).
Being on the subject, do you know “real” benchmark that has critical loop with the break/return/goto inside it?

 

David Livshin

 

www.dalsoft.com

 

Second part of the reply seems to be altered, so resubmitting it: For ‘cmppth’, the code that you created has a ‘break’ inside ‘omp for’ area which is, as I understand it, is not allowed. However the problem is not in that. I can easily generate parallel version of this loop – actually I did but couldn't verify it because test cases for “eqntott” benchmark I have contain few tens of items which is not enough for parallelization. The problem is not to allow iterations after the exit from the loop to be executed. It is not clear if the code you proposed guarantees that. Also ‘atomic ret’ may be expensive to implement - you set it once inside the loop ( which is fine ), but test it at every iteration ( which is something that I would try to avoid ). Being on the subject, do you know “real” benchmark that has critical loop with the break/return/goto inside it?

 

 

I agree that the cost of reduction for arrays may be too high for lightly loaded loops. I did/do not see any issue with break (C/C++) or exit (Fortran), but their would be objection to goto outside of the scope of the loop. The array reduction can be eliminated using the % filter as in #15, but this adds code and overhead. If dco performs better this would be a good example to include.

The cmppth example is another bad example do to not being applicable for a real world test case (unless you can find an in-use example with a very large ninputs).

Jim Dempsey

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya