Missing AVX-optimization of atan2f

Missing AVX-optimization of atan2f

  • Using Intel CC 14.0 under Visual Studio 2013SP2

    • atan2f()

      • with AVX: 3.915 sec.
      • with SSE2: 0.800 sec.
    • ​atanf() is not affected
      • with AVX: 0.475sec.
      • with SSE2: 0.626 sec.
  • atan2() is widely used when calculating with complex numbers (to get the phase).
  • Double precision seems to be affected too, but the numbers are not as clear as with single precision.

 

Simplified example code:

const int iterations = 100000;
const int size = 2048;
float* a = new float[size];
float* b = new float[size];
for (int i = 0; i < size; ++i) {
  a[i] = 1.1f;
  b[i] = 2.2f;
}


for (int j = 0; j < iterations; ++j) {
  for (int i = 0; i < size; ++i) {
    a[i] = atan2f(a[i], b[i]);
  }
}
for (int j = 0; j < iterations; ++j) {
  for (int i = 0; i < size; ++i) {
    a[i] = atanf(b[i]);
  }
}

 

Options (simplified from real world project)

  • ​using SSE:
    /GS /Qopenmp /Qrestrict /Qansi-alias /W3 /Qdiag-disable:"4267" /Qdiag-disable:"4251" /Zc:wchar_t /Zi /O2 /Ob2 /Fd"Release\64\vc120.pdb" /fp:fast  /Qstd=c++11 /Qipo /GF /GT /Zc:forScope /GR /Oi /MD /Fa"Release\64\" /EHsc /nologo /Fo"Release\64\" /Ot /Fp"Release\64\TestPlugin.pch"
  • using AVX:
    /Qopenmp /Qrestrict /Qansi-alias /W3 /Qdiag-disable:"4267" /Qdiag-disable:"4251" /Zc:wchar_t /Zi /O2 /Ob2 /Fd"Release\64\vc120.pdb" /fp:fast  /Qstd=c++11 /Qipo /GF /GT /Zc:forScope /GR /arch:AVX /Oi /MD /Fa"Release\64\" /EHsc /nologo /Fo"Release\64\" /Ot /Fp"Release\64\TestPlugin.pch" 
30 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I see a significant increase in run time when adding /arch:CORE-AVX2 or /arch:AVX to the build options, in 15.0 beta compiler as wel (using VS2012)l.  I don't know if the differences are buried in the svml library, __svml_atan2f4 vs. __svml_atan2f8. A plain AVX box (which I don't have available) may be needed to see whether an svml performance regression appears with AVX as opposed to AVX2.

It would be preferable to try the cases with 32-byte alignment; otherwise results may not be consistent (even with SSE).  I didn't see this possible issue having an effect in my attempt.

The AVX (or AVX2) version appears to perform about the same as with the Microsoft compiler.  So I will try to figure out whether the time is actually spent in __svml_atan2f8 or maybe it fails to enter the vectorized loop version.

g++ (where there is no vectorization of atan) appears to run the case much faster, but I suspect it may be short-cutting timing loops.

I'm finding Windows 8.1 decidedly unfriendly as to how I might view the VTune screenshot.  Possibly it's due to my having to remove NotePad due to other issues with that.

It shows me the following (Intel64 mode):

_svml_satan2_cout_rare 3.2s

_svml_atan2f8_l9             1.4s

_svml_atanf8                    0.2s

which appears to confirm that it takes frequently a non-vector branch inside svml_atan2f8.  By contrast, the SSE version shows

_svml_atan2f4_h9     0.8s

_svml_atanf4_h9       0.5s

@Tim

>>>It shows me the following (Intel64 mode):

_svml_satan2_cout_rare 3.2s

_svml_atan2f8_l9             1.4s

_svml_atanf8                    0.2s>>>

Are these functions ordered in some caller-callee relationship?

I wonder if compiler optimized the second loop (probably by removing it and only once calculating atan2f values) where atan2f was operating on the arrays filled with identical values.

 

Couldn't this be related to my report @ https://software.intel.com/en-us/forums/topic/516011 by chance?

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Quote:

iliyapolak wrote:

@Tim

>>>It shows me the following (Intel64 mode):

_svml_satan2_cout_rare 3.2s

_svml_atan2f8_l9             1.4s

_svml_atanf8                    0.2s>>>

Are these functions ordered in some caller-callee relationship?

I believe those atan2 functions are called somewhere inside svml library, as the reference in the compiled .obj is to the top level svml entry point atan2f8().  There is no significant time spent in that entry point function.

I think that bulk of the computation is done by "_svml_atan2f8_l9 _" function.

Quote:

iliyapolak wrote:

I wonder if compiler optimized the second loop (probably by removing it and only once calculating atan2f values) where atan2f was operating on the arrays filled with identical values.

 

I think ICL is not short-cutting anything, but I do believe g++ does short-cut, as it runs fast (unless I set -O0) even though there is no vector math library.  By "second loop" you must not mean the atanf loop, which is certainly expected to be faster than atan2f.  As the original post implied, the AVX svml version of atan2f ought to be at least as fast as the SSE, where the AVX atanf could reasonably be close to double SSE speed and does show a good improvement.

Quote:

Marián "VooDooMan" Meravý wrote:

Couldn't this be related to my report @ https://software.intel.com/en-us/forums/topic/516011 by chance?

In this case, the compiler reports vectorization and builds in a call to the svml library, both for SSE2 and for AVX compilation, but the one AVX svml function turns out not to be vectorized effectively internally.  You could at least show us whether the compiler attempts to make a vectorized fmod() function call in your case; if so, does that fail to give a performance improvement.

Quote:

iliyapolak wrote:

I think that bulk of the computation is done by "_svml_atan2f8_l9 _" function.

That may be, but the overall effect shows no speedup over Microsoft scalar function when the compiler builds in call to the AVX vector function, while the SSE vector function works as expected.

Quote:

Tim Prince wrote:

Quote:

iliyapolak wrote:

I wonder if compiler optimized the second loop (probably by removing it and only once calculating atan2f values) where atan2f was operating on the arrays filled with identical values.

 

I think ICL is not short-cutting anything, but I do believe g++ does short-cut, as it runs fast (unless I set -O0) even though there is no vector math library.  By "second loop" you must not mean the atanf loop, which is certainly expected to be faster than atan2f.  As the original post implied, the AVX svml version of atan2f ought to be at least as fast as the SSE, where the AVX atanf could reasonably be close to double SSE speed and does show a good improvement.

Sorry i should have formulated my answer differently. My assumption was that compiler will at compile time calculate only once atan2f() function call probably by analysing called function arguments and understanding that their value will not be changed during n-loop iterations. I Think that compiler could went further in its optimization efforts and simply eliminate inner loop by calculating atan2 values and filling array in compile time.

I suppose that only atan2f() calculation was done at compile time.

It's a legitimate concern whether simplistic timing tests like this are short-cut by the compiler seeing that results are discarded and need not be calculated, and I do think g++ -O does that in this case, but Intel C++ does not.

The follow-up question in my mind is about whether anyone is interested enough to file a ticket for investigation of the SVML atan2f8(), which I think should be directed at library implementation.  I hope the IPS web site may be available for a few days next week; the messages I received indicated no scheduled down time during the next 6 days.

>>>The follow-up question in my mind is about whether anyone is interested enough to file a ticket for investigation of the SVML atan2f8(), >>>

Should we investigate at assembly  code level?

>>>It's a legitimate concern whether simplistic timing tests like this are short-cut by the compiler seeing that results are discarded and need not be calculated, and I do think g++ -O does that in this case, but Intel C++ does not>>>

Yes I agree with you. I think that I will try to investigate the issue of compiler optimization.

So do you think that in the case of ICC both of the for-loops are preserved in runtime?

Quote:

iliyapolak wrote:

>>>The follow-up question in my mind is about whether anyone is interested enough to file a ticket for investigation of the SVML atan2f8(), >>>

Should we investigate at assembly  code level?

I wouldn't suggest that.  I'll watch for the IPS premier web site to become available the next few days.

Quote:

iliyapolak wrote:

>>>It's a legitimate concern whether simplistic timing tests like this are short-cut by the compiler seeing that results are discarded and need not be calculated, and I do think g++ -O does that in this case, but Intel C++ does not>>>

Yes I agree with you. I think that I will try to investigate the issue of compiler optimization.

So do you think that in the case of ICC both of the for-loops are preserved in runtime?

Yes, it appears that all the calls to svml functions are made, and the relative timings with icc should be meaningful.

Quote:

Tim Prince wrote:

Quote:

iliyapolak wrote:

>>>The follow-up question in my mind is about whether anyone is interested enough to file a ticket for investigation of the SVML atan2f8(), >>>

Should we investigate at assembly  code level?

 

I wouldn't suggest that.  I'll watch for the IPS premier web site to become available the next few days.

Ok it seems more reasonable thing to do.

Quote:

Tim Prince wrote:

Quote:

iliyapolak wrote:

>>>It's a legitimate concern whether simplistic timing tests like this are short-cut by the compiler seeing that results are discarded and need not be calculated, and I do think g++ -O does that in this case, but Intel C++ does not>>>

Yes I agree with you. I think that I will try to investigate the issue of compiler optimization.

So do you think that in the case of ICC both of the for-loops are preserved in runtime?

 

Yes, it appears that all the calls to svml functions are made, and the relative timings with icc should be meaningful.

I Wonder if functions calls are present because of array data type?

IIRC there is second or even third case where ICC did not remove function calls with constant arguments where array type was present.

At first, thanks to all for the quick confirmation.

Changing the second loop (the atanf at line 31) to be "self-referencing"

a[i] = atanf(a[i]);

should disable even gcc's loop-eliding optimizations. Optimizing away 100000 iterations of atan with known initial input is possible at compile time, but seems very unlikely to me. If that is happening (say, if the timing results are implausible), one may add some rand() initialization and print the average of the resulting array "a".

	const int iterations = 100000;
	const int size = 2048;
	float* a = new float[size];
	float* b = new float[size];

	for (int i = 0; i < size; ++i) {
		a[i] = rand();
	}

	for (int j = 0; j < iterations; ++j) {
		for (int i = 0; i < size; ++i) {
			a[i] = atan2f(a[i], b[i]);
		}
	}

	float averageA = 0.0f;
	for (int i = 0; i < size; ++i) {
		averageA += a[i];
	}
	averageA /= size;
	cout << "Average of array a: " << averageA << endl;


	for (int i = 0; i < size; ++i) {
		a[i] = rand();
		b[i] = rand();
	}

	for (int j = 0; j < iterations; ++j) {
		for (int i = 0; i < size; ++i) {
			a[i] = atanf(a[i]);
		}
	}

	averageA = 0.0f;
	for (int i = 0; i < size; ++i) {
		averageA += a[i];
	}
	averageA /= size;
	cout << "Average of array a: " << averageA << endl;

Using Intel CC 14.0, both loops (atanf and atan2f) are calling the SVML functions (__svml_atanf8 and __svml_atan2f8, respectively).

Thank for noticing IPS, I was not aware of that. I thought this forum would be the appropriate way to file a bug.

Your IPS support account is the way to submit issues where you require security, or wish to be able to track the response without depending on a volunteer from the Intel team.  As this appears to be a library issue, it may not be the direct responsibility of Intel people who monitor this site regularly.

I'm still not getting a response from the SAVE step at IPS, and it's scheduled for down time at the end of the week.  I thought perhaps my input might be helpful since I set it up to verify on VTune. 

There seems to have been some sort of spam attack on Intel sites the last few days; why it's so important to some people to deny us the use of the sites beats me, if in fact there's a connection.

I have to say, I can not reproduce the timing results for the SSE2-case anymore. Maybe that was a mistake of mine.

When using 64-bit code with AVX, then the comparison between Intel CC and VC++ is interesting:

  • atan2f using VC++ is twice as fast as when using ICC (the missing optimization noted in my first post).
  • atan2f using VC++ is twice as fast as atanf when using VC++ (?! - didn't notice that before, maybe related to SP3)

 

Using Intel CC 14.0, 64-bit, AVX (calling __svml_atanf8/__svml_atan2f8):

  • ATan:           0.443 GFLOPS ( 0.462 sec.)
  • ATan2:          0.052 GFLOPS ( 3.912 sec.)

Using VS2013SP3, 64-bit, AVX (calling atanf/atan2f):

  • ATan:           0.051 GFLOPS ( 3.991 sec.)
  • ATan2:          0.111 GFLOPS ( 1.847 sec.)

 

 

Ok, needed to properly rebuild everything (knew that but forgot).

Here are the results when using 32-bit with SSE2 (calling __svml_atanf4/__svml_atan2f4):

  • ATan:            0.333 GFLOPS ( 0.615 sec.)
  • ATan2:          0.280 GFLOPS ( 0.731 sec.)

 

Filed as issue 6000062158: "Missing AVX-optimization of atan2f (__svml_atan2f8)"
(Intel C++ Compiler for Windows, Medium, 08/11/2014)

 >>>Optimizing away 100000 iterations of atan with known initial input is possible at compile time, but seems very unlikely to me. If that is happening (say, if the timing results are implausible), one may add some rand() initialization and print the average of the resulting array "a>>>

I suppose that ICC could optimize away the inner for-loop by removing call statements from the run-time code.Of course I do not expect further optimization like compile-time array filling which could eliminate inner for-loop from the runtime code.

>>>I Think that compiler could went further in its optimization efforts and simply eliminate inner loop by calculating atan2 values and filling array in compile time>>>

I made mistake in quoted sentence. Of course compiler will not fill in  dynamically allocated array at compile time because of new operator.

My Intel premier account is blocked.  I've been getting some help from the support team but still can't file this new issue.  The site is scheduled down tonight, so we will be waiting another week or two on this.

submitted as intel premier issue 6000063006 during today's uptime between IPS site modifications

issue reported closed as a duplicate of another submission without further comment

Hello,

This issue is fixed with version 16.0.110.

AVX:
atan2(): 0.864195 seconds
atan(): 0.33743 seconds

SSE2:
atan2(): 0.93485 seconds
atan(): 0.457738 seconds

Bye,
Lars

Leave a Comment

Please sign in to add a comment. Not a member? Join today