Auto-vectorization and std::vector

Auto-vectorization and std::vector

Hi,

I would like to understand why the Intel compiler happens to fail to vectorize some basic loops when using std::vector. The following code is testing 3 different "arrays" : C-style array, an homebrewed vector class, and the std::vector class. I am timing a loop that just does v[i] =i for all the elements of the array. The results are the following :

fayard@speed:Desktop$ icpc -std=c++11 -Ofast vector-simd.cpp -o main
fayard@speed:Desktop$ ./main
std::vector:	362787
999999
HomeMade:	164704
999999
C-array:	166045
999999

fayard@speed:Desktop$ g++-4.9 -std=c++11 -Ofast vector-simd.cpp -o main
fayard@speed:Desktop$ ./main
std::vector:	186377
999999
HomeMade:	179809
999999
C-array:	176598
999999

A quick look at the assembly code (or even with -vec-report2) proves that the Intel Compiler does not vectorize the loop with std::vector. As you can see, gcc 4.9 has no problem doing it. I would like to understand :

-  Why does my own vector class is fine for vectorization, and std::vector does not ? I would like to find some change in my class so that it prevents vectorization (in order to understand what happen with std::vector), but I can't. I've tried to move it to another file, and even put the getter in the .cpp file instead of the .h file, but careful compiling with -ipo still triggers the vectorization. Can anyone give me a hint ?

- Is the Intel Compiler the culprit or the standard library ? As far as I understand icpc uses the libc++ library from clang (I am on OSX), and g++-4.9 uses the libstdc++ library. I have tried to make icpc use the libstdc++ library and it does not vectorize, but it takes the library that is installed on OSX, not the one that I've compiled with gcc 4.9. Is there a way to make icpc use the standard library that I've compiled ?

- Can anyone find a code where my homemade class does not vectorize and the C-array does ?

Thanks for your help,

François

#include <iostream>
#include <vector>
#include <chrono>

class MyVector {
private:
    int n_elements;
    int* data;
public:
    MyVector(int in_n_elements) {
        n_elements = in_n_elements;
        data = new int[n_elements];
    }
    int& operator[](size_t i){
        return data[i];
    }
};


int main (int argc, char const *argv[])
{
    const int n_elements {1000000};
    const int n_iterations {1000};
    
    {
        std::vector<int> v(n_elements);
    
        std::chrono::steady_clock::time_point timeStart, timeEnd;
                timeStart = std::chrono::steady_clock::now();
        for(size_t i = 0; i < n_iterations; ++i)
        {
            for(size_t j = 0; j < n_elements; ++j)
            {
                v[j] = j;
            }
        }
        timeEnd = std::chrono::steady_clock::now();
        std::cout << "std::vector:\t" <<
            std::chrono::duration_cast<std::chrono::microseconds>(timeEnd -
            timeStart).count() << std::endl;
    
        std::cout << v[n_elements-1] << std::endl;
    }
    
    
    {
        MyVector v(n_elements);
    
        std::chrono::steady_clock::time_point timeStart, timeEnd;
                timeStart = std::chrono::steady_clock::now();
        for(size_t i = 0; i < n_iterations; ++i)
        {
            for(size_t j = 0; j < n_elements; ++j)
            {
                v[j] = j;
            }
        }
        timeEnd = std::chrono::steady_clock::now();
        std::cout << "HomeMade:\t" <<
            std::chrono::duration_cast<std::chrono::microseconds>(timeEnd -
            timeStart).count() << std::endl;
    
        std::cout << v[n_elements-1] << std::endl;
    }
    
    {
        int v[n_elements];
    
        std::chrono::steady_clock::time_point timeStart, timeEnd;
                timeStart = std::chrono::steady_clock::now();
        for(size_t i = 0; i < n_iterations; ++i)
        {
            for(size_t j = 0; j < n_elements; ++j)
            {
                v[j] = j;
            }
        }
        timeEnd = std::chrono::steady_clock::now();
               std::cout << "C-array:\t" <<
                   std::chrono::duration_cast<std::chrono::microseconds>(timeEnd -
                   timeStart).count() << std::endl;
    
        std::cout << v[n_elements-1] << std::endl;
    }
    
    return 0;
}

 

17 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

icpc normally in-lines code from the headers and then attempts to vectorize using its own mechanisms.  In your case, I see, as with g++, there appears to be source code appended, where additional vector code is generated.  I don't know whether clang operates like g++, but I don't expect icpc to rely on separate compilation of STL with any of them.

You should be able to verify where the source code is picked up and whether iit is copied verbatim by saving pre-processed source (and perhaps see what difference in STL source code may inhibit optimization).

In my observation, source code line 36-38 is not vectorized until I set #pragma simd (I suppose the cost analysis may be close) while the other 2 don't require a pragma (as you seem to indicate).

If you don't want vectorization you can add -no-vec to compiler options or set #pragma novector.

I might have thought that the locally defined large loop count, or a #pragma loop count, might sway cost analysis, but apparently they don't.. 

I'm a little surprised that the compiler doesn't appear to shortcut your first 2 cases, as it does (for me) with the last.

Hi Tim,

I am running MacOSX Mavericks and I have tried this test with Clang. Here are the results.

fayard@speed:Desktop$ clang++ --version
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.3.0
Thread model: posix
fayard@speed:Desktop$ clang++ -Ofast -std=c++11 main.cpp -o main
fayard@speed:Desktop$ ./main
std::vector:	163083
999999
HomeMade:	159202
999999
C-array:	160498
999999
fayard@speed:Desktop$ icpc --version
icpc (ICC) 14.0.3 20140415
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.
fayard@speed:Desktop$ icpc -Ofast -std=c++11 main.cpp -o main
fayard@speed:Desktop$ ./main
std::vector:	355755
999999
HomeMade:	160455
999999
C-array:	169081
999999

I have tried to save the preprocessed file with

icpc -E -C -Ofast -std=c++11 main.cpp  > file.cpp

and it uses the STL from the system "/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/c++/v1/vector" which is the same used by clang. So I don't understand why icpc does not vectorize this.

Also, my experience is that icpc usually does a poor job at vectorizing std::vector. Is that expected ?

Best regards,

Francois

Best Reply

Hi velcia,

I reproduced your error. It looks like a real error. I've filed tracker.

--Sergey

15.0 compiler may remove dependency on -ansi-alias in vectorization of std::.

Thanks for your answers.

it would be nice to have a clear statement from Intel on what we loose using std::vector compared to C heap-allocated vectors. For instance all Intel performance documents (for instance the one on auto vectorization) use only C-arrays (pointers) even for "C++" code, which is not considered C++-style of doing things.

So, could we expect the same performance from std::vector and C-array in the near future, without using pragmas ?

 

 

Fixing on a small part of the issue you raised: 

It seems surprising that -fast options didn't set -ansi-alias.  This is important for those cases where the rules should enforce non-aliasing between data and other contents of a container.  No, I'm not good at C++-speak.

The plan of making -ansi-alias a default for linux was announced to be effective with icpc 15.0, as that is consistent with g++ default.  non-linux targets were announced as an exception to this change.  I haven't seen these changes settling down, but it seems a clear improvement, given the difficulty of explaining why Intel compilers are so aggressive by default with this one exception.

icpc prefers not to include peeling for alignment when vectorizing container implementations.  I don't know all reasons why it would be difficult to use the same rules as for plain C code, but it may be so.  The whole iterator scheme made things difficult for optimization (by calculation of loop count) with the only apparent motivation being the generalization to linked lists and the like.  Non-alignment (with otherwise effective vectorization) should be at most a 2x loss for the current CPUs which aren't excessively dependent on special-casing for alignment (as compilers for corei7-2 and -3 avoid 256-bit unaligned moves anyway).  Peeling for alignment also may be dropped when the compiler wants to Multiversion, as otherwise code expansion might be excessive.  Those small deficits do detract from using C++ code for bragging documents.

I don't see the Microsoft VS2012 compiler auto-vectorizing anything involving STL nor OpenMP nor achieving any alignment optimizations when it does auto-vectorize.  I suppose the Microsoft comparison is among the more important ones for C++.

I'm looking at a case where opportunities for interprocedural optimization involve the Intel compiler reporting Multiversioned and VECTORIZED, yet the container version doesn't achieve vectorized performance like plain C  (where I use #pragma GCC ivdep).  I'd tend to agree that these cases don't get the marketing attention which might be implied by including them in performance optimization documents.  If your position is like mine, I don't see value in achieving a VECTORIZED report without performance gain, as marketers appear to do.

My cases where macros are used to change names of pointers are showing similar issues with Multiversioned and VECTORIZED not gaining performance, even though g++ has no difficulty demonstrating full performance.  I suppose it's considered bad practice to rename stuff in STL containers that way.

So I've been able to verbalize more than you'd like on just this small part of the subject.

 

Hi Tim,

Thank you so much for this interesting explanation. For -ansi-alias, I would have thought that this is set by default. I'll add it to my "release" configuration.

I would be very interested to know why loop-peeling is not taken care of by the optimizer when using containers such as std::vector. I anyone from the C++ compiler team could give use a hint on why this is not done, it would be nice.

François

Part of the problem, as Tim states, is the opaqueness of the vector and the use of the iterator. When the vector does not alter during the loop, you can obtain the base of the buffer used by the vector (assuming the complete vector is contiguous) and obtain the .begin and .end of the iteration space. From there you can perform a parallel loop on what is inside the vector using the pre-fetched values. Often you can get vectorization by simply fetching the .begin and .end then using the []. But this depends on the T(s) stored in the vector.

Jim Dempsey

www.quickthreadprogramming.com

Hi Tim,

Thanks, the -ansi-alias option makes a big difference to some of my functions. For instance

void add_vector(std::vector<double>& a, const std::vector<double>& b, const std::vector<double>& c)
{
	for(int i = 0; i < a.size(); ++i)
	{
		a[i] += b[i] + c[i];
	}
}

does not get vectorized without the -ansi-alias option, and is correctly vectorized with the -ansi-alias option. But I can't figure out why.

My understanding is that -ansi-alias prevents two pointers of different types to point to the same memory location. On my machine, a std::vector<double> is made of 3 pointers to a double (for "begin", "end", and "reserved end"). So, behind the scene, I only have pointers to double here. So I don't understand why -ansi-alias helps.

Francois

-ansi-alias asserts that your program complies with the standard about not aliasing incompatible data types.  It's the same as gcc option -fstrict-aliasing, which has been a default for over a decade.   In the Microsoft tradition, it's necessary for a compiler to protect against all kinds of hidden aliasing regardless of whether they violate C and C++ standards, including the possibility that any updates inside a container may alter any other component of the container.  As mentioned above, Intel is just beginning to introduce compilers which don't need this option, at least not on linux.  Otherwise, each update of a component of a[] requires, in your example, another check of a.size(), and the loop can't be optimized by using a.size() to determine loop count before entering the loop.

Intel compilers seem less inclined to flag obvious violations of -ansi-alias than other compilers, but as long as there is a requirement to turn such cases which are undefined behavior in the standards into defined behavior, many important optimizations have to be avoided due to the impossibility of detecting many violations.

Thanks Tim,

Coming from Fortran, I am not that used to the tricky business of pointers. I hope I got it right now. Here is a sample program that explains what I think is going on:

class Vector {
private:
	double* data_;
	int n_;
public:
	inline Vector(int n)
	{
		n_ = n;
		data_ = new double[n];
	}
	inline int size() const
	{
		return n_;
	}
	inline double& operator[](int k)
	{
		return data_[k];
	}
	inline ~Vector()
	{
		delete[] data_;
	}
};

void func_carray(double* a, int n)
{
	for(int i = 0; i < n; ++i)
	{
		a[i] += 1.0;
	}
}

void func_carray_size_reference(double* a, int& n)
{
	for(int i = 0; i < n; ++i)
	{
		a[i] += 1.0;
	}
}

void func_vector(Vector& a)
{
	for(int i = 0; i < a.size(); ++i)
	{
		a[i] += 1.0;
	}
}

and here are the result of the compilation

fayard@speed:Ansi-Alias$ icpc -c -vec-report3 add.cpp -o add.o
add.cpp(26): (col. 2) remark: LOOP WAS VECTORIZED
add.cpp(34): (col. 2) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate
add.cpp(43): (col. 23) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate
fayard@speed:Ansi-Alias$ icpc -c -ansi-alias -vec-report3 add.cpp -o add.o
add.cpp(26): (col. 2) remark: LOOP WAS VECTORIZED
add.cpp(34): (col. 2) remark: LOOP WAS VECTORIZED
add.cpp(43): (col. 23) remark: LOOP WAS VECTORIZED

Without -ansi-alias, func_vector is not vectorized because the compiler might take into account the fact that a "funny guy" could have done :

data_ = (double*) (&n_);

which would change the size of a when writing to a[0], which obviously prevents vectorization. Of course, this program does not respect the strict aliasing rule which is why func_vector gets vectorized with the -ansi-alias keyword.

On the other hand func_carray gets vectorized right away as there is no way that the pointer a could point to n as this one has been created on the stack when the function is called.

On the other hand, func_carray_size_reference could not be vectorized right away as some funny guy could have done

a = (double*) (&n);

before calling the function. But it is vectorized with the -ansi-alias keyword.

I hope this explanation is correct.

Francois

 

PS : I have found a big flaw in my explanation. What does prevent us to do

a = (double*) (&a)

before calling the function func_carray in a program that does not respect strict aliasing ? This would change what a points to when writing "a[0] = something" and therefore prevents vectorization. But it does not as func_carray is vectorized even without -ansi-alias. Now, I am lost. Any help on this topic would be nice.

If you need verification that the subject isn't written up in a satisfactory way, you could start with the Wikipedia on type punning (where there are anonymous criticisms of the presentation).

a = (double*) (&a);

is already a violation of standard, which gcc should flag, at least when the right options are set.  I'm not certain there are options to make icc flag it.  As an old Fortranner myself, I have extra reasons not to pursue whether the "legal" alternatives to the above are any safer.

You may have seen the thread a few days ago where I wondered why icpc supports the common __restrict extension without satisfactory documentation, while the documented extension of C99 restrict in icpc isn't portable.

 

Quote:

Sergey Melnikov (Intel) wrote:

I reproduced your error. It looks like a real error. I've filed tracker.

Could you please give me a tracking number?

TIA!

 

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Quote:

Tim Prince wrote:

a = (double*) (&a);

is already a violation of standard, which gcc should flag, at least when the right options are set. 

Chris Lattner, one of the leader of LLVM/Clang, has a blog article just about that which states that this should be accepted in a C without the strict aliasing rule : http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html

So what the Intel compiler does by default is still unknow and undocumented because it should not have vectorized the C-Array version without -ansi-alias.

 

I don't understand why Lattner indicated that case should not optimize. -ansi-alias does come into play if it's changed to P [j][I] = 0.f
As then there are pointers which could be stepped on if floats were permitted to do so.

Hi Marián "VooDooMan" Meravý,

Tracker #359613.

--Sergey

Leave a Comment

Please sign in to add a comment. Not a member? Join today