__m128 + std::vector + TBB = ??

__m128 + std::vector + TBB = ??

I have been learning/working withTBB and have made some good progress. I am now using a sorted std::vector to hold my data and getting reasonable speedup with a simple parallel_for loop. I am now taking a closer look at some of the numerical calculation being donewithin the loop and would like to use the SSE2 intrinsic functions. This is the first time Ihave used SSE2 type calls.

The problem is the stl::vector does not work (at least for me) with __m128. Microsoft Visual Studio gives the following error

Error1error C2719: '_Val': formal parameter with __declspec(align('16')) won't be alignedc:\program files (x86)\microsoft visual studio 10.0\vc\include\vector888

I am using several __m128 variables in the class I am trying to create a vector of.

class __declspec(align(16)) Element {
public:
__m128 u, v, vold;
};

int _tmain(int argc, _TCHAR* argv[])
{
std::vector simpleVector;

return 0;
}

Now searching on the web you find that the stl::vector is not aligned 16 bits and the stl::vector wont work. You also find guidance to write your own container with a custom allocator to align the memory etc. etc.

Is there a better solution to use TBB with SSE2 and a std::vector without having to write you own containers and memory allocators?

Does switching to the Intel C++ compiler fix this issue?

Is there an Intel std::vector library that works with __m128?

Really appreciate any guidance.

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Why not create a container of pointers to your (align_malloc'd) objects? Or, if you prefere, make a class object that contains a pointer to your object, then create a vector of these pointer objects. Purpose being is to have the dtor of the pointer object delete the object pointed to from within the pointer object. Very much like a vector of CString.

Jim Demspey

Quoting - jimdempseyatthecove

Why not create a container of pointers to your (align_malloc'd) objects? Or, if you prefere, make a class object that contains a pointer to your object, then create a vector of these pointer objects. Purpose being is to have the dtor of the pointer object delete the object pointed to from within the pointer object. Very much like a vector of CString.

Jim Demspey

Jim a good idea. Pointers to my objects or a new class containing a pointer to my object would be an option. My initial concern would be that while aligned the actual objects may not be continuos in memory. How much of a performance hit will this cause?

Searching today I found this sort of similarforum thread
http://software.intel.com/en-us/forums/showthread.php?t=61651

Here the discussion mentions the use of the cache_aligned_allocator as part of the TBB.

Is there a std::vectorcontainer available that will take the cache_aligned_allocator and work with the SSE calls?

I have found the btAlignedObjectArray from http://www.bulletphysics.com. This would be a good start but again they use there own allocator.

Your initial message expressed a concern of alignment issues for use with SSE instruction set. To this I can infer that you have a significantly sized array within your objects such that alignment for use with SSE is an issue. Should this not be the case then please describe your circumstances.

When using significantly sized array within your objects the overhead for one additional de-reference is generally insignificant. It should be relatively easy enough for you to place a wrapper around a pointer and run a test to determine if the overhead is significant or not.

When sizing your objects such that they are a multiple of the alignment requirement (pad if necessary) then an aligned_malloc can be used to allocate numerous same-sized objects (say at initialization time) into a pool. This pool can be managed with an an allocator of your own or, if you prefer you can experimentwithTBB's cache aligned allocator. Done in this manner, adjacent objects in your pointer vector will with high degree of probability,reside within the same pagetable. So some advantage comes from re-use of the limited number ofTLBs (Translation Lookaside Buffer). Use of vector of pointers will have a higher "pressure" on TLB requirements which will be more of a concern than the cache effect. Test runs (with your data)will be the only way to ascertain if using a vector of pointers to aligned objects is more/less effective than a vector of unaligned objects.

There is one other technique you could explore.

This will require some description.

You can create a loose bag container object where the object of your design lies within. Your object is aligned but the bag is not. The loose bag container object is conceptually a character array of size = sizeof(yourObject) + alignmentRequirement - 1. And the getYourObject member function returns the lowest address within the bag that meets your alignment requirement.

Although this sounds contrieved, with optimizations, the extra code injected should be distilled down to an AND of the -alignmentRequirement. This will come at the cost of adjacent objects having a pad between them (and additional memory requirements). And this in turn will increase the pressure on the TLBs.

There is not an easy "one answer fits all" solution.

Jim Dempsey

Hi Jim,

Appreciate your time. I am working on updating some computationally intensive c code I wrote 12 years ago. As many people have pointed out "the free lunch is over".

My objects (will) contain 1xint64, 9x__mm128, 8xdouble, and 1xbool.

The algorithm is a fairly typical numerical simulation of a) an initial setup, b) followed by millions of lookups and calculations between those objects, c) a few additions and deletes, d) resort the array (If I am using a sorted vector) and e) go back to step b. Ive looked at std::multiset, boost::flat_multiset and right now am also working with a simple sorted std::vector which actually works quite nicely using the tbb::parallel_sort to update.

Im running on an Intel i7 machine and have been able to structure my code so I can use parallel_for safely. I am getting a speedup of about 4 which flattens out after 5 threads are being used.

Right now about 35% of the time is spent in the actual calculations and therefore my interest in the SSE calls. I see great potential there. Another 35% is spent in the lower_bound function looking up the elements - the majority of which should be in the division of the current task in the parallel_for. Im guessing that the basic lower_bound function is jumping all over the memory in its binary search.

The container needs to be able to look up elements insanely fast by a key value, work with __m128 Intel SSE calls, and also work with Intel TBB::parallel_for, and run on an Intel i7 machine.

Is if fair to summarize that if I am going to use the SSE* calls I can not use the std library containers unless I use your trick with pointers mentioned above. Any containers that I do write I need to pay attention to memory allocations etc for them to work optimally with the TBB parallel functions.

Regards

I assume your 1 x int64 is your key. The bool is likely there to indicate yes/no for process/don't process.

And I assume speed matters.

Prior to beginning you may have the means to calculate the peak number of objects, or at least be able todeterminea good working set. With knowledge of the maximumnumber of objects you can construct a good key generation routine.

The type of key you generate depends on what you want to do with the key. Since you mention binary search and occasional sort, I will assume you are sorting these objects by key, then binary searching the array (which may or may not have duplicate keys).

I suggest that you consider using a hash code system whereby the key you generate for an arbitrary set of relationships, has a high probability of being unique. This is usually called a hash code, then use the hash code as an index into a non-fully populated array. Usually the key is structured to have a dynamic range of 115% or more of that of the number of elements. Since memory is "cheap" you might select a key that is a power of 2 above 1.15x the number of data objects. When your keys do not have duplicates there will be no search. The key is the index to the data. Should you permit duplicate keys then you need to construct a collision resolution method. You design the key generation and dynamic range of the key such that collisions are very rare to the point where overhead for collision resolution is insignificant.

The organization I would suggest for maximum performance (subject to change due to how you manipulate the data).

int* KeyToIndex; // (or __int64*) allocate to dynamic range of key
// your hash key is used to index this table
// -1 = not used
// msb = your bool flag
// next to msb = flag to indicate hash key collision
// 0:n = one object generated key, this is its index

Then you need to explore using a single array ofobjects containing your 9 x __mm128 + 8 x double or two arrays of different component objects, one 9 x __mm128, the other 8 x double, OR 9 arrays of __mm128 and 8 arrays of double. etc...With each array preallocated to the dynamic range of the key.

During initialization objects are constructed in sequential positions of your data arrays. A key is generated and use to index the KeyToIndex table. If the entry is not used, insert the index to the newly constructed object, if it is used or indicated prior collision, then perform your collision resolution routine.

For entries in the data object array(s) that are not yet assigned, they must be flagged accordingly (e.g. zeroed, NaN'd, etc...).

You mentioned one part of your millions of iterations randomly searches for key and now can use hash as index to fetch index, and/or test bool flag.

You may have a different section of code that processes all data regardless of key. This section can process the large object array(s) directly without lookup.

Note, until you delete objects, this process loop can iterate without consideration for (conditional branches for) junk/deleted data. There is no object lookup, just fast simple SSE code.

When you delete objects, and if you precondition the data within the deleted object properly (e.g. zeroed or NaN'd) then your fast simple SSE code can compute through the deleted objects without test and branch control. Note, this may require you to insert into your fast SSE loop, code that generates an SSE Mask for a store (i.e. a mask that permits store for valid object and inhibits store for invalid object). Again, without seeing your computation requirements it is hard to say if a conditional branch will work better than execute through with mask on store.

Best Regards,

Jim Dempsey

Quoting - jimdempseyatthecove

I assume your 1 x int64 is your key. The bool is likely there to indicate yes/no for process/don't process.

And I assume speed matters.

Prior to beginning you may have the means to calculate the peak number of objects, or at least be able todeterminea good working set. With knowledge of the maximumnumber of objects you can construct a good key generation routine.

The type of key you generate depends on what you want to do with the key. Since you mention binary search and occasional sort, I will assume you are sorting these objects by key, then binary searching the array (which may or may not have duplicate keys).
...
Best Regards,

Jim Dempsey

Hi Jim,

I'll need to read the rest of your post a few times before responding. Yes speed matters.

The int64 is my key. It is calculated based on a 3d grid:
key = i*maxgrid*maxgrid + j*maxgrid + k;
I can have multiple items in each grid cell (i,j,k) and therefore with the same key value. I need to look up the items in grid cell (i,j,k) millions of times so I calculate the key value, call lower_bound with the key, and then loop through the elements until the key value is no longer what I am looking for or is the end of the list. I avoid a call to upper_bound this way. At the end of each iteration the i, j, k values of each element are updated (the items move in space) so I calculate a new key value, re-sort the array, and start the calculations again.

Yes I have a good idea of the maximum number of items in the simulation at the start of the algorithm.

Best Regards,

Andrew

Andrew,

You mayhave problemswith using std::vector or tbb:: type of vector is they need some improvement at poping an arbitrary entry out of the vector (efficiently). You might be able to modify one of these vectors, but I think you can do a better job at reworking something line CString.

I suggest you NOT store the objects in the vector/CString derivation. Why move an object when you can move a selector (index or pointer). The elements stored in the vector/CString derivation should be as small as possible and be sufficient to identify the object (or slices of the object) and contain your control flow. See if you can fit this within one __int64. The rework of vector/CString needs a fast extract element from arbitrary position without regard to order of elements. IOW

if(element .NOT. last)
copy last to element position
dec count of elements
return element

Then the add element to vector/CString derivation uses the standard append.

Note, the key is not stored.

If you grow your vector/CString derivation in gulps then allocations will be infrequent and gets less frequent further into the simulation. You could use the vector lengths developed during early runs as initial sizes for subsequent runs (or develop a function to derrive a good initial size based on number of cells and number of objects).

If your objects are only permitted to move to an adjacent cell, then the code to facilitate object movements can be without locksif you schedule each thread to work paths seperated by two cells (in all directions). I suggest that since you might experience localized bursts of movement, that you consider using milestone markers. When a thread reaches a marker it sets a flag indicating it has reached this point (i.e. completed to this point). This setting can be done witout LOCK. The milestone flag will be used later to indicate if it is safe for a thread to traverse adjacent to the path protected by the milestone. Done correctly, you can keep all threads busy for all but the last few paths.

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today