Memory alignment

Memory alignment

bishgada's picture

Hi all,
I hope this is the right forum to ask about this issue.

I read at several articles ( ~1 year old ) how to align memory addresses so they fit cache lines. I was wondering why does it required? Doesn't the compiler knows the best alignment of the addresses? Couldn't the dynamic allocation decide the best alignment during run time depending on the specific HW it runs on? Is it possible that the program (processor) during run time will detect frquent misses when addressing specific addresses in small sequence of operations (i.e. detect a small and frequent "pattern" that really hits performance) and decide to change the data addresses to aligned address internally (completely transparenin to the user)?

The second issue is how to detect potential performance gain from memory alignment on a given code. I assume running VTune and having tons of cache misses might indicate memory alignment issue. Is there a better or more accurate indication and solution for this issue?

thanks,
Guy.

16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
mami's picture
Quoting - bishgada Hi all,
I hope this is the right forum to ask about this issue.

I read at several articles ( ~1 year old ) how to align memory addresses so they fit cache lines. I was wondering why does it required? Doesn't the compiler knows the best alignment of the addresses? Couldn't the dynamic allocation decide the best alignment during run time depending on the specific HW it runs on? Is it possible that the program (processor) during run time will detect frquent misses when addressing specific addresses in small sequence of operations (i.e. detect a small and frequent "pattern" that really hits performance) and decide to change the data addresses to aligned address internally (completely transparenin to the user)?

The second issue is how to detect potential performance gain from memory alignment on a given code. I assume running VTune and having tons of cache misses might indicate memory alignment issue. Is there a better or more accurate indication and solution for this issue?

thanks,
Guy.

Really nice-to-have things you ask for...
However, not all language objects are re-locatable at run time (C/C++ vs C#/Java etc).
Second, memory access pattern would varry from time to time during the run of the program.
While keeping some kinds and instances of some object in close proximity would behave cache-wise, in other circumstance it would cause"false-sharing" as well. I think we are in programmers-need-to-be cache-aware era going towards compilers&OS-need-to-be-cache-aware. Finally and hopefully, CPUs will eventuall cleverly co-operate with OS and language runtime environments to have cache friend excutions. Till then, we need to hunt for papers and implementations which cares caching and allignment.

bishgada's picture
Quoting - mgorkem@hotmail.com

Really nice-to-have things you ask for...
However, not all language objects are re-locatable at run time (C/C++ vs C#/Java etc).
Second, memory access pattern would varry from time to time during the run of the program.
While keeping some kinds and instances of some object in close proximity would behave cache-wise, in other circumstance it would cause"false-sharing" as well. I think we are in programmers-need-to-be cache-aware era going towards compilers&OS-need-to-be-cache-aware. Finally and hopefully, CPUs will eventuall cleverly co-operate with OS and language runtime environments to have cache friend excutions. Till then, we need to hunt for papers and implementations which cares caching and allignment.

Thank you for the answer.

What about the second item? How to detect alignment problems (or which of the thousands of allocations in my code should I focus on)?

thanks,
Guy.

mami's picture
Quoting - bishgada

Thank you for the answer.

What about the second item? How to detect alignment problems (or which of the thousands of allocations in my code should I focus on)?

thanks,
Guy.

I have not used intel tools yet. Soon i will check for myself. However, AQTime as an external profiler and VC++ 2008 have itself support for watching out cache-miss counters. Actually, to my understanding, allignment of addresses is relatively easy (by this i mean, for instance, a 64-bit read, of, say, an int64, is quickest if the address is a multiple of 8-bytes. Cache-friendlyness is a totally different issue. It, simply refers to, avoiding scattered reads of objects from here and there and similarly avoding unintended invalidation of other objects while modifying the target one. Per-line of per-function cache-miss counting can greately help pin point such cases. Just start profiling :)

bishgada's picture
Quoting - mami

I have not used intel tools yet. Soon i will check for myself. However, AQTime as an external profiler and VC++ 2008 have itself support for watching out cache-miss counters. Actually, to my understanding, allignment of addresses is relatively easy (by this i mean, for instance, a 64-bit read, of, say, an int64, is quickest if the address is a multiple of 8-bytes. Cache-friendlyness is a totally different issue. It, simply refers to, avoiding scattered reads of objects from here and there and similarly avoding unintended invalidation of other objects while modifying the target one. Per-line of per-function cache-miss counting can greately help pin point such cases. Just start profiling :)

Well thanks, I did start profiling. I use the VTune to find L1 and L2 Read cache misses. Unfortantly however, there are few functions which the cache misses occur and these function for some obscured reason don't take a long execution time (in percentage) from the whoe run.

Anyway, as a beginner of this issue I'm feeling I'm starting to have some grip on it, and your response wasalso helpful.

Thanks again,
Guy.

sahasay's picture

Guy

I believe you are lucky if the cache misses happen for functions which don't take up a lot of execution time. Do you use a map / hashmap in your programme? Try to see the kind of cache miss you get while trying to look up data there. See the difference when you replace it with a cache friendly array in its place.

Also, try to focus on the cache hit rate, that will tell you how efficiently your program is designed.

Cheers!

Sayandeep

Roman Dementiev (Intel)'s picture

Hi Guy,

cache-line aligned allocators allocate data items on cache line boundaries to avoid possible "false sharing". They also pad the data with an additional payload to fill the full cache line (64 bytes), therefore it has some overhead for smaller data allocations. The code lines that cause false sharing can be detected by VTune by sampling the MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM events on Core i7 or Xeon 5500 series processors. More info on VTune and cache miss event measuring is avaliablehere.

hope it helps,
Roman

bishgada's picture
Quoting - sahasay

Guy

I believe you are lucky if the cache misses happen for functions which don't take up a lot of execution time. Do you use a map / hashmap in your programme? Try to see the kind of cache miss you get while trying to look up data there. See the difference when you replace it with a cache friendly array in its place.

Also, try to focus on the cache hit rate, that will tell you how efficiently your program is designed.

Cheers!

Sayandeep

Actually I do use more than a few maps. I'm using the stl map implementation. When you mention cache friendly arrays do you mean to try and change the std::map addresses or use "cache friendly library" which also implements map and hashes? If the case is the latter, do you mean TBB? can you refer to such libraries?

Another thing, is it possible to limit the VTune/Amplifier analysis to only some specific code areas and increase in these sections the sampling resolution?
I'm asking, since currently when I look at the code after VTune sampling, first I see a lot of modules that I've no idea what do they have to do with my program, and second most of the line codes don't get any samples of timer or cache misses. How can I focus/narrowthe analysis?

One last thing, are there any thumb rules what is a good cache hit rates and what are bad? (the same for cache misses). I see numbers but I don't know what to expect...

Thanks a lot for the guidelines!!

Guy.

bishgada's picture
Quoting - Roman Dementiev (Intel) Hi Guy,

cache-line aligned allocators allocate data items on cache line boundaries to avoid possible "false sharing". They also pad the data with an additional payload to fill the full cache line (64 bytes), therefore it has some overhead for smaller data allocations. The code lines that cause false sharing can be detected by VTune by sampling the MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM events on Core i7 or Xeon 5500 series processors. More info on VTune and cache miss event measuring is avaliablehere.

hope it helps,
Roman

Hi Roman,
The article is very helpfull.

Thanks,
Guy.

sahasay's picture
Quoting - bishgada

Actually I do use more than a few maps. I'm using the stl map implementation. When you mention cache friendly arrays do you mean to try and change the std::map addresses or use "cache friendly library" which also implements map and hashes? If the case is the latter, do you mean TBB? can you refer to such libraries?

Another thing, is it possible to limit the VTune/Amplifier analysis to only some specific code areas and increase in these sections the sampling resolution?
I'm asking, since currently when I look at the code after VTune sampling, first I see a lot of modules that I've no idea what do they have to do with my program, and second most of the line codes don't get any samples of timer or cache misses. How can I focus/narrowthe analysis?

One last thing, are there any thumb rules what is a good cache hit rates and what are bad? (the same for cache misses). I see numbers but I don't know what to expect...

Thanks a lot for the guidelines!!

Guy.

Guy, if you are using stl maps and they are using the default stl allocators, I am afraid they are not at all cache friendly. They create serious problem with inuced false sharing, memory bloating, etc. All these will make your system slow without any definite bounds.

If you need to use maps, hashmaps, etc, use the ones from intel TBB or atleast try to use a good memory allocator like hoard which will scale your system for multi-processor environment. The array is a very good data structure for performant code as the compiler can do prefetching and chances are that the prefetched code is hot in the cache when you try to use it. But for a map, the elements are allocated dynamically, and the compiler has no idea which is the next data to be prefetched.

Hope that helps.

Cheers!
Sayandeep

jeff_keasler's picture

Hi,


One more thing to consider -- you can get an additional performance advantage above and beyond alignment itself. The instruction mix that the compiler selects can be improved if you add the restrict keyword to pointers, along with compiler directives such as __assume_aligned(). Just because your memory allocator aligns data for you, that doesn't mean the compiler knows the pointer is aligned. If you do everything right, the compiler will generate packed SSE instructions as opposed to scalar SSE (or x87) instructions. Packed SSE instructions can be more efficient simply because less instruction in a tight loop means the loop will execute faster. Also, a more efficient form of the memory read instruction can be used when the compiler knows about alignment.
Roman Dementiev (Intel)'s picture
Quoting - bishgada

Another thing, is it possible to limit the VTune/Amplifier analysis to only some specific code areas and increase in these sections the sampling resolution?
I'm asking, since currently when I look at the code after VTune sampling, first I see a lot of modules that I've no idea what do they have to do with my program, and second most of the line codes don't get any samples of timer or cache misses. How can I focus/narrowthe analysis?

One last thing, are there any thumb rules what is a good cache hit rates and what are bad? (the same for cache misses). I see numbers but I don't know what to expect...

VTune samples all processes running on your system, this includes your program, operating system, and all other programs running in parallel. Therefore, in the output, you can also see modules that do not belong to your program. As far as I know the technique behind sampling does not allow you to increase the sampling for certain functions since sampling period is a global parameter (i.e. sample is taken every x ms, or after y processor events, where x and y are global constants). In order to increase the resolution of sampling you can decrease x and y (this may increase the overhead of sampling) or you can increase the period of data collection (e.g. 20->60 seconds), for the latter you may need to increase the input size of your program or/and run it several times. There is a forum on VTune where you can ask further questions or already find answers in previous posts.

Slide 18 ("Cache Misses") of the above mentioned slide deck gives an idea of how to estimate the impact of the observed cache miss ratio on the running time of the program or particular function.

--
Roman

Roman Dementiev (Intel)'s picture
Quoting - sahasay
Guy, if you are using stl maps and they are using the default stl allocators, I am afraid they are not at all cache friendly. They create serious problem with inuced false sharing, memory bloating, etc. All these will make your system slow without any definite bounds.

If you need to use maps, hashmaps, etc, use the ones from intel TBB or atleast try to use a good memory allocator like hoard which will scale your system for multi-processor environment. The array is a very good data structure for performant code as the compiler can do prefetching and chances are that the prefetched code is hot in the cache when you try to use it. But for a map, the elements are allocated dynamically, and the compiler has no idea which is the next data to be prefetched.

Please, note that STL maps and STL hash maps are not thread-safe (modifying the data structure from different threads can make it invalid and also crash the program). They may work in many-thread read-only case, but you might get false sharing performance problems because of the already mentioned reasons.

You can try the TBB concurrent_hash_map if you need a scalable thread-safe map data structure. TBB data structures use internal TBB scalable memory allocator per default. The TBB scalable allocator can be also used separately from your custom data structures, consult the TBB manual and/or TBB forum for details.

--
Roman

bishgada's picture
Sayandeep, Jeff and Roman,

Thank you for your explainations. I'd like to refer to concrete example in my code...

In part of my code I have toextractone plane from 4 planes interleaved image.
I copy elements from buffer to buffer.
I use some class that I wrote in order to avoid allocation of buffer if it already has the amount of size required.

Here is the old code.

template
class DataBuffer
{
	bool bNeedDelete;
public:
	void Clear()
	{
		if(data != NULL && bNeedDelete)
		{
			delete[] data;
			data = NULL;
		}
		size = 0;
		bNeedDelete = false;
	}
	DataBuffer(int nsize) : bNeedDelete(false)
	{
		AllocateBuffer(nsize);
	}
	DataBuffer() : data(NULL), size(0), bNeedDelete(false)
	{
	}
	~DataBuffer() // can't be virtual size it's template
	{
		Clear();
	}
	T* GetPtr()
	{
		return data;
	}
	void AllocateBuffer(int nsize)
	{
		if(size != nsize ||  // different size 
			!bNeedDelete)	 // or the pointer don't belong to the class
		{
			Clear();
			data = new T[nsize];
			size = nsize;
		}
		bNeedDelete = true;
	}
	void CopyBuffer(const void* in, int nsize)
	{
		AllocateBuffer(nsize);
		memcpy(data, in, nsize*sizeof(T));
	}
	DataBuffer& operator=(const DataBuffer& buf)
	{
		Set(buf.data, buf.size);
		return *this;
	}
	void Set(T * bdata, int bsize)
	{
		Clear();
		data = bdata;
		size = bsize;
	}
	inline T operator[](const int& location) const
	{
		return data[location];
	}
	int size;
	T * data;
};

Then I tried to change it to cache-friendly code (as far as I understand it ) and the class becomes:

template
class DataBuffer
{
	bool bNeedDelete;
public:
	void Clear()
	{
		if(dataToDelete != NULL && bNeedDelete)
		{
			delete[] dataToDelete;
		}
		dataToDelete = NULL;
		data = NULL;
		size = 0;
		bNeedDelete = false;
	}
	DataBuffer(int nsize) : bNeedDelete(false)
	{
		AllocateBuffer(nsize);
	}
	DataBuffer() : data(NULL), dataToDelete(NULL), size(0), bNeedDelete(false)
	{
	}
	~DataBuffer() // can't be virtual size it's template
	{
		Clear();
	}
	T* GetPtr()
	{
		return data;
	}
	void AllocateBuffer(int nsize)
	{
		if(size != nsize ||  // different size 
			!bNeedDelete)	 // or the pointer don't belong to the class
		{
			Clear();
			dataToDelete = new byte[sizeof(T)*nsize+127];
			data = (T*)( (((int)dataToDelete) + 127) & ~127 );
			size = nsize;
		}
		bNeedDelete = true;
	}
	void CopyBuffer(const void* in, int nsize)
	{
		AllocateBuffer(nsize);
		memcpy(data, in, nsize*sizeof(T));
	}
	DataBuffer& operator=(const DataBuffer& buf)
	{
		Set(buf.data, buf.size);
		return *this;
	}
	void Set(T * bdata, int bsize)
	{
		Clear();
		data = bdata; // not surely aligned
		size = bsize;
	}
	inline T operator[](const int& location) const
	{
		return data[location];
	}
	int size;
	T * data; // 128 bit aligned pointer (suppose to be cache friendly)
	byte * dataToDelete; // the actual data pointer allocated
};


the images attached shows the VTune for the un-friendly cache code and for the firendly cache code. It shows that there is almost not difference.
The percentage of the Load misses retired is almost the same, the Hit rate and miss rate is almost the same. The difference is that the performance impact changed from 8 to 3+ but from
what I read, everything above "2" considered high performance impact.
I also checked the frame rate and it changed from 60Hz to 62Hz. Even that I ran it over thousands of frames it doesn't necessarily only from this change.
So, what do you think?
I'm going to try those compiler directives... *crossing fingers* (did you ever tried to write code with your fingers crossed?)
Thanks again,
Guy.


Attachments: 

AttachmentSize
Download cache-friendly.JPG110.42 KB
Download cache-unfriendly.JPG125.57 KB
bishgada's picture
Quoting - jeff_keasler Hi,
One more thing to consider -- you can get an additional performance advantage above and beyond alignment itself. The instruction mix that the compiler selects can be improved if you add the restrict keyword to pointers, along with compiler directives such as __assume_aligned(). Just because your memory allocator aligns data for you, that doesn't mean the compiler knows the pointer is aligned. If you do everything right, the compiler will generate packed SSE instructions as opposed to scalar SSE (or x87) instructions. Packed SSE instructions can be more efficient simply because less instruction in a tight loop means the loop will execute faster. Also, a more efficient form of the memory read instruction can be used when the compiler knows about alignment.

The restrict keyword had no influence :(

But thanks for the remarks.
Guy.

jeff_keasler's picture
Quoting - bishgada

The restrict keyword had no influence :(

But thanks for the remarks.
Guy.

In your example code, all the work occurs down in memcpy(). The suggestion of using restrict and __assume_aligned() only apply where you will be maniputating the individual elements of the arrays in code you write yourself.

Sorry I couldn't help for this case.

-Jeff

Login to leave a comment.