suspect false sharing and LLC cache misses

suspect false sharing and LLC cache misses

Hello,

I have spotted a memory related performance issue with Vtune in my multi-threaded application compiled with ICC (and optimization switched on) running on a Xeon E5 processor. In particular, such performance degradation only occurs when enabling a feature in my application that makes the threads concurrently execute many simple operations on a dynamically allocated array of the following data structure:

struct bitstream {
int32_t count;
uint8_t *bytes;
};

Vtune reveals that the functions performing the operations on such data structure have unexpectedly high MEM_LOAD_UOPS_RETIRED.LLC_MISS and the whole application a high number of "LLC loads misses serviced by remote DRAM".

At first sight, it looks like a false sharing given that all array elements would be stored in the same cache line and that every thread modification to an array element would cause all the other threads caches to be invalidated and reloaded. Exactly as fully explained here: http://software.intel.com/en-us/articles/avoiding-and-identifying-false-...

Following the guide advice, I allocated the array so that each element is aligned to the cache line size (64 bytes). However, that doesn't seem to sort the issue.

I was wondering if anyone could suggest how to further dig into this issue.

Thanks

18 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

Does your application works on AoS?

Maybe  your code has gather operations when accessing AoS it could prevent efficient vectorization of the code.

Well, I guess the basic question is, did your excessive last level cache misses go away?  You showed a data structure but not how you used it or how you modified it.  Were the excessive cache misses on the data in the structure you showed, or in the array it points to?  What are the patterns of data access that the threads employ on this data structure?  

I have actually noticed that the false sharing problem was sorted by forcing the data structure in itsown cacheline as suggested in the guide about false sharing. The metric I should have looked at in the General Exploration Analysis is the number of contested accesses.

The performance bottleneck highlitghted by Vtune is now the L2 cache replacement in the General Exploration.

The function that operated on that data struct is absolutely trivial. It just adds one or more bytes to the dinamically allocated area of memory pointed by data and increases count accordingly. As you can imagine this very simple operation is repeated many times in a short interval by the same thread after some heavy computations that produces the bytes. The whole sequence (heavy computation, bitstream operations) are repeated several times.

to try to indentify the issue I have commented out the array data, so that the function that handle the data structure only counts bytes. The results is that the L2 cache replacement still are high and all related to a trivial line of code such as:

bitstream->count += n;

Thanks for your help, by the way.

As I understand your code is accessing count element of Array of bitstream structures.L2 cache is shared between logical CPUs and If I am not wrong you have more than two threads to operate on the data.It could be quite possible that OS scheduler scattered your threads on various cores so you can still have false sharing issues.

Each thread is bound to a specific physical core. Also, when such specific feature is not enabled, the application has pretty good performance. The access to that specific data structure somehow is the bottleneck.

I am slightly confused by your answer.Do you mean that when the CPU affinity is disabled application has a better performance in comparison to affinity being enabled?

Sorry, I didn't explain well. The application performs fairly well, but when I switch on a feature called bitstream generation. The bitstream generation encode some data and then calls the low-level routine that adds one or more bytes of encoded data at a time to the data structure bitstream (defined above). The low-level function is as simple as :

1. void put_bytes(struct bitstream *stream, uint8_t *data, uint32_t n){
2. for (i = 0; i < n; i++){
3. stream->bytes[stream->count] = data[i];
4. stream->count++;
5. }
6. }

Even commenting out line 3 (so that it only counts the bytes) doesn't achieve good performance, which sounds very strange. As I said before, Vtune reports a high L2 cache replacement percentage on line 4.

I reckon the false sharing was sorted by adding __attribute__((aligned(64))) to the data structure bitstream as the number of contested accesses for put_bytes() are zero.

I still suspect an AOS gather accesses as a culprit.Does put_bytes function operate on Array of structures?

Yes, each thread operates on one or more elements of a dynamic allocated array of bitstream structs. Being n the number of threads, thread 0 works on bitstream elements 0, n, 2n, etc. thread 1 on bistream elements 1, n+1, 2n+1, etc. , thread 2 on bitstream elements 2, n+2, 2n+2, etc. However, the code that calls the bits_put is inherently sequential as well as put_bits. Can you explain why that would matter?

Are multiple threads updating your stream counts?  That in itself would explain a high L2 miss rate and might also suggest the possiblity of races in your count update code, unless that put_bytes function is protected with a critical section, which would further serialize and lower performance on the sample code.

And your "dynamic allocated array(s)": are they being allocated dynamically as an initialization, then remain at that fixed size and location for the duration of your computation?  Or are they being grown in response to multiple put_byte requests?  Are those buffers being reallocated?  More possibilities for slow code.

If you've got n threads, each touching a bitstream element (byte?) at offsets that are mutlipliers of the number of threads, this is not a false shareing condition; this is a real cache conflict situation, as those cache lines are copied back to memory and thrashing through cache to satisfy such a chaotic access pattern.  Thread 0 comes in, writes byte 0, forcing the read for ownership and write-back of the cache line owner, whichever thread was the last to write a byte in the first 64 of this bitstream, a situation that could be rife with contention and "L2 cache replacement."

How much do you know about races, locks, and programming in a multi-thread environment?

I didn't explain well (again), sorry about that.

At the initialization stage, an array of M bitstream data structures is dynamically allocated.

struct bitstream *bitstreams = malloc_aligned( M * sizeof(struct bitstream __attribute__((aligned(64)))) );

malloc_aligned(size){
    void *p=NULL;
    if (0 == posix_memalign(&p, 64, size))
        return p;
    else return NULL;
}

(the 64-byte alignment is meant to avoid false sharing, being the cache line 64 byte wide).

All (N) threads operate on such array of bitstreams. Each thread has itsown set of bitstreams that it operates on. Thread 0 operates on the bitstreams set {0, n, 2n}, thread 1 on {1, n+1, 2n+1} etc. Each bitstream in the array is only handled by one thread. Therefore there is no race condition in accessing the bitstreams.

(for M = 8 and N = 3)
bitstreams   0   1   2   3   4   5   6   7
threads        0   1   2   0   1   2   0   1

The area pointed to by each bitstream struct is allocated only at the initialization stage and not moved/reallocated during the operations.

The core is fully functionally working, but not performing well due something that has to do with the bistream struct and/or put_bytes function.

Wow, i've never seen a C syntax like that before, "sizeof(struct bitstream __attribute__((aligned(64))))"  Does that actually work?  Do you actually get an index or pointer into an array of "__attribute__((aligned(64)))" bitstream structures?  Or do you get pointer arithmetic assuming the nonpadded bitstream structure size, with some extra space at the end?  The resulting cache behavior you observe would be consistent with the latter.

Quote:

iliyapolak wrote:

I still suspect an AOS gather accesses as a culprit.Does put_bytes function operate on Array of structures?

Yes, it operates on cacheline-aligned array of structures.

struct coder {
    struct bitstream bits;
    struct cabac
} __attribute__((aligned(64),packed));

coder = mm_malloc(M * sizeof(struct coder));

As I was saying in some previous posts, the threads acccess such AOS with the following pattern:
thread 0: 0, N, 2N...
thread 1: 1, N+1, 2N+1, ...
thread 2: 2, N+2, 2N+2, ...
...
where N=8, M=45.

Why should that be a problem?

Thanks.

[corrected version]

Yes, it operates on cacheline-aligned array of structures.

struct coder_t {
    struct bitstream_t bitstream;
    struct cabac_t cabac;
} __attribute__((aligned(64),packed));

struct coder_t *coder = mm_malloc(M * sizeof(struct coder), 64);

(64B being the cacheline size)

As I was saying in some previous posts, the threads acccess such AOS with the following pattern:
thread 0: 0, N, 2N...
thread 1: 1, N+1, 2N+1, ...
thread 2: 2, N+2, 2N+2, ...
...
where N=8, M=45.

Why should that be a problem?

Thanks.

Let me just specify that the threads access an array element for a big chunk of computation before moving to the next. In other words, the threads are not jumping across the array all the time.

It boils down to better packing a data which exhibits a cache  locality in the case of SoA(Structure of Arrays) and also there is also an improvement in case of vectorization of SoA when compared to AoS.So I think that it is better to have data in close proximity than access a less data more scattered .I do not know what impact could have AoS on your performance.Data array has linear and predictable indexing so it could be hardware prefetched maybe after few firs iterations or after two consecutive misses.

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!