DTLB_LOAD_MISSES.WALK_DURATION

DTLB_LOAD_MISSES.WALK_DURATION

Rafael Silva's picture

Hello Guys, I'm having performance problems due to DTLB misses, and I'm using the counterDTLB_LOAD_MISSES.WALK_DURATION to measure it. In order to decrease the use of TLB I'm using MAP_HUGETLB, as mmap parameter on Linux. I've created the pool of huge tables, and I can see on /proc/meminfo these 2M pages being alocated, but, surprisingly this counter increases according Vtune analysis. I'm analysing a specific part of the code, accessing arrays allocated using these huge page memory, and I can see a big increment on DTLB misses. It sounds to me very strange. I would like to know if an I misunderstanding the behavior of this counter. Do you have any experience with MAP_HUGETLB? Any suggestion? Is it possible I'm doing something wrong?

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Hussam Mousa (Intel)'s picture

Hi Rafael,

We will need more information to better understand the issue. What is the program doing? How big is the allocated memory, the active region, what does the vmstat output say, what are the other Vtune counters looking like. What is the processor and how much memory, what is the OS, etc...

Please provide some more details, and I will be happy to have a look at your issue.
Regards,
Hussam

Rafael Silva's picture

Hello, This part of the program is a loop like this (in Fortran): !$OMP PARALLEL DO do k=nmin3,nmax3 do j=nmin2,nmax2 do i=nmin1,nmax1 ux(i,j,k,5) = (20.*ux(i,j,k,4) - 6.*ux(i,j,k, 3) -4.*ux(i,j,k,2) + ux(i,j,k,1) + 12.*ux(i,j,k,5)*dt2)/11. uy(i,j,k,5) = (20.*uy(i,j,k,4) - 6.*uy(i,j,k, 3) -4.*uy(i,j,k,2) + uy(i,j,k,1) + 12.*uy(i,j,k,5)*dt2)/11. uz(i,j,k,5) = (20.*uz(i,j,k,4) - 6.*uz(i,j,k, 3) -4.*uz(i,j,k,2) + uz(i,j,k,1) + 12.*uz(i,j,k,5)*dt2)/11. end do end do end do They are three 640x586x536x4 array of floats(4 bytes). The memory requested to mmap is exactly the size of the array,butI know mmap sometimes allocates more, because it allocates an entire number of memory pages. The machineconfiguration is: -Xeon E5-2690 2.9GHz - 128 GB RAM - Linux Cent-OS 6.0 The vmstat output is: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 82434112 75888 4944924 0 0 0 2 2 3 0 0 100 0 0 I'm using Intel Fortran Compiler (last version) with O3 optimization. What you mean "other Vtune counters looking like"? What counter you need the values?

John D. McCalpin's picture

The code is set up with unit strides on all accesses, so the page tables should be used efficiently.
Of course the arrays are huge, so you will be rolling the TLBs, but you should only get one TLB miss for every 4kB of memory traffic, which should not be a performance issue.
How big was DTLB_LOAD_MISSES.WALK_DURATION compared to the total execution time?

The only other way to get significant TLB misses is via DTLB address conflicts. If I am reading the CPUID information correctly on my Xeon E5 systems, the level 0 DTLB is 4-way set associative for either 4kB pages or 2MB pages. The loop nest above will access at least 15 pages at a time (one for each of the last indices of each of the three arrays), so it is at least possible that there is a TLB addressing conflict.

The easiest way to check this is to simply split the loop nest into three separate loop nests -- one for ux, one for uy, one for uz. This will reduce the number of pointers to ~5 per loop, which should eliminate any conflicts.
If there is a systematic TLB conflict, this should improve the execution time as well as the TLB miss counts.

Note that the array sizes don't appear to be "bad" for conflicts -- 640x586x536 for the first three dimensions does not end up close to an even multiple of 2MB, but the code fragment does not show the relative alignment of the three arrays, so it is not possible to rule out conflicts.

John D. McCalpin, PhD "Dr. Bandwidth"

Login to leave a comment.