Hello Guys, I'm having performance problems due to DTLB misses, and I'm using the counterDTLB_LOAD_MISSES.WALK_DURATIONto measure it. In order to decrease the use of TLB I'm using MAP_HUGETLB, as mmap parameter on Linux. I've created thepool of huge tables, and I can see on /proc/meminfo these 2M pages being alocated, but, surprisingly this counter increasesaccording Vtune analysis. I'm analysing a specific part of the code, accessing arrays allocated using these huge page memory, and I can see a bigincrement on DTLB misses. It sounds to me very strange. I would like to know if an I misunderstanding the behavior of thiscounter. Do you have any experience with MAP_HUGETLB? Any suggestion? Is it possible I'm doing something wrong?

4 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Hi Rafael,

We will need more information to better understand the issue. What is the program doing? How big is the allocated memory, the active region, what does the vmstat output say, what are the other Vtune counters looking like. What is the processor and how much memory, what is the OS, etc...

Please provide some more details, and I will be happy to have a look at your issue.

Hello,This part of the program is a loop like this (in Fortran):!$OMP PARALLEL DO do k=nmin3,nmax3 do j=nmin2,nmax2 do i=nmin1,nmax1 ux(i,j,k,5) = (20.*ux(i,j,k,4) - 6.*ux(i,j,k, 3) -4.*ux(i,j,k,2) + ux(i,j,k,1) + 12.*ux(i,j,k,5)*dt2)/11. uy(i,j,k,5) = (20.*uy(i,j,k,4) - 6.*uy(i,j,k, 3) -4.*uy(i,j,k,2) + uy(i,j,k,1) + 12.*uy(i,j,k,5)*dt2)/11. uz(i,j,k,5) = (20.*uz(i,j,k,4) - 6.*uz(i,j,k, 3) -4.*uz(i,j,k,2) + uz(i,j,k,1) + 12.*uz(i,j,k,5)*dt2)/11. end do end do end doThey are three 640x586x536x4 array of floats(4 bytes). The memory requested to mmap is exactly the size of thearray,butI know mmap sometimes allocates more, because it allocates an entire number of memory pages. Themachineconfiguration is:-Xeon E5-2690 2.9GHz- 128 GB RAM- Linux Cent-OS 6.0The vmstat output is:procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----r b swpd free buff cache si so bi bo in cs us sy id wa st1 0 0 82434112 75888 4944924 0 0 0 2 2 3 0 0 100 0 0 I'm using Intel Fortran Compiler (last version) with O3 optimization. What you mean "other Vtune counters looking like"?What counter you need the values?

The code is set up with unit strides on all accesses, so the page tables should be used efficiently.
Of course the arrays are huge, so you will be rolling the TLBs, but you should only get one TLB miss for every 4kB of memory traffic, which should not be a performance issue.
How big was DTLB_LOAD_MISSES.WALK_DURATION compared to the total execution time?

The only other way to get significant TLB misses is via DTLB address conflicts. If I am reading the CPUID information correctly on my Xeon E5 systems, the level 0 DTLB is 4-way set associative for either 4kB pages or 2MB pages. The loop nest above will access at least 15 pages at a time (one for each of the last indices of each of the three arrays), so it is at least possible that there is a TLB addressing conflict.

The easiest way to check this is to simply split the loop nest into three separate loop nests -- one for ux, one for uy, one for uz. This will reduce the number of pointers to ~5 per loop, which should eliminate any conflicts.
If there is a systematic TLB conflict, this should improve the execution time as well as the TLB miss counts.

Note that the array sizes don't appear to be "bad" for conflicts -- 640x586x536 for the first three dimensions does not end up close to an even multiple of 2MB, but the code fragment does not show the relative alignment of the three arrays, so it is not possible to rule out conflicts.

John D. McCalpin, PhD "Dr. Bandwidth"

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen