we develope cache aware iterative solvers.
This group of algorithms is mainly memory
access limited. With 3D problems we face
an increase in Data TLB misses. One solution
seemed to be the usage of large (4M) pages.
We recognized several strange effects by switching
to large pages. As bottomline: D-TLB misses and L2
DCM dropped, but performance in terms of runtime
increased, roughly by a factor of 2.
To pin down the problem I did two tests:
1) I have several assembler implementations of memcpy.
One uses software prefetching (mov in register). The
version with "hand prefetching" showed exactly the same
performance for 4k and 4M pages, while the standard
version showed a large decrease.
2) I checked the vector triad with an SSE2 assembler
implementation. So in this case prefetching should play
no role. With 4M pages I get rougly half of the performance
than with 4k pages.
I have two questions:
* Is the hardware prefetching disabled for large pages?
* Is there any issue with SSE2 instructions and large pages?
Is there any other point I didnt recognize?
Thanks in advance for you help,
Just for completeness:
The codes are exactly the same for different page sizes.
We use the mmap call to allocate memory
on a hugetlbfs on linux. To use it, we overwrite
malloc and LD_PRELOAD the implementing library.
The operating system is Linux with a 2.6.5 Kernel.
As all benchmarks are written in assembler the compiler
is no issue. But this effects can also be seen with C Code.