Odd cache results

Odd cache results

Hi all,I'm trying to maximise use of cache in matrices, so I'm testing some of my codes with both IPC and PAPI. The problem is that the results obtained are very different. I'm measuring L2 and L3 hit ratio with 4 programs:

PAPI
IPC

L2
L3
L2
L3

matrix0
0,000257
0,000394
0,0108473
0,0274595

matrix1
0,001641
0,590435
0,00420045
0,0081431

matrix2
0,001943
0,641179
0,00416087
0,00807843

matrix3
0,001849
0,942466
0,00388092
0,0484803
The L3 results are specially significant. With PAPI, I obtained a hit ratio of 60~90%, but when measured with IPC i obtained 0~4%. The routines measured are the same, so I don't undestand the results. Is IPC measuring wrong?For example, the code for matrix1 (accesing a matrix by columns):With PAPI:[bash]/**
* Ejecucion de matriz normal en multihilo con openmp utilizando
* bucle for de openmp. Eventos medidos con PAPI
*/

#include
#include
#include
#include

#include "papi.h"

#define NUM_EVENTS 5

int Events[NUM_EVENTS] = {PAPI_TOT_CYC, PAPI_L2_TCA, PAPI_L2_TCM, PAPI_L3_TCA, PAPI_L3_TCM};
long long values[NUM_EVENTS];
long long start_usec, end_usec, start_v_usec, end_v_usec, start_cycles, end_cycles;
int EventSet = PAPI_NULL;
int num_counters;

const PAPI_hw_info_t *hwinfo = NULL;

int main(int argc, char* argv[])
{
int n;

if ((n=PAPI_library_init(PAPI_VER_CURRENT)) != PAPI_VER_CURRENT) {
printf("\n Papi ver current (%d) distinto de %d \n", n,PAPI_VER_CURRENT);
}

/* Gets the starting time in microseconds */
if ((hwinfo = PAPI_get_hardware_info()) == NULL) {
printf("\n Papi: Error PAPI_get_hardware_info null\n");
}
else {
printf("\n%d CPU at %f Mhz.\n",hwinfo->totalcpus,hwinfo->mhz);
}

// CODIGO A MEDIR

#include
#include
#include
#include "timer.h"
#include

#define nth 4
#define F 17000
#define C 17000
#define VAR double

int i,j;

// Seleccionamos el numero de hilos de ejecucion
omp_set_num_threads(nth);

// Reservamos matriz
VAR** m;
m = (VAR**)malloc(F*sizeof(VAR*));
for(i = 0; iWith IPC:[bash]#include "cpucounters.h"
#include
#include
#include
#include

#define nth 4
#define F 17000
#define C 17000

using namespace std;

int
main(){
cout<<"Testing Intel PCM\n"<program() != PCM::Success){
printf("Error Code: %d\n",ipc->program());
return -1;
}

// Begin of custom code

int i,j;

// Reservamos matriz
double** m;
m = (double**)malloc(F*sizeof(double));
for(i = 0; icleanup();

cout <I'm measuring wrong? The results I obtain with PAPI have more coherence for me (at least in L3)Thanks in advance.

8 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di Roman Dementiev (Intel)

korso,what are the sizes of your matrices (0,1,2,3) and what is the hardware configuration are running (number of sockets, processor type, etc) ?Thanks,Roman

Hi Roman,Matrices size are 17000x17000 in all codes. C Double type. They have been selected for not reaching RAM limit (and avoid using virtual memory)My processor is ai7 CPU 860 @ 2.80GHzIt has a 3 level cache:L1 -> C=64; L=8; W=64 -> 32K instructions, 32K data (per core)L2 -> C=512; L=8; W=64 -> 256K (per core)L3 -> C=8192; L=16; W=64 -> 8192K (unified)Sockets -> 1Cores -> 4RAM -> 4GBIf you need any other information, just ask for it.Thanks.

Ritratto di Roman Dementiev (Intel)

Korso,do you know how PAPI maps it "virtual"PAPI_L3_TCA, PAPI_L3_TCM events to real hardware event and what are those?17K x 17K x 8 matrix implies data size >= 2GByte and the L3 cache size is only 8 MByte. Your access pattern (by column - increasing j index) is not sequential. Why do you expect L3 hit rate > 60% ?

  1. for(i=0;
  2. for(j=0;
  3. m[j][i] = (VAR)sqrt(m[j][i]);
  4. }
  5. }

Thanks,Roman

Hi Roman,Well, each program is different, and the code I posted is the worst case scenario. Let me expain a bit:matrix0 code is a sequential access for the matrix, the code measured is:[bash]//Begin of measures
SystemCounterState before_sstate = getSystemCounterState();

/**
* Ejecucion de matriz mala con parallel for y padding simple.
*/

#pragma omp parallel for shared(m) private(i,j)
for(i=0;imatrix1 code is a non sequential access for the matrix, and the code is the same I posted before:[bash]//Begin of measures
SystemCounterState before_sstate = getSystemCounterState();

/**
* Ejecucion de matriz mala con parallel for y padding simple.
*/

#pragma omp parallel for shared(m) private(i,j)
for(i=0;imatrix3 code is an experimental method to use array padding to access the matrices by columns so the cache only need to store a single column of the matrix (L2 W is 64bytes/block, so if a double value has 8 bytes long, a m[0][0] access will produce a cache miss and store into L3 block m[0][0] to m[0][7] cells).My algorithm guarantees the maximization of cache size so if cache can store num_threads*num_files*8 matrix cells, the hit ratio should be nearly the same as in a sequential access. But even if my algorithm was bad, matrix0 is a sequential access, so I expect a higher hit ratio in both caches.In fact, I use this large matrices so I can obtain more differences between worst case scenario and my algorithm.About PAPI question, I use PAPI_L3_TCA and TCM for total cache accesses and misses, and i obtain hit ratio:hit_ratio = 1-(misses/acesses). Both events are available and native in my processor, but I don't know any more details.I could be using PAPI bad, but L3 results seem to have more sense to me (I can't understand L2 low hit ratio, and that was the main reason for me to change PAPI to IPC)Thanks

Replying to reupload the post...

Ritratto di Roman Dementiev (Intel)

Hi korso,

Do you know howPAPI_L3_TCA and PAPI_L3_TCA PAPI generic events are mapped to the low-level Intel event (names)? As far as I understanf PAPI mappingscould be dependent on the PAPI version and also underlying CPU architecture. Is there utility in PAPI that can output such mapping on your particular system? Or any documentation?

It would be also useful to see and compare the absolute counts of L2/L3 cache hits and misses in PCM and PAPI. Could you post them here?

Thank you,
Roman

Hi korso,

Assuming your OS is Linux, you can use "perf" utility as a 3rd method to check, without having to modify source code. Simply run, on the command line:

> sudo perf stat -e rXXXX,rYYYY,rZZZZ,... ./ ...

where the rXXXX etc are the hex codes formed by Umask and EventCode of relevant cache events. The Intel Programming Guide (Volume 3B), Chapters 18, 19 on Performance Counters give the event codes for your processor (Core i7) nehalem.

Or, have you already done it?

Sanath

Accedere per lasciare un commento.