Question about unexpected performance when using icc for TSVC benchmark

Question about unexpected performance when using icc for TSVC benchmark

Imagen de susangao

Hi all, I got a TSVC benchmark fromhttp://polaris.cs.uiuc.edu/ maleki1/TSVC.tar.gzIt is used to evaluate the vectorization of compilers. I am trying to test icc with it. I changed the makefile to use icc and I am trying to see the timing of it. The code is simple, there are 151 main functions, each of them has a typical loop and it's own initialization. Each of them prints out it's timing and a certain correctness checking result. main() function calls these 151 functions. I found something confusing happened for function s162(). In main() function, if I comment out all function calls behind s162(), then it's timing is 2.x sec on e5-2680; but if I do not comment them and just use the original code, it timing is 4 sec. This can be repeat and is not randomly happened. It doesnt matter whether I keep or comment those function calls before s162(). It is the point that confuse me since only later part after s162() matters. I compared the .s of these s162(), seems there no instruction related difference.(The correctness check result are the same for this two cases. ) The platform I am using is: Chip: Xeon E5, 2680. OS: GNU/Linux ICC:icc (ICC) 12.1.3 20120212 This also happens when I run it on Xeon 5660. Timing is 3.x and 6.x. The platform info is: Chip: Xeon 5660 OS:GNU/Linux ICC:icc (ICC) 12.1.2 20111128 Attachment is the original pakage and the makefiles I used for above two platforms. I wish get your precious advise on what the reason is. It looks like a silly question, but I hope I could learn from you and figure it out. Thank you for reading. Best Regards, Susan

AdjuntoTamaño
Descargar TSVC-s162.tar.gz13.53 KB
publicaciones de 8 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de Tim Prince

This benchmark was not intended to be compiled with the driver and the test function in a single source file. Even 22 years ago (when the original was published), compilers could short-cut artificial benchmarks by inter-procedural optimization.

clock() is not a satisfactory timer for this benchmark. Whoever produced this modified version had to make it repeat far more than the original just to make it run long enough to be timed by clock(). Unfortunately, Standard C doesn't include a satisfactory timer function.

The point of s162() is to see whether the compiler recognizes the direction of data overlap, seeing that it will never be executed with a negative overlap. It's possible that when interprocedural optimization succeeds, the compiler sees that the overlap is a compile time constant and can eliminate the conditional as well as propagate the constant overlap into the code. There are other tests in this suite intended to concentrate on that.

Imagen de susangao

Got it. Thank you very much for your helpful and kindly reply.

Imagen de david m.

the link to http://software.intel.com/en-us/system/files/TSVC-s162.tar.gz is broken

can you provide an updated link

Imagen de Sergey Kostrov

>>...I found something confusing happened for function s162(). In main() function, if I comment out all function calls behind s162(),
>>then it's timing is 2.x sec on e5-2680; but if I do not comment them and just use the original code, it timing is 4 sec...

It could be an alignment issue ( needs to be investigated ) and I've detected a similar issue with two of my performance evaluation tests for some SSE2 and AVX instructions. Almost the same thing, however opposite, that is, it gets better if I comment some pieces of codes.

I really disagree with Tim's comment regarding a CRT-function clock since it provides satisfactory accuracy up to milli-seconds if some test runs more than a couple of seconds. Of course, if somebody will try to use the CRT-function clock to measure a time interval with micro- or nano-seconds accuracy it won't provide reliable numbers.

Imagen de Tim Prince

In the original version of this benchmark http://www.netlib.org/benchmark/vectord loops of length 1000 are timed without extra repetitions.  The shorter loops are repeated so as to process as much data as the longer one, but repeating over the same cached data, Many of these tests run around 100 microseconds on 2.6Ghz coreI7-2, so a timer with microsecond resolution is needed.

I ran some tests this week on linux with various timers and got reports of microsecond resolution with Intel Openmp omp_get_wtime().  I believe it's nearly that good on Windows.  On linux, gettimeofday() is expected to work as well.

Imagen de Sergey Kostrov

>>In the original version of this benchmark http://www.netlib.org/benchmark/vectord loops of length 1000 are timed
>>without extra repetitions. The shorter loops are repeated so as to process as much data as the longer one, but repeating
>>over the same cached data, Many of these tests run around 100 microseconds on 2.6Ghz coreI7-2, so a timer with
>>microsecond resolution is needed
.

Thanks, Tim for these details.

Imagen de Sergey Kostrov

>>In the original version of this benchmark [ ...link removed... ] loops of length 1000 are timed
>>without extra repetitions. The shorter loops are repeated so as to process as much data as the longer one, but repeating
>>over the same cached data, Many of these tests run around 100 microseconds on 2.6Ghz coreI7-2, so a timer
>>with microsecond resolution is needed
.

Thanks, Tim for these details.

Inicie sesión para dejar un comentario.