Question about unexpected performance when using icc for TSVC benchmark

Question about unexpected performance when using icc for TSVC benchmark

Hi all,I got a TSVC benchmark fromhttp://polaris.cs.uiuc.edu/ maleki1/TSVC.tar.gzIt is used to evaluate the vectorization of compilers. I am trying to test icc with it.I changed the makefile to use icc and I am trying to see the timing of it.The code is simple, there are 151 main functions, each of them has a typical loop and it's own initialization. Each of them prints out it's timing and a certain correctness checking result. main() function calls these 151 functions.I found something confusing happened for function s162(). In main() function, if I comment out all function calls behind s162(), then it's timing is 2.x sec on e5-2680; but if I do not comment them and just use the original code, it timing is 4 sec. This can be repeat and is not randomly happened.It doesnt matter whether I keep or comment those function calls before s162(). It is the point that confuse me since only later part after s162() matters.I compared the .s of these s162(), seems there no instruction related difference.(The correctness check result are the same for this two cases. )The platform I am using is:Chip: Xeon E5, 2680.OS: GNU/LinuxICC:icc (ICC) 12.1.3 20120212This also happens when I run it on Xeon 5660. Timing is 3.x and 6.x. The platform info is:Chip: Xeon 5660OS:GNU/LinuxICC:icc (ICC) 12.1.2 20111128Attachment is the original pakage and the makefiles I used for above two platforms.I wish get your precious advise on what the reason is. It looks like a silly question, but I hope I could learn from you and figure it out.Thank you for reading.Best Regards,Susan

AllegatoDimensione
Download TSVC-s162.tar.gz13.53 KB
8 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

This benchmark was not intended to be compiled with the driver and the test function in a single source file. Even 22 years ago (when the original was published), compilers could short-cut artificial benchmarks by inter-procedural optimization.

clock() is not a satisfactory timer for this benchmark. Whoever produced this modified version had to make it repeat far more than the original just to make it run long enough to be timed by clock(). Unfortunately, Standard C doesn't include a satisfactory timer function.

The point of s162() is to see whether the compiler recognizes the direction of data overlap, seeing that it will never be executed with a negative overlap. It's possible that when interprocedural optimization succeeds, the compiler sees that the overlap is a compile time constant and can eliminate the conditional as well as propagate the constant overlap into the code. There are other tests in this suite intended to concentrate on that.

Got it. Thank you very much for your helpful and kindly reply.

the link to http://software.intel.com/en-us/system/files/TSVC-s162.tar.gz is broken

can you provide an updated link

>>...I found something confusing happened for function s162(). In main() function, if I comment out all function calls behind s162(),
>>then it's timing is 2.x sec on e5-2680; but if I do not comment them and just use the original code, it timing is 4 sec...

It could be an alignment issue ( needs to be investigated ) and I've detected a similar issue with two of my performance evaluation tests for some SSE2 and AVX instructions. Almost the same thing, however opposite, that is, it gets better if I comment some pieces of codes.

I really disagree with Tim's comment regarding a CRT-function clock since it provides satisfactory accuracy up to milli-seconds if some test runs more than a couple of seconds. Of course, if somebody will try to use the CRT-function clock to measure a time interval with micro- or nano-seconds accuracy it won't provide reliable numbers.

In the original version of this benchmark http://www.netlib.org/benchmark/vectord loops of length 1000 are timed without extra repetitions.  The shorter loops are repeated so as to process as much data as the longer one, but repeating over the same cached data, Many of these tests run around 100 microseconds on 2.6Ghz coreI7-2, so a timer with microsecond resolution is needed.

I ran some tests this week on linux with various timers and got reports of microsecond resolution with Intel Openmp omp_get_wtime().  I believe it's nearly that good on Windows.  On linux, gettimeofday() is expected to work as well.

>>In the original version of this benchmark http://www.netlib.org/benchmark/vectord loops of length 1000 are timed
>>without extra repetitions. The shorter loops are repeated so as to process as much data as the longer one, but repeating
>>over the same cached data, Many of these tests run around 100 microseconds on 2.6Ghz coreI7-2, so a timer with
>>microsecond resolution is needed
.

Thanks, Tim for these details.

>>In the original version of this benchmark [ ...link removed... ] loops of length 1000 are timed
>>without extra repetitions. The shorter loops are repeated so as to process as much data as the longer one, but repeating
>>over the same cached data, Many of these tests run around 100 microseconds on 2.6Ghz coreI7-2, so a timer
>>with microsecond resolution is needed
.

Thanks, Tim for these details.

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi