I have just finished a study titled "A Comparison of Compilers and Operating Systems for High-Performance Low-Cost Computations".
The details of the work can be found at
I welcome comments and suggestions.
Rajan,Welcome and thanks for posting your work here.I have not had a chance to review this in detail, but thanks for the feedback.
That was a great paper - it made for some really instructive reading. Here are a few comments:
The benchmarks used are good for measuring the performance of the system(hardware/OS/compiler/data-structures) as a whole. Since identical hardware was used, we have 3 dimensions to consider. The benchmarks do a great job of differentiating performance along the OS axis and the data structure axis.
However, I have some suggestions for measuring/comparing compiler performance. Perhaps all the compilers do perform in the same ballpark, as suggested by the paper.But it's also possible that a different test methodology could be adopted so as to more accurately contrast the performance of the compilers used.
With the current method,
- Insufficient memory leading to swapping can cause wide variance in the results. Swapping activity is not very predictable.
- The benchmark uses "elapsed time" or wallclock time. For compiler performance, "user time" is a more appropriate metric since it discounts time spent inside the kernel and I/O wait times which might vary significantly between runs. This may cause large constant factors to creep into the timings, reducing the visibility of differences due to compiler code quality.
- All daemons like crond must be stopped before taking performance runs so that they don't compete for CPU time. Presumably this was done, but not explicitly mentioned in the paper.
For measuring compiler performance, it is preferable to run a workload which- causes minimum I/O,- fits completely in memory, avoiding the uncertainties in measurement introduced by swapping activity,- very high (>95%) average CPU utilization (measured by sar),- gives completely (withing +/- 5%) repeatable results.
Thanks for your comments. Specifically here are my responses to your comments.
(1) The Vector Performance Tests and the Finite Element Analysis Tests had problems that would fit in available RAM as well as problems that would not fit in the available RAM. You are right in pointing out that swapping causes a wide variance in the results. That was one of the underlying motivations in conducting the study - how well the OSes handle swapping. There are a number of other studies cited in the paper that compare the performances along the lines that you have suggested - causes minimum I/O, fits completely in memory, avoiding the uncertainties in measurement introduced by swapping activity, very high (>95%) average CPU utilization (measured by sar), gives completely (withing +/- 5%) repeatable results.
(2) The number of background services/deamons were reduced to a bare minimum.
(3) Finite element (FE) analysis programs run on a wide variety of computer systems - systems with very little memory to systems that are maxed out. Software developers have little control over the composition of the hardware. Moreover, the amount of actual memory required to solve a particular class of problems unfortunately can never satisfy the needs of the users. There was a time when a problem involving 100,000 unknowns was considered too large. Today, problems with a few million unknowns are routinely solved on a cluster of PCs (running 32-bit OS) to supercomputers (running 64-bit OS). Commercial FE programs on 32-bit OSes use out-of-core equation solvers to solve very large problems that would not fit in the 2 GB memory restriction. My next study involves how well Intel's Itanium system (64-bit OS) performs when the same test software is compiled and executed, and possibly see the effects of using OpenMP directives.
A realistic comparison between different options (compilers, OS) would have to involve wall clock time. This is the time that a user has to wait from the time the run is initiated to the time the results are available. As the results show, there are differences (though minor) between the two OSes.
Once again, thanks for your comments.
Barnali,Welcome! Thanks for reading and contributing to this forum!
Thanks for posting your paper for discussion.
One point comes to mind when you discuss code that is portable across C/C++ & Fortran is 2d arrays. A very important difference in C/C++ vs. Fortran is row-major vs. column major 2D array layout. Unit stride across a 2D array is critical for efficient cache usuage and app performance. This is important as often HPC apps deal with matrices, implemented as 2D arrays.
I'd also agree that it's useful to have the user time reported in addition to wall clock time. User time shouldn't vary much run to run.
For the FEA programs, what are the functions that the Intel Linux compilers do not support - is this just an issue of running under Red Hat 8.0 (this isn't a supported combination of glibc & kernel that the 7.0 Intel Compilers for Linux support).
As a finaly FYI, In August 2001, the Visual Fortran product development team joined Intel and immediately began work on a "best of both worlds" product line, combining the Visual Fortran feature set and Fortran-specific optimizations with the Intel optimizer and code generators - details available at http://www.intel.com/software/products/compilers/techtopics/PortingCVF1.htm
Thank you for your comments. You are absolutely right that multi-dimensional arrays require very special treatment (this is perhaps where FORTRAN compilers are smarter than C/C++). The FEA program uses two-dimensional arrays sparingly. In a few weeks I will let you know how different implementations of C++ dynamically allocated 2D matrices perform against FORTRAN code & compilers.
The reason why I was unable to build the FEA program under Rad Hat Linux 8.0 was because Intel's C++ compiler is not compeletly compatible with gcc library. I am having difficulty compiling a file that uses a call to ntohl function. I get the following error message
error: identifier "__bswap_32" is undefined.
Your Technical Support People concurred with me that this is a compatiblity problem. Perhaps you can shed some light on when Intel's C++ compiler will be completely supported under Red Hat Version 8.0.
You can examine the /usr/include/bits/byteswap.h file and the /usr/include/netinet/in.h file to see what ntohl() is trying to use.
Also, a workaround that works on other versions of linux is to put the following *after* your #include of in.h:
#undef htons#undef ntohs#undef htonl#undef ntohl#define htons(x) __bswap_constant_16(x)#define ntohs(x) __bswap_constant_16(x)#define htonl(x) __bswap_constant_32(x)#define ntohl(x) __bswap_constant_32(x)
Of course that will only work if those macros are defined on rhat 8.0. bits/byteswap.h should tell you which macros you can use instead.
Thanks for your suggestions on fixing the problem with in.h. They worked.
I have updated the article with the Intel C++ Linux timing values for the finite element test problems.
discuss some factors for which compilers for programming languages can be compared