# How to evaluate flops of MIC card?

## How to evaluate flops of MIC card?

Hi everybody,

I need to count flops of a code which should be running on MIC card with the native mode. But I don’t know the correct way to evaluate the flops. I find a document from internet, and the links are given as follows,

which provides a way using VTune to evaluate the flops. But the value I got with this method seems too small.

Are there any other methods which will give the reasonable flops of MIC?

Thanks a lot!

Shaohua

17 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

I would suggest looking at your application/benchmark in an abstract way by hand, or spreadsheet, compute the number of flops necessary to complete the section under test. Then run the program, preferably well optimized, and time the runtime. Then divide the estimated number of flops by the runtime. This will produce a practical FLOPS figure for that application/benchmark including overhead for loop control and other non-FLOP compute sections of the program.

The number of FLOPS that you can attain is more a measure of your ability to code efficiently than it is of the raw computational ability of the MIC processor.

Jim Dempsey

Jim is absolutely correct. It is the root of a lot of misinformation concerning the achievable performance of a piece of HW given a certain algorithm.

Even doing that, your analysis of the number of floating point operations performed by your code isn't clear cut. The compiler and libraries you use likely change your algorithm unbeknownst to you by doing optimizations that change the number of floating point operations your code needs to execute. For example, it may identify and change your original algorithm to account for a common special case that requires fewer floating point operations. All this makes your analysis less useful.

All this is part of the reason why there is and has been so much debate as to the usefulness of using floating point operations per second as a performance metric.

Regards
---
Taylor

Dear Jim and Taylor,

I try to evaluate the flops with the method mentioned by Jim, but I get a much smaller value which cannot be the actual computational ability of MIC as Jim has commented. But for my application, I have to evaluate the percentage of peak-performance when it is running on MIC card, so I have to get the actual flops of MIC rather than the flops by manually count the code.

I have ported my algorithm on the MIC card and it runs about two times faster than two Xeon E5-2650 CPUs. This algorithm will also be tested on GPU and the results will be compared with MIC. I think comparing the flops is a better way to evaluate the performance of these two different devices. But I feel depressed because I can’t find the way to get reasonable flops of MIC.

Regards,

Shaohua

The MIC flop rate can vary from 1x to 16x depending on how well your application/benchmark is amenable to vectorization .AND. you as a programmer and the compiler writer's skill at utilizing the 16-wide vectors (floats), or 8-wide doubles. Further, if your code is amenable to Fused Multiply and Add/Subtract, you might see another +1.25x. Therefore, FLOPS rating could vary from 1x to 20x depending on you (or the author of a benchmark) in writing code that optimally uses the vector capabilities. Other factors for cache and memory latencies can add or remove from the FLOP ratings. There is so much leeway here to be specific as to an arbitrary FLOPs rating. This said, for a specific application, where you know (or fine estimate) the number of floating point operations, you can determine the FLOPS rating for that application.

>>I have ported my algorithm on the MIC card and it runs about two times faster than two Xeon E5-2650 CPUs

Which indicates the algorithm you chose or implemented is having good vectorization.

>>This algorithm will also be tested on GPU and the results will be compared with MIC.

Be sure to run the same test using doubles as well as floats.

>>I think comparing the flops is a better way to evaluate the performance of these two different devices.

Again, I strongly disagree. There are too many additional factors in an application other than FLOPS in determining which computing engine is best. For an example, the automobile would have a related measure of redline RPMs. How can this be used to determine its performance at Le Mans, or Bonneville Salt Flats, or drag strip? You cannot look at one factor to assess the performance.

If your application does not vectorize well, then it might not be suitable for MIC.
If your application has conditional flow control combined with good vectorization then it might not be suitable for GPU
If your application is dependent on doubles (IOW engineering as opposed to gaming) then it might not be suitable for GPU, but suitable for MIC.

Too many factors are dependent on the programmer's ability to make full use of the computational capabilities of the system to use an abstract FLOPS rating to be of meaningful use (be it using a GPU or MIC).

Look on the blogs section (top of page, Development | Tools | Resources | Blogs) search for "chronicles of phi". This is a series of articles I wrote, showing the same functional application, written several ways, showing the progression of FLOPS (on MIC) from OpenMP parallel but scalar baseline program referenced as 1x to, simple OpenMP vectorized, ~5.8x, Loop peeling added to ~7.25x, tiled ~10x, and then three additional strategies that attained ~11x, 12.1x, ~14.5x that of the original baseline program.

Note, all of these programs were running on the same "processor" with the same FLOPS rating. Ask yourself, how well does a FLOPS factor into assessing the performance for your application, when programming skill can vary the results so widely.

Jim Dempsey

"Percentage of Peak" is much more difficult to define than one might think.

• What is the "peak" if your code requires the execution of double-precision divide or square root instructions? Xeon Phi does these in software, and the exact number of operations performed by the iterative algorithm can vary significantly depending on the degree of accuracy required.  Of course the exact number of operations performed will also be quite different than what you would see on a system with native double-precision divide and/or square root instructions.

• In one high-profile case "System A" solved a required benchmark 10% faster than "System B", but "System A" was considered to have "failed" the benchmark test while "System B" passed.  Why?  "System A" used hardware divide instructions which were fast, but which were only counted as 4 operations each.  "System B"  used a reciprocal approximation following by a sequence of 13 fused multiply-add instructions, so they got to take credit for executing 27 floating-point instructions.  This gave "System B" more "GFLOPS" than "System A", despite being slower at solving the problem.
• Most computational codes are limited by the ability to load and store data, rather than by the ability to perform arithmetic.  Most current high-performance processors have a "peak" performance of 1/3 of "peak" when running the DAXPY kernel from L1 cache, and a "peak" performance of 1/2 of "peak" when running the DDOT kernel from L1 cache.   Which "peak" makes sense?
• I once worked through an example for a community atmospheric model, showing that the time required to execute all the "load" and "store" instructions (assuming the maximum load/store issue rate, and all loads and stores hitting in the L1 cache) would result in a floating-point operation rate of 18% of "peak" -- even if all of the floating-point instructions were removed from the code (or otherwise took zero time).   Based on my experience this is a typical result.
• Few computational codes have an exact balance of floating-point addition and multiplication operations, and fewer still have exactly balanced floating-point addition and multiplication operations that can all be implemented using the fused multiply-add instruction that provides the basis for the most common "peak" performance estimates.
• Note that the fused multiply-add is less general than separate add and multiply operations, so some algorithms will have the same "peak" performance with either approach, while other algorithms will have up to twice the "peak" performance with one adder and one multiplier compared to one fused multiply-add unit.
• Many algorithms perform variable amounts of work to achieve a result -- for example iterative solvers.   The rate at which arithmetic operations are done can easily be completely uncorrelated with the rate at which completed results are provided.   In this case, the increased accuracy of the fused multiply-add operation on Xeon Phi could allow convergence in fewer operations than what you would obtain using SSE or AVX arithmetic.

Despite these concerns, it is often of value to interpret performance in terms of "rate" rather than "time", so we do need a way to determine the numerator in the equation: rate = work / time.   Sometimes it is relatively easy to determine what "work" value to use -- often by manual counting of the nominal number of operations specified by the code.  Sometimes it is impractical to take this approach -- codes performing adaptive refinement come to mind.    In these cases it is still better to understand the problem and define "work" in a way that is related to the current problem and its parameters.  The "work" should be independent of the details of how the compiler chooses to implement the arithmetic, and the "work" should be independent of iteration counts in cases where that might vary.

Of course not all applications are subject to confounding subtleties in counting arithmetic.  For the large number of applications that are "well-behaved" it would be nice to be able to request that the compiler generate code to count the actual number of operations performed at run time (this might include arithmetic operations of various types and precisions, load operations, store operations, or perhaps other specific categories of operations).  Doing this in the compiler (rather than in a post-processing binary-instrumentation step) would allow the high-level and back-end optimizers to eliminate most of the overhead involved in tracking these operations, leaving (typically) an integer multiplication by a constant and an integer accumulation for each basic block as the maximum overhead expected.   The "exit" functionality would need to be patched to print or otherwise save these values at program termination.

Of course this would certainly not satisfy all possible requirements for how one might define "work" -- the results would still vary with compiler optimization level and code generation choices -- but it would be useful for a large number of applications and would not depend on changing the hardware performance monitoring facilities to meet the specific monitoring requirements of a user.

"Dr. Bandwidth"

Dear Jim,

>> The MIC flop rate can vary from 1x to 16x depending on how well your application/benchmark is amenable to vectorization .AND. you as a programmer and the compiler writer's skill at utilizing the 16-wide vectors (floats), or 8-wide doubles…

Yes, I totally agree with you, the performance of an application largely depends on the optimization you have done. There are many things will affect the flops, such as the compiler, and the optimization of the code. But I still think there are real meanings of flops, because it can show you how fast of your code on a specific device, at least qualitatively.

>> >>I have ported my algorithm on the MIC card and it runs about two times faster than two Xeon E5-2650 CPUs

>> Which indicates the algorithm you chose or implemented is having good vectorization.

Actually besides the vectorization, my algorithm also has a good scalability and high cache usage. I do experience some applications which have a very good vectorization, but their performances are bad.

>> Be sure to run the same test using doubles as well as floats.

My algorithm only uses floats, so I just run the test with floats.

>> Again, I strongly disagree. There are too many additional factors in an application other than FLOPS in determining which computing engine is best. For an example, the automobile would have a related measure of redline RPMs. How can this be used to determine its performance at Le Mans, or Bonneville Salt Flats, or drag strip? You cannot look at one factor to assess the performance.

It is known that the Intel compiler do much in code's optimization, and the performance of the code running on the Intel’s product not only reflect the performance of the code itself, there are many other things. But the flops indeed provide the overall performance information at least qualitatively of a code on a specific platform such as MIC and Intel’s compiler.

To discuss whether the flops is suitable for evaluating the performance is a huge topic, but that’s no really I sought in this post. Could you please so kind to tell me how to evaluate it on MIC?

Regards,

Shaohua

Dear John,

Many thanks for your kind advice, it do deepen my knowledge of flops and the application performance evaluation.

The flops is still a key parameter in the current state of art, because for some well-known conferences, such as Gordon bell prize in annual SC (Supercomputing) conference, do require people to provide flops in their paper. The flops is also a key parameter to evaluate the actual performance of TOP500 supercomputers, such as 33.86 petaflop/s (quadrillions of calculations per second) on the Linpack benchmark of Tianhe-2. That’s why I want to count flops of my application on MIC.

Could you please so kind to tell me how to evaluate it on MIC?

Regards,

Shaohua

Shaohua,

In the Gordon Bell-like SC conferences, the FLOPS figures given are:

(the abstract number of operations) / time

With a side note that math functions (sin, cos, sqrt, ...) are treated as one operation as opposed to (the abstract number of operations equivalence using, say Newton's method for finding square root).

For example the matrix multiply you compute the number of operations you would perform by hand. As far a I know, nobody I know performs a fused multiply and add by hand. Also, on the flip side, the by hand calculation does not count the load and store as a operations.

To follow John's advice, I suggest you instrument your code using #define macros that take a FLOP weight value for the preceding or following statement. Then in Debug build (or Instrument build), the program accumulates the FLOP weights (and any other weights you might want to collect). Then make an Instrumented run to get your FLOP count, and then make a fully optimized run to get the time. While you can use the performance counters as mentioned in the link mentioned in #1, you will find that it will give you a different number than the one you can calculate by hand (or Instrumented build).

The authors of these SC papers likely have a footnote or appendix as how they computed the FLOPS. Not all FLOPS are equal.

Jim Dempsey

John hinted at an interesting point.   Giving iterative algorithms for divide and sqrt credit for the number of individual flops seems to distort priorities.  For example, the Intel Haswell architecture should do fine in terms of actual performance as well as accuracy, when using the IEEE divide and sqrt (Intel options -prec-div -prec-sqrt), although the flops rate may suffer.  The MIC implementation of -prec-div must use even more fp instructions than -no-prec-div, so, if your objective is to maximize the rate of instructions used, it might be attractive.

The only way I know of to count FLOPs on the Xeon Phi (or any other Intel IA-32 or x64 processor) is to do it by hand.
This is almost always approximate, since you usually don't know whether the compiler will be able to reduce operation count by common subexpression elimination, or how much it will increase operation count for divides, square roots, and transcendental functions, but it does provide a "nominal" operation count that can (and should) be used on all platforms.  Jim Dempsey provided an example of how to encode this in your program.  I usually add a discussion of the operation counts in the comments and then include operation count formulas that are computed at runtime based on the parameters of the problem.

Two approaches that would be nice, but which don't seem to work here:

1. The hardware performance counters don't know how to differentiate between arithmetic vector instructions and other vector instructions.
2. On other Intel processors it should be possible to count various operation types using the "PIN" binary instrumentation program -- see:
https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumen...
Unfortunately this won't run on Xeon Phi -- it requires a slightly newer core -- and you can't run the Xeon Phi binary on any other platform.

You could get *close* to the right answer by compiling the same code using AVX or SSE and running with instrumentation on the host, but it won't be exact because of software divides and square roots on Xeon Phi and because of (sometimes subtle) differences in common subexpression elimination that are possible with different instruction set architectures.

"Dr. Bandwidth"

Dear Jim and John,

Regards,

Shaohua.

Dear Tim,

I think the method you mentioned is to count flops based on instructions such as the method mentioned in the following link,

But now I strongly doubt the accuracy of this method.

I will follow Jim and John’s advice to count FLOP manually, and treat math function as one operation. Using this method I will get an absolute FLOP which is not affected by any hardware or software. Then I will use this value to evaluate flops of my code on different platform such as MIC and GPU. I think this is a more meaningful method to compare the performance among different devices.

Regards,

Shaohua

Hi all,

Whether using FLOPS (floating point operations per sec) is a useful metric or not is a controversy that has been around for over 3 (and probably more) decades. I've seen its use go in and out of fashion over the years. It seems to be in fashion right now.

I've been trying to find a good discussion of the controversy and haven't been able to. I know it has to be out there, and there are probably dozens of them. So far, the best discussion I've found is the one above.

Can any of you point me to one?

Regards
--
Taylor

You're probably aware of this http://www.hpcwire.com/2013/11/22/new-benchmark-shake-top500-rankings/

even though I haven't seen any Intel recommendation on how to run that benchmark.  It's definitely intended to provide an alternative from the past proponents of the major flops ratings.  It seems to provide an opportunity to demonstrate improvements in gather implementation.

Tim,

With the little reading I've done of the paper I do have a cautionary caveat regarding:

However, HPL only stresses Type 1 patterns and, as a metric, is incapable of measuring Type 2 patterns. With the emergence of accelerators, which are extremely effective with Type 1 patterns relative to CPUs, but much less so with Type 2 patterns, HPL results show a skewed picture relative to real application performance.

Where Type 1 is highly vectorizable code and Type 2 is not or is lessor.

My gripe is if you code (or benchmark) poorly for vectorization when you can code for vectorization then the new test criteria can mislead you into believing accelerators are of little use.

I think it is important to read between the lines. Under Requirements:

Accurately predict system rankings for a target suite of applications:...

The "between the lines" parts are: Does the suite of applications match your applications requirements. And, should one/some of the applications in the benchmark align with your applications, is source available such that one can reorganize test program to take advantage of the accelerator. Without this capability, you won't know if the benchmark is expressly written (chosen) to disparage coprocessors.

Then look at note 2 on page 9 with the emphasis on C++ object oriented features. OO is typically tailored to scalar operations not vector (IOW it tends to be AOS oriented as opposed to SOA oriented). Don't get me wrong by assuming my position that C++ is no good for HPC, Instead, my position is there are numerous excellent features of C++ ctor, dtor, templates, etc... my point is  to pay attention to data layout that favors vectorization (don't do fine grained OO classes). Use the C++ features to aid you in hiding (by encapsulation) the nasty's of using SOA.

Jim Dempsey