Xeon Phi Performance Question

Xeon Phi Performance Question

We have a current software application that we currently only run a single instance of at once.  It requires no user input, simply runs for a pre-defined amount of time essentially navigating through a flowchart and spitting out data.  However, we desire to run multiple (100+) instances of it at once.  It is a relatively complicated program, with a decent amount of branching as it decides what to do with each node as it reaches it.  There is no current memory bandwidth issues, either.  However, it is technically running the same code in parallel... does that mean it can take advantage of SIMD? Or does SIMD only apply to simple instructions, like multiplication and addition? 

16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It seems unlikely that you will benefit from SIMD, since you say that you have a lot of control flow deviation. SIMD is useful if you have the same operation (instruction) operating on multiple pieces of data at the same time (for instance, adding two vector together). It is not easy to apply to code that is conditional branch intensive and operating on scalar data. (No doubt Wikipedia has an article on SIMD if you want more details).

The simplest way to run your code would seem to be just to do exactly what you said "run mutiple (100+) copies at once". You could do this from the shell on the MIC in the obvious way. (Since the MIC is running Linux, all of your code will be shared, even though you have multiple processes).

Quote:

James Cownie (Intel) wrote:

It seems unlikely that you will benefit from SIMD, since you say that you have a lot of control flow deviation. SIMD is useful if you have the same operation (instruction) operating on multiple pieces of data at the same time (for instance, adding two vector together). It is not easy to apply to code that is conditional branch intensive and operating on scalar data. (No doubt Wikipedia has an article on SIMD if you want more details).

The simplest way to run your code would seem to be just to do exactly what you said "run mutiple (100+) copies at once". You could do this from the shell on the MIC in the obvious way. (Since the MIC is running Linux, all of your code will be shared, even though you have multiple processes).

Appreciate the quick reply.  I was able to run the code from the shell on the MIC, however it took 4 times as long to run 106 instances in parallel on the MIC than it took to run 106 instances serially on a normal Xeon processor. That surprised me enough to ask the above question. I would think that even though it would run somewhat slower per instance, as the MIC cores are only about 1 GHz, it would still be at least 50x faster overall in parallel than serially.  If a program is conditional branch intensive, without any simple algorithmic for loops, is the Xeon Phi simply not a good card to run on, even if you have 100+ threads going at once?

Serial, integer, branchy code is likely the worst case for the MIC. Remember that as well as being clocked more slowly, the MIC processor  is a small, in-order core, which you're comparing with a large, out-of-order one on the host. (Think of it this way; if we could put >60 big out-of-order cores on a chip we would, but we can't, so we've put >60 small in-order ones :-)).

I would, though, check the throughput scaling. There is a difference between running one thread/core and running two or more, so the scaling may change as you pass that point.

There may, also, be other limitations. Even if the code is not BW limited on the Xeon, the MIC doesn't have 60x the memory bandwidth, so even though the cores are slower, you could be BW bound when you're running lots of instances in parallel.

You haven't told us the simplest measurement: What is the relative performance of one instance on KNC vs one instance on Xeon?

If we knew that we'd be able to see whether the problem is the serial code or some interference between your parallel instances.

Running on a 2.80 GHz Xeon Quad Core, the single instance took 26 ms.  Running on Xeon Phi 3120, the single instance took 435 ms.  We were hoping that there wouldn't be much overhead involved on running 100+ instances (we tried 106 instances), but that took approx 4 seconds.  When we ran multiple instances in parallel on the Quad Core, the performance was core-bound, with very little overhead; i.e. 1-4 instances took between 26-30ms, 5-8 instances took between 55-62 ms, etc.  However, this was not at all what we found with the Phi. Again, thank you for your prompt responses; they have already answered a lot of questions I had.

Hmm, 435/26 = 17x slowdown which is more than I would have expected. (About 10x is more plausible). 

You prompted me to measure the time for fork/exec, since it seemed as though that might be non-trivial compared with the smal amount of work you are doing. To fork/exec a null process, I see it taking ~15M clocks on a 3GHz xeon (so ~5mS), and 10.3M clocks (so ~10.3mS)  on the KNC (when loading the image from a host file system), and 9.4M clocks (so ~9.4mS) when loading from /tmp on the device.

Assuming that the times you're quoting are (as if) measured by "% time test_case", not measured inside the code, that suggests that the cost of fork/exec may be ~25% of the total time on the host! Which really says that you should be batching more of these samples into a single execution.

435mS is rather a small time to do performance analysis over with VTune (since the number of samples taken pobably won'tebe enough to tell you anything useful). If you can extend the code to run for something nearer four or five seconds, then running VTune would definitely be my next step.

One other thought... you said "... spitting out data."

How much data? Maybe you are I/O bound on the KNC?

Just to clairfy, KNC == Knight's Corner == MIC correct?

Also, we essentially disabled all I/O for the testing just to get raw performance results. It iterates through the data, but does not print it out (until the end, when it prints out the time taken).

The times i was quoting were measured internally for the single instances (435 ms and 26 ms, respectively, for the Phi and the Xeon Quad-core), and that involved no forking. Any thoughts as to why the non-forking single instance might be 17x slower rather than 10x slower?

However, the time was measured via "% time test_case" for the 106 parallel executions, so that certainly seems to be a major cause of slowdown there, since it takes double the time to fork on the Phi than on the Xeon Quad-core.  However, not sure if batching the executions will be beneficial, since we will rarely have more parallel executions than that, and everything I have read on the Phi says that unless you are using at least two threads per core, you probably aren't getting much benefit. However, I will try batching and see what happens.

Quote:

Jonathan J. wrote:
Just to clarify, KNC == Knight's Corner == MIC correct? 

Spot on.

Quote:

Jonathan J. wrote:
Any thoughts as to why the non-forking single instance might be 17x slower rather than 10x slower?

Not without looking at what VTune tells you... 

Batching into groups of 4 lowered the runtime from 4 seconds to 3.67 seconds, so that certainly helped. However, still very slow compared to simply running concurrently on the Quad core. Can you envision any scenario, configuration, or library (I am using TBB to thread the instances) that would enable 100+ concurrent instances of a complex, yet very quick, program to perform better on the Phi?

Also, how could i determine if i was BW bound on the Phi while running many instances in parallel?

It would be good to investigate why you're seeing such a big slowdown, which is why I was suggesting looking at the code with Vtune. Maybe something will show up there that gives you an "Ah ha" moment.

Vtune shoudl also be the answer for the BW bound problem, though you might need to invoke a VTune expert (whcih I am not) to work out how to do the experiment. (It seems as though you need to set up a constant background load, then look at one instance).

OK. I will attempt VTune.  Thank you very much. I will post when I have details from that.

Running the test through VTune, it takes 2.569s of Elapsed time (4.163s of CPU time).  Of that, 1.654s is [vmlinux], and about 2 seconds is the actual test instance, and the rest is [libcrypto], [libc], and [libstdc++]... what is [vmlinux]?

[vmlinux] is time in the Linux kernel. With the appropriate incantations you should be able to have Vtune use a kernel symbol table and break that down in more detail.

http://software.intel.com/en-us/articles/a-new-user-interface-for-vtune-amplifier-search-directories seems to contain the magic you need.

I have seen very similar behaviour with my application.I think what is happening is that Linux kernel and (likely) system libraries were NOT compiled with Intel C compiler, but with gcc, which is limited to scalar code only.

This is a problem because the scalar unit of Xeon Phi is severely underpowered - both for itself and in comparison to modern CPUs. For example if one makes a simple loop like:

for(i=0;i<N;i++)single_vector_instruction;

then it won't run at full speed, one needs to have more vector instructions per loop iterations. This is just an illustration, of course, as one can use loop unrolling. As the scalar unit is shared between 4 threads it (and especially LEA instruction) can become a bottleneck that prevents issuing enough vector instructions to keep FPU busy, even if one or two of the 4 threads are running more efficient code. (I am not 100% sure about it, but I think if one thread has a stalled instruction, then this stalls all other threads on the core as well).

The reason I mention LEA is that usually one needs one LEA or integer ADD per each load/store call to compute the array pointer. So any more integer instructions (say for control flow or for double-indexed arrays) and you have more instructions for the scalar unit than for the vector unit and are not keeping FPU busy. Fortunately there are a lot of FPU registers, so one can load constants, etc and then pack loops with vector arithmetic which does not need extra scalar instructions.

The solution was to minimize calls to system libraries, which includes (especially) calloc/new/free/delete - they are extremely slow and cause ridiculous contention in threads. Where a conventional application with lots of calls to new/delete in initialization loop scales fine to 64 cores (AMD system) Xeon Phi barely makes it to few dozens. With all 240 threads I saw the unmodified code spend more time in kernel calls waiting for spinlock than in user land.

Things will improve when gcc catches up.

Or if new model of Xeon Phi with faster scalar core is released ;) Just making pointer arithmetic faster will be a big improvement.

Thank you Vladimir, this makes sense. We make a decent amount of new/delete calls; we will attempt to minimize that as well as calls to system libraries.

Leave a Comment

Please sign in to add a comment. Not a member? Join today