what is the relation between "hardware thread" and "hyperthread"?

what is the relation between "hardware thread" and "hyperthread"?

Dear Forum,

One of the Intel TBB webpages states that "a typical Xeon Phi coprocessor has 60 cores, and 4 hyperthreads/core". But this blog from Intel emphasizes that "The Xeon Phi co-processor utilizes multi-threading on each core as a key to masking the latencies inherent in an in-order micro-architecture. This should not be confused with hyper-threading on Xeon processors that exists primarily to more fully feed a dynamic execution engine."

I'm confused with these two conflicting statements. Could anyone explain the difference/similarity between hyperthread and hardware thread?

Besides, the software developer's guide says MIC has hardware multithreading by replicating complete architectural state 4 times (has this been used in xeon's hyperthreading, where one physical core is seen as two logical cores?), and further, MIC implements a “smart” round-robin multithreading. Could you explain the relation between these two multithreading techniques?

Thanks a lot!

 

18 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

When we split words, it can be confusing can't it?  The TBB documentation is wrong, thank you for pointing that out.

We choose to NOT call the hardware threads on the current Intel Xeon Phi Coprocessor (previously known by the code name Knights Corner) by the name "hyper-threads." The most important thing to know is that you'll usually need more Knights Corner threads per core to hit your best performance than you would with hyperthreading. That's consistent with the "highly parallel" optimized nature of an Intel Xeon Phi Coprocessor. The difference in these hardware threading techniques is instructive so I'll try to give an explanation of it that makes sense.

Regardless of what Intel device we talk about, a processing core will have one or more "hardware threads" per core. We use "hardware threads" as a very generic term that refers to multithreading achieved mostly by duplicating thread state and sharing most everything else in a processing core. Multithreading achieved by duplicating most everything, the whole "core," is what multicore and many-core designs are all about. Processors and coprocessor can have both "hardware threads" and lots of cores. "Hyperthreading" is a very specific form of implementing a "hardware thread" that is only found on dynamic (a.k.a. out-of-order) execution engines.

This highlights a difference in the Knights Corner microarchitecture and an Intel Xeon processor microarchitecture. The Knights Corner microarchitecture uses "in order" execution, so the hardware threads do a relatively simple round robin to feed the dual execution pipes of the microarchitecture. In this design, you can execute two vector (SIMD) instructions in parallel but they need to come from different threads. This is why we advise programmers to use at least two threads per core on Intel Xeon Phi Coprocessors. If you do not, the floating point (FP) peformance will peak at about half of what is possible. For most programmers, this is simply a matter of making sure OpenMP or TBB use at least 122 threads on a 61 core device. Many of us are in the habit of limiting FP intensive code to threads=cores on hyperthreaded machines. This is because on a hyperthreaded machines we find a microarchitecture with an out-of-order execution engine. In those designs, the full FP potential may be realized with a single thread. Additional threads on any device will put more pressure on caches and ask for more memory bandwidth. If your algorithm is already hitting peak FP usage, additional threads are not helpful unless they help with latency hiding. For the most part, out-of-order execution engines take care of latency hiding which an in-order design cannot. Therefore, with hyperthreading on an Intel Xeon processor you may hit peak performance with threads=cores. With the in-order execution design in the Knights Corner microarchitecture, at least two threads are needed to hit peak and latency hiding is often enhanced with even more threads. Many algorithms find three threads per core is their sweet spot while others prefer two or four.

In teaching programming for the Intel Xeon Phi coprocessor, we found that it was helpful to speak of this distinction mostly to encourage us all to experiment with how many threads per core served our applications best. Using OpenMP or TBB, this is as simple as setting a different parameter or environment variable and running several times to compare. No changes to a program can be avoided.

If we are used to always running threads=cores on a hyperthreaded machine, then it is useful to know that Knights Corner is not using hyperthreads and we should be at least using two threads per core (almost always) to get best performance.

That said, today's hyperthreading is much more advanced than it was a decade ago. If we've not ventured to test performance of our applications with hyperthreads recently, we should consider running some performance tests.  If you are surprised how much better it is with hyperthreading than it used to be, please don't tell our marketing people... or they'll want to call them "hyper-thread PRO" or something else else I'll have to explain in a future blog. ;)

I hope this clears everything up.

Thank you for pointing out our error in the TBB documents - I'll look to correct them.

 

 

James,

Thanks for the great explanation of the difference between the terms "hyperthreads" and "hardware threads". This post will serve as a good reference.

Richard Gerber

Intel forums need a function to vote up.down the answers/discussions and/or thank the authors.. to make visible important/global/interesting/informative discussion easily on the board. like this one.

also perhaps to improve its visibility in the general search on the intel.com

 I wonder how the above guidance/terminology has changed from knight corner to knights landing?

We have a KNL based system.  I also found the terminology "hardware" vs "hyper" to be confusing, this thread has been helpful in that regard. I found using 2 threads per core on KNL slows things down considerably for me. KNL has out of order execution,  but  retains 4 "h" threads per core (are these still hardware threads or are they now hyper threads?). Given that KNL has out of order execution what's the benefit of 4 "h" threads? Should we still expect to get best performance by using them?

 

For many applications, KNL is fastest using 1 hardware thread per core.

Generally an application which is more latency-bound will benefit from hyper-threading, but the effect is difficult to predict.

Best to try all of 1, 2, and 4 threads per core.

 

Using Intel® Hyper-Threading Technology

The Knights Landing microarchitecture supports 4 logical processors with each processor core. There are choices that highly-threaded software may need to consider with respect to:
• Maximizing per-thread performance by providing maximum per-core resources to one logical processor per core.
• Maximizing per-core throughput by allowing multiple logical processors to execute on a processor core.

As thread count per core grows to 2 or 4, some applications will have higher per core performance, but lower per thread performance. If an application can perfectly scale its performance to an arbitrary number of threads, 4 threads per core is likely to have the highest instruction throughput. Practical limitations on memory capacity or parallelism may limit the number of threads per core.

In Knights Landing microarchitecture, some per core resources (like the ROB or scheduler) are partitioned to one for each of 4 logical processors. Because of this, a 3 thread configuration will have fewer aggregate resources available than 1, 2, or 4 threads per core. Placing 3 threads on a processor core is unlikely to perform better than 2 or 4 threads per core. 

 

When doing this kind of comparison, please be careful about how you do it and how you plot your results. Blowing my own trumpet :-), my blog ("How to plot OpenMP scaling results") gives some advice on this, which is actually applicable whether or not you're using OpenMP.

Greg,

>>For many applications, KNL is fastest using 1 hardware thread per core...

While this statement is true, it is also somewhat misleading. The important metric, IMHO, is not a specific thread performance, but rather the throughput of the application using available resources. Some of us may wish to measure in terms of least cost.

Note, it is possible on KNL, for a specific piece of code, to identify the fastest tile, then within the tile, the fastest core, then within the core the fastest thread... and then note that if you put any of the other threads, cores or tiles to work that the performance of "the fastest thread" diminishes. What I am trying to say is too many of the readers are taught "scaling performance" in a CS course and remain focused only on the slope of the line, primarily on the left hand side (1, 2, 4, 8, .. threads), and IMHO, are never taught to get the most productivity out of the available resources.

Jim Dempsey

Quote:

jimdempseyatthecove wrote:

What I am trying to say is too many of the readers are taught "scaling performance" in a CS course and remain focused only on the slope of the line, primarily on the left hand side (1, 2, 4, 8, .. threads), and IMHO, are never taught to get the most productivity out of the available resources.

Indeed, which is why, as engineers, we should be using parallel efficiency, not speedup. A sentiment with which Wikipedia agrees (see the "Additional Details" section from "In engineering contexts" onwards. This may be because no-one has changed it since I added that text :-) ).

Let me restate for clarity.  For many applications KNL yields the best elapsed time using 1 hardware thread per core.

Here is an example, a seismic imaging 3D finite difference kernel, using all KNL cores.

1 hardware thread per core, 9.6 seconds

2 hardware threads per core, 11.1 seconds

4 hardware threads per core, 16.6 seconds

I might venture to guess that if the 3D finite difference is written well, that it is memory/cache bandwidth limited. If so, then you may find, for KNL that one thread per tile yields better performance. (IOW 1 thread within the 2 cores that share a tile). An actual test would have to be made using your code. John McCalpin may have run such experiments and would be a better source for opinion.

Jim Dempsey

1 hardware thread per tile, 16.5 seconds

2 hardware threads on 1 core per tile, 14.5 seconds

4 hardware threads on 1 core per tile, 16.2 seconds

(Shameless plug for my article on how to select 1, 2, 4, threads per core, 1 thread per tile, and so on:  https://software.intel.com/en-us/articles/process-and-thread-affinity-for-intel-xeon-phi-processors-x200 )

Thanks for the additional info.

Jim Dempsey

It is common for bandwidth-limited codes to run slightly faster using no more than one thread per physical core.  

There are two common reasons:

  1. Using more than 1 thread per core decreases the amount of cache per thread, which increases the cache miss rate, which increases competition for the bandwidth that is already limiting performance.
  2. Using more than 1 thread per core increases the number of "memory access streams", which increases the number of DRAM pages that are being accessed concurrently.  If the number of memory access streams is larger than the number of DRAM banks, the memory controller has to repeatedly close and re-open the banks (because it can't reorder operations enough to process *all* of the loads from one stream before processing all of the loads from another stream).  This increase in page open/close rates adds latency and often triggers stall conditions in the DRAMs (required to minimize maximum power consumption).

The first case can have a performance impact that is anywhere from very small to very large.

The second case is typically a fairly small impact (single-digit percentages), but I have seen performance degradation of 25% or more on systems with dumb DRAM configurations (e.g., one single-rank DIMM per channel).

"Dr. Bandwidth"

>>Here is an example, a seismic imaging 3D finite difference kernel, using all KNL cores.
>>
>>1 hardware thread per core, 9.6 seconds
>>2 hardware threads per core, 11.1 seconds
>>4 hardware threads per core, 16.6 seconds

Since in these tests a "performance saturation", for cases with 2 and 4 hardware threads, is Not achieved there is an implementation problem with 3D finite difference kernel codes.

In a perfect case it should be as follows:
...
1 hardware thread per core, 9.6 seconds
2 hardware threads per core, 9.8 seconds
4 hardware threads per core, 9.9 seconds
...
and it is a completely different situation - There is No performance improvement (!) and there is No performance degradation (!) when more than 1 hardware thread is used on a KNL system with 4 hardware threads per core.

>>In a perfect case it should be as follows:
>> ...
>>1 hardware thread per core, 9.6 seconds
>>2 hardware threads per core, 9.8 seconds - slower by ~2% when compared with case 1
>>4 hardware threads per core, 9.9 seconds - slower by ~3% when compared with case 1

No implementation problem -- this is a "ninja" tuned kernel reviewed by KNL architects.  However the seismic imaging kernel does depend on cache for performance and is affected by the decreased cache per thread described by John McCalpin.

Another example, a CG iterative solver, has a very low flop to byte ratio and gets little help from cache.

1 hardware thread per core, 8.4 seconds

2 hardware threads per core, 8.7 seconds

4 hardware threads per core, 9.2 seconds

>>Another example, a CG iterative solver, has a very low flop to byte ratio and gets little help from cache.
>>
>>1 hardware thread per core, 8.4 seconds
>>
>>2 hardware threads per core, 8.7 seconds
>>
>>4 hardware threads per core, 9.2 seconds

It looks good and there is No significant performance degradation for the last case ( with 4 hw threads ). I agree that it is impossible to predict theoretically performance of an algorithm if it needs to be scaled up to use more threads.

Leave a Comment

Please sign in to add a comment. Not a member? Join today