Should we be switching HT on or off for a Nehalem ESSXX for best MKL performance?

Should we be switching HT on or off for a Nehalem ESSXX for best MKL performance?

We have just obtained a machine with a Nehalem ESSXX (E5520 chip) processor (4 quad-core processors). The MKL documentation states we should turn Hyper-threading OFF for best MKL performance; yet our sysadim has read that it should be turn ON to get the best overall performance for this machine. So what should we do please?

16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - Tony Garratt

We have just obtained a machine with a Nehalem ESSXX (E5520 chip) processor (4 quad-core processors). The MKL documentation states we should turn Hyper-threading OFF for best MKL performance; yet our sysadim has read that it should be turn ON to get the best overall performance for this machine.

If you are using an older version of MKL, you might heed the documentation, or test it yourself. The current compilers and MKL recognize the KMP_AFFINITY=physical option, which will spread the threads correctly across cores, if you don't request more threads than cores. This is not well documented, and I think first available with 11.1 compilers. With older MKL, you might find that you have to specify number of threads and mapping to cores, if you choose to leave HT enabled, and you have fewer opportunities to lose performance if HT is disabled.
According to a recent post on this forum, the current MKL takes account of HT automatically and runs at most 1 thread per core by default, so you may not find a measurable difference. Then you could leave HT on, in case other applications may gain.
Your BIOS setting for NUMA enabled or not will have influence as well. You have opportunities to gain performance with NUMA enabled, if your arrays are initialized with the same OpenMP parallel schedule as MKL will use.

Quoting - tim18

If you are using an older version of MKL, you might heed the documentation, or test it yourself. The current compilers and MKL recognize the KMP_AFFINITY=physical option, which will spread the threads correctly across cores, if you don't request more threads than cores. This is not well documented, and I think first available with 11.1 compilers. With older MKL, you might find that you have to specify number of threads and mapping to cores, if you choose to leave HT enabled, and you have fewer opportunities to lose performance if HT is disabled.
According to a recent post on this forum, the current MKL takes account of HT automatically and runs at most 1 thread per core by default, so you may not find a measurable difference. Then you could leave HT on, in case other applications may gain.
Your BIOS setting for NUMA enabled or not will have influence as well. You have opportunities to gain performance with NUMA enabled, if your arrays are initialized with the same OpenMP parallel schedule as MKL will use.

Thanks Tim for your fast reply. Ive been trying to understand the 10.1 MKL documentation and I agree is it not well documented! However, what you are saying is that if I upgrade to the Fortran 11.1 compiler (which has the latest MKL included within in), then all this will be handled automatically by the software at run-time.

(Our customers use our software and we do not know what multi-corearchitecture they will be running it on, so having MKL automatically figure out the correct number of threads at run time is extremely appealing to us).

Quoting - tim18

If you are using an older version of MKL, you might heed the documentation, or test it yourself. The current compilers and MKL recognize the KMP_AFFINITY=physical option, which will spread the threads correctly across cores, if you don't request more threads than cores. This is not well documented, and I think first available with 11.1 compilers. With older MKL, you might find that you have to specify number of threads and mapping to cores, if you choose to leave HT enabled, and you have fewer opportunities to lose performance if HT is disabled.
According to a recent post on this forum, the current MKL takes account of HT automatically and runs at most 1 thread per core by default, so you may not find a measurable difference. Then you could leave HT on, in case other applications may gain.
Your BIOS setting for NUMA enabled or not will have influence as well. You have opportunities to gain performance with NUMA enabled, if your arrays are initialized with the same OpenMP parallel schedule as MKL will use.

Can you point me to the recent post you mention please, if possible :-)

Quoting - Tony Garratt

Can you point me to the recent post you mention please, if possible :-)

http://software.intel.com/en-us/forums/showthread.php?t=67622
gives information on how current MKL deals with HyperThreading/SMT

Quoting - tim18

http://software.intel.com/en-us/forums/showthread.php?t=67622
gives information on how current MKL deals with HyperThreading/SMT

Thanks Tim. Is there anyone at Intel I speak withthat canconfirm that with Fortran 11.1, MKL automatically chooses the optimal number of threads please for multi-core/HT architectures?

We would rather go the compiler upgrade route (from our existing 10.1 installation to 11.1) rather than deal with all the issues of us and our customers setting the affinity, but want to be 100% sure doing that will actually achieve this.

Quoting - Tony Garratt

We have just obtained a machine with a Nehalem ESSXX (E5520 chip) processor (4 quad-core processors). The MKL documentation states we should turn Hyper-threading OFF for best MKL performance; yet our sysadim has read that it should be turn ON to get the best overall performance for this machine. So what should we do please?

Intel MKL optimizations maximize use of the physical cores and running with HT may result in a marginal performance overhead. For just MKL usage, we recommend HT be disabled. However, other applications or parts of your code may benefit from HT. If you want HT on, make sure the number of threads set do not exceed the number of physical cores. MKL_DYNAMIC=TRUE (it is TRUE by default, see the IntelMKL User's Guide for details) allows MKL to reduce the number of threads in any circumstance that they exceed the number of cores, which could easily happen if HT is on.

Quoting - Tony Garratt

Thanks Tim. Is there anyone at Intel I speak withthat canconfirm that with Fortran 11.1, MKL automatically chooses the optimal number of threads please for multi-core/HT architectures?

We would rather go the compiler upgrade route (from our existing 10.1 installation to 11.1) rather than deal with all the issues of us and our customers setting the affinity, but want to be 100% sure doing that will actually achieve this.

Let's consider an example. Suppose you have a 2-socket machine with quad-core processors that have HT enabled. That's 8 physical cores, and 16 threads with HT. Now suppose you ran MKL on this, and didn't go out of your way to change MKL_DYNAMIC fromits default ofTRUE to FALSE. Perhaps a user just wants to maximize the threads, but a request of 16 isn't as good as 8 in our hypothetical situation. In this situation, the compiler might also give you 16 threads by default. Regardless of how MKL gets a request for 16 threads, MKL will recognize this number is too high and back it off to 8 (if you really want 16 threads, then set MKL_DYNAMIC to FALSE first). If, however, you were to ask for only 4 threads, MKL has insufficient information to know if you want 4 threads, or some other number. So, MKL only pays attention to HT in the cases where the number of threads requested exceeds the number of physical cores. In those cases, MKL will attempt to set the optimal number of threads.

Quoting - Gregory Henry (Intel)

Let's consider an example. Suppose you have a 2-socket machine with quad-core processors that have HT enabled. That's 8 physical cores, and 16 threads with HT. Now suppose you ran MKL on this, and didn't go out of your way to change MKL_DYNAMIC fromits default ofTRUE to FALSE. Perhaps a user just wants to maximize the threads, but a request of 16 isn't as good as 8 in our hypothetical situation. In this situation, the compiler might also give you 16 threads by default. Regardless of how MKL gets a request for 16 threads, MKL will recognize this number is too high and back it off to 8 (if you really want 16 threads, then set MKL_DYNAMIC to FALSE first). If, however, you were to ask for only 4 threads, MKL has insufficient information to know if you want 4 threads, or some other number. So, MKL only pays attention to HT in the cases where the number of threads requested exceeds the number of physical cores. In those cases, MKL will attempt to set the optimal number of threads.

Thanks for your detailed reply Gregory. I dont quite follow your statement " If, however, you were to ask for only 4 threads, MKL has insufficient information to know if you want 4 threads, or some other number." - why does MKL have insufficient information?

We are not giving MKL any hints about the number of threads to use (i.e. we do not set MKL_NUM_THREADS etc) since our software will be used by our customers and we dont know up front what their architecture will be. However, we do allowthem to set MKL_NUM_THREADS to a specific value if they so desired, but that is packaged in our software as an "Advanced" feature.

Do you know if Tim's comment about v11 is correct? - i.e. if we stick with Fortran/MKL 10.1, then we and our customers DO have to think carefully about setting some of the MKL enviroment variables if we have a multi-core or HT machine to get the best MKL performance; but if we upgrade to v11, then we can simply forget all about that since MKL will figure it all out for itself to get the best performance?

Many thanks for both yours and Tim's replies.

Greg is on the MKL team, so his replies are authoritative, while mine aren't. I read Greg's replies as confirming that current MKL should recognize HT and limit threads to 1 per core by default. I'm a little more concerned about whether KMP_AFFINITY settings may still be needed to optimize performance on a 4 socket platform (with NUMA BIOS option), which most of us have never seen.

Quoting - tim18
Greg is on the MKL team, so his replies are authoritative, while mine aren't. I read Greg's replies as confirming that current MKL should recognize HT and limit threads to 1 per core by default. I'm a little more concerned about whether KMP_AFFINITY settings may still be needed to optimize performance on a 4 socket platform (with NUMA BIOS option), which most of us have never seen.

I did some tests on an 8 physical core machine that has HT turned on (so it showed 16 cores) using MKL. MKL never used more than 8 threads. This was with version 11 of MKL. In this case, we were using LAPACK and BLAS library rotuines. Does that mean that for this use case at least, there is no issue when using MKL on HT machines because everywhere else your documentation (that is supplied with MKL and on the web) you are telling us explicity to turn HT OFF when using MKL? Is it just that your documentation is out of date?

The default options in current MKL make it so that it should make little difference to MKL performance whether HT is enabled. You might set KMP_AFFINITY with a verbose option to get the screen echo so as to confirm whether your platform correctly assigns 1 thread per core, regardless of HT option. Chances are it works as intended on your platform, but there are so many BIOS variations, you may wish to make sure.

Best Reply

Quoting - Tony Garratt

I did some tests on an 8 physical core machine that has HT turned on (so it showed 16 cores) using MKL. MKL never used more than 8 threads. This was with version 11 of MKL. In this case, we were using LAPACK and BLAS library rotuines. Does that mean that for this use case at least, there is no issue when using MKL on HT machines because everywhere else your documentation (that is supplied with MKL and on the web) you are telling us explicity to turn HT OFF when using MKL? Is it just that your documentation is out of date?

Intel Hyper-Threading technologyis a exciting resource that in many cases allows applicationsto seeperformance benefits. It works by allowing the processors to gain a more efficient use of the processor resources. In some parts of MKL, DGEMM for instance, we are already able to achieve 95+% percent of theoretical peak for some typical inputs. In this case, we are exploiting every resource in the processor we can, and the additional thread given by HT does not yield additional benefits. However, the architects have taken measures to reduce the overheads of switching between the threads in an HT environment. With the exception of affinity problems already mentioned, this means that the overheads for using HT may not be so large. For someone like myself, interested in getting our users that last tenth of a percent of performance, it is noticable. MKL is a library. It is used in combination with an application, and the application may see benefits for HT where MKL does not. I suspect that some functions might run faster in MKL with HT. But if you ask a developer, in the absence of any other data, should HT be on or off, we may advise you to turn it off- because we've seen our highest performance numbers with it off. So, the MKL documentation isn't wrong. I've seen some problems run faster with HT and some run slower (like our most tuned BLAS and LAPACK functions.) On the whole, most things run faster. Thus the advice to turn it on (given outside MKL documentation) is also valid. In the end, you should try and see for yourself- your specific application and usage model may be unique to what you're doing. If you see a 1% loss of performance of a few MKL functions, you might not care if everything else is running 10% faster.

Quoting - Gregory Henry (Intel)

Intel Hyper-Threading technologyis a exciting resource that in many cases allows applicationsto seeperformance benefits. It works by allowing the processors to gain a more efficient use of the processor resources. In some parts of MKL, DGEMM for instance, we are already able to achieve 95+% percent of theoretical peak for some typical inputs. In this case, we are exploiting every resource in the processor we can, and the additional thread given by HT does not yield additional benefits. However, the architects have taken measures to reduce the overheads of switching between the threads in an HT environment. With the exception of affinity problems already mentioned, this means that the overheads for using HT may not be so large. For someone like myself, interested in getting our users that last tenth of a percent of performance, it is noticable. MKL is a library. It is used in combination with an application, and the application may see benefits for HT where MKL does not. I suspect that some functions might run faster in MKL with HT. But if you ask a developer, in the absence of any other data, should HT be on or off, we may advise you to turn it off- because we've seen our highest performance numbers with it off. So, the MKL documentation isn't wrong. I've seen some problems run faster with HT and some run slower (like our most tuned BLAS and LAPACK functions.) On the whole, most things run faster. Thus the advice to turn it on (given outside MKL documentation) is also valid. In the end, you should try and see for yourself- your specific application and usage model may be unique to what you're doing. If you see a 1% loss of performance of a few MKL functions, you might not care if everything else is running 10% faster.

Thank you for your detailed answer - that is very valuable. We produce simulation software that our customers use, and we have no direct control over the set up of the machine they will run it - from number of cores to HT on or off. We also do not tell them explicity we use MKL or really want them to have to be bothered about MKL. So I was looking for general guidelines for that we should do when building our software and whether we should recommend to our users HT on or off.

Thank you very much!
Tony

Quoting - Gregory Henry (Intel)

Intel Hyper-Threading technologyis a exciting resource that in many cases allows applicationsto seeperformance benefits. It works by allowing the processors to gain a more efficient use of the processor resources. In some parts of MKL, DGEMM for instance, we are already able to achieve 95+% percent of theoretical peak for some typical inputs. In this case, we are exploiting every resource in the processor we can, and the additional thread given by HT does not yield additional benefits. However, the architects have taken measures to reduce the overheads of switching between the threads in an HT environment. With the exception of affinity problems already mentioned, this means that the overheads for using HT may not be so large. For someone like myself, interested in getting our users that last tenth of a percent of performance, it is noticable. MKL is a library. It is used in combination with an application, and the application may see benefits for HT where MKL does not. I suspect that some functions might run faster in MKL with HT. But if you ask a developer, in the absence of any other data, should HT be on or off, we may advise you to turn it off- because we've seen our highest performance numbers with it off. So, the MKL documentation isn't wrong. I've seen some problems run faster with HT and some run slower (like our most tuned BLAS and LAPACK functions.) On the whole, most things run faster. Thus the advice to turn it on (given outside MKL documentation) is also valid. In the end, you should try and see for yourself- your specific application and usage model may be unique to what you're doing. If you see a 1% loss of performance of a few MKL functions, you might not care if everything else is running 10% faster.

This might be off topic. I am also interested in figuring out what to do with hyperthreading. We are using MKL but with openMP off. If I understood it correctly, efficient parallezation of MKL is limited to very few LAPACK routines. Therefore we opt to parallelize our computations at higher level (unfortunately the efficiency also decreases when matrix sizes get very large). I run some tests on a system with hyperthreading on, and figured out that if I assign the tasks to CPUs with even IDs only, the efficiency is quite good (on Windows XP), since as I understand, the OS assigns two CPUs to each core sequentially. Is there a routine to detect if the system has hyperthreading?

Quoting - Hanyou Chu

This might be off topic. I am also interested in figuring out what to do with hyperthreading. We are using MKL but with openMP off. If I understood it correctly, efficient parallezation of MKL is limited to very few LAPACK routines. Therefore we opt to parallelize our computations at higher level (unfortunately the efficiency also decreases when matrix sizes get very large). I run some tests on a system with hyperthreading on, and figured out that if I assign the tasks to CPUs with even IDs only, the efficiency is quite good (on Windows XP), since as I understand, the OS assigns two CPUs to each core sequentially. Is there a routine to detect if the system has hyperthreading?

Not at all off topic.
Unfortunately, there is no uniform BIOS numbering scheme for HyperThreading logicals. Even/odd is only one of the common variations. The OS doesn't control this.
As you say, it is intended that most applications which don't take advantage of HyperThreading will run well with threads assigned 1 per core. MKL threaded library is usually successful at implementing this by default.
Threading your application at a higher level and using MKL sequential often is an excellent strategy. Unfortunately, I don't know a way to foresee the various BIOS numbering schemes and set optimum parameters for KMP_AFFINITY.
Windows 7 is expected to do a better job than XP at distributing threads automatically across cores, reducing the dependency on KMP_AFFINITY, when you limit the number of threads appropriately.

Leave a Comment

Please sign in to add a comment. Not a member? Join today