Have your cake and eat it too

This English idiomatic proverb or figure of speech dating back to 1546 is used to describe situations where “you can’t have it both ways.” Considering that this blog is posted on a website which, barring a few exceptions, has a preponderance of content relating to programming and computers. So where does this idiomatic proverb fit in with typical content found on this website?

 A few days ago (October 12, 2011) a post was made by Andy to the Threading on Intel® Parallel Architectures forum titled:

Effect of turning off HT for SPMD style HPC applications (sandybridge and openmp)

Where the poster (Andy) assumed his situation was a where “you can’t have it both ways”.

The Sandy Bridge processor is a 4-core design with Hyper-Threading providing 8 hardware threads with but one floating point processing resource per core as opposed to one per hardware thread.

The poster’s dilemma, the “you can’t have it both ways” part, was the preponderance of his computations is floating point operations. And where some of his test programs, and (I assume) test runs of his main application, would perform better when Hyper-Threading was disabled on processor (via BIOS setting). There are various reasons why improved performance might be observed, and not having his application for inspection (say with VTune) it would only be speculation as to what the specific cause was. The speculation is with HT enabled the system behaved as if there were an over subscription to the resources. That is: having 8 threads competing for the 4 floating point resources and the pipelines, the cache and the memory control feeding the floating point resources. It is not unusual on an HT capable system to observe an improvement in performance for some applications by disabling HT. Although this is not necessarily true with most applications.

What the poster found was the dilemma of being forced to sacrifice integer performance for improvement in floating point performance.

It should be understood that the computer will likely run at different times, or concurrently, different applications which are not necessarily floating point intensive. Additionally, the operating system and system tasks are invariably running and these contain almost all integer operations. Therefore, by disabling HT these other applications suffer as well as his main application when the operating system is called, system tasks run, and any non-user application runs concurrent with his main application.

Now to the “Have your cake and eat it too.” In an OpenMP application (dependent upon compiler, library and operating system) you can have some degree of control using the KMP_AFFINITY environment variable as well as incorporating into the application calls to kmp_set_affinity_mask_proc and use of related function calls. The use of KMP_AFFINITY would be an application wide threading model attribute. Whereas the kmp_get_... and kmp_set_... library calls are on a call by call basis. Information relating to these calls is listed in the Intel C++ compiler documentation. Use of these library calls means you have to program closer to the metal (include code that is more cognizant of the runtime environment).

What if there were ways that you could attribute your loop control statements to indicate core, cache, or socket relationships that produce better performance?

For the C++ programmer there is the QuickThread Parallel Programming Toolkit (available for free download at www.quickthreadprogramming.com). One of the benefits of the QuickThread approach is on a (parallel) loop-by-loop basis you can choose the hardware thread associatively.

Consider Andy’s dilemma on his Sandy Bridge

Assume in part of  his program he has a floating point intensive parallel for loop that  he wishes to schedule restricting the thread team to one thread per core (4 cores). Also assume following this loop he has an integer intensive for loop that works best using all HT threads (8 hardware threads). Assume further that next year Andy will update his processor to the next- gen Sandy Bridge with possibly 8 cores/16 threads, and he would like not to re-code his program for the processor change. Using QuickThread, the same code works on both systems:

   // using only one thread per core
      aFloatingPointFunction, iBegin, iEnd [,optional args]);

   // now using all threads
      anIntegerFunction, iBegin, iEnd [,optional args]);

AllThreads$ is the default and can be omitted from the argument list.
And optional args could be one or more function arguments (arrays, etc...)

Assume later you have Ivy Bridge with 4 processors, each with 8 cores and 16 threads. Although the above code works well on this new configuration there are some programming situations whereby it may be beneficial to partition the work by socket (sockets share L3 cache).

To partition the work by processor, then within each processors partition a parallel for using a team of one thread per core within the given processor. For clarity in the code description we will be using alternate lambda function format.

  // slice rows by socket
    OneEach_L3$, iRowBegin, iRowEnd,
    [&](int rowBegin, int rowEnd) {
      // slice this socket's col's by core
        OneEach_Within_L3$ + L1$,
        iColBegin, iColEnd, rowBegin, rowEnd[,optional args]);

Note, the above will work on a 1 socket system too.

The OneEach_L3$ is a teaming attribute hint to the QuickThread task manager. This attribute specifies that however many L3 caches there exists (equivalent to sockets), slice-up the loop to that number of pieces and enqueue such that  only one thread from each L3 cache is permitted to take a slice. Note, in the event the code runs on a CPU without L3 (e.g. Intel Q6600) the selection filter retrogrades to OneEach_L2$.

In the nested parallel_for loop, which each slice runs on one of the threads within each socket, the qualification attribute is specified as “OneEach_Within_L3$ + L1$”. This attribute expression results in a thread team being selected from one thread per L1 cache sharing the current threads L3 cache. This is relatively easy to do using QuickThread – all the hard work has been taken care of.

The above programming sketches are only a taste of the tasking capabilities available to the programmer using QuickThread. While there may be little use for these features on a Dual Core non-HT laptop, the design trend for high performance systems is more cores, more sockets, and coming soon heterogeneous programming (Intel Many Integrated Core). While I haven’t had the pleasure of having an Intel Knights Corner MIC I have every reason to expect that, with the help of some internal information and technology sharing, that QuickThread could provide a fully integrated task model programming environment. An example of the simplification this could provide to the programmer is:

Consider a system with dual or quad processor, each processor having 8 cores, 16 threads and the system containing dual Knights Corner MIC.

   // using only one thread per core
   // in all of the processors and all of the MIC’s
      aFloatingPointFunction, iBegin, iEnd [,optional args]);

Note that there is no code change to Andy’s program. Andy’s multi-socket code would run as well.

*** Attention to unexpressed detail in above code ***

The current programming models for Knights Corner as disclosed as of IDF 2011 (there are two models currently disclosed) are:

a) You program using kernels similar to GPGPU design (ARBB can hide some of the details).

b) You program similar to a cluster (systems connected via controller or bus).

The potential for QuickThread is to provide a third programming model.

c) You program as if on SMP with or without considerations for thread placement (CPU, MIC, cache, …)

QuickThread will likely have thread team qualification selectors added that distinguish CPU vs MIC as well as multiple MICs.

FYI, I am in the process of updating the software on the website. If you have any issues, please report via the email address listed on the web site. QuickThread works on Windows and Ubuntu Linux systems both 32-bit and 64-bit.

Here's a link to Part 2 of this blog: /en-us/blogs/2011/11/22/have-your-cake-and-eat-it-too-part-2

Jim Dempsey

For more complete information about compiler optimizations, see our Optimization Notice.