The CnC concepts allow tuning the program independently of the actual program core. Without touching the code in steps and only adding a few declarations in the collection definitions the tuning interface allows providing tuning hints. In the following you will learn about the features of the modular, flexible and easy-to-use tuning interface.
Supporting the full generality of the CnC programming model obviously comes at some cost in the runtime. In order to accelerate the evaluation of a specific application, the API provides capabilities to influence the execution performance. Most importantly it allows the specification of a "tuner" for each collection. Through the tuners you can provide various hints to the CnC runtime. The tuner-type is an optional template argument to the collection classes. Here is an example of how the tuner type "fib_tuner" is assigned to the step-collection:
To define a tuner, you should derive your tuner-class from the default implementations which are provided by the CnC runtime. Otherwise you will need to provide the entire tuner interface even if you intend to use only parts of the interface. The appropriate class for a step-collection tuner is CNC::step_tuner<>:
The funky "<>" stems from the way C++ handles default template paramemters. This default parameter is only needed for the more advanced use of ranges (Tuning Ranges).
Besides feautures for distributed memory (Running CnC applications on distributed memory) probably the most relevant feature of a step-tuner is the ability to pre-declare item-dependencies. Upon putting a tag normal CnC execution will create a step-instance and hand it over to the underlying scheduler which will eventually launch the step. Only during the actual execution the runtime will observe the unavailability of items (e.g. if a get fails) and needs to re-schedule the step. If the tuner pre-declares items the step-instances will not be scheduled before those items are actually available. To pre-declare item-dependencies you need to provide a template method named "depends", which accepts the same arguments as the step::execute method and an additional parameter. Here's how this would look like in our Fibonacci example:
You can find the full code here: fib_tuner.cpp
A similar effect can be achieved by pre-scheduling a step, which executes a step on the same thread which puts the prescribing tag until the first unavailable item was accessed. Obviously this mechanism will not exploit parallelism if all items are available, because the entire step will be executed. Still, this mechanism is much simpler and can yield performance improvements.
Pre-scheduling is enabled by providing a tuner which provides the method "pre-schedule" returning true (fib_preschedule.cpp):
In combination with the unsafe version of get and context::flush_gets() pre-scheduling allows interesting things. For example you can inhibit step execution beyond a call to context::flush_gets() within the pre-scheduling phase, which might increase parallelism. Just uncomment
The step_tuner interface provides more advanced capabilities:
They all work on individual step instances (identified by their control tag) and are used similar to the above.
Without additional information the runtime cannot decide when no more gets to a data item will be issued. However, Without this information it is not possible to remove a data item from the internal storage. As the CnC programming model never declares "when" something happens, it is also not generally possible to let the user indicate which is the last get of an item.
As keeping data items in memory for ever can quickly (and unnecessarily) blow the available memory a mechanism preventing this is needed. When putting an data item, the CnC C++ API lets you declare the number of gets that will ever be issued. When this "get-count" number of gets has been reached the CnC runtime will remove the item from its internal store and free the memory for other use.
In our Fibonacci example we know that each intermediate item 'x' will be accessed exactly twice: once by fib(x+1) and once again by fib(x+2). Hence the get-count for every Fibonacci item is exactly "2".
The get-count is an attribute of an item, hence the necessary functionality is located in the item-tuners. Declaration of the get-count is straight-forward by providing a const method actually putting the item:
Like with tag- and step-collections, the tuner needs to made available in the item-collection definition:
Each data item will now be freed once it was accessed the second time. The full code (fib_getcount.cpp, fib.h) also accounts for the corner case 0 but also ignores the smaller real get-count for n and n-1.
The get-count features helps a lot keeping the memory-footprint under control and hence can significantly improve application performance.
As the execution of step-instances is state-less, e.g. functional, executing the same function more than once does not alter the result; it only adds unnecessary overhead. In Fibonacci, most of the tags which prescribe steps are put many times. By default, the CnC runtime will execute the corresponding steps as many times as the same tag is put. By providing a simple hint to the runtime, it will automatically elide duplicate steps. This feature is not enabled by default, because duplicate tags are not common and the automatic memoization apparently adds some overhead.
The memoization is a matter of the tag-collections, hence the corresponding tuning feature is provided by assigning a tuner to the tag-collection. The memoization is enabled by requesting to preserve tags. CnC provides special base-class for tag-tuners to make this extremely convenient:
That's all what's needed to let the runtime memoize step executions. In the Fibonacci example this feature has an amazing effect on performance as it elides most of the computation, because due to the nature of the algorithm it stems from redundant computation (also apparent when comparing scheduling statistics).