In principle, every clean CnC program should be immediately applicable for distributed memory systems. With only a few trivial changes most CnC programs can be made distribution-ready. You will get a binary that runs on shared and distributed memory. Most of the mechanics of data distribution etc. is handled inside the runtime and the programmer does not need to bother about the gory details. Of course, there are a few minor changes needed to make a program distribution-ready, but once that's done, it will run on distributed CnC as well as on "normal" CnC (decided at runtime).
Support for distributed memory is part of the "normal" CnC distribution, e.g. it comes with the necessary communication libraries (cnc_socket, cnc_mpi). The communication library is loaded on demand at runtime, hence you do not need to link against extra libraries to create distribution-ready applications. Just link your binaries like a "traditional" CnC application (explained in the CnC User Guide, which can be found in the doc directory).
Even though it is not a separate package or module in the CNC kit, in the following we will refer to features that are specific for distributed memory with "distCnC".
As a distributed version of a CnC program needs to do things which are not required in a shared memory version, the extra code for distCnC is hidden from "normal" CnC headers. To include the features required for a distributed version you need to
. If you want to be able to create optimized binaries for shared memory and distributed memory from the same source, you might consider protecting distCnC specifics like this:
In "main", initialize an object CnC::dist_cnc_init< list-of-contexts > before anything else; parameters should be all context-types that you would like to be distributed. Context-types not listed in here will stay local. You may mix local and distributed contexts, but in most cases only one context is needed/used anyway.
Even though the communication between process is entirely handled by the CnC runtime, C++ doesn't allow automatic marshalling/serialization of arbitrary data-types. Hence, if and only if your items and/or tags are non-standard data types, the compiler will notify you about the need for serialization/marshalling capability. If you are using standard data types only then marshalling will be handled by CnC automatically.
Marshalling doesn't involve sending messages or alike, it only specifies how an object/variable is packed/unpacked into/from a buffer. Marshalling of structs/classes without pointers or virtual functions can easily be enabled using CNC_BITWISE_SERIALIZABLE( type ); others need a "serialize" method or function. The CnC kit comes with an convenient interface for this which is similar to BOOST serialization. It is very simple to use and requires only one function/method for packing and unpacking. See Serialization for more details.
This is it! Your CnC program will now run on distributed memory!
The communication infrastructure used by distCnC is chosen at runtime. By default, the CnC runtime will run your application in shared memory mode. When starting up, the runtime will evaluate the environment variable "DIST_CNC". Currently it accepts the following values
Please see Using Intel(R) Trace Analyzer and Collector with CnC on how to profile distributed programs
On application start-up, when DIST_CNC=SOCKETS, CnC checks the environment variable "CNC_SOCKET_HOST". If it is set to a number, it will print a contact string and wait for the given number of clients to connect. Usually this means that clients need to be started "manually" as follows: set DIST_CNC=SOCKETS and "CNC_SOCKET_CLIENT" to the given contact string and launch the same executable on the desired machine.
If "CNC_SOCKET_HOST" is not a number it is interpreted as a name of a script. CnC executes the script twice: First with "-n" it expects the script to return the number of clients it will start. The second invocation is expected to launch the client processes.
There is a sample script "misc/start.sh" which you can use. Usually all you need is setting the number of clients and replacing "localhost" with the names of the machines you want the application(-clients) to be started on. It requires password-less login via ssh. It also gives some details of the start-up procedure. For windows, the script "start.bat" does the same, except that it will start the clients on the same machine without ssh or alike. Adjust the script to use your preferred remote login mechanism.
CnC comes with a communication layer based on MPI. You need the Intel(R) MPI runtime to use it. You can download a free version of the MPI runtime from http://software.intel.com/en-us/articles/intel-mpi-library/ (under "Resources"). A distCnC application is launched like any other MPI application with mpirun or mpiexec, but DIST_CNC must be set to MPI:
Alternatively, just run the app as usually (with DIST_CNC=MPI) and control the number (n) of additionally spawned processes with CNC_MPI_SPAWN=n. If host and client applications need to be different, set CNC_MPI_EXECUTABLE to the client-program name. Here's an example:
It starts your host executable "cnc_host" and then spawns 3 additional processes which all execute the client executable "cnc_client".
for CnC a MIC process is just another process where work can be computed on. So all you need to do is
Step instances are distributed across clients and the host. By default, they are distributed in a round-robin fashion. Note that every process can put tags (and so prescribe new step instances). The round-robin distribution decision is made locally on each process (not globally).
If the same tag is put multiple times, the default scheduling might execute the multiply prescribed steps on different processes and the preserveTags attribute of tag_collections will then not have the desired effect.
The CnC tuning interface provides convenient ways to control the distribution of work and data across the address spaces. The tuning interface is separate from the actual step-code and its declarative nature allows flexible and productive experiments with different distribution strategies (Tuning for distributed memory).
You can specify the work distribution across the network by providing a tuner to step-collections (the second template argument to CnC::step_collection, see The Tuners). Each step-collection is controlled by its own tuner which maximizes flexibility. By deriving a tuner from CnC::step_tuner it has access to information specific for distributed memory. CnC::tuner_base::numProcs() and CnC::tuner_base::myPid() allow a generic and flexible definition of distribution functions.
To define the distribution of step-instances (the work), your tuner must provide a method called "compute_on", which takes the tag of the step and the context as arguments and has to return the process number to run the step on. To avoid the afore-mentioned problem, you simply need to make sure that the return value is independent of the process it is executed on. The compute_on mechanism can be used to provide any distribution strategy which can be computed locally.
CnC provides special values to make working with compute_on more convenient, more generic and more effective: CnC::COMPUTE_ON_LOCAL, CnC::COMPUTE_ON_ROUND_ROBIN, CnC::COMPUTE_ON_ALL, CnC::COMPUTE_ON_ALL_OTHERS.
By default, the CnC runtime will deliver data items automatically to where they are needed. In its current form, the C++ API does not express the dependencies between instances of steps and/or items. Hence, without additional information, the runtime does not know what step-instances produce and consume which item-instances. Even when the step-distribution is known automatically automatic distribution of data requires global communication. Apparently this constitutes a considerable bottleneck. The CnC tuner interface provides two ways to reduce this overhead.
The ideal, most flexible and most efficient approach is to map items to their consumers. It will convert the default pull-model to a push-model: whenever an item becomes produced, it will be sent only to those processes, which actually need it without any other communication/synchronization. If you can determine which steps are going to consume a given item, you can use the above compute_on to map the consumer step to the actual address spaces. This allows changing the distribution at a single place (compute_on) and the data distribution will be automatically optimized to the minimum needed data transfer.
The runtime evaluates the tuner provided to the item-collection when an item is put. If its method consumed_on (from CnC::item_tuner) returns anything other than CnC::CONSUMER_UNKNOWN it will send the item to the returned process id and avoid all the overhead of requesting the item when consumed.
As more than one process might consume the item, you can also return a vector of ids (instead of a single id) and the runtime will send the item to all given processes.
Note that consumed_on can return CnC::CONSUMER_UNKOWN for some item-instances, and process rank(s) for others.
Sometimes the program semantics make it easier to think about the producer of an item. CnC provides a mechanism to keep the pull-model but allows declaring the owner/producer of the item. If the producer of an item is specified the CnC-runtime can significantly reduce the communication overhead because it on longer requires global communication to find the owner of the item. For this, simply define the depends-method in your step-tuner (derived from CnC::step_tuner) and provide the owning/producing process as an additional argument.
The push-model consumed_on smoothly cooperates with the pull-model as long as they don't conflict.
For a more productive development, you might consider implementing consumed_on by thinking about which other steps (not processes) consume the item. With that knowledge you can easily use the appropriate compute_on function to determine the consuming process. The great benefit here is that you can then change compute distribution (e.g. change compute_on) and the data will automatically follow in an optimal way; data and work distribution will always be in sync. It allows experimenting with different distribution plans with much less trouble and lets you define different strategies at a single place. Here is a simple example code which lets you select different strategies at runtime. Adding a new strategy only requires extending the compute_on function: blackscholes.h A more complex example is this one: cholesky.h
Many algorithms require global data that is initialized once and during computation it stays read-only (dynamic single assignment, DSA). In principle this is aligned with the CnC methodology as long as the initialization is done from the environment. The CnC API allows global DSA data through the context, e.g. you can store global data in the context, initialize it there and then use it in a read-only fashion within your step codes.
The internal mechanism works as follows: on remote processes the user context is default constructed and then de-serialized/un-marshalled. On the host, construction and serialization/marshalling is done in a lazy manner, e.g. not before something actually needs being transferred. This allows creating contexts on the host with non-default constructors, but it requires overloading the serialize method of the context. The actual time of transfer is not statically known, the earliest possible time is the first item- or tag-put. All changes to the context until that point will be duplicated remotely, later changes will not.
Here is a simple example code which uses this feature: blackscholes.h