CnC
 All Classes Namespaces Functions Variables Typedefs Enumerator Friends Groups Pages
dist_cnc.h
1 //********************************************************************************
2 // Copyright (c) 2007-2013 Intel Corporation. All Rights Reserved. **
3 // **
4 // The source code contained or described herein and all documents related to **
5 // the source code ("Material") are owned by Intel Corporation or its suppliers **
6 // or licensors. Title to the Material remains with Intel Corporation or its **
7 // suppliers and licensors. The Material contains trade secrets and proprietary **
8 // and confidential information of Intel or its suppliers and licensors. The **
9 // Material is protected by worldwide copyright and trade secret laws and **
10 // treaty provisions. No part of the Material may be used, copied, reproduced, **
11 // modified, published, uploaded, posted, transmitted, distributed, or **
12 // disclosed in any way without Intel's prior express written permission. **
13 // **
14 // No license under any patent, copyright, trade secret or other intellectual **
15 // property right is granted to or conferred upon you by disclosure or delivery **
16 // of the Materials, either expressly, by implication, inducement, estoppel or **
17 // otherwise. Any license under such intellectual property rights must be **
18 // express and approved by Intel in writing. **
19 //********************************************************************************
20 
21 // ===============================================================================
22 // !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
23 //
24 // INCLUDE THIS FILE ONLY TO MAKE YOUR PROGRAM READY FOR DISTRIBUTED CnC
25 //
26 // !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
27 // ===============================================================================
28 
29 #ifndef __DIST_CNC__H_
30 #define __DIST_CNC__H_
31 
32 /**
33 \page distcnc Running CnC applications on distributed memory
34 
35 In principle, every clean CnC program should be immediately
36 applicable for distributed memory systems. With only a few trivial
37 changes most CnC programs can be made distribution-ready. You will
38 get a binary that runs on shared and distributed memory. Most of
39 the mechanics of data distribution etc. is handled inside the
40 runtime and the programmer does not need to bother about the gory
41 details. Of course, there are a few minor changes needed to make a
42 program distribution-ready, but once that's done, it will run on
43 distributed CnC as well as on "normal" CnC (decided at runtime).
44 
45 \section dc_comm Inter-process communication
46 Conceptually, CnC allows data and computation distribution
47 across any kind of network; currently CnC supports SOCKETS and MPI.
48 
49 \section dc_link Linking for distCnC
50 Support for distributed memory is part of the "normal" CnC
51 distribution, e.g. it comes with the necessary communication
52 libraries (cnc_socket, cnc_mpi). The communication library is
53 loaded on demand at runtime, hence you do not need to link against
54 extra libraries to create distribution-ready applications. Just
55 link your binaries like a "traditional" CnC application (explained
56 in the CnC User Guide, which can be found in the doc directory).
57 \note a distribution-ready CnC application-binary has no dependencies
58  on an MPI library, it can be run on shared memory or over SOCKETS
59  even if no MPI is available on the system
60 
61 Even though it is not a separate package or module in the CNC kit,
62 in the following we will refer to features that are specific for
63 distributed memory with "distCnC".
64 
65 \section dc_prog Making your program distCnC-ready
66 As a distributed version of a CnC program needs to do things which
67 are not required in a shared memory version, the extra code for
68 distCnC is hidden from "normal" CnC headers. To include the
69 features required for a distributed version you need to
70 \code #include <cnc/dist_cnc.h> \endcode
71 instead of \code #include <cnc/cnc.h> \endcode .
72 If you want to be able to create optimized binaries for shared
73 memory and distributed memory from the same source, you might
74 consider protecting distCnC specifics like this:
75  @code
76  #ifdef _DIST_
77  # include <cnc/dist_cnc.h>
78  #else
79  # include <cnc/cnc.h>
80  #endif
81  @endcode
82 
83 In "main", initialize an object CnC::dist_cnc_init< list-of-contexts >
84 before anything else; parameters should be all context-types that
85 you would like to be distributed. Context-types not listed in here
86 will stay local. You may mix local and distributed contexts, but
87 in most cases only one context is needed/used anyway.
88  @code
89  #ifdef _DIST_
90  CnC::dist_cnc_init< my_context_type_1 //, my_context_type_2, ...
91  > _dinit;
92  #endif
93  @endcode
94 
95 Even though the communication between process is entirely handled
96 by the CnC runtime, C++ doesn't allow automatic
97 marshalling/serialization of arbitrary data-types. Hence, if and
98 only if your items and/or tags are non-standard data types, the
99 compiler will notify you about the need for
100 serialization/marshalling capability. If you are using standard
101 data types only then marshalling will be handled by CnC
102 automatically.
103 
104 Marshalling doesn't involve sending messages or alike, it only
105 specifies how an object/variable is packed/unpacked into/from a
106 buffer. Marshalling of structs/classes without pointers or virtual
107 functions can easily be enabled using CNC_BITWISE_SERIALIZABLE( type );
108 others need a "serialize" method or function. The CnC kit
109 comes with an convenient interface for this which is similar to
110 BOOST serialization. It is very simple to use and requires only
111 one function/method for packing and unpacking. See \ref
112 serialization for more details.
113 
114 <b>This is it! Your CnC program will now run on distributed memory!</b>
115 
116 \attention Global variables are evil and must not be used within
117  the execution scope of steps. Read \ref dist_global
118  about how CnC supports global read-only data.
119  Apparently, pointers are nothing else than global
120  variables and hence need special treatment in distCnC
121  (see \ref serialization).
122 
123 \section dc_run Running distCnC
124 The communication infrastructure used by distCnC is chosen at
125 runtime. By default, the CnC runtime will run your application in
126 shared memory mode. When starting up, the runtime will evaluate
127 the environment variable "DIST_CNC". Currently it accepts the
128 following values
129 - SHMEM : shared memory (default)
130 - SOCKETS : communication through TCP sockets
131 - MPI : using Intel(R) MPI
132 
133 Please see \ref itac on how to profile distributed programs
134 
135 \subsection dc_sockets Using SOCKETS
136 On application start-up, when DIST_CNC=SOCKETS, CnC checks the
137 environment variable "CNC_SOCKET_HOST". If it is set to a number,
138 it will print a contact string and wait for the given number of
139 clients to connect. Usually this means that clients need to be
140 started "manually" as follows: set DIST_CNC=SOCKETS and
141 "CNC_SOCKET_CLIENT" to the given contact string and launch the
142 same executable on the desired machine.
143 
144 If "CNC_SOCKET_HOST" is not a number it is interpreted as a
145 name of a script. CnC executes the script twice: First with "-n"
146 it expects the script to return the number of clients it will
147 start. The second invocation is expected to launch the client
148 processes.
149 
150 There is a sample script "misc/start.sh" which you can
151 use. Usually all you need is setting the number of clients and
152 replacing "localhost" with the names of the machines you want the
153 application(-clients) to be started on. It requires password-less
154 login via ssh. It also gives some details of the start-up
155 procedure. For windows, the script "start.bat" does the same,
156 except that it will start the clients on the same machine without
157 ssh or alike. Adjust the script to use your preferred remote login
158 mechanism.
159 
160 \subsection dc_mpi MPI
161 CnC comes with a communication layer based on MPI. You need the
162 Intel(R) MPI runtime to use it. You can download a free version of
163 the MPI runtime from
164 http://software.intel.com/en-us/articles/intel-mpi-library/ (under
165 "Resources"). A distCnC application is launched like any other
166 MPI application with mpirun or mpiexec, but DIST_CNC must be set
167 to MPI:
168 \code
169 env DIST_CNC=MPI mpiexec -n 4 my_cnc_program
170 \endcode
171 Alternatively, just run the app as usually (with DIST_CNC=MPI) and
172 control the number (n) of additionally spawned processes with
173 CNC_MPI_SPAWN=n. If host and client applications need to be
174 different, set CNC_MPI_EXECUTABLE to the client-program
175 name. Here's an example:
176 \code
177 env DIST_CNC=MPI env CNC_MPI_SPAWN=3 env CNC_MPI_EXECUTABLE=cnc_client cnc_host
178 \endcode
179 It starts your host executable "cnc_host" and then spawns 3 additional
180 processes which all execute the client executable "cnc_client".
181 
182 \subsection dc_mic Intel Xeon Phi(TM) (MIC)
183 for CnC a MIC process is just another process where work can be computed on. So all you need to do is
184 - Build your application for MIC (see http://software.intel.com/en-us/articles/intel-concurrent-collections-getting-started)
185 - Start a process with the MIC executable on each MIC card just like on a CPU. Communication and Startup is equivalent to how it works on intel64 (\ref dc_mpi and \ref dc_sockets).
186 \note Of course the normal mechanics for MIC need to be considered (like getting applications and dependent lbiraries to the MIC first). You'll find documentation about this on IDZ, like <A HREF="http://software.intel.com/en-us/articles/how-to-run-intel-mpi-on-xeon-phi">here</A> and/or <A HREF="http://software.intel.com/en-us/articles/using-the-intel-mpi-library-on-intel-xeon-phi-coprocessor-systems">here</A>
187 \note We recommend starting only 2 threads per MIC-core, e.g. if your card has 60 cores, set CNC_NUM_THREADS=120
188 \note To start different binaries with one mpirun/mpiexec command you can use a syntax like this:<br> mpirun -genv DIST_CNC=MPI -n 2 -host xeon xeonbinary : -n 1 -host mic0 -env CNC_NUM_THREADS=120 micbinary
189 
190 
191 \section def_dist Default Distribution
192 Step instances are distributed across clients and the host. By
193 default, they are distributed in a round-robin fashion. Note that
194 every process can put tags (and so prescribe new step instances).
195 The round-robin distribution decision is made locally on each
196 process (not globally).
197 
198 If the same tag is put multiple times, the default scheduling
199 might execute the multiply prescribed steps on different processes
200 and the preserveTags attribute of tag_collections will then not
201 have the desired effect.
202 
203 The CnC tuning interface provides convenient ways to control the
204 distribution of work and data across the address spaces. The
205 tuning interface is separate from the actual step-code and its
206 declarative nature allows flexible and productive experiments with
207 different distribution strategies (\ref dist_tuning).
208 
209 \section dist_tuning Tuning for distributed memory
210 \subsection dist_work Distributing the work
211 You can specify the work distribution across the network by providing
212 a tuner to step-collections (the second template argument to
213 CnC::step_collection, see \ref tuning). Each step-collection is
214 controlled by its own tuner which maximizes flexibility. By
215 deriving a tuner from CnC::step_tuner it has access to information
216 specific for distributed memory. CnC::tuner_base::numProcs() and
217 CnC::tuner_base::myPid() allow a generic and flexible definition of
218 distribution functions.
219 
220 To define the distribution of step-instances (the work), your
221 tuner must provide a method called "compute_on", which takes the
222 tag of the step and the context as arguments and has to return the
223 process number to run the step on. To avoid the afore-mentioned
224 problem, you simply need to make sure that the return value is
225 independent of the process it is executed on. The compute_on
226 mechanism can be used to provide any distribution strategy which
227 can be computed locally.
228 
229  @code
230  struct my_tuner : public CnC::step_tuner<>
231  {
232  int compute_on( const tag_type & tag, context_type & ) const { return tag % numProcs(); }
233  };
234  @endcode
235 
236 CnC provides special values to make working with compute_on more
237 convenient, more generic and more effective:
238 CnC::COMPUTE_ON_LOCAL, CnC::COMPUTE_ON_ROUND_ROBIN,
239 CnC::COMPUTE_ON_ALL, CnC::COMPUTE_ON_ALL_OTHERS.
240 
241 \subsection dist_data Distributing the data
242 By default, the CnC runtime will deliver data items automatically
243 to where they are needed. In its current form, the C++ API does
244 not express the dependencies between instances of steps and/or
245 items. Hence, without additional information, the runtime does not
246 know what step-instances produce and consume which
247 item-instances. Even when the step-distribution is known
248 automatically automatic distribution of data requires
249 global communication. Apparently this constitutes a considerable
250 bottleneck. The CnC tuner interface provides two ways to reduce
251 this overhead.
252 
253 The ideal, most flexible and most efficient approach is to map
254 items to their consumers. It will convert the default pull-model
255 to a push-model: whenever an item becomes produced, it will be
256 sent only to those processes, which actually need it without any
257 other communication/synchronization. If you can determine which
258 steps are going to consume a given item, you can use the above
259 compute_on to map the consumer step to the actual address
260 spaces. This allows changing the distribution at a single place
261 (compute_on) and the data distribution will be automatically
262 optimized to the minimum needed data transfer.
263 
264 The runtime evaluates the tuner provided to the item-collection
265 when an item is put. If its method consumed_on (from
266 CnC::item_tuner) returns anything other than CnC::CONSUMER_UNKNOWN
267 it will send the item to the returned process id and avoid all the
268 overhead of requesting the item when consumed.
269  @code
270  struct my_tuner : public CnC::item_tuner< tag_type, item_type >
271  {
272  int consumed_on( const tag_type & tag )
273  {
274  return my_step_tuner::consumed_on( consumer_step );
275  }
276  };
277  @endcode
278 
279 As more than one process might consume the item, you
280 can also return a vector of ids (instead of a single id) and the
281 runtime will send the item to all given processes.
282  @code
283  struct my_tuner : public CnC::item_tuner< tag_type, item_type >
284  {
285  std::vector< int > consumed_on( const tag_type & tag )
286  {
287  std::vector< int > consumers;
288  foreach( consumer_step of tag ) {
289  int _tmp = my_step_tuner::consumed_on( consumer_step );
290  consumers.push_back( _tmp );
291  }
292  return consumers;
293  }
294  };
295  @endcode
296 
297 Like for compute_on, CnC provides special values to facilitate and
298 generalize the use of consumed_on: CnC::CONSUMER_UNKNOWN,
299 CnC::CONSUMER_LOCAL, CnC::CONSUMER_ALL and
300 CnC::CONSUMER_ALL_OTHERS.
301 
302 Note that consumed_on can return CnC::CONSUMER_UNKOWN for some
303 item-instances, and process rank(s) for others.
304 
305 Sometimes the program semantics make it easier to think about the
306 producer of an item. CnC provides a mechanism to keep the
307 pull-model but allows declaring the owner/producer of the item. If
308 the producer of an item is specified the CnC-runtime can
309 significantly reduce the communication overhead because it on
310 longer requires global communication to find the owner of the
311 item. For this, simply define the depends-method in your
312 step-tuner (derived from CnC::step_tuner) and provide the
313 owning/producing process as an additional argument.
314 
315  @code
316  struct my_tuner : public CnC::step_tuner<>
317  {
318  int produced_on( const tag_type & tag ) const
319  {
320  return producer_known ? my_step_tuner::consumed_on( tag ) : tag % numProcs();
321  }
322  };
323  @endcode
324 
325 Like for consumed_on, CnC provides special values
326 CnC::PRODUCER_UNKNOWN and CnC::PRODUCER_LOCAL to facilitate and
327 generalize the use of produced_on.
328 
329 The push-model consumed_on smoothly cooperates with the
330 pull-model as long as they don't conflict.
331 
332 \section dist_sync Keeping data and work distribution in sync
333 For a more productive development, you might consider implementing
334 consumed_on by thinking about which other steps (not processes)
335 consume the item. With that knowledge you can easily use the
336 appropriate compute_on function to determine the consuming process.
337 The great benefit here is that you can then change compute distribution
338 (e.g. change compute_on) and the data will automatically follow
339 in an optimal way; data and work distribution will always be in sync.
340 It allows experimenting with different distribution plans with much less
341 trouble and lets you define different strategies at a single place.
342 Here is a simple example code which lets you select different strategies
343 at runtime. Adding a new strategy only requires extending the compute_on function:
344 \ref bs_tuner
345 A more complex example is this one: \ref cholesky_tuner
346 
347 \section dist_global Using global read-only data with distCnC
348 Many algorithms require global data that is initialized once and
349 during computation it stays read-only (dynamic single assignment,
350 DSA). In principle this is aligned with the CnC methodology as
351 long as the initialization is done from the environment. The CnC
352 API allows global DSA data through the context, e.g. you can store
353 global data in the context, initialize it there and then use it in
354 a read-only fashion within your step codes.
355 
356 The internal mechanism works as follows: on remote processes the
357 user context is default constructed and then
358 de-serialized/un-marshalled. On the host, construction and
359 serialization/marshalling is done in a lazy manner, e.g. not
360 before something actually needs being transferred. This allows
361 creating contexts on the host with non-default constructors, but
362 it requires overloading the serialize method of the context. The
363 actual time of transfer is not statically known, the earliest
364 possible time is the first item- or tag-put. All changes to the
365 context until that point will be duplicated remotely, later
366 changes will not.
367 
368 Here is a simple example code which uses this feature:
369 \ref bs_tuner
370 
371 **/
372 
373 #ifdef _CnC_H_ALREADY_INCLUDED_
374 #warning dist_cnc.h included after cnc.h. Distribution capabilities will not be activated.
375 #endif
376 
377 #ifndef _DIST_CNC_
378 # define _DIST_CNC_
379 #endif
380 
381 #include <cnc/internal/dist/dist_init.h>
382 
383 namespace CnC {
384  namespace Internal {
385  class void_context;
386  }
387 
388  /// To enable remote CnC you must create one such object right after entering main.
389  /// The object must persist throughout main.
390  /// All context classes ever used in the program must be referenced as template arguments.
391  /// All contexts must have all collections they use as members and must be default-constructible.
392  /// Pointers as tags are not supported by distCnC.
393  template< class C1, class C2 = Internal::void_context, class C3 = Internal::void_context,
394  class C4 = Internal::void_context, class C5 = Internal::void_context >
395  struct /*CNC_API*/ dist_cnc_init : public Internal::dist_init< C1, C2, C3, C4, C5 >
396  {
397  dist_cnc_init() : Internal::dist_init< C1, C2, C3, C4, C5 >() {}
398  };
399 
400 } // namespace CnC
401 
402 #include <cnc/cnc.h>
403 
404 #endif // __DIST_CNC__H_