Performance Tools for Software Developers - Cluster OpenMP* frequently asked questions

















Number of Threads

Question: I set the environment variable OMP_NUM_THREADS to 4. Why did my program run with only 1 thread?

Answer: As with many OpenMP* implementations, there is a maximum number of OpenMP threads that can be used. The maximum number of threads usable by a Cluster OpenMP program is determined by values set in the initialization file (see section 5.2 of the User’s Guide). Setting OMP_NUM_THREADS or using omp_set_num_threads() cannot increase this maximum. The default is 1 thread. If the maximum number of threads determined by the initialization file is not large enough to accommodate the number of threads desired, increase it by increasing values in the kmp_cluster.ini file. The maximum number of threads can be reduced using the OMP_NUM_THREADS environment variable or the set_omp_num_threads() API call, but cannot be increased.

Question: I increased the number of threads per node (using -process_threads), so why does my program run slower?

Answer: It is usually best to use a number of OpenMP threads equal to the number of processors in a node multiplied by the number of nodes.

Program Crash

Question: Why does my program fail with a ’Killed’ message or an internal error immediately on startup?

Answer: Your program may be using too much stack space. Try setting KMP_STACKSIZE to a larger value (default value is 1 megabyte). This controls the stack size allocated to each thread that is not the master thread. The master thread’s stack size is controlled by the stack size limit of the shell used to run the program. You can check and set the shell stack size limit with "ulimit -s" (for the sh or bash shell) or "limit stacksize" (for the csh or tcsh shells).

Question: Why does my program fail with an internal segmentation fault after running for a short time?

Answer: Most likely at least one variable that is shared in at least one parallel region is not marked sharable. The compiler is able to make some types of variables sharable automatically, but cannot for other variables (see section 7.3 of the User’s Guide).

Question: Why does my program crash when I open a file in a serial region and then access it within a parallel region?

Answer: Files (accessed through C/C++ file descriptors or Fortran unit numbers) are not automatically shared across nodes of a Cluster OpenMP program; each node has its own view of the file system. Opening a file in a serial region opens it only on the master node, not on the remote nodes. Therefore, the failure is due to attempting to read the file on a remote node without having opened it first. In order to read the file on all nodes, it must be opened on all nodes first. One way to do this is to open the file within a parallel region.

Question: Why does my program still fail after I made my variables sharable?

Answer: Does the program use dynamic memory allocation (e.g., C++ new, malloc) to allocate shared variables? If so, you need to make sure that you use kmp_sharable_malloc to allocate sharable memory, and kmp_sharable_freeto free it (see section 10.7 of the User’s Guide).

Question: Why does my program run out of memory?

Answer: If your program is running out of stack space, increase the stack space allocated to each of the threads (you may need to increase shell stack limit by setting KMP_STACKSIZE, and KMP_SHARABLE_STACKSIZE). In addition, you may need to increase the number of virtual memory maps (/proc/sys/vm/max_map_count), you may need to use a separate swap file to remove the twins from system managed memory (--backing_store in the kmp_cluster.ini file), and you may just need more physical memory. Cluster OpenMP programs can use more memory than regular OpenMP due to the need to use more than one page to represent some of the sharable memory pages.

Sharable Memory

Question: What does sharable mean?

Answer:Sharable memory is memory allocated in a special part of the program’s address space. Sharable memory has special attributes that allow variables stored in it to be shared by multiple threads in a parallel region. A sharable variable is a variable that resides in sharable memory. Variables accessed by different processors must be placed in sharable memory. The compiler can place some variables in sharable memory automatically, but often you must place variables explicitly in sharable memory with the sharable directive (see section 7.1 of the User’s Guide).

Question: Why aren't Global variables declared extern or in Fortran common blocks not treated as sharable?

Answer: The global variables must be declared to be sharable, either by listing them in a sharable directive, or (for Fortran common blocks) using the -clomp-sharable-commons compiler option. In the case of Fortran, it is always better to use a sharable directive than to use -clomp-sharable-commons, because -clomp-sharable-commons may make things sharable that need not be sharable.

Question: Why doesn't the code work after I used the OpenMP shared clause to indicate all my sharable variables?

Answer: The shared clause is not enough to make a variable sharable. Cluster OpenMP requires all variables shared in a parallel region, either by virtue of having been used in a shared clause or by being shared under the OpenMP default rules, be made sharable (see section 7.1 of the User’s Guide). Apparently you need one or more sharable directives in your code.

Incorrect Execution

Question: Why does my program work with regular OpenMP but fail with Cluster OpenMP?

Answer: You may have failed to declare some variables sharable, or you may have an error in your program logic that fails silently on a true shared memory system but fails more drastically with Cluster OpenMP. You should try -clomp-sharable-propagation to find more va riables that need to be made sharable. If the program still fails, review chapter 7 of the User’s Guide for techniques to find the rest of the variables that need to be made sharable.

Configuration Questions

Question: Why isn't my program using the options I expected?

Answer: Set the environment variable KMP_CLUSTER_SETTINGS to 1 and run the program. The settings of all options will be dumped out. This may help you figure out the problem.

Question: How can I get a list of all the available options to use when executing my program?

Answer: Set the environment variable KMP_CLUSTER_HELP and execute the program. The table of all options will be dumped out.

Debugging

Question: I’m trying to debug a Cluster OpenMP program that prints a segmentation fault message. How do I do this?

Answer: First review the use of command-line debuggers with Cluster OpenMP (User’s Guide chapter 8). Then, try setting a breakpoint in the special routine __itmk_segv_break. Since this symbol is in a shared library, it may not be recognized by the debugger until the shared library is loaded. Some newer versions of gdb will ask if you want to defer setting the breakpoint until a shared library is loaded, then will set the breakpoint when the shared library for Cluster OpenMP is loaded. When the breakpoint is reached, you should be able to get a traceback and thereby find the point in the program that caused the SEGV.

Question: I run my program under a debugger and it SEGVs before it even gets into main. Why does this happen?

Answer: You have probably forgotten to tell the debugger to forward SEGV signals to the process. In a Cluster OpenMP* program, SEGV is used by the inter-node memory consistency protocol, so a SEGV may not represent a bug in the program.

If you are using gdb, you can use the command "handle SIGSEGV nostop noprint". If you are using idb, you can use the command "ignore segv". If you are using the TotalView* debugger, you can insert the line "dset TV::signal_handling_mode {Resend=SIGSEGV}" in your ~/.tvdrc file.

Program Runs too Slowly

Question: Why is my program slower when "--processes=1 --process_threads=1" in the kmp_cluster.ini file than when run using regular OpenMP on a single processor?

Answer: Cluster OpenMP introduces some extra overhead, so this is not unexpected.

Question: My program uses dynamic scheduling, atomic constructs, critical sections, or lots of OpenMP locks, and is very slow with Cluster OpenMP. What causes this?

Answer: Any program that relies on the performance of locks is likely to run much slower with Cluster OpenMP, since locking and unlocking require very expensive network operations. Some built-in operations in OpenMP such as dynamic or guided scheduling, critical sections, and atomic statements use locks, and may thus be much slower with Cluster OpenMP than with OpenMP. The program, as written, may not be appropriate for use with Cluster OpenMP. Howe ver it may be possible to change parts of the program to improve its performance. For instance, using static scheduling instead of dynamic scheduling may be possible. Replacing file I/O with a mapped file may reduce overhead. Replacing atomic constructs and critical sections with hand-written reductions may improve performance.

Question: Why does the program execute much more slowly than the rest of the iterations the first time through my parallel loop?

Answer: The first time a node references a sharable page of memory, it may need to fetch that page from the master node. Subsequent references to the page will typically execute much more quickly. So, this behavior is not unexpected.

Using SSH

Question: How do I use ssh instead of rsh to launch a Cluster OpenMP program?

Answer: Set the --launch=ssh option in the kmp_cluster.ini file. The default is to use rsh. See the User’s Guide for additional information

Question: How do I configure ssh to allow Cluster OpenMP programs to run?

Answer: If the different nodes you will use to run your Cluster OpenMP program share the same home directory, the following commands will correctly setup ssh.

# enter ssh-keygen & keep hitting <cr> to accept the defaults.
$ ssh-keygen -t rsa
$ cd ~/.ssh
$ cp id_rsa.pub authorized_keys2
$ chmod 700 .
$ chmod 0600 id_rsa
# ssh into the remote node(s) specified in the kmp_cluster.ini file to verify ssh works without a password
$ ssh node1; ssh node2, etc
# Now build sample Cluster OpenMP program
$ icc -cluster-openmp clomp.c
$ icc -cluster-openmp clomp.c
clomp.c(4) : (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

# run Cluster OpenMP program and verify it is able to run on all nodes specified in the kmp_cluster.ini file.

$ ./a.out

For more complete information about compiler optimizations, see our Optimization Notice.