-parallel switch and TCL wrapper...

-parallel switch and TCL wrapper...

I have a library written in C++ which is loaded into TCL via the usual way. I tried to compile said library with "-parallel -par-threshold0 -par-report2" and got it to work -- some loops were parallelised, other not.

However,if I try to load the library within TCL, I get the following error:

$ failed to load overlayMeasurement.so: undefined symbol: __kmpc_global_thread_num

Now, I suspect that this is because TCL does not understand that symbloe even though TCL was compiled with ICC. But if anyone else here had the misfortune of use TCL, could you either confirm my predicament or give me a solution?

Thanks.

-- Nescire autem quid ante quam natus sis acciderit, id est semper esse puerum. Quid enim est aetas hominis, nisi ea memoria rerum veterum cum superiorum aetate contexitur?
7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

Quoting - nanometrics

"-parallel -par-threshold0 -par-report2"

undefined symbol: __kmpc_global_thread_num

True, that function has nothing to do with tcl. It's an Intel OpenMP run-time symbol, presumably defined in libguide and libiomp. -parallel creates a dependence on the OpenMP library. If you used icc -parallel to link, an appropriate library would be included implicitly.

Lowering par-threshold to 0 basically tells the compiler to parallelize loops even when it is 99% probable this will slow them down. It may be that the automatic parallelization misses a loop interchange or firstprivate/lastprivate which would make parallelization effective, or the parallelizer judges the loop too short for effective parallelization.

Quoting - tim18

True, that function has nothing to do with tcl. It's an Intel OpenMP run-time symbol, presumably defined in libguide and libiomp. -parallel creates a dependence on the OpenMP library. If you used icc -parallel to link, an appropriate library would be included implicitly.

Lowering par-threshold to 0 basically tells the compiler to parallelize loops even when it is 99% probable this will slow them down. It may be that the automatic parallelization misses a loop interchange or firstprivate/lastprivate which would make parallelization effective, or the parallelizer judges the loop too short for effective parallelization.

Thanks, the -parallel to the linker line was what was missing. Thank you for the clarification on -par-threshold as well.

-- Nescire autem quid ante quam natus sis acciderit, id est semper esse puerum. Quid enim est aetas hominis, nisi ea memoria rerum veterum cum superiorum aetate contexitur?

What tells ldd on your shared object?

One other solution wuld be totry to link against the static omp library - where the kmpc... symbol is defined (libiomp5.a from the lib directory under the path to your icc compiler) and see what is the outcome.

Quoting - Nicolae Popovici (Intel)

What tells ldd on your shared object?

One other solution wuld be totry to link against the static omp library - where the kmpc... symbol is defined (libiomp5.a from the lib directory under the path to your icc compiler) and see what is the outcome.

$ ldd testharness
linux-gate.so.1 => (0x00110000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x077b3000)
libimf.so => /opt/intel/Compiler/11.0/069/lib/ia32/libimf.so (0x00112000)
libm.so.6 => /lib/libm.so.6 (0x00d5f000)
libz.so.1 => /lib/libz.so.1 (0x00dad000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00d91000)
libtcl.so => /usr/local/climetpackages/tcltk8.4/lib/libtcl.so (0x00349000)
libgsl.so.0 => /usr/lib/libgsl.so.0 (0x00425000)
libgslcblas.so.0 => /usr/lib/libgslcblas.so.0 (0x005f8000)
libiomp5.so => /opt/intel/Compiler/11.0/069/lib/ia32/libiomp5.so (0x00630000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00baf000)
libc.so.6 => /lib/libc.so.6 (0x00be9000)
libdl.so.2 => /lib/libdl.so.2 (0x00d8a000)
/lib/ld-linux.so.2 (0x00bc4000)
libsvml.so => /opt/intel/Compiler/11.0/069/lib/ia32/libsvml.so (0x006ae000)
libintlc.so.5 => /opt/intel/Compiler/11.0/069/lib/ia32/libintlc.so.5 (0x00782000)

I got it working fine adding -parallel to the linker line. Ultimatly, there is a lot of work to do on the code as all instances of possible parallelisation got reported as not possible for a wide range of reasons. The option -par-report3 gave me plenty to play with. Sadly, this code is so old, hacked and undocumented that changing things kind of scares me. Of course, it is as well vital in what I do.

Most of the messages were like the following three lines:

Measure/profile.cpp(51): (col. 19) remark: parallel dependence: assumed FLOW dependence between (unknown) line 51 and (unknown) line 51.
Measure/profile.cpp(51): (col. 19) remark: parallel dependence: assumed ANTI dependence between (unknown) line 51 and (unknown) line 51.
Measure/profile.cpp(51): (col. 19) remark: parallel dependence: assumed OUTPUT dependence between (unknown) line 51 and (unknown) line 51.

-- Nescire autem quid ante quam natus sis acciderit, id est semper esse puerum. Quid enim est aetas hominis, nisi ea memoria rerum veterum cum superiorum aetate contexitur?

Quoting - nanometrics

I got it working fine adding -parallel to the linker line. Ultimatly, there is a lot of work to do on the code as all instances of possible parallelisation got reported as not possible for a wide range of reasons. The option -par-report3 gave me plenty to play with. Sadly, this code is so old, hacked and undocumented that changing things kind of scares me. Of course, it is as well vital in what I do.

Most of the messages were like the following three lines:

Measure/profile.cpp(51): (col. 19) remark: parallel dependence: assumed FLOW dependence between (unknown) line 51 and (unknown) line 51.
Measure/profile.cpp(51): (col. 19) remark: parallel dependence: assumed ANTI dependence between (unknown) line 51 and (unknown) line 51.
Measure/profile.cpp(51): (col. 19) remark: parallel dependence: assumed OUTPUT dependence between (unknown) line 51 and (unknown) line 51.

If your code is incorrect to the point where you can't set -ansi-alias, or maybe even -alias-const, that will certainly impede parallelization. Typically, it is necessary to use restrict as well, if your loop uses more than one pointer. Setting -fargument-noalias-global, even if your program doesn't live up to it, could be used to confirm places where aliasing concerns are preventing parallelization or vectorization (why didn't you allow vectorization?). -O3, together with appropriate -x level, permits the compiler to consider loop interchanges.

The 10.x and later compilers prefer auto-parallelization in some situations where vectorization should take priority, as it did with the 9.1 compilers. The same is true of programmers who set a goal of maximizing threaded performance scaling, as opposed to optimizing performance.

Quoting - tim18

If your code is incorrect to the point where you can't set -ansi-alias, or maybe even -alias-const, that will certainly impede parallelization. Typically, it is necessary to use restrict as well, if your loop uses more than one pointer. Setting -fargument-noalias-global, even if your program doesn't live up to it, could be used to confirm places where aliasing concerns are preventing parallelization or vectorization (why didn't you allow vectorization?). -O3, together with appropriate -x level, permits the compiler to consider loop interchanges.

The 10.x and later compilers prefer auto-parallelization in some situations where vectorization should take priority, as it did with the 9.1 compilers. The same is true of programmers who set a goal of maximizing threaded performance scaling, as opposed to optimizing performance.

The options passed to icc are : "-w -ansi -w1 -O3 -no-prec-div -xP -ip -unroll -ansi-alias -alias-const -fargument-noalias-global -parallel -par-report3". The code is a legacy system, most of it written 15 years ago. I would be surprised if there was any way to parallelise it automatically at all. But it was worth a try to see that it did not.

The next step is to look at the whole code via vtune and see where it can be improved. If this shows that it would benefits from parallelisation then a refactoring of the code could be on the table.

Nonetheless, thanks for your help.

-- Nescire autem quid ante quam natus sis acciderit, id est semper esse puerum. Quid enim est aetas hominis, nisi ea memoria rerum veterum cum superiorum aetate contexitur?

Login to leave a comment.