The checking addresses two different concerns:
Finding programming mistakes in the application that need to be fixed by the application developer. These include potential portability problems and violations of the MPI standard which do not immediately cause problems, but might when switching to different hardware or a different MPI implementation.
Detecting errors in the execution environment. This is typically done by users of ISV codes or system administrators who just need to know whom they have to ask for help.
In the former case correctness checking is most likely done interactively on a smaller development cluster, but it might also be included in automated regression testing. The second case must use the hardware and software stack on the system that is to be checked.
While doing correctness checking one has to distinguish error detection which is done automatically by tools and error analysis which is done by the user to determine the root cause of the error and eventually fix it.
The error detection in Intel® Trace Collector is implemented in a special library, libVTmc, which always does online error detection at runtime of the application. To cover both of the scenarios mentioned above, recording of error reports for later analysis as well as interactive debugging at runtime are both supported. By default libVTmc does not write a trace file. Set the CHECK-TRACING option to store correctness and performance information to the trace. Use the Intel® Trace Analyzer to view correctness checking events. Take in account that correctness checking requires resources. Do not use the obtained trace for performance analysis.
In some cases special features in Intel® MPI Library are required by libVTmc. Therefore this is currently the only MPI for which a libVTmc is provided.
The errors are printed to stderr as soon as they are found. Interactive debugging is done with the help of a traditional debugger: if the application is already running under debugger control, then the debugger has the possibility to stop a process when an error is found.
Currently it is necessary to manually set a breakpoint in the function MessageCheckingBreakpoint(). This function and debug information about it are contained in the Intel® Trace Collector library. Therefore it is possible to set the breakpoint and after a process was stopped, to inspect the parameters of the function which describe what error occurred. In later versions it will also be possible to start a debugger at the time when the error is found.