I'm profiling different variants of a network benchmark (NetPIPE) while running it in loop back. One version is single threaded and uses standard blocking sockets. The other versions are multi-threaded (4 threads) and use non-blocking sockets. The difference in performance between threaded/unthreaded versions is pretty big which was to be expected, but I'm having trouble accounting for all the time in all cases.
One of the multi-threaded versions shows a 3x latency difference over non-threaded. This version uses pthread_cond_timedwait() to coordinate between threads. However, when I complete a run using sampling, both this and the non-threaded version report very close to the same number of Clockticks. I would have expected that the Clockticks in the Event Summary for the process view would have been 3x bigger with the extra time going to the Idle Pid if nothing else. The other multi-threaded versions that use a polling method instead of pthread_cond_timedwait() do come up with the expected larger Clockticks. Assuming the benchmark timing mechanism (which is the same in all versions) is accurate, can you speculate as to why VTune would report virtually the same number of Clockticks when the run takes 3x longer when using pthread_cond_timedwait()?
P.S. This is being run on RedHat 7.3 (2.4.18 SMP kernel) with VTune 2.0