Unnecessary malloc when nthreads==1 costs 700ns on Xeon Phi

Unnecessary malloc when nthreads==1 costs 700ns on Xeon Phi

I've observed that OpenMP overhead is unexpectedly high on a Xeon Phi when omp_set_num_threads(1) is used. I've determined that this is due, at least in part, to an unnecessary allocation each time __kmpc_serialized_parallel() is invoked when nthreads==1. On each invocation a new dispatch_private_info_t is allocated for the serial team's th_disp_buffer. The buffer is freed by __kmpc_end_serialized_parallel().

malloc() seems to be particularly expensive on Phi -- about 700ns per call vs. 70ns per call on a Xeon X5570. If this buffer were retained between calls the OMP overhead would be reduced by 700ns.

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Thanks for the report. Could you provide a small reproducer ( a test-case ) to verify the allocation times on different platforms ( with different CPUs )? Thanks in advance.

This may be a mis-feature, but it  doesn't seem very serious. Running openMP on KNC with one thread seems agenerally futile thing to be doing. Can you explain why this actually matters to any real code?

Sergey: I've attached a simple test which measures the time to call malloc(192) in two scenarios: many mallocs in a row, and malloc/free cycling (which benefits from cache locality, among other things). I measure 0.06us/0.04us on a X5570 and 0.68us/0.49us on a Phi (i.e. malloc is more than 10X slower).

James: I suspect it doesn't matter much in real code. I've been trying to measure OpenMP overhead to understand how small a task can be to still benefit from OpenMP parallelization. I measured overhead with different numbers of threads and found that single threaded overhead was surprisingly high. I investigated it in the hopes that I might be able to reduce overhead in general, but it turns out to be a special case. I could imagine it might affect code which tunes parallelization to the size of the task. If the task is small enough that parallelization is unwarranted, it seems undesirable that OpenMP would still introduce significant overhead.


Downloadtext/x-csrc malloc-test.c1.07 KB

Leave a Comment

Please sign in to add a comment. Not a member? Join today