I've observed that OpenMP overhead is unexpectedly high on a Xeon Phi when omp_set_num_threads(1) is used. I've determined that this is due, at least in part, to an unnecessary allocation each time __kmpc_serialized_parallel() is invoked when nthreads==1. On each invocation a new dispatch_private_info_t is allocated for the serial team's th_disp_buffer. The buffer is freed by __kmpc_end_serialized_parallel().
malloc() seems to be particularly expensive on Phi -- about 700ns per call vs. 70ns per call on a Xeon X5570. If this buffer were retained between calls the OMP overhead would be reduced by 700ns.