Now that OpenMP 4 is fairly well supported on the Fortran side (except for simd reduction), I've been able to set up an example which can run in host, MIC native, and offload mode, simply by changing compile options.
I've arranged the benchmark to minimize the accounting for data transfers between host and coprocessor by running the test loop thousands of times between transfers, yet the offload performance doesn't approach MIC native performance. A small part of the problem is that the offload mode peaks at 59 threads (about double the performance of default number of threads), while the native mode shows gains up to 177 threads.
Another small part of the problem is that compiler directives surrounded by #ifdef __MIC__ are used only for -mmic compilation. I would use them also for offload target mode if I knew the incantation. Stuff like !dir$ no vector for the case where the vectorization is slow due to ineffective software prefetch, and !dir$ unroll(0).
According to the vecanalysis tool
not only are the conditional directives not used for omp target mode, in every test some vector operations which are reported as "lightweight" in the native mode are reported as "medium" for omp target, and some "medium" are promoted to "heavyweight." Examination of the .s files doesn't show any difference to account for the difference in reports. The MIC .s files are difficult to read as there appears to be no way to suppress a debug symbol showing prior to each instruction.
Also, the native mode compilation vecanalysis reports no peeled vectorized loops and several vectorized remainder loops, opposite to the omp target mode. That's another problem which should be only minor.
My C++ version gives similar performance to the Fortran in MIC native mode, but isn't sufficiently stable in omp target mode. The old problem of reporting buffer overlaps when transferring explicitly more than 64MB remains, among others. The only suggestion I've received about that is that the current MPSS may not be supporting the earlier KNC coprocessors (apparently all current production models have more than 4GB RAM). It's strange that the problem is solved for ifort but not for icpc.
There's also a remaining conflict between target map and target update, where both are needed in the same application, which was fixed in ifort. gcc seems to be copying the lack of support for target update. I filed a bug report with gcc about omp simd reduction being accepted (even where icc rejects it) but killing the optimization which occurred without the directive, which was verified.