One of the big new features introduced in the Intel MKL 11.2 is the greatly improved performance for small problem sizes. In 11.2, this improvement focuses on xGEMM functions (matrix multiplication). Out of the box, there is already a version-to-version improvement (from Intel MKL 11.1 to Intel MKL 11.2). But on top of it, Intel MKL introduces a new control that can lead to further significant performance boost for small matrices. Users can enable this control when linking with Intel MKL by specifying "-DMKL_DIRECT_CALL" or "-DMKL_DIRECT_CALL_SEQ".
The time required by the first MKL call should be ignored for the perfromance measurements. The first MKL call has overhead due to buffer allocation and thread initialization. Ignoring the first MKL call gives more consistent times for small problems.