The time required by the first MKL call should be ignored for the perfromance measurements. The first MKL call has overhead due to buffer allocation and thread initialization. Ignoring the first MKL call gives more consistent times for small problems.
