'a' is a double array of size 10000. It is small enough to reside in cache of a core. Following operations run on a single core of Xeon Phi. The operations may be parallellized for 1,2, 3, or 4 threads.
Is it possible to improve 'Operation A' so that it does not perform worse than 'Operation B'?