(This work was done by Vivek Lingegowda during his internship at Intel.)
Most multi-threaded software uses locking. Lock optimization traditionally has aimed to reduce lock contention, that is make the critical regions smaller.
Tim Mattson (Intel) has authored an extensive series of excellent videos as in introduction to OpenMP*.
The Intel® Compiler provides SIMD intrinsics APIs for short vector math library (SVML) and starting with Intel® Advanced Vector Extensions