It finally happened!
Writing the sample code for this post I was amazed myself to see how simple it was to reach over 20 times performance improvement with so little effort.
(This work was done by Vivek Lingegowda during his internship at Intel.)
Most multi-threaded software uses locking. Lock optimization traditionally has aimed to reduce lock contention, that is make the critical regions smaller.
Tim Mattson (Intel) has authored an extensive series of excellent videos as in introduction to OpenMP*.