The netlib version is a shar file. On Windows, you could use cygwin sharutils to unpack it.
I show examples of how to optimize, using compiler vectorization and OpenMP where appropriate, in Fortran, C, and C++ (all sharing the same Fortran driver).
In a few cases, threading is possible only at the expense of optimization of individual threads, and more than 4 threads may be needed to overcome the deficit, even on the longest loop length. Both versions are shown.
Makefiles are provided for Intel icpc/ifort (linux), ifort/ICL/MSVC (Windows), and gcc/gfortran. The Intel libiomp5 libraries are recommended, regardless of compiler chosen.
For testing on multiple last level cache machines, or with HyperThreading enabled, affinity environment variables KMP_AFFINITY or GOMP_CPU_AFFINITY are strongly recommended, so as to place adjacent threads on the same cache, and to set 1 thread per core. Note that libiomp5 implements either environment variable, gcc only the latter, and MSVC neither.