Using Enhanced Intel SpeedStep® features in HPC Clusters

Summary:

Enhanced Intel SpeedStep® Technology  (EIST) enables moderns computers to adjust CPU frequency and power consumption according to CPU usage. With the addition of Nehalem Turbo Boost it became quite cumbersome to use all these features in the benchmarking data center Intel runs in Dupont/Washington.

As the focus is on benchmarking users – both Intel Engineers and external customers, want to run the systems in various configurations. Some want best performance, other do not want to enable Turbo Boost and a few even want to clock down the CPU to simulate behavior of  slower CPUs or evaluate the scaling of programs with clock speed.

Supporting these requirements became impossible for the small administrator stuff, and an automatic solution had to be found. This paper presents a way, how to switch CPU frequency on a PER JOB basis in our PBS Pro managed HPC cluster.

Technical Background – EIST:

The following is an excerpt from a far more complete overview on “Enhanced Intel SpeedStep® Technology and Demand-Based Switching on Linux” by Venkatesh Pallipadi to be found here.

The following figure depicts the 2.6.8 kernel cpufreq infrastructure at a high level:


The Cpufreq module of the Linux kernel provides a framework to support frequency and voltage changes. It depends on hardware specific drivers (like acpi and speedstep-centrino) and provides a hardware independent interface to the so called governors. These governors can either reside in the kernel or be completely user controlled via the /proc or /sys file systems and this is the interface used in our approach.

For every CPU, including logical CPUs implemented via Hyper-Threading, found in the system the Linux kernel will create a subdir under /sys/devices/system/cpu/cpu?/ cpufreq.

[root]# ls /sys/devices/system/cpu
cpu0  cpu1  cpu2  cpu3  cpu4  cpu5  cpu6  cpu7  sched_mc_power_savings
[root]# ls /sys/devices/system/cpu/cpu0/cpufreq
affected_cpus     cpuinfo_max_freq  scaling_available_frequencies  scaling_cur_freq  scaling_governor  scaling_min_freq
cpuinfo_cur_freq  cpuinfo_min_freq  scaling_available_governors    scaling_driver    scaling_max_freq  scaling_setspeed

These files can be read/write using standard Unix methods. On a shell the system administrator can use cat {filename} and echo {value} > {filename} to do all necessary changes.
Readable files include:

  • scaling_driver: low-level CPU-specific driver currently in use
  • scaling_available_frequencies: list of all the frequencies supported on this processor (all frequency values are in KHz
  • scaling_available_governors: lists all the governors that can be used in this system

The following files allow read and write access. While read gives the current settings, only specific values are allowed for write. Allowed values can be found by reading from the files in the previous paragraph.

  • scaling_governor: current policy governor being used
  • scaling_cur_freq: provides an interface to get the current frequency
  • scaling_max_freq: limits maximum frequency that can be set by the governor
  • scaling_min_freq: limits minimum frequency that can be set by the governor
  • scaling_setspeed: available only if governor is set to userspace; if set, writing a value from scaling_available_frequencies will change the CPU frequency accordingly.

Redhat Enterprise 5 employs this interface (userpace governor and scaling_setspeed file) to control the emand based frequency switching with the user-level daemon cpuspeed.

The batch scheduling system:

Note: although the description mentions specifically Altair PBS Pro, the described prologue/epilogue feature is available in most batch scheduling software.

To allow parallel usage of our clusters we employ Altair’s PBS Pro batch management solution. In our configuration PBS Pro ensures, that every node at any given time is usable only by exactly ONE job and user. Wile a user’s job has control over the system, the user can use remote commands like ssh to access it. All other processes with a user ID greater than 1000 are automatically detected by PBS and killed.

Once PBS schedules a job to a number of nodes, on the first node the script prologue is executed (default location /var/spool/PBS/mom_priv) with an effective UID of 0 (aka run as root). In our environment this is a shell script used for a couple of reasons:

  • Checking consistency of all nodes reserved for a job.
  • Ensuring all file systems report properly
  • Ensuring no processes from previous jobs have been left behind
  • Setting on ALL nodes associated to a job special configurations as requested by a user. One of these items is CPU frequency.
  • Prints a report on important characteristics of the nodes in a job. This includes kernel and OFED version in use, version of motherboard BIOS and IB-card firmware and so on

After the job is done an epilogue script is run (in effect we use the same script that is executed under different names). Again the nodes are checked, and any special configurations returned to their default states.

Additional tools used

It is important to note, the these two scripts are only executed on the first node associated to a job (the head node). To process various commands in parallel on all nodes within a job, we use the program pdsh. A typical command might look like

[root]# pdsh –w en[001,003-004] –u 3 pwd

This executes in PARALLEL the command pwd on the nodes en001, en003 and en004. Often the output is then parsed by dshbak to combine identical output into a format more easy to read.

Methodoloy

EIST must be enabled within the BIOS and supported by the Linux kernel. We use Redhat 5.3 within CRT-DC which supports all required features and can use Linux capabilities to switch between various states. Our methodology consists of:

  1. ensure all necessary drivers are loaded and all files in /sys have been created
  2. a user submits a job and requests via a PBS resource that all nodes on this job are set to a specified CPU frequency
  3. during PBS pre execution the prologue script parses the user requests and take appropriate measures to set everything according to user request. If a node is found wanting, it is take offline, the prologue script exits with an error and the job automatically requeued by PBS.
  4. job executes
  5. after the job has finished, the epilogue script ensures all nodes are returned to standard configuration. If a node is found wanting, it is take offline.

Preparation of nodes

One has to ensure that all necessary drivers are loaded and all files in /sys have been created. We found the easiest way to do this under Redhat 5 Linux is to execute

/etc/init.d/cpuspeed start; sleep 1; /etc/init.d/cpuspeed stop 

on our compute nodes. After cpuspeed is stopped the system will remain in the highest frequency available. Checking

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies

on our NHM systems gives

2794000 2793000 2660000 2527000 2394000 2261000 2128000 1995000 1862000 1596000

with 2794000 indicating "Turbo Mode" (notice it's only a single step above the next lower frequency). As our default behavior is "Turbo Off", we next force the system to switch to 2793000 MHz.

speed=2793000
for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_setspeed
do
echo "$speed" > $file
done

At this point the system will run fixed at the design speed. One can check via

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq

or

grep MHz /proc/cpuinfo

These steps we run during our validation process. Each time before a system is handed over to users an elaborate script runs ensuring consistency of all compute node across the cluster.

Job Sumbission by User

To allow users to change speed settings (and Turbo mode) at run time we use features of PBS, the batch processing and queuing system managing our clusters. Users indicate at the time of job submission via resource EIST what speed they want to use during their run, and if Turbo should be enabled or not.

Turbo can be activated in 2 ways - either by setting

echo FREQ > /sys/devices/system/cpu/cpu*/cpufreq/scaling_setspeed

to the highest possible frequency (2794000 in the example above), or by starting the cpuspeed demon. We recommend using the later option, as this will also ensure CPU frequencies on cores not needed are kept low. This optimizes both power consumption and turbo behavior.

Within our environment a user can change the behavior of all nodes via the resource EIST. Valid options are:

EIST=0 CPUspeed=2.793, no cpuspeed demon, default
EIST=1 cpuspeed demon started; Turbo active
EIST=X cpuspeed set to value closest to X (X>0)

Some examples:

  qsub … -l EIST=1 ./test.sh switches cpuspeed daemon on 
  qsub … -l EIST=2261000 ./test.sh Sets the cpuspeed to 2261000 Hz
  qsub … -l EIST=1596000 ./test.sh Sets the cpuspeed to 1596000 Hz
  qsub … -l EIST=0 ./test.sh switches cpuspeed daemon off
  (disables TURBO Mode, default)

The PBS prologue script

Under PBS, before a job runs, on the headnode (the first node used in a job), the script /var/spool/PBS/mom_priv/prologue is executed under root privileges. During this script we evaluate the resources requested, and set the frequency accordingly.

Unfortunately the author did not find a direct way to query PBS (version 8) in an easier way for resources. So we use "qstat -f" and analyze the output in a fairly complicated sed statement and "eval" the resulting string. In our environment a request "-l EIST=1596000" will therefore create a shell variable Resource_List_EIST with the value 1596000.

RETURN=`qstat -f $JOBID | sed -e 's/\t//g' -e 's/Job Id:/Job_Id =/' | \
sed -e ':a' -e '$!N; s/\n//; ta' -e 's/ /\n/g' |\
sed -e 's/ = /="/' -e 's/$/";/g' -e 's/resources_used./resources_used_/' -e 's/Resource_List./Resource_List_/' \
-e 's/^/export /'`
eval $RETURN

In the configuration part of the script we set EIST to the default frequency. We also use this variable to switch this option on a clusterwide level.

EIST=2793000

The code evaluating the user set resource and setting frequency is shown below. Keep in mind, that the script is only executed on the headnode. We use "pdsh" to distribute the settings to all nodes used in this job.

# check if this feature is currently enabled on the cluster

if [ "${EIST}" -gt 0 ]
then

# if the user set "-l EIST=0" we are going to use 
#   the default frequency
  if [ "${Resource_List_EIST}" = 0 ]
  then
    Resource_List_EIST=${EIST}
  fi
 
  # during prologue set 
  if [ "$prologue" -a -n "${Resource_List_EIST}" ]
  then
    EIST=${Resource_List_EIST}
  fi
 
  # if EIST is still set to one, we only start cpuspeed
  if [ "${EIST}" = 1 ]
  then
    pdsh -w "${NODES}" -x "${HEADNODE}"  -u 5 \
      /etc/init.d/cpuspeed start | dshbak -c
  else
    # ensure cpuspedd is stoped, and then the requested speed 
    #   is set on all nodes
    pdsh -w "${NODES}" -x "${HEADNODE}"  -u 5 \
        /etc/init.d/cpuspeed stop | dshbak -c
    for I in `seq 0 ${MAXCORES}`
    do
       FILE=/sys/devices/system/cpu/cpu${I}/cpufreq/scaling_setspeed
       pdsh -w "${NODES}" -x "${HEADNODE}"  -u 5 \
        "[ -f ${FILE} ] && echo ${EIST} > ${FILE};exit 0"
    done
  fi
fi

 

 


At the end of the script we inform the user about the current settings. This will show up in the standard output file of each job before any other user output.

 

 

 

echo "speedstep setting: EIST=${EIST}"
pdsh -w "$NODES" -x "$HEADNODE" -u 5 \
    '/etc/init.d/cpuspeed status;exit 0' | dshbak -c
pdsh -w "${NODES}" -x "${HEADNODE}"  -u 5 \ 
    "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq" \
    | dshbak –c

Beyond the fence – SuSE SEL11

SuSE systems are a little bit different. As with Redhat some items depend on the configuration. Using a default desktop config the author found that power management was directed by the “Gnome Power Management” utility. SEL11 also comes with the powersaved package that contains userspace demon to control the CPU frequency. Please take a look at your configuration and the documentation provided by Novell.

Using a SEL11 in runstate 3 (no X windows; typical for HPC server farms) the author found that the kernel provided ondemand governor was regulating CPU frequencies, and without load the CPU would run on lowest frequency:
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
ondemand
…
ondemand
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
1596000
…
1596000

Nevertheless, a small benchmark program revealed EIST was working, and the CPU would go into Turbo mode as soon as load was applied:

> ./bin/blackscholes 1 100000000
The integral of BS(T) over [0,1] with 100000000 steps (1 threads) is 0.770042642388
Time Elapsed: 9.07 sec

Not surprisingly the same possible frequency range as found under Redhat was seen again:

#cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
2794000 2793000 2660000 2527000 2394000 2261000 2128000 1995000 1862000 1596000

Remember – the highest frequency denotes Turbo mode, the second highest value gives the rated CPU frequency.

To customize frequencies by hand one has first to switch the governor to userpace (at that point CPU frequency will remain unchanged from it’s current state):

# for I in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor;  \
    do echo userspace > $I; done
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
userspace
…
userspace
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
1596000
…
1596000

The Blacksholes benchmark takes now more than twice as long as before:

> ./bin/blackscholes 1 100000000
The integral of BS(T) over [0,1] with 100000000 steps (1 threads) is 0.770042642388
Time Elapsed: 18.20 sec

Again one can easily set the CPU to it’s highest standard value WITHOUT turbo:

# for I in /sys/devices/system/cpu/cpu*/cpufreq/scaling_setspeed; \
   do echo 2793000 > $I; done
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
2793000
…
2793000

The Blacksholes benchmark regains almost it’s original speed:

> ./bin/blackscholes 1 100000000
The integral of BS(T) over [0,1] with 100000000 steps (1 threads) is 0.770042642388
Time Elapsed: 10.36 sec

And lastly one can enable Turbo mode (highest available frequency):

# for I in /sys/devices/system/cpu/cpu*/cpufreq/scaling_setspeed; \
   do echo 2794000 > $I; done
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
2794000
...
2794000

The Blackshole test runs now as fast as before – the small difference of 0.01s should not be take seriously:

> ./bin/blackscholes 1 100000000
The integral of BS(T) over [0,1] with 100000000 steps (1 threads) is 0.770042642388
Time Elapsed: 9.06 sec

For this specific SuSE installation there the only change necessary in the PBS prologue script would be to exchange the lines

/etc/init.d/cpuspeed start

with

# for I in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; \
   do echo userspace > $I; done

and lines

/etc/init.d/cpuspeed stop

With

# for I in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; \
   do echo ondemand > $I; done


Summary

This paper explained in detail a method how to enable Enhanced Intel SpeedStep® Technology and Nehalem Turbo Mode within the constraints of a multi user HPC cluster. The author hopes this Whitepaper helps the reader in using Intel technology to her best advantages. He can be reached via e-mail at Michael.hebenstreit@intel.com.

For more complete information about compiler optimizations, see our Optimization Notice.

3 comments

Top
anonymous's picture

why scaling_available_frequencies show speed over normal speed only 1000 Hz
In technology sheet by Intel it show varies between 2.66 GHz and 2.93 GHz

How can I seen that?

Michael Hebenstreit (Intel)'s picture

voltage should be adjusted automatically by the hardware - this is using the system as designed - we do not try to run it outside well defined (and tested) parameter, we do not overclock - so adjusting voltage by hand is not necessary

anonymous's picture

Using the cpufreq can we even modify the voltage? The /sys entries only specify the
available frequencies and not the available voltages, then how does it vary the voltage?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.