Intel® Xeon Phi™ coprocessor Power Management Configuration: Using the micsmc command-line Interface

Previous blogs on power management and a host of other power management resources can be found in, “List of Useful Power and Power Management Articles, Blogs and References” at http://software.intel.com/en-us/articles/list-of-useful-power-and-power-management-articles-blogs-and-references.

INTRODUCTIONS: TEMPERATURE SENSORS AND THE COPROCESSOR

Figure SENSORS is borrowed from the Intel® Xeon Phi™ Coprocessor Datasheet, dated June 2013. This image shows the location of various sensors and components on the coprocessor’s printed circuit board (PCB).

Front of phi coprocessor PCB

Front of phi coprocessor PCB

Figure SENSORS. The front and back of a representative coprocessor printed circuit board showing the position of thermal sensors and major components.

The below descriptions of the various micsmc command line options discuss the purpose and interpretation of the data obtained from these sensors. Since the coprocessor has both passive and active cooling SKUs, there are inlet, outlet and fan related sensors. These sensors exist on both types of coprocessors but their meaning only applies to the active versions. In passive versions, the meaning and usefulness of the sensors is going to depend upon the cooling provided by the housing of the host containing the coprocessors.

 

COMMAND LINE USAGE: MEASURING POWER

Not unexpectedly, you can do everything that can be done using the graphical tool and more by using the command line version. For a full list of commands, see Table FULL. Table POWER below that shows the most relevant options related to power.

twkidd@knightscorner1:~> micsmc --help

Intel(R) Xeon Phi(TM) Coprocessor Platform Status Panel
VERSION: 3.1-0.1.build0
Developed by Intel Corporation. Intel, Xeon, and Intel Xeon Phi are trademarks
of Intel Corporation in the U.S and/or other countries.

This application monitors device performance, including driver info,
temperatures, core usage, etc.

This program is based in part on the work of the Qwt project
(http://qwt.sf.net).

The Status Panel User Guide is available in all supported languages, in PDF and HTML formats, at:

   "/opt/intel/mic/sysmgmt/docs"+

USAGE:
======
   -a, --all [[device] <device_list>]
         Displays all/selected device status data. Equivalent to: -i -t -f -m
         -c.
   -c, --cores [[device] <device_list>]
         Displays the average and per core utilization levels for all/selected
         devices.
   -f, --freq [[device] <device_list>]
         Displays the clock frequency and power levels for all/selected
         devices.
   -i, --info [[device] <device_list>]
         Displays general system information for all/selected devices.
   -l, --lost
         Displays all Intel(R) Xeon Phi(TM) Coprocessors in the system and
         whether they are currently in the Lost Node condition.
   --online
         Displays all Intel(R) Xeon Phi(TM) Coprocessors in the system that are
         currently online.
   --offline
         Displays all Intel(R) Xeon Phi(TM) Coprocessors in the system that are
         currently offline, lost, or otherwise unavailable.
   -m, --mem [[device] <device_list>]
         Displays the memory utilization data for all/selected devices.
   -t, --temp [[device] <device_list>]
         Displays the temperature levels for all/selected devices.
   --ecc [status | enable | disable] [[device] <device_list>]
         Optional arguments:
            enable  - enables ECC Mode
            disable - disables ECC Mode
            status  - displays the ECC Mode
         Enables, disables or displays the ECC Mode for all/selected devices.
         NOTE: If no arguments are provided, status is displayed.
   --turbo [status | enable | disable] [[device] <device_list>]
         Optional arguments:
            enable  - enables Turbo Mode
            disable - disables Turbo Mode
            status  - displays Turbo Mode status
         Enables, disables or displays the Turbo Mode for all/selected devices.
         NOTE: If no arguments are provided, status is displayed.
   --led [status | enable | disable] [[device] <device_list>]
         Optional arguments:
            enable  - enables LED Alert
            disable - disables LED Alert
            status  - displays LED Alert status
         Enables, disables or displays the LED Alert for all/selected devices.
         NOTE: If no arguments are provided, status is displayed.
   --pthrottle [[device] <device_list>]
         Displays the Power Throttle State for all/selected devices.
   --tthrottle [[device] <device_list>]
         Displays the Thermal Throttle State for all/selected devices.
   --pwrenable [cpufreq | corec6 | pc3 | pc6 | all] [[device] <device_list>]
         Optional arguments:
            cpufreq - enables the cpufreq power management feature
            corec6  - enables the corec6 power management feature
            pc3     - enables the pc3 power management feature
            pc6     - enables the pc6 power management feature
            all     - enables all four power management features
         Enables/disables the Power Management Features for all/selected
         devices.
         NOTE: Each feature not specified will automatically be disabled. If no
         features are specified, then all Power Management Features are
         disabled.
   --pwrstatus [[device] <device_list>]
         Displays the Power Management Feature status for all/selected devices.
   --timeout <value>
         Required argument:
            value - integer timeout value in seconds.
         Sets the sub-process timeout value for the current invocation. Affects
         only command option(s) requiring sub-process execution.
   -h, --help [<options_list>]
         Displays full/selected usage information and then exits.
   -v, --version
         Displays the tool version and then exits.
=======================================
Common Argument: [[device] device_list]
   Specifies the device name arguments for a given command option. The
   'device_list' specifies one or more 'micN' values where 'N' is the device
   number: 'mic2 mic5 ...' When no device names are specified, the option
   operates on all devices in the system.
twkidd@knightscorner1:~>

Table FULL. Full list of micsmc command line options.

   -f, --freq [[device] <device_list>]
         Displays the clock frequency and power levels for all/selected
         devices.
   -t, --temp [[device] <device_list>]
         Displays the temperature levels for all/selected devices.
   --turbo [status | enable | disable] [[device] <device_list>]
         Optional arguments:
            enable  - enables Turbo Mode
            disable - disables Turbo Mode
            status  - displays Turbo Mode status
         Enables, disables or displays the Turbo Mode for all/selected devices.
         NOTE: If no arguments are provided, status is displayed.
   --pthrottle [[device] <device_list>]
         Displays the Power Throttle State for all/selected devices.
   --tthrottle [[device] <device_list>]
         Displays the Thermal Throttle State for all/selected devices.
   --pwrenable [cpufreq | corec6 | pc3 | pc6 | all] [[device] <device_list>]
         Optional arguments:
            cpufreq - enables the cpufreq power management feature
            corec6  - enables the corec6 power management feature
            pc3     - enables the pc3 power management feature
            pc6     - enables the pc6 power management feature
            all     - enables all four power management features
         Enables/disables the Power Management Features for all/selected
         devices.
         NOTE: Each feature not specified will automatically be disabled. If no
         features are specified, then all Power Management Features are
         disabled.
   --pwrstatus [[device] <device_list>]
         Displays the Power Management Feature status for all/selected devices.

Table POWER. List of micsmc options most relevant to power management.

As is often the case, the documentation on many of these is pretty sparse. This is not a criticism, just an acceptance of the fact that there are always fires that need stamping out.

 

DETAILS FOR THE OPERANDS THAT APPLY SPECIFICALLY TO POWER.

“--freq”

twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US> micsmc --freq mic0

mic0 (freq):
   Core Frequency: .......... 1.10 GHz
   Total Power: ............. 107.00 Watts
   Low Power Limit: ......... 315.00 Watts
   High Power Limit: ........ 375.00 Watts
   Physical Power Limit: .... 395.00 Watts
twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US>

Measurement

Definition

Core Frequency

 

Total Power

Rate of energy usage (Joules/sec, aka Watts)

Low Power Limit

Above which PM initiates basic cooling activities such as increasing the fan speed.

High Power Limit

Above which PM performs aggressive cooling activities such as throttling the cores and maximizing fan speed

Physical Power Limit

Also called the “shutdown limit”. Above this limit, the PM starts shutting down the coprocessor. A warning may precede this shutdown.

Table FREQ: Explanation of the output of “micsmc --freq”

 

“--temp”

twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US> micsmc --temp mic0

mic0 (temp):
   Cpu Temp: ................ 49.00 C
   Memory Temp: ............. 36.00 C
   Fan-In Temp: ............. 30.00 C
   Fan-Out Temp: ............ 36.00 C
   Core Rail Temp: .......... 35.00 C
   Uncore Rail Temp: ........ 36.00 C
   Memory Rail Temp: ........ 36.00 C
twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US>

Measurement

Definition

CPU Temperature

For the die

Memory Temperature

For memory

Fan-In Temperature

For the fan inlet sensor for an active coprocessor.

Fan-Out Temperature

For the fan outlet sensor for an active coprocessor.

Core Rail Temp

For the power rail feeding the coprocessor chip

Uncore Rail Temp

For the power rail feeding all other circuitry except memory

Memory Rail Temp

For the power rail feeding memory

Table TEMP. Explanation of the output of “micsmc --temp”

 

“--pthrottle”

twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US> micsmc --pthrottle mic0

mic0 (pthrottle):
   Throttle state: ......... inactive
   Current throttle time: .. 0 msec
   Throttle event count: ... 0
   Total throttle time: .... 0 msec
twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US>

Measurement

Explanation

Throttle state

Indicates if the processor is being power throttled; power throttling is done when coprocessor power exceeds a certain threshold.

Current throttle time

If power throttled, how long has it been so

Throttle event count

# of times throttled over current interval

Total throttle time

Total time throttled over current interval

Table PTHROTTLE. Explanation of the output of “micsmc --pthrottle”

 

“--tthrottle”

twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US> micsmc --tthrottle mic0

mic0 (tthrottle):
   Throttle state: ......... inactive
   Current throttle time: .. 0 msec
   Throttle event count: ... 0
   Total throttle time: .... 0 msec
twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US>

Measurement

Explanation

Throttle state

Indicates if the processor is being thermally throttled; thermal throttling is done when the coprocessor die temperature exceeds a certain threshold.

Current throttle time

If thermally throttled, how long has it been

Throttle event count

# of times throttled over current interval

Total throttle time

Total time throttled over current interval

Table TTHROTTLE. Explanation of the output of “micsmc --tthrottle”

 

“--pwrstatus”

twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US> micsmc --pwrstatus mic0

mic0 (pwrstatus):
   cpufreq power management feature: .. enabled
   corec6 power management feature: ... enabled
   pc3 power management feature: ...... enabled
   pc6 power management feature: ...... enabled
twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US>

Measurement

Explanation

cpufreq

Enabled/disabled status of P-states

corec6

Enabled/disabled status of Core C6

pc3

Enabled/disabled status of Package C3

pc6

Enabled/disabled status of Package C6

Table STATUS. Explanation of the output of “micsmc --pwrstatus”

 

COMMAND LINE USAGE: CONFIGURING POWER

“--turbo”

twkidd@knightscorner5:~> micsmc --turbo status

mic0 (turbo):
   Turbo mode is enabled

mic1 (turbo):
   Turbo mode is disabled
twkidd@knightscorner5:~>

Measurement

Definition

Turbo mode

Indicates if it is enabled, disabled or not supported

Table TURBO. Explanation of the output of “micsmc --turbo status”

Here is an important note: To enable or disable turbo, you do not have to reboot / restart the card.

 

 

“--pwrenable”

twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US> micsmc –pwrenable [cpufreq | corec6 | pc3 | pc6 | all] [[device] <device_list>]

Setting

Explanation

cpufreq

Enables the use of P-states

corec6

Enable the cores to drop into Core C6

pc3

Enable the specified coprocessor to enter package C-state pc3

pc6

Enable the specified coprocessor to enter package C-state pc6 (the lowest possible idle state)++

all

Obvious

Table ENABLE. Explanation of the options of “micsmc --pwrenable”

 

“--all”

twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US> micsmc –-all [[device] <device_list>]

This operand is equivalent to specifying “--freq --temp --info --mem –cores” for the specified devices.

 

ERROR MESSAGES

If you do not have permissions, you will likely get the message, “Error: mic0: unable to set power management configuration: unable to open configuration file: /etc/mpss/mic0.conf”. Also notice that this command setups the coprocessor boot configuration files “micn.conf”, where n is the coprocessor number. Intel® Manycore Platform Software Stack (MPSS) can only implement any changes you specify upon rebooting the specified coprocessor. (This does not include turbo.)

Only certain SKUs have support for turbo. If your card does not, you will get an error message similar to the following.

twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US> micsmc --turbo status
Warning: mic0: Turbo mode not supported by this device:
       Device ID: 0x225d, stepping: 0x2, substepping: 0x0
Warning: mic1: Turbo mode not supported by this device:
       Device ID: 0x225d, stepping: 0x2, substepping: 0x0
twkidd@knightscorner1:/usr/share/doc/sysmgmt/en_US>

+The documentation is in error as of January 2014. This directory should be /usr/share/doc/sysmgmt. Hopefully this has been corrected by the time you, my humble reader, have read this blog.

++Some of the earliest SKUs do not have PC6 capability

For more complete information about compiler optimizations, see our Optimization Notice.