Rangeley Uncore Performance Monitoring Events

Using the Rangeley SoC Performance Monitoring Events

This article focuses directly on the uncore performance monitoring events for the SoC Rangeley.  For the introduction to SoC uncore performance monitoring, please see this article:

Introduction to the Rangeley SoC

Below is a block diagram, illustrating the typical Rangeley layout. The green arrows connecting each block represent interfaces that can be monitored for requests to calculate bandwidth.  The gray arrows in the south cluster represent interfaces that are not supported for uncore performance monitoring.  As shown below, all analysis will be focused in the north cluster.

Available Groups

The following table documents the available groups for Rangeley.  The group name refers to the pre-determined set of events that will be programmed by the software monitoring tools.  The event column documents how many events that group contains.  The clock column documents if the group includes counts from the SoC source clock.

Rangeley Uncore Event Group Table

Group Name Events Clock Description
UNC_SOC_Memory_DDR_BW 8 No Counts requests of size 32 bytes and 64 bytes to memory, for memory channel 0 and 1. Determine memory bandwidth by multiplying event count by the request size (32 or 64 bytes).
UNC_SOC_Memory_DDR0_BW 5 Yes Counts requests of size 32 bytes and 64 bytes to memory, for memory channel 0. Determine memory bandwidth by multiplying event count by the request size (32 or 64 bytes).
UNC_SOC_Memory_DDR1_BW 5 Yes Counts requests of size 32 bytes and 64 bytes to memory, for memory channel 1. Determine memory bandwidth by multiplying event count by the request size (32 or 64 bytes).
UNC_SOC_DDR_Self_Refresh 3 Yes Counts the number of cycles that memory channel 0 and 1 are in self-refresh.
UNC_SOC_All_Reqs 7 Yes Counts the number of requests per memory agent. Counts can be used to identify high demand agents or to estimate bandwidth for an agent by multiplying each request by 64 bytes.
UNC_SOC_Module0_BW 7 Yes Counts bandwidth events for Silvermont module 0. Determine bandwidth between Silvermont module 0 and memory by multiplying event count by the request size (32 or 64 bytes).
UNC_SOC_Module1_BW 7 Yes Counts bandwidth events for Silvermont module 1. Determine bandwidth between Silvermont module 1 and memory by multiplying event count by the request size (32 or 64 bytes).
UNC_SOC_Module2_BW 7 Yes Counts bandwidth events for Silvermont module 2. Determine bandwidth between Silvermont module 2 and memory by multiplying event count by the request size (32 or 64 bytes).
UNC_SOC_Module3_BW 7 Yes Counts bandwidth events for Silvermont module 3. Determine bandwidth between Silvermont module 3 and memory by multiplying event count by the request size (32 or 64 bytes).
UNC_SOC_Module0_1_BW 8 No Counts bandwidth events for Silvermont module 0 and module 1. Determine bandwidth between Silvermont module 0, module 1 and memory by multiplying event count by the request size (32 or 64 bytes).
UNC_SOC_Module2_3_BW 8 No Counts bandwidth events for Silvermont module 2 and module 3. Determine bandwidth between Silvermont module 2, module 3 and memory by multiplying event count by the request size (32 or 64 bytes).
UNC_SOC_Module0_1_Snoops 5 Yes Counts the number of snoop requests and replies for Silvermont module 0 and 1.
UNC_SOC_Module2_3_Snoops 5 Yes Counts the number of snoop requests and replies for Silvermont module 2 and 3.
UNC_SOC_Module0_1_2_3_Snoops 8 No Counts the number of snoop requests and replies for Silvermont module 0, 1, 2 and 3.
UNC_SOC_LowSpeedPF_BW 7 Yes Counts aggregate bandwidth events for the low speed peripheral fabric. Determine the aggregate memory bandwidth by multiplying event count by the request size (32 or 64 bytes).
UNC_SOC_HighSpeedPF_BW    

Counts aggregate bandwidth events for the high speed peripheral fabric. Determine the aggregate memory bandwidth by multiplying event count by the request size (32 or 64 bytes).

UNC_SOC_Memory_DDR_BW

UNC_SOC_Memory_DDR_BW group provides counts to compute the total memory bandwidth as seen by the SoC memory controller.  The events provide a break-down of per channel requests, including 32 and 64 byte request sizes. While these events do not provide insight on which agent is demanding memory, it is the most accurate way to determine how much actual memory bandwidth is being consumed.

The number of memory channels for Rangeley is SKU dependent and may be either one or two channels.  If there is only one channel, all counts associated with the second channel will be zero.  For more information on memory channel architecture, please see: http://en.wikipedia.org/wiki/Multi-channel_memory_architecture.

The groups UNC_SOC_MEMORY_DDR0_BW and UNC_SOC_MEMORY_DDR1_BW are subsets of this group, collecting only memory channel 0 or 1 respectively.

The image below illustrates the traffic flow being monitored for this group.

The table below documents the events contained in the UNC_SOC_Memory_BW group.

UNC_SOC_Memory_DDR_BW
Name Counter Description
DDR_Chan0_Read32B 0 Counts memory read requests of size 32 bytes to memory channel 0.
DDR_Chan0_Read64B 1 Counts memory read requests of size 64 bytes to memory channel 0.
DDR_Chan0_Write32B 2 Counts memory write requests of size 32 bytes to memory channel 0.
DDR_Chan0_Write64B 3 Counts memory write requests of size 64 bytes to memory channel 0.
DDR_Chan1_Read32B 4 Counts memory read requests of size 32 bytes to memory channel 1.
DDR_Chan1_Read64B 5 Counts memory read requests of size 64 bytes to memory channel 1.
DDR_Chan1_Write32B 6 Counts memory write requests of size 32 bytes to memory channel 1.
DDR_Chan1_Write64B 7 Counts memory write requests of size 64 bytes to memory channel 1.

Analyzing Results

Bandwidth in terms of MB/s can be calculated for 64 byte events listed above as follows:

Event metric formula: event_count/seconds_sampled*64bytes/1000000bytes = MB/s

For 32 byte events replace the 64 bytes with 32:

Event metric formula: event_count/seconds_sampled*32bytes/1000000bytes = MB/s

Events can be summed together to form the desired metric, for example:

  • Total memory bandwidth = sum of all event MB/s
  • Total read bandwidth = sum of all read event MB/s
  • Channel 0 bandwidth = sum of channel 0 event MB/s

Known behaviors

  1. If the platform does not have two channels, the second channel counts will be zero.

UNC_SOC_DDR_Self_Refresh

UNC_SOC_DDR_Self_Refresh group provides counts of the memory hardware event self-refresh.  Self-refresh represents a low power state and can be used for power optimization of the SoC and application.

The table below documents the events contained in the UNC_SOC_DDR_Self_Refresh group.

Name Counter Description
DDR_Chan0_Self_Refresh 0 Counts the number of cycles that memory channel 0 is in self-refresh.
DDR_Chan1_Self_Refresh 1 Counts the number of cycles that memory channel 1 is in self-refresh.
Clock_Counter 2 SoC clock counter

Analyzing Results

Analyzing Results

Event metric formula: ((event_count * 100)/(time_interval * base_DRAM_frequency)= DDR Self-refresh Residency

Known behaviors

  1. If the Rangeley memory does not have two channels, the second channel counts will be zero.
  2. Counters 0, 1 may be running at a different source clock frequency than counter 2.

UNC_SOC_All_Reqs

The per agent requests count events contained in the group UNC_SOC_All_Reqs measures the total number of requests for the available SoC agents in a single, concurrent sampling.  The fact that it captures all agents concurrently makes this a key metric for studying any non-static or “bursty” workload.  Unlike the other bandwidth events which are sampled one or two at a time, this metric provides insight to all agents in a single time window.

Per agent bandwidth can be estimated by multiplying each request count by 64 bytes and total bandwidth can be estimated by summing the bandwidth of all agents.  It is critical to understand that the end result is only an estimate based on the assumption that each request is 64B and will over count when transaction are 32 bytes or partial sized requests.  The other disadvantage is that there is no read vs. write breakdown per agent.

For an exact bandwidth measurement with read and write breakdown, per agent bandwidth metric must be used one or two agents at a time.

Name Counter Description
Mod0_Reqs 0 Counts the number of requests from Silvermont module 0.
Mod1_Reqs 1 Counts the number of requests from Silvermont module 1.
Mod2_Reqs 2 Counts the number of requests from Silvermont module 2.
Mod3_Reqs 3 Counts the number of requests from Silvermont module 3.
HighSpeedPF_Reqs 4 Counts the aggregate number of requests from the HighSpeed peripheral fabric.
LowSpeedPF_Reqs 5 Counts the aggregate number of requests from the low speed peripheral fabric.
Clock_Counter 6 SoC clock counter

Analyzing Results

Event metric formula:

  • event_count/seconds_sampled*64bytes/1000000bytes = Estimated Agent MB/s
  • sum_of_all_event counters/seconds_sampled*64bytes/1000000bytes = Estimated DDR MB/s

 

Known behaviors

  1. It is important to remember that these events count transactions of any request size and that multiplying them by 64 bytes to calculate a MB/s metric is a generalization and may over count actual bandwidth observed at the memory channels.

UNC_SOC_Module0_BW

UNC_SOC_Module0_BW group provides counts to compute the bandwidth of processor module zero as seen by the system agent.  The events provide a break-down of per request type.

The image below illustrates the traffic flow being monitored for this group.

Name Counter Description
Mod0_ReadPartial 0 Counts all module 0 read transactions of any data size request. This event count is inclusive of partial, 32 byte and 64 byte transactions.
Mod0_Read32B 1 Counts memory read requests of size 32 bytes from Silvermont module 0.
Mod0_Read64B 2 Counts memory read requests of size 64 bytes from Silvermont module 0.
Mod0_WritePartial 3 Counts all module 0 write transactions of any data size request. This event count is inclusive of partial, 32 byte and 64 byte transactions.
Mod0_Write32B 4 Counts memory write requests of size 32 bytes from Silvermont module 0.
 Mod0_Write64B 5 Counts memory write requests of size 64 bytes from Silvermont module 0.
Clock_Counter 6 SoC clock counter

Analyzing Results

Read and write bandwidth can be calculated for the 32 byte and 64 byte events, but the partial event is problematic for bandwidth computations since a partial request has an unknown payload size.  It is also vital to understand that the partials events for this group represent the sum of 64 byte, 32 byte and partial requests.  This partial event may also be thought of as total read or write count.

Event metric formula:

  • partial_requests - 32_byte_requests - 64_byte_requests = actual partial request count
  • (Mod0_Read32B_count * 32_bytes / seconds_sampled) + (Mod0_Read64B_count * 64_bytes / seconds_sampled) = Read MB/s
  • (Mod0_Write32B_count * 32_bytes / seconds_sampled) + (Mod0_Write64B_count * 64_bytes / seconds_sampled) = Write MB/s

Known behaviors

  1. The module 0, 1, 2 and 3 partial event count include 32 byte, 64 byte and partial requests and can be thought of as a total request count.

UNC_SOC_ModuleX_BW,

UNC_SOC_Module1_BW, UNC_SOC_Module2_BW and UNC_SOC_Module3_BW are identical to the UNC_SOC_Module0_BW group except that it is counting events from the respective processor module.

UNC_SOC_Module0_1_BW

There are not enough counter resources to measure bandwidth from all processor modules concurrently, but groups UNC_SOC_Module0_1_BW and UNC_SOC_Module2_3_BW provide counts to compute the bandwidth of two processor modules concurrently.  The partial events have been dropped in order to accomplish the concurrent measurement.

The image below illustrates the traffic flow being monitored for UNC_SOC_Module0_1_BW.

Name Counter Description
Mod0_Read32B 0 Counts memory read requests of size 32 bytes from Silvermont module 0.
Mod0_Read64B 1 Counts memory read requests of size 64 bytes from Silvermont module 0.
Mod0_Write32B 2 Counts memory write requests of size 32 bytes from Silvermont module 0.
Mod0_Write64B 3 Counts memory write requests of size 64 bytes from Silvermont module 0.
Mod1_Read32B 4 Counts memory read requests of size 32 bytes from Silvermont module 1.
Mod1_Read64B 5 Counts memory read requests of size 64 bytes from Silvermont module 1.
Mod1_Write32B 6 Counts memory write requests of size 32 bytes from Silvermont module 1.
Mod1_Write64B 7 Counts memory write requests of size 64 bytes from Silvermont module 1.

UNC_SOC_Module0_1_Snoops

UNC_SOC_Module0_1_Snoops counts the number of snoop requests and snoop replies for Silvermont module 0 and 1, as seen by the system agent.  These counts can be used to confirm other traffic counts and correlate core snoop event counts.  Unlike the core based snoop events, the uncore snoop counts are not able to distinguish between cores within a module and count the total for the module rather than per core.

Name Event Description
Mod0_Snoop_Replies 0 Counts the number of snoop replies received from module 0.
Mod0_Snoop_Reqs 1 Counts the number of snoop requests sent to module 0.
Mod1_Snoop_Replies 2 Counts the number of snoop replies received from module 1.
Mod1_Snoop_Reqs 3 Counts the number of snoop requests sent to module 1.
Clock_Counter 4

SoC clock counter

UNC_SOC_Module2_3_Snoops is identical to UNC_SOC_Module0_1 except that it provides snoop counts related to processor modules 2 and 3. UNC_SOC_Module0_1_2_3_Snoops collects snoop data for all four processor modules concurrently but does not provide the SoC clock counter.

Analyzing Results

Analyzing snoop results are usage model specific.

Known behaviors

  1. Snoop counts are per processor module totals with no per core break-down available.

UNC_SOC_LowSpeedPF_BW

The UNC_SOC_LowSpeedPF_BW group provides events to determine the bandwidth of the low speed peripheral fabric, representing aggregate bandwidth for all south cluster units such as: USB, SATA and GbE.

Name Counter Description
LowSpeedPF_ReadPartial 0 Counts all low speed peripheral fabric read transactions with partial sized data requests.
LowSpeedPF_Read32B 1 Counts memory read requests of size 32 bytes from the low speed peripheral fabric.
LowSpeedPF_Read64B 2 Counts memory read requests of size 64 bytes from the low speed peripheral fabric.
LowSpeedPF_WritePartial 3 Counts low speed peripheral fabric write transactions with partial sized data requests.
LowSpeedPF_Write32B 4 Counts memory write requests of size 32 bytes from the low speed peripheral fabric.
LowSpeedPF_Write64B 5 Counts memory write requests of size 64 bytes from the low speed peripheral fabric.
Clock_Counter 6 SoC Clock Counter

Analyzing Results

Read and write bandwidth can be calculated for the 32 byte and 64 byte events, but the partial event is problematic for bandwidth computations since a partial request has an unknown payload size.

Event metric formula:

  • (LowspeedPF_Read32B_count * 32_bytes / seconds_sampled) + (LowspeedPF_Read64B_count * 64_bytes / seconds_sampled) = Read MB/s
  • (LowspeedPF_Write32B_count * 32_bytes / seconds_sampled) + (LowspeedPF_Write64B_count * 64_bytes / seconds_sampled) = Write MB/s

Known behaviors

  1. None.

UNC_SOC_HighSpeedPF_BW

The UNC_SOC_HighSpeedPF_BW group provides events to determine the bandwidth of the high speed peripheral fabric, representing aggregate bandwidth for all high speed connectables .  It is identical to the UNC_SOC_LowSpeed_BW group in event-counter configuration and analysis metrics.  Event names have changed from LowSpeedPF to HighSpeedPF.

 

 

Per informazioni più dettagliate sulle ottimizzazioni basate su compilatore, vedere il nostro Avviso sull'ottimizzazione.