One of my Xeon D1521 based servers (Linux 4.15 w/Ubuntu 18.04) is experiencing bizarre performance throttling and am struggling to diagnose the cause. The four cores (hyperthreading disabled) normally run at 800Mhz in power saving mode with a modest application load. After 2-3 days uptime, core frequencies suddenly plummet under 250Mhz. On two occasions, performance mysteriously recovered after 2-3 days of degraded performance. I captured the following data during the most recent failure/recovery:
MSR_IA32_THERM_STATUS (0x19c): Bit 10 (Power Limitation Status) goes HIGH in all four cores for the event duration.
MSR_IA32_PACKAGE_THERM_STATUS (0x1b1): Bit 10 (Power Limitation Status) goes HIGH for the event duration.
MSR_CORE_PERF_LIMIT_REASONS (0x690): Bit 2 (Power Budget Management), Bit 13 (Core Frequency P1 Status) and Bit 15 (Core Frequency Limiting Status) all go HIGH for the event duration.
MSR_PKG_PERF_STATUS (0x613) and MSR_DRAM_PERF_STATUS (0x61b) both count up RAPL throttling during the event.
MSR_TURBO_ACTIVATION_RATIO (0x64c) is updated every second or so with values from 7-9 (vs 24 before/after an event).
From my reading of the SDM, this indicates the RAPL power (but not thermal) limits have been exceeded and the cores are being forcefully throttled. However, MSR_PKG_ENERGY_STATUS/MSR_DRAM_ENERGY_STATUS report power usage of 11.0-11.5W / 0.4-0.6W respectively. Package power consumption actually increasing to 12-14W range during an event (likely due to the cores being overloaded). Core/package temperatures are always under 40C before/during/after an event (so hard to believe the cause is thermal). Have increased the power limits in MSR_PKG_POWER_LIMIT to the max but no change.
The application load includes 12 processes performing significant AVX2 256-bit calculations (real-time image reprocessing from IP cameras). While this increases power / reduces max frequency, could that be relevant when already running in power savings mode @800Mhz on a 2.4Mhz (non-turbo) rated part? The motherboard includes a BMC but it reports no events. Any thoughts welcome as running out of ideas of what else to monitor to track down the cause.